Repo with ready-to-run Spark jobs or notebooks.
This setup lets you run PySpark notebooks/jobs directly inside VS Code, connected to the Spark cluster. You get an isolated environment without needing to manually install Python, PySpark, or Jupyter locally.
Make sure you have the following installed:
- Docker
- Docker Compose
- VS Code
- Dev Containers extension (optional, for VS Code)
project-root/
├── .devcontainer/
│ └── devcontainer.json
├── docker-compose.yml
├── requirements.txt
├── jobs/
│ └── example_job.py
└── notebooks/
└── example_notebook.ipynb
jobs/— Python scripts using PySparknotebooks/— Jupyter notebooksrequirements.txt— Python dependencies (shared for all jobs and notebooks)
-
Open the project in VS Code
code .Reopen in Dev Container:
Press
fn + F1→ search “Dev Containers: Rebuild and Reopen in Container”.VS Code will build the container based on
docker-compose.ymlanddevcontainer.json. -
Run PySpark jobs
python /home/jovyan/jobs/example_job.py
-
Run notebooks
- Open any
.ipynbfile innotebooks/. - The pre-configured kernel has PySpark ready — run cells directly.
- Spark UI logs are accessible in VS Code output and web UI.
- Open any
-
Build and start containers
docker-compose up -d
This will start:
- Spark master + Spark history server
- Jupyter notebook/lab
- Mounts for jobs, notebooks, and logs
-
Access Jupyter Lab
- Open your browser: http://localhost:8888
- Use the token specified in
docker-compose.yml(e.g.,yourtoken123).
-
Run PySpark scripts inside the container
docker exec -it pyspark bash python /home/jovyan/jobs/example_job.py -
Spark UI
- Spark Master UI: http://localhost:8080
- Spark Application UI: http://localhost:4040
- History Server: http://localhost:18080
Edit requirements.txt with any packages you need, e.g.:
pandas
numpy
matplotlib
findspark
-
If using Dev Container:
Press
fn + F1→ search “Dev Containers: Rebuild and Reopen in Container” -
If using Docker Compose directly:
docker exec -it jupyter pip install -r /home/jovyan/requirements.txt