Example of simple ETL job
Job transfers data from Postgres to Redshift
- Docker
- Docker Compose
- Redshift
- S3 Create bucket
etl(or set your bucket name incongig.py)
-
fill in
etl.envwith your AWS and Redshift credentials -
run
make deploy
To execute the whole job, run make run_etl
ETL job consists of the following steps, each can be triggered separately:
-
Creating tables in source database (postgres)
make create_tables -
Seeding source database with random data (seed size can be adjusted in
config.py)make seed_db -
Exporting data from source database to local csv file
make export_csv -
Uploading file to s3 bucket
make upload_s3 -
Copying data from s3 to Redshift
make copy_rs
Simple example of word count job using PySpark
The easiest way to demonstrate execution of pyspark scipt is to run it in pyspark Docker container with jupyter notebook
- start the container:
make pyspark - Go to http://127.0.0.1:8888 followed by token according to instructions in terminal
- Start the notebook work/wordcount.ipynb
- Copy input data file into ./pyspark folder
- Run the code in the notebook
Directory sql_and_jq contains:
- examples of ddl and select query for apps table
- example of json transform script using JQ