Complete replication of the NYC Taxi pipeline on Amazon Web Services with step-by-step instructions.
| GCP Component | AWS Equivalent | Notes |
|---|---|---|
| Cloud Storage (GCS) | S3 | Raw parquet, dashboard JSON |
| BigQuery RawBronze | Glue Catalog + S3 (Parquet) | Or Athena external tables |
| BigQuery CleanSilver | Glue Catalog + S3 (Parquet) | |
| BigQuery PreMlGold | Glue Catalog + S3 (Parquet) | |
| BigQuery PostMlGold | Glue Catalog + S3 (Parquet) | |
| Dataproc (Spark) | EMR or Glue Spark | EMR for full control, Glue for serverless |
| Cloud Composer | MWAA or Step Functions | MWAA for Airflow, Step Functions for simple DAGs |
| Cloud Functions | Lambda | TLC ingestion, export |
| Cloud Run | ECS Fargate / Lambda | ML stages (04a, 04b) |
┌─────────────────────────────────────────────────────────────────────────────────┐
│ AWS NYC TAXI PIPELINE │
└─────────────────────────────────────────────────────────────────────────────────┘
[TLC URLs] [S3 Raw] [Glue/Athena] [S3 Dashboard]
│ │ │ │
▼ ▼ ▼ ▼
Lambda/ECS s3://nyc-taxi- Bronze → Silver s3://nyc-taxi-
(Stage 00) raw-XXX/ → Gold (EMR/ dashboard-XXX/
taxi_type/ Glue Spark) data/*.json
year/file.parquet │ │
▼ ▼
Glue Catalog Static Website
+ S3 Parquet + CloudFront
- AWS account with billing enabled
- AWS CLI v2 installed and configured (
aws configure) - IAM user/role with:
AmazonS3FullAccess,AWSGlueConsoleFullAccess,AmazonEMRFullAccessPolicy_v2,AWSLambda_FullAccess,IAMFullAccess(or equivalent scoped policies)
export AWS_REGION="us-east-1"
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
export PROJECT_PREFIX="nyc-taxi"
export RAW_BUCKET="${PROJECT_PREFIX}-raw-${AWS_ACCOUNT_ID}"
export DASHBOARD_BUCKET="${PROJECT_PREFIX}-dashboard-${AWS_ACCOUNT_ID}"
export AIRFLOW_BUCKET="${PROJECT_PREFIX}-airflow-${AWS_ACCOUNT_ID}"aws s3 mb s3://${RAW_BUCKET} --region $AWS_REGION
aws s3 mb s3://${DASHBOARD_BUCKET} --region $AWS_REGION
aws s3 mb s3://${AIRFLOW_BUCKET} --region $AWS_REGION
# Optional: Enable versioning for raw data
aws s3api put-bucket-versioning --bucket $RAW_BUCKET --versioning-configuration Status=Enabledaws glue create-database \
--database-input "{\"Name\":\"nyc_taxi\",\"Description\":\"NYC TLC taxi data - Bronze, Silver, Gold\"}"Create a service role and instance profile for EMR. Use the AWS Console IAM or:
# EMR needs: EMR_DefaultRole, EMR_EC2_DefaultRole
# Create custom role for S3/Glue access if needed
aws iam create-role --role-name NYCTaxi-EMR-ServiceRole --assume-role-policy-document '{
"Version": "2012-10-17",
"Statement": [{"Effect": "Allow", "Principal": {"Service": "elasticmapreduce.amazonaws.com"}, "Action": "sts:AssumeRole"}]
}'
# Attach policies: AmazonElasticMapReduceRole, AmazonS3FullAccess, AWSGlueConsoleFullAccessCreate lambda/ingest_tlc.py:
import boto3
import requests
from datetime import datetime
from dateutil.relativedelta import relativedelta
BUCKET = "nyc-taxi-raw-ACCOUNT_ID" # Replace
TAXI_TYPES = ["yellow", "green", "fhv", "fhvhv"]
TLC_BASE = "https://d37ci6vzurychx.cloudfront.net/trip-data"
def handler(event, context):
s3 = boto3.client("s3")
taxi_type = event.get("taxi_type", "yellow")
year_month = event.get("year_month") # "2024-01"
if not year_month:
# Default: last 2 months
end = datetime.now().replace(day=1) - relativedelta(months=2)
year_month = end.strftime("%Y-%m")
y, m = year_month.split("-")
url = f"{TLC_BASE}/{taxi_type}_tripdata_{y}-{m}.parquet"
key = f"{taxi_type}/{y}/{taxi_type}_tripdata_{y}-{m}.parquet"
resp = requests.get(url, timeout=120)
resp.raise_for_status()
s3.put_object(Bucket=BUCKET, Key=key, Body=resp.content, ContentType="application/octet-stream")
return {"status": "ok", "key": key}# Package (add requests layer or include in zip)
zip -r ingest.zip lambda/ingest_tlc.py
aws lambda create-function \
--function-name nyc-taxi-ingest \
--runtime python3.11 \
--handler ingest_tlc.handler \
--zip-file fileb://ingest.zip \
--role arn:aws:iam::${AWS_ACCOUNT_ID}:role/lambda-execution-role \
--timeout 120 \
--memory-size 512Create a state machine that loops over taxi types and year-months, invoking the Lambda for each.
| GCP Code | AWS Adaptation |
|---|---|
spark.read.parquet("gs://bucket/path") |
spark.read.parquet("s3://bucket/path") |
df.write.format("bigquery").option("table", ...) |
df.write.format("parquet").mode("overwrite").save("s3://bucket/glue_db/table/") + create Glue table |
google.cloud.bigquery |
Use Glue Catalog or Athena for metadata; store data in S3 Parquet |
Run Glue Crawler or create tables manually:
# Run in Glue Studio or as script
import boto3
glue = boto3.client("glue")
glue.create_table(
DatabaseName="nyc_taxi",
TableInput={
"Name": "raw_bronze_yellow_2024_01",
"StorageDescriptor": {
"Columns": [
{"Name": "vendor_id", "Type": "bigint"},
{"Name": "tpep_pickup_datetime", "Type": "timestamp"},
# ... full schema
],
"Location": f"s3://{RAW_BUCKET}/nyc_taxi/bronze/yellow/year=2024/month=01/",
"InputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat",
"OutputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat",
"SerdeInfo": {"SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"}
},
"PartitionKeys": []
}
)# Create cluster
CLUSTER_ID=$(aws emr create-cluster \
--name "nyc-taxi-pipeline" \
--release-label emr-6.15.0 \
--applications Name=Spark Name=Hadoop Name=Hive \
--ec2-attributes KeyName=your-key,SubnetId=subnet-xxx,InstanceProfile=EMR_EC2_DefaultRole \
--instance-groups \
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m5.xlarge \
InstanceGroupType=CORE,InstanceCount=2,InstanceType=m5.xlarge \
--use-default-roles \
--log-uri s3://${RAW_BUCKET}/emr-logs/ \
--query 'ClusterId' --output text)
# Wait for cluster ready
aws emr wait cluster-running --cluster-id $CLUSTER_ID
# Submit PySpark job (upload script to S3 first)
aws s3 cp pipeline/01_gcs_to_bronze.py s3://${RAW_BUCKET}/scripts/
# Adapt script: replace GCS/BigQuery with S3/Glue
aws emr add-steps --cluster-id $CLUSTER_ID --steps '[
{
"Name": "GCS-to-Bronze",
"ActionOnFailure": "CONTINUE",
"HadoopJarStep": {
"Jar": "command-runner.jar",
"Args": ["spark-submit", "--deploy-mode", "cluster",
"s3://'${RAW_BUCKET}'/scripts/01_s3_to_bronze.py"]
}
}
]'Create Glue job with Spark script. No cluster to manage; pay per DPU-hour.
Use Lambda with container image for XGBoost/TensorFlow. Increase memory to 10 GB, timeout to 15 min.
- Build Docker image with Python, pandas, xgboost, tensorflow, boto3
- Push to ECR
- Run as Fargate task triggered by Step Functions or EventBridge
Use SageMaker Processing job for batch ML. Good for 04b (anomaly detection).
# Query Athena for aggregated data
import boto3
import json
athena = boto3.client("athena")
s3 = boto3.client("s3")
query = "SELECT pickup_date, pickup_hour, SUM(trips) as trips FROM nyc_taxi.pre_ml_yellow_hourly GROUP BY 1,2"
result = athena.start_query_execution(QueryString=query, ResultConfiguration={"OutputLocation": "s3://bucket/results/"})
# Poll for completion, fetch results, write JSON to dashboard buckets3://nyc-taxi-dashboard-XXX/
├── index.html
├── data/
│ ├── time_series/
│ │ ├── yellow_daily.json
│ │ └── ...
│ ├── anomalies/
│ └── metadata/
│ └── dashboard_metadata.json
# Create environment (takes ~30 min)
aws mwaa create-environment \
--name nyc-taxi-airflow \
--airflow-version 2.7.0 \
--source-bucket-arn arn:aws:s3:::${AIRFLOW_BUCKET} \
--execution-role-arn arn:aws:iam::${AWS_ACCOUNT_ID}:role/MWAA-role \
--network-configuration "SubnetIds=subnet-xxx,subnet-yyy,SecurityGroupIds=sg-xxx" \
--requirements-file requirements.txt# dags/nyc_taxi_pipeline.py
from airflow import DAG
from airflow.providers.amazon.aws.operators.lambda import LambdaInvokeFunctionOperator
from airflow.providers.amazon.aws.operators.emr import EmrAddStepsOperator
from airflow.providers.amazon.aws.sensors.emr import EmrStepSensor
from datetime import datetime, timedelta
default_args = {"owner": "airflow", "retries": 2, "retry_delay": timedelta(minutes=5)}
with DAG(
"nyc_taxi_pipeline",
default_args=default_args,
schedule="0 2 * * *",
start_date=datetime(2024, 1, 1),
catchup=False,
) as dag:
ingest = LambdaInvokeFunctionOperator(
task_id="ingest_tlc",
function_name="nyc-taxi-ingest",
payload='{"taxi_type":"yellow","year_month":"2024-01"}',
)
bronze = EmrAddStepsOperator(
task_id="bronze",
job_flow_id="{{ var.value.emr_cluster_id }}",
aws_conn_id="aws_default",
steps=[...],
)
bronze_sensor = EmrStepSensor(...)
ingest >> bronze >> bronze_sensor >> ...For simpler linear flow, use Step Functions with Lambda + EMR RunJobFlow.
aws s3 website s3://${DASHBOARD_BUCKET} --index-document index.html --error-document index.htmlaws s3api put-bucket-cors --bucket $DASHBOARD_BUCKET --cors-configuration '{
"CORSRules": [{
"AllowedHeaders": ["*"],
"AllowedMethods": ["GET", "HEAD"],
"AllowedOrigins": ["*"],
"ExposeHeaders": []
}]
}'Create distribution with S3 origin for faster global access and HTTPS.
| Variable | Example | Description |
|---|---|---|
| AWS_REGION | us-east-1 | Region for all resources |
| RAW_BUCKET | nyc-taxi-raw-123456789 | S3 raw parquet |
| DASHBOARD_BUCKET | nyc-taxi-dashboard-123456789 | Dashboard JSON + static site |
| GLUE_DATABASE | nyc_taxi | Glue database name |
| EMR_CLUSTER_ID | j-XXXXXXXX | Active EMR cluster (if long-running) |
Actual costs depend on data volume, cluster size, and run frequency. Estimate through a test-phase run before production:
- Test run batch: Use only 2 specific months of data (e.g.
2024-01and2024-02) for the full pipeline. - Run ingest → Spark (Bronze/Silver/Gold) → ML → export end-to-end.
- Monitor billing in AWS Cost Explorer for S3, EMR, Glue, Athena, Lambda, MWAA.
- Scale estimates to full data volume and schedule.
Cost tips: Use EMR spot instances, Glue serverless Spark, S3 Intelligent-Tiering.
- Phase 1: S3 buckets, Glue database, IAM roles
- Phase 2: Lambda ingest, test with one month
- Phase 3: Adapt pipeline scripts (S3/Glue), run EMR job
- Phase 4: Deploy 04a/04b (ECS or SageMaker)
- Phase 5: Export Lambda/task to S3
- Phase 6: MWAA or Step Functions orchestration
- Phase 7: Static website, CORS, CloudFront
| GCP (pipeline/) | AWS Equivalent |
|---|---|
google.cloud.storage |
boto3.client("s3") |
gs://bucket/path |
s3://bucket/path |
google.cloud.bigquery |
Athena start_query_execution or Glue Catalog |
df.write.format("bigquery") |
df.write.parquet("s3://...") + Glue create_table |
DataprocSparkSession |
EMR Spark (standard SparkSession) |
pipeline_utils.config |
Env vars: RAW_BUCKET, DASHBOARD_BUCKET, GLUE_DATABASE |
| Issue | Solution |
|---|---|
| Lambda timeout on ingest | Increase timeout to 120s; parquet files can be large |
| EMR cluster fails to start | Check subnet, security group, IAM instance profile |
| Glue table not found | Run crawler or create table manually; ensure S3 path exists |
| Athena query slow | Partition tables by year/month; use columnar format (Parquet) |