An automated ETL pipeline that fetches sales data from Square POS API, transforms it, and loads it into PostgreSQL for business analytics.
This project automates the monthly collection and processing of sales data from 3 Square POS locations. The ETL pipeline runs on GitHub Actions, fetching transaction data, transforming it into a normalized format, and loading it into a partitioned PostgreSQL database hosted on AWS RDS.
Purpose: Enable business intelligence and data visualization in tools like Tableau by maintaining a clean, queryable sales database.
Square POS API (3 Locations)
β
[GitHub Actions]
Monthly Trigger (1st @ 3 AM CT)
β
ETL Pipeline
ββ Fetch: Square Orders API
ββ Transform: JSON β Database Schema
ββ Load: Bulk Insert to PostgreSQL
β
AWS RDS PostgreSQL
ββ locations (reference)
ββ sales (partitioned by location)
β
Tableau / BI Tools
-
Square Service (
app/services/square_service.py)- Authenticates with Square API
- Fetches completed orders with pagination
- Handles rate limiting and retries
-
ETL Service (
app/services/etl_service.py)- Transforms Square JSON to database schema
- Converts timestamps (UTC β America/Chicago)
- Extracts date components (month, day_of_week)
- Validates data before insertion
-
Database Service (
app/services/database_service.py)- Manages PostgreSQL connections
- Performs bulk inserts
- Queries location mappings
- Handles transactions and rollbacks
-
Monthly ETL Script (
app/scripts/monthly_etl.py)- Orchestrates the complete ETL process
- Processes all 3 locations
- Logs progress and statistics
- Runs via GitHub Actions
- Python 3.11: Core language
- FastAPI: REST API framework
- PostgreSQL 15: Relational database with partitioning
- AWS RDS: Managed database hosting (Free Tierfor for now)
- Square API v2: Orders API for transaction data
- GitHub Actions: CI/CD automation
- SQLAlchemy: ORM for database operations
analytics-etl/
βββ app/
β βββ services/
β β βββ square_service.py # Square API integration
β β βββ etl_service.py # Data transformation logic
β β βββ database_service.py # PostgreSQL operations
β βββ utils/
β β βββ config.py # Environment configuration
β β βββ date_utils.py # Timezone helpers
β βββ scripts/
β βββ monthly_etl.py # ETL orchestration
βββ .github/
β βββ workflows/
β βββ monthly_etl.yml # Automation workflow
βββ requirements.txt
βββ .env.example
βββ README.md
CREATE TABLE locations (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
square_id TEXT UNIQUE NOT NULL
);Stores the 3 physical locations and their Square IDs.
CREATE TABLE sales (
sale_id BIGSERIAL,
item_name TEXT NOT NULL,
sale_price NUMERIC(12,2) NOT NULL, -- Base price (excludes tax)
qty INT NOT NULL,
sale_timestamp TIMESTAMPTZ NOT NULL, -- Chicago timezone
month TEXT NOT NULL, -- YYYY-MM format
day_of_week TEXT NOT NULL, -- Monday, Tuesday, etc.
item_category TEXT,
location_id INT NOT NULL,
PRIMARY KEY (sale_id, location_id),
FOREIGN KEY (location_id) REFERENCES locations(id)
) PARTITION BY LIST (location_id);Partitioning: One partition per location (sales_location_1, sales_location_2, sales_location_3) for optimized queries.
Indexes:
idx_sales_timestamp- Date range queriesidx_sales_month- Monthly aggregationsidx_sales_item_name- Product analysis
| Square Field | Database Column | Transformation |
|---|---|---|
line_item.name |
item_name |
Direct mapping |
line_item.base_price_money.amount |
sale_price |
Divide by 100 (cents β dollars) |
line_item.quantity |
qty |
Direct mapping |
sale_timestamp |
UTC β America/Chicago | |
| Extract from timestamp | month |
Format as 'YYYY-MM' |
| Extract from timestamp | day_of_week |
'Monday', 'Tuesday', etc. |
line_item.variation_name |
item_category |
Direct mapping or 'N/A' |
| Lookup from locations | location_id |
Join on square_id |
- Pricing: Only
base_price_moneyis stored. Taxes, tips, and service charges are excluded. - Timezone: All timestamps are stored in America/Chicago (handles CST/CDT automatically).
- Granularity: Each line item in an order becomes a separate database row.
- Order State: Only
COMPLETEDorders are fetched.
- Python 3.11+
- Square Developer Account with API access token
- AWS account with RDS PostgreSQL instance
- GitHub repository with Actions enabled
# Clone repository
git clone https://github.com/rromero09/analytics-etl.git
cd analytics-etl
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Set up environment variables
cp .env.example .env
# Edit .env with your credentials# Database Configuration (Production)
PROD_DB_HOST=your-rds-endpoint_here
PROD_DB_PORT=5432 #default port for postgresql
PROD_DB_NAME=db_name_here
PROD_DB_USER=github_actions_user
PROD_DB_PASSWORD=your_secure_password
# Square API
SQUARE_ACCESS_TOKEN= your_token_here
# Application
ENVIRONMENT=production # or 'development' for local
LOG_LEVEL=INFOThe ETL pipeline runs automatically on the 1st of each month at 3 AM Chicago time.
Required GitHub Secrets (Settings β Secrets and variables β Actions):
| Secret | Description |
|---|---|
AWS_ROLE_ARN |
IAM Role ARN for OIDC authentication |
AWS_REGION |
AWS region |
DB_HOST |
RDS endpoint |
DB_NAME |
Database name |
DB_USER |
Database username |
DB_PASSWORD |
Database password |
SQUARE_ACCESS_TOKEN |
Square API token |
- Go to:
GitHub Repo β Actions β Monthly Sales ETL - Click "Run workflow"
- Options:
- Run mode:
production(all data) ortest(200 orders) - Location:
allor specific location (1, 2, 3) - Dates: Leave empty for previous month
- Run mode:
- Checkout code - Clone repository
- Setup Python - Install Python 3.11
- Install dependencies - Install requirements.txt
- Configure AWS credentials - OIDC authentication
- Test database connection - Verify connectivity
- Run ETL - Execute monthly_etl.py
- Summary - Report statistics
View execution details:
- Go to
Actionstab in repository - Click on workflow run
- View logs for each step
- Automatic emails on workflow failure
- Configure in
GitHub Settings β Notifications β Actions
-- Check recent sales
SELECT * FROM sales
ORDER BY sale_timestamp DESC
LIMIT 10;
-- Monthly summary
SELECT
month,
location_id,
COUNT(*) as order_count,
SUM(sale_price * qty) as total_revenue
FROM sales
GROUP BY month, location_id
ORDER BY month DESC;- Instance: db.t3.micro (Free Tier)
- Region: us-east-2
- Storage: 20 GB (Free Tier)
- Backups: 7-day retention (automated)
- Method: OpenID Connect (OIDC)
- Benefits: No long-lived credentials, automatic token rotation
- Setup: IAM Identity Provider + Role with RDS permissions
- Security group allows GitHub Actions IPs
- IAM role-based access
- Encrypted connections (SSL/TLS)
- Execution Time: ~30-60 seconds per month (production)
- Test Mode: ~5-10 seconds (200 orders)
- API Requests: ~15-20 per location per month
- Database Rows: ~3,000-10,000 per month (all locations)
- GitHub Actions Cost: $0.00 (within free tier)
β
Automated: Runs monthly without manual intervention
β
Multi-Location: Processes 3 locations in parallel
β
Timezone-Aware: Handles CST/CDT conversions
β
Partitioned: Location-based partitioning for fast queries
β
Reliable: Retry logic, error handling, rollback on failure
β
Observable: Detailed logs and statistics
β
Secure: AWS OIDC, no hardcoded credentials
MIT License - See LICENSE file for details.
Rafael Romero
GitHub