Agri Yield Pipeline

End-to-end agricultural monitoring system: satellite NDVI (Sentinel-2 via Google Earth Engine), real-time NOAA/USDA ingestion, Kafka streaming, InfluxDB/PostgreSQL storage, and ML yield prediction for Missouri corn production.

Takeaways

Data (2001–2023, 97 MO counties, 1,658 county-years):

NDVI → yield signal is real but heterogeneous. Pearson r = 0.52 between peak summer NDVI and corn yield statewide; GBR CV R² = 0.681 vs 0.785 for the 6 homogeneous NW-county subset. A single state-wide regression under-predicts on the glacial-till corn belt and over-predicts on Ozark pasture / Bootheel rice paddies.
July is the hinge month. ndvi_july is the single most important feature (0.28), followed by drought_flag (0.15), prcp_may_aug (0.13), and tmax_july_mean (0.13). Pollination-window stress dominates annual yield variance.
Known drought years show up cleanly in the statewide trend lines for 2012 and 2022 across corn / soy / sorghum — sanity check that MODIS + GHCND + NASS are aligned.
GBR beats Ridge by ~3× on R² (0.681 vs 0.208). The 97 county one-hot features are too coarse for a linear model to recover per-region intercepts.
Top corn counties in the 2015–2023 average cluster along the Missouri River bottom (Atchison, Holt, Nodaway, Andrew, Buchanan) — consistent with published USDA rankings.

Pipeline:

Baseline runs end-to-end from cached parquets with no Kafka / Postgres / InfluxDB. dash_baseline.py is the proof: python dash_baseline.py → http://127.0.0.1:8050.
Kaleido + remote-geojson hangs in headless Chromium; the matplotlib centroid-bubble fallback (scripts/generate_choropleth_static.py) is reliable and renders in < 1 s.
The 8-commodity USDA pull (data/real/usda_allcrops_county_missouri_2001_2023.parquet, 7,890 rows, 116 counties) feeds both the ML baseline and the companion career-ops MO-geo job scorer — same parquet, two consumers.

Product strategy

This repo is two products, one parquet cache — not one product.

Layer	Role	Data source	Cadence
L1 — Historical	training / ground truth	USDA NASS yields + NOAA GHCND + Daymet + MODIS MOD13Q1 NDVI, 2001–2023	yearly refresh (NASS lag ~12 mo)
L2 — Live fields	inference / per-customer ops	Sentinel-2 NDVI + Sentinel-1 SAR via GEE	weekly per field

L1 is the defensible corpus — 23 years of county-level truth that takes time to assemble. L2 is the real-time anomaly stream scored against L1's expected trajectory. The story no single-layer competitor can tell:

"Field X is 1.6σ below expected NDVI trajectory for DOY 195 given Boone County's 2001–2023 history."

Why Google Earth Engine is the substrate for L2:

Geographic portability — the same export_fields_ndvi.py runs Iowa, São Paulo, or Punjab by editing fields.yml. No per-region API integrations, no per-country license negotiation.
Multi-sensor at one seam — S2 (optical), S1 (radar/moisture), Landsat 8/9 TIRS (thermal/ET), MODIS LST, SMAP (soil moisture), CHIRPS (rainfall) — all indexable by the same ee.Geometry + date range. Adding thermal-stress pings later is ~40 LoC.
Cheap for sparse queries — noncommercial is free; commercial per-compute-hour pricing rounds to pennies for "20 fields × weekly 1 km² exports." Cost only matters at whole-country monitoring scale, which is the opposite of this use case.

Where GEE is the wrong tool. Parametric ag-insurance triggers or irrigation valve control need <1 h latency; S2 post-acquisition lag is 6–24 h and S1 is 2–3 days. That's a Planet Labs (daily 3 m) problem, not a GEE problem — don't bolt sub-hour latency onto this pipeline.

Tenant silo options:

Per-tenant silo — one fields.yml + one data/fields/<tenant>/ tree per customer. Clean data boundaries, straightforward billing. What this repo scales to naturally.
Global silo + tenant_id column — one fields table keyed by tenant, one geotiff prefix per field globally. Unlocks cross-tenant ML: train one yield model on 10,000 fields across crops/regions instead of 20 fields per tenant. Higher-ceiling ML play.

ML Results

Statewide real-data baseline — MODIS MOD13Q1 NDVI (250 m, 16-day composite) + daily weather + USDA NASS county yields, 97 Missouri counties, 2001–2023, 1,658 county-years.

Variant	Weather source	Features	CV R²	CV RMSE (bu/ac)
A. Ridge (baseline)	NOAA GHCND, KC MCI (1 station)	106	0.208	30.5
B. GBR + KC single station	NOAA GHCND, KC MCI (1 station)	106	0.681	19.4
C. GBR + Daymet per county	Daymet 1-km per centroid	106	0.610	21.4
D. GBR + Daymet per county + year FE	Daymet 1-km + year dummies	129	0.713	18.4

Why the naïve per-county swap (C) hurt R² and the fix (D) helps. The KC-only feature set has 0% within-year variance — every county's 2012 row carried the same drought_flag/tmax_july_mean, so GBR was implicitly learning a year fixed effect through the weather features. Swapping in Daymet splits weather variance ~50/50 between spatial and temporal (so 2012 is a moderate drought in the Ozarks, a severe drought along the Missouri River), which dissolves the year-as-macro-signal and the model loses its implicit year effect. Adding explicit year dummies (variant D) restores the year effect and lets GBR use real local weather deviations on top — best of both worlds, +3 points of R² over the KC baseline.

Counties with the biggest residual improvement under Daymet are exactly the ones whose climate differs most from KC: Madison (bias −30 → −15), Wright (+13 → +7), Dent (−33 → −28).

Models. Ridge is L2-penalized linear regression (sklearn.linear_model.Ridge); adding α‖β‖² to the loss stabilizes coefficients when features are correlated — ndvi_june / ndvi_july / ndvi_mean_growing all move together, and the 97 county dummies are near-collinear with the intercept. It's the baseline sanity check. GBR is the Gradient Boosting Regressor (sklearn.ensemble.GradientBoostingRegressor, 300 trees × depth 3, lr 0.05, subsample 0.8) — an ensemble that fits each new tree to the residuals of the prior ensemble. It wins by ~3× here (R² 0.681 vs 0.208) because it captures the non-linear NDVI→yield saturation, the drought × NDVI interaction, and per-county intercepts — all of which a linear model either can't represent or has its coefficients shrunk away.

Top GBR features (statewide): ndvi_july (0.28), drought_flag (0.15), prcp_may_aug (0.13), tmax_july_mean (0.13), ndvi_mean_growing (0.11). Going from 6 homogeneous NW counties (R²=0.785) to all 97 yield-bearing counties (R²=0.681) reflects real agroclimatic heterogeneity: the Bootheel rice paddies, Ozark pasture, and Glacial-till corn belt don't share a single NDVI→yield slope.

Corn yield map — avg 2015-2023

Interactive (plotly): figures/real/choropleth_corn_yield.html.

NDVI vs corn yield (full state)

Pearson r = 0.52 between peak summer NDVI and corn yield across 1,658 observations.

Multi-crop coverage

USDA NASS coverage for 8 commodities (CORN, SOYBEANS, WHEAT, SORGHUM, COTTON, RICE, OATS, HAY) written to data/real/usda_allcrops_county_missouri_2001_2023.parquet (7,890 rows, 116 counties).

Per-county Daymet × yield / NDVI correlations

23 years of growing-season weather vs yield / canopy, one dot per county (85 counties with ≥10 paired years). July heat hurts corn yield in 84/85 counties (mean r=−0.63), May-Aug rain helps in 81/85 (mean r=+0.33), and canopy heat-stress tracks in 84/85 (mean r=−0.61). The near-universal sign confirms the features aren't cosmetic — they're the real weather-response signal the statewide GBR is exploiting.

GBR feature importance + per-county residuals

Reading the residuals chart. Each horizontal bar is the mean of actual − predicted corn yield for one county across all years. Green (positive) → GBR under-predicts (actual is higher than modeled). Red (negative) → GBR over-predicts (actual is lower). Bar length is the bias magnitude in bu/acre. Across the 97 counties: 16 are within ±2 bu/acre, 44 within ±5, and only 7 exceed |10|; the statewide mean residual is −1.3 bu/acre, so there is no global offset.

The geographic pattern is the real story:

Region	Bias direction	Representative counties
Missouri River bottom (loess)	green / under-predicted	Buchanan +8.5, Platte +7.3, Lafayette +7.2, Chariton +7.6, Ray +7.6
Bootheel alluvial plain	green / under-predicted	Scott +7.7, New Madrid +7.6
Ozark plateau	red / over-predicted	Christian −11, Dallas −10, Texas −8.8, Wayne −8.3
Single-year outliers (n=1)	large red tails	Dent −33, Madison −30, Pulaski −21

GBR is helpful everywhere — it still beats Ridge ~3× on every subset — but it has been shrunk toward the statewide mean for the tail regions. The Missouri River bottom genuinely out-yields what NDVI + weather alone predict (richer soils, irrigation); the Ozarks genuinely under-yield (corn is marginal acreage). The negative mean ↔ std correlation (−0.22) says biased counties also have higher variance, another sign that distinct agro-regions are being blended into a single fit. Two fixes worth exploring: (1) per-region GBR (river bottom / Ozark / Bootheel / N-Missouri), or (2) additional features — soil class, elevation, irrigated-acre fraction — so the model can separate low NDVI because Ozark pasture from low NDVI because drought stress. The current 97 county-dummies are a crude substitute.

Earlier models

python3.11 -m venv .venv && .venv/bin/pip install -r requirements.txt
# 6-county quick start (~5 min):
.venv/bin/python scripts/fetch_real_data.py
# Statewide MO — NDVI for every county with USDA yields (~6 h, one-time):
.venv/bin/python scripts/fetch_all_counties.py
# All 8 MO commodities (CORN, SOYBEANS, WHEAT, SORGHUM, COTTON, RICE, OATS, HAY):
.venv/bin/python scripts/fetch_usda_all_crops.py
# Per-county Daymet 1-km daily weather (~4 min, 115 counties):
.venv/bin/python scripts/fetch_daymet_per_county.py
# Train + figures (includes the three-way KC vs Daymet vs Daymet+YearFE comparison):
.venv/bin/python scripts/train_real.py
.venv/bin/python scripts/train_real_daymet.py
.venv/bin/python scripts/generate_full_figures.py
.venv/bin/python scripts/generate_choropleth_static.py
# Per-county Daymet x yield / NDVI correlation choropleth (85 counties):
.venv/bin/python scripts/generate_daymet_correlation_maps.py

MODIS NDVI comes from the ORNL DAAC REST endpoint (no auth). NOAA GHCND daily weather and USDA NASS yields require the respective tokens. Reruns hit the parquet cache; each NDVI parquet is 250 m × 250 m × 16-day, 2001–2023.

Per-field monitoring (Sentinel-2 NDVI + Sentinel-1 SAR)

For a small, known set of AOIs (3–20 fields, ~ha-scale), the right pattern is export once, cache locally — nothing in the loop has to reach back to Earth Engine once the TIFFs are on disk. Covers the agronomist use case: per-field NDVI anomalies for stress flags, VV backscatter as a soil-moisture proxy, and a quick acreage estimate from the NDVI mask.

One-time GCP setup (free tier is fine):

Create / pick a GCP project; enable the Earth Engine API.
Register the project for Earth Engine non-commercial use.
IAM → Create service account → grant role Earth Engine Resource Viewer.
Create a JSON key for the SA and drop it at /Users/aurascoper/agri_yield_pipeline/ee-service-account.json (.gitignored).

Add to .env:

GCP_PROJECT=agri-yield-pipeline
EE_SERVICE_ACCOUNT=<name>@<project>.iam.gserviceaccount.com
EE_SA_KEY_FILE=/Users/aurascoper/agri_yield_pipeline/ee-service-account.json

If EE_SERVICE_ACCOUNT is unset, src/ee_auth.init_ee() falls back to user OAuth (earthengine authenticate).

Fields quickstart:

cp fields.yml.example fields.yml   # then edit lat/lon/buffer for each AOI
.venv/bin/python scripts/export_fields_ndvi.py   # Sentinel-2 NDVI @ 10 m
.venv/bin/python scripts/export_fields_sar.py    # Sentinel-1 VV dB @ 10 m
.venv/bin/python dash_baseline.py                # → Fields tab

Per field, you get:

data/fields/<name>/ndvi_<YYYY-MM-DD>.tif + ndvi_series.parquet
data/fields/<name>/sar_<YYYY-MM-DD>.tif + sar_series.parquet

The Fields tab shows NDVI/VV tile previews, the DOY z-score against the field's own prior-years baseline (flags stress when NDVI z < −1 or VV anomaly > 2 dB dry), and a rough vegetated-hectares estimate from NDVI > 0.3. Rerun the exporters weekly; they only pull new scenes.

Stress alerts — Layer 1 grounds Layer 2

scripts/field_stress_alerts.py scores each field's recent Sentinel-2 NDVI against its nearest MO county's 2001–2023 MOD13Q1 baseline (same DOY ±7 days across 23 years). Severity: info < warn (|z|≥1.5) < stress (|z|≥2.0). Pipes to JSON for Slack/email/dashboards:

.venv/bin/python scripts/field_stress_alerts.py --lookback 21

Example ping:

{
  "field": "boone_cafnr_field_1",
  "date": "2024-10-22",
  "ndvi": 0.525,
  "county": "Boone",
  "county_doy_mu": 0.605,
  "z": -1.62,
  "severity": "warn",
  "message": "boone_cafnr_field_1: NDVI 0.53 on 2024-10-22 is 1.6σ below the 2001–2023 Boone County mean of 0.60 for DOY 296."
}

Cross-sensor caveat: field NDVI is Sentinel-2 @ 10 m, county baseline is MODIS MOD13Q1 @ 250 m. The z-score normalizes scale, but the baseline is biased. Upgrade path: once a field accumulates ≥3 years of S2 history, swap to that field's own DOY baseline (see dash_baseline.field_baseline_z).

Baseline dashboard (no Docker, no Postgres)

.venv/bin/python dash_baseline.py
# → http://127.0.0.1:8050

Reads directly from the cached parquets — county map with year-range slider, per-crop statewide trend, per-county NDVI time series, per-county yield series, and the per-county Daymet correlation panel. The production dash_app.py still requires PostgreSQL + InfluxDB (see Docker Services).

Deploy to Render (free tier)

render.yaml is already in the repo; the Statewide MO tab is the deployable product (the Fields tab needs the local GEE cache, so it gracefully shows an "empty" notice when data/fields/ is absent).

Push the repo to GitHub (already done).
https://dashboard.render.com → New → Blueprint → point at this repo.
Render auto-detects render.yaml, provisions a Python 3.11 web service, runs gunicorn dash_baseline:server.
First boot ~3–5 min; subsequent deploys <60 s.

No secrets needed for Statewide-only. To also serve the Fields tab live, set EE_SERVICE_ACCOUNT, EE_SA_KEY_FILE_CONTENTS (base64), and GCP_PROJECT in Render's env vars and add a build-step that writes the key to disk — but that's only worth it for a real tenant.

Prerequisites

Git (for cloning the repo)
Docker & Docker Compose (v3.8+)
Python 3.9+ (for running local scripts)
Google Earth Engine CLI (earthengine)

Clone & Setup

git clone https://github.com/aurascoper/agri_yield_pipeline.git
cd agri_yield_pipeline

Environment Configuration

Create a .env file in the project root with the following variables:

NOAA_API_TOKEN=<your_noaa_api_token>
USDA_API_KEY=<your_usda_api_key>
INFLUXDB_URL=http://localhost:8086
INFLUXDB_TOKEN=<your_influxdb_token>
INFLUXDB_ORG=<your_org>
INFLUXDB_BUCKET=<your_bucket>
POSTGRES_USER=user
POSTGRES_PASSWORD=password
POSTGRES_DB=alerts
INFLUXDB_INIT_USERNAME=admin
INFLUXDB_INIT_PASSWORD=password
# (Optional) Kafka settings:
KAFKA_BOOTSTRAP_SERVERS=localhost:9092
KAFKA_GROUP_ID=processor-group
# (Optional) Redis URL:
REDIS_URL=redis://localhost:6379/0

Authenticate with Google Earth Engine:

earthengine authenticate

Docker Services

Start core services:

docker-compose up -d

Services launched:

Zookeeper & Kafka
Redis
PostgreSQL
InfluxDB (initialized with your .env settings)
Stream Processor
Dash Dashboard

Check status and logs:

docker-compose ps
docker-compose logs -f

Real-Time Data Ingestion

Fetch NOAA weather and USDA yield data into InfluxDB and Kafka:

python src/data_ingestion/live_ingestor.py \
  --start-date 2021-01-01 \
  --end-date 2021-12-31 \
  --station-id GHCND:USW00003952 \
  --year 2021

Stream Processing

Consumes raw Kafka topics (weather, yield), enriches data, and writes to:

Kafka output topic (enriched-yield)
InfluxDB (for dashboard queries)
Redis (for alerts cache)

Run via Docker Compose (already started above):

docker-compose up -d stream-processor

Or locally:

python src/processing/stream_processor.py

Dashboard Access

The live Dash dashboard is available at: http://localhost:8050

To run locally:

python src/visualization/live_dashboard.py

API Endpoints

Backend API with FastAPI (NDVI & weather):

uvicorn api.main:app --reload

GET /ndvi/
GET /weather/

Troubleshooting Tips

Ensure .env is correctly configured and contains all required variables.
Verify no port conflicts on 5432, 6379, 8086, 8050, and 9092.
Use docker-compose ps and docker-compose logs <service> for diagnostics.
Access InfluxDB UI at http://localhost:8086 (use credentials from .env).
Test Redis with redis-cli -u redis://localhost:6379/0.
Confirm Google Earth Engine authentication: earthengine authenticate.

License

Released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.github/workflows		.github/workflows
api		api
config		config
data		data
figures/real		figures/real
scripts		scripts
src		src
.env.template		.env.template
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
agri_react.py		agri_react.py
agri_weather.db		agri_weather.db
agri_weather_yield.py		agri_weather_yield.py
agri_yield_etl.py		agri_yield_etl.py
dash_app.py		dash_app.py
dash_app_updated.py		dash_app_updated.py
dash_baseline.py		dash_baseline.py
dashboard.py		dashboard.py
etl_pipeline.py		etl_pipeline.py
fields.yml		fields.yml
fields.yml.example		fields.yml.example
missouri_yield_etl.py		missouri_yield_etl.py
missouri_yield_etl2.py		missouri_yield_etl2.py
missouri_yield_etl3.py		missouri_yield_etl3.py
missouri_yield_etl4.py		missouri_yield_etl4.py
models.py		models.py
render.yaml		render.yaml
requirements-dash.txt		requirements-dash.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agri Yield Pipeline

Takeaways

Product strategy

ML Results

Corn yield map — avg 2015-2023

NDVI vs corn yield (full state)

Multi-crop coverage

Per-county Daymet × yield / NDVI correlations

GBR feature importance + per-county residuals

Earlier models

Table of Contents

Real-data baseline

Per-field monitoring (Sentinel-2 NDVI + Sentinel-1 SAR)

Stress alerts — Layer 1 grounds Layer 2

Baseline dashboard (no Docker, no Postgres)

Deploy to Render (free tier)

Prerequisites

Clone & Setup

Environment Configuration

Docker Services

Real-Time Data Ingestion

Stream Processing

Dashboard Access

API Endpoints

Troubleshooting Tips

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Agri Yield Pipeline

Takeaways

Product strategy

ML Results

Corn yield map — avg 2015-2023

NDVI vs corn yield (full state)

Multi-crop coverage

Per-county Daymet × yield / NDVI correlations

GBR feature importance + per-county residuals

Earlier models

Table of Contents

Real-data baseline

Per-field monitoring (Sentinel-2 NDVI + Sentinel-1 SAR)

Stress alerts — Layer 1 grounds Layer 2

Baseline dashboard (no Docker, no Postgres)

Deploy to Render (free tier)

Prerequisites

Clone & Setup

Environment Configuration

Docker Services

Real-Time Data Ingestion

Stream Processing

Dashboard Access

API Endpoints

Troubleshooting Tips

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages