AgriSatAI is a distributed ETL pipeline and predictive analytics platform for agriculture. It processes multi-spectral satellite imagery and weather data to predict crop health and forecast yields using ensemble machine learning models.
Live App: agrisatai.streamlit.app
graph LR
subgraph "1. Cloud Storage (MinIO/S3)"
C["raw/ bucket"]
D["processed/ bucket"]
E["models/ bucket"]
end
subgraph "2. PySpark ETL Pipeline"
F["Extract"]
G["Transform (NDVI, Features)"]
H["Load (Parquet/CSV)"]
end
subgraph "3. scikit-learn ML"
I["Ensemble Classifier (Health)"]
J["RF Regressor (Yield)"]
end
subgraph "4. Streamlit Dashboard"
K["Interactive Map"]
L["Model Metrics"]
end
C --> F --> G --> H --> D
D --> I & J --> E
D & E --> K & L
- Distributed Data Processing: Uses PySpark to process raw multi-spectral GeoTIFF satellite data, calculate NDVI/EVI, and engineer features at scale.
- Ensemble Machine Learning: Achieves high accuracy (>90%) on crop health classification using a soft-voting ensemble of Random Forest, Gradient Boosting, and SVM models.
- S3-Compatible Cloud Storage: Integrates
boto3to store raw data, processed Parquet files, and serialized models in MinIO (drop-in replacement for AWS S3). - Interactive Dashboard: A full-stack Streamlit app featuring Folium geospatial maps, Pipeline DAG visualizations, and real-time inference simulators.
- Performance Profiling: Built-in runtime decorators to benchmark Spark's distributed processing against single-threaded Pandas execution.
-
Clone the Repository:
git clone https://github.com/TechieSamosa/AgriSatAI.git cd AgriSatAI -
Set Up Python Environment:
python -m venv .venv .\.venv\Scripts\activate # Windows source .venv/bin/activate # Mac/Linux pip install -r requirements.txt
-
Start MinIO Storage (Optional): Ensure Docker Desktop is running, then execute:
docker-compose up -d
Run the following commands sequentially to execute the full data engineering and ML pipeline:
Generates 4-band (Red, Green, Blue, NIR) synthetic GeoTIFF files simulating satellite imagery.
python scripts/generate_data.pyExtracts the raw imagery, calculates NDVI, handles missing data, and writes to Parquet.
python scripts/run_etl.pyTrains the ensemble classifier and yield regressor, saving the artifacts to disk.
python scripts/run_training.pyOpen the interactive Streamlit dashboard to view the map, pipeline DAG, and predictions.
streamlit run src/dashboard/app.py- Overview & Map: KPI metrics and interactive Folium field map with health statuses.
- Pipeline & Data Quality: Sankey diagram of the ETL DAG, data completeness scores, and PySpark profiling benchmarks.
- Predictive Models: Live sliders to simulate field conditions and output real-time predictions.
- Seasonal Trends & Alerts: Alert system for critically stressed fields and NDVI growth trends.
Contributions are welcome! Please open an issue or submit a pull request for new features, bug fixes, or optimizations.
This project is licensed under the MIT License. See the LICENSE file for details.