Airflow for processing Salesforce data using the Simple-Salesforce Rest-api
This project demonstrates the integration of Salesforce data using Apache Airflow. The project fetches data from Salesforce using the Salesforce REST API, processes it, and stores it in CSV files, automtate this script to run every 1 min.
The goal of this project is to automate the data extraction and processing of Salesforce data using Apache Airflow. The project is divided into several steps, each implemented as an Airflow task:
Here is the visual representation of the task flow:

- make_api_call: Establishes a connection to Salesforce using credentials from a configuration file.
- query_data: Executes a Salesforce query to retrieve data.
- process_results: Processes the retrieved data and extracts specific attributes.
- download_data_to_tm_file: Downloads processed data to a CSV file.
- to_st_file: Cleans and formats data for a specific use case.
- to_etl_file: Appends cleaned data to an ETL CSV file.
- Python 3.x
- Apache Airflow (installed and configured)
- Simple Salesforce library (
pip install simple-salesforce) - Docker (if using Dockerized Airflow)
- Configuration YAML file (
configuration_files/config.yaml) with Salesforce credentials.
- Clone this repository to your local machine:
git clone https://github.com/pavan-forlooper/Data_engineering_Airflow_salesforce_rest-api.git
-
Install the required packages: pip install simple-salesforce
-
Configure your Salesforce credentials in the
configuration_files/config.yamlfile.
-
Run the Apache Airflow webserver and scheduler.
-
Upload the DAG file (
main_DAG.py) to your Airflow DAGs directory. -
Access the Airflow web interface (usually at http://localhost:8080) and enable the
salesforce_data_processingDAG. -
The DAG will run based on the specified schedule and execute the defined tasks.
- Modify the Salesforce query in the
sql_files/soql_query.sqlfile as needed. - Adjust other file paths and task logic as required by your use case.
The DAG is scheduled to run every minute. You can modify the schedule_interval parameter in the main_DAG.py file to change the frequency.