AI-Powered Media Bias Analysis for Electoral Coverage

An end-to-end Natural Language Processing (NLP) pipeline designed to identify and quantify ideological bias in online news articles covering the 2025 Delhi Assembly Elections. This project moves beyond simple sentiment analysis to provide a multi-faceted view of media framing, source attribution, and thematic focus.

The core contribution is a novel methodology for analyzing party-specific sentiment by aggregating sentence-level scores, which avoids the neutralization effect common in full-article averaging and provides a more accurate, granular view of media slant.

Key Features

Automated Content Scraping: A robust scraper to fetch article text from a list of URLs, with error handling and logging.
Advanced NLP Preprocessing: A configurable text cleaning pipeline to prepare data for analysis.
Multi-Dimensional Bias Analysis: Measures media bias across several key dimensions (see below).
Comparative Sentiment Analysis: Implements and compares both VADER and TextBlob for a comprehensive understanding of sentiment.
Unsupervised Topic Modeling: Utilizes Scikit-Learn's LDA and NMF to uncover latent themes and frames in election coverage.
Named Entity Recognition (NER): Employs spaCy to identify and analyze the sources (people, organizations) cited in articles.
Statistical Validation: Uses SciPy for statistical tests (ANOVA, Chi-squared) to validate the significance of observed differences.
Automated Pipeline: A master script (run_pipeline.py) executes the entire workflow from scraping to analysis and visualization.

Analysis Dimensions

This project quantifies bias by focusing on five key dimensions derived from media analysis research:

Language and Wording: Frequency analysis of ideologically charged terms.
Tone and Sentiment: Comparative sentiment analysis of articles and specific political entities (parties, candidates).
Framing of Issues: Unsupervised topic modeling (LDA/NMF) to identify how different outlets frame the same events.
Source Attribution: NER to extract and analyze the types of sources (e.g., government, opposition, independent) each outlet cites.
Omission (Proxy): A keyword-based proxy to track which outlets cover specific key events or controversies.

Tech Stack

Component	Technologies & Libraries
Data Scraping	`Python`, `newspaper3k`, `requests`
Data Processing	`Pandas`, `NumPy`
NLP & ML	`Scikit-Learn` (LDA, NMF, Vectorizers), `spaCy` (NER), `NLTK`, `VADER`, `TextBlob`
Statistics	`SciPy`
Visualization	`Matplotlib`, `Seaborn`, `WordCloud`
Workflow	`subprocess`, `logging`, `os`

Project Structure

media-bias-analysis/
├── data/                    # -> Houses input/output CSVs (links, scraped, preprocessed)
├── src/                     # -> All Python source code modules
│   ├── __init__.py
│   ├── article_scraper.py   # -> Scrapes articles from URLs
│   ├── preprocessing.py     # -> Cleans and preprocesses text data
│   ├── analysis_vader.py    # -> Full analysis pipeline using VADER
│   ├── analysis_textblob.py # -> Full analysis pipeline using TextBlob
├── analysis_results/        # -> All generated outputs (plots, JSONs, result CSVs)
├── logs/                    # -> Log files for each script, tracking progress and errors
├── .gitignore               # -> Specifies files/directories to ignore in version control
├── config.py                # -> Central configuration for paths, parameters, keywords, etc.
├── run_pipeline.py          # -> Master script to execute the entire pipeline
├── requirements.txt         # -> Project dependencies
└── README.md                # -> This file

Installation & Setup

Clone the repository:

git clone https://github.com/your-username/media-bias-analysis.git
cd media-bias-analysis

Create and activate a virtual environment:

# For Unix/macOS
python3 -m venv venv
source venv/bin/activate

# For Windows
python -m venv venv
.\venv\Scripts\activate

Install the required packages:
```
pip install -r requirements.txt
```

Download necessary NLP models:

# Download spaCy model for NER
python -m spacy download en_core_web_sm

# The analysis scripts will automatically download required NLTK models on first run.

Prepare Input Data:
- Add your target article URLs to data/article-links.csv. The file must contain outlet, category, and link columns.

Usage

Execute the entire pipeline from scraping to analysis with a single command from the project root directory. The master script will run each module in the correct order.

python run_pipeline.py

The script will provide console output on its progress.
Detailed logs for each step are saved in the /logs directory.
All generated CSVs, JSON files, and plots are saved in the /analysis_results directory.

Methodology Highlight: Party Sentiment Analysis

A key challenge in sentiment analysis is the neutralization effect, where an article with strong positive and strong negative statements about a party averages out to a neutral score. This project mitigates this by:

Sentence-Level Analysis: Identifying only the sentences that explicitly mention a political party or its key members.
Per-Article Aggregation: Calculating an average sentiment score for each party within a single article based on only those relevant sentences.
Categorization: Classifying each article's stance towards a party as 'Positive', 'Neutral', or 'Negative' based on this calculated score.
Outlet-Level Aggregation: Counting the number and proportion of articles from each outlet that fall into these three categories for each party, providing a clear and nuanced view of media slant.

Limitations and Future Work

Scope: The analysis is currently limited to English-language media.
Omission Proxy: The omission analysis is a simplified proxy based on keywords and could be improved with more advanced event-detection models.
NER Accuracy: NER is performed on preprocessed text for simplicity, which may slightly reduce accuracy compared to running it on raw text.
Future Work:
- Develop an interactive dashboard using Dash or Streamlit to explore the results.
- Incorporate Aspect-Based Sentiment Analysis (ABSA) for even more granular insights.
- Fine-tune a transformer-based model (e.g., BERT) for domain-specific sentiment classification.
- Expand the analysis to include Hindi and other regional-language media.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI-Powered Media Bias Analysis for Electoral Coverage

Key Features

Analysis Dimensions

Tech Stack

Project Structure

Installation & Setup

Usage

Methodology Highlight: Party Sentiment Analysis

Limitations and Future Work

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.venv38		.venv38
__pycache__		__pycache__
analysis_results-old		analysis_results-old
analysis_results		analysis_results
analysis_results1		analysis_results1
analysis_results2		analysis_results2
data		data
logs		logs
src		src
.gitignore		.gitignore
README.md		README.md
analysis.py		analysis.py
config.py		config.py
requirements.txt		requirements.txt
run_pipeline.py		run_pipeline.py

Folders and files

Latest commit

History

Repository files navigation

AI-Powered Media Bias Analysis for Electoral Coverage

Key Features

Analysis Dimensions

Tech Stack

Project Structure

Installation & Setup

Usage

Methodology Highlight: Party Sentiment Analysis

Limitations and Future Work

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages