EconSense: Real-Time News Summarization and Market Sentiment Analysis

About

EconSense is a HackMIT 2023 project that integrates financial news (data mining), LLM-driven abstractive summarization, and sentiment classification into an end-to-end pipeline for market insight. The system scrapes live articles, extracts structured content, generates concise summaries, and quantifies sentiment polarity relevant to economic contexts. It was designed for robustness, transparency, and extensibility, allowing downstream integration into financial dashboards or trading simulators.

The project is designed to track and analyze economic discussions and news in real time, cluster them by keywords (inflation, unemployment, GDP, etc.), and then generate human-readable insights:

Definitions of the chosen economic term.
Summaries of community sentiment (from comments/social input).
Summaries of recent news coverage.
Visualization of which economic topics are trending (bar charts of keyword frequency).

The system acts like an interactive “Economic Sense-Maker”:

Pulls in community comments + news data.
Groups them around economic concepts.
Lets a user pick a keyword (like “market” or “inflation”).
Generates a definition, summarizes views, detects sentiment, and blends in news coverage.
Lets the user keep asking follow-up questions in a conversational loop.

What it does (end-to-end)

Reddit ingestion (PRAW): Authenticates via PRAW and pulls recent top submissions from r/economics, r/economy, and r/globalmarkets. For each submission it:
- Records title, score (upvotes), upvote ratio.
- Expands all comment trees (replace_more(limit=0)) and flattens text into a comment list.
Keyword telemetry: Maintains an economics keyword inventory (inflation, interest rates, GDP, unemployment, CPI/PPI, housing, oil/energy, markets/crypto, etc.). Counts keyword hits in both post bodies and aggregated comments to get crude topic salience over the pull window.
News ingestion (NewsAPI + HTML scrape): Queries NewsAPI for economy-related terms; for each result it:
- Fetches the full article URL and extracts readable text via BeautifulSoup (paragraph harvesting, basic de-HTML).
- (Prototype) Generates short abstractive summaries using Google Gemini 1.5 Flash through google.genai (prompted for concise, factual synthesis).
Text normalization: Minimal preprocessing using NLTK:
- tokenization → lowercase → English stop-word removal.
  (No stemming/lemmatization in the current notebook; this is intentionally simple to keep signal from proper nouns.)
Vectorization + clustering: Builds a TF-IDF matrix on aggregated Reddit text and clusters with K-Means:
- TfidfVectorizer() with default params;
- KMeans(n_clusters=3); assigns cluster labels back to rows.
  This gives coarse topical buckets (e.g., “inflation/rates”, “labor/housing”, “markets/energy”) without hand-coding.
Outputs / artifacts:
- A combined tabular view (titles, URLs, platform tag, keyword matched index, optional LLM summary).
- Basic bar charts of keyword counts (Matplotlib) to visualize which themes dominate.
- Prototype CSV export (e.g., Final.csv) for downstream analysis.

Eg:

...

Article 3 (Bremen):This article briefly notes a discrepancy between predicted and actual salary increases in several countries during the first half of 2025.  The article is limited in scope and does not offer a strong opinion.

The news articles, unlike the community comments, present a relatively calm and analytical perspective, focusing on economic data and trends rather than speculative anxieties. They do, however, indirectly support some of the community's concerns about inflation and the potential for economic instability.  The articles suggest that economic data shows ongoing inflation and offer predictions about future inflation based on existing trends.

Why these choices

Top posts + full comment trees capture both headline narratives and grass-roots reactions with bounded API calls.
TF-IDF + K-Means is fast, deterministic enough for quick triage, and easy to interpret via top terms per cluster.
LLM summarization (Gemini 1.5 Flash) provides short, human-readable context over long articles; kept separate from clustering to avoid leakage.

Current limitations

No sentiment modeling (e.g., VADER/TextBlob) or stance detection.
K is fixed at 3; no model selection (silhouette/Davies-Bouldin) yet.
TF-IDF uses defaults (no n-grams, no domain stoplist, no lemmatization).
News extraction is best-effort paragraph scraping; no boilerplate removal heuristics beyond basic tag filtering.
Interactive cells (e.g., selecting a keyword by index) and API keys for PRAW/NewsAPI/google.genai are required to run.
Summaries are prototype calls; batching, caching, and rate-limit handling are not implemented.

Sensible next steps

Add lemmatization + domain stoplist; enable bigrams for phrases (“interest rates”, “housing market”).
Tune k with internal metrics; surface top terms per cluster.
Optional sentiment layer (document and comment-level).
Robust article cleaner (readability heuristics) and duplicate detection.
Persist unified corpus with timestamps for time-series tracking.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
outputs		outputs
.DS_Store		.DS_Store
.gitignore		.gitignore
Clusters.csv		Clusters.csv
Data.csv		Data.csv
Final.csv		Final.csv
HackMIT-Econsense(Clean).ipynb		HackMIT-Econsense(Clean).ipynb
README.md		README.md
econsense_system.png		econsense_system.png
keyword_mentions.csv		keyword_mentions.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EconSense: Real-Time News Summarization and Market Sentiment Analysis

About

What it does (end-to-end)

Why these choices

Current limitations

Sensible next steps

About

Uh oh!

Releases

Packages

Languages

DeboJp/EconSense

Folders and files

Latest commit

History

Repository files navigation

EconSense: Real-Time News Summarization and Market Sentiment Analysis

About

What it does (end-to-end)

Why these choices

Current limitations

Sensible next steps

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages