EconSense is a HackMIT 2023 project that integrates financial news (data mining), LLM-driven abstractive summarization, and sentiment classification into an end-to-end pipeline for market insight. The system scrapes live articles, extracts structured content, generates concise summaries, and quantifies sentiment polarity relevant to economic contexts. It was designed for robustness, transparency, and extensibility, allowing downstream integration into financial dashboards or trading simulators.
The project is designed to track and analyze economic discussions and news in real time, cluster them by keywords (inflation, unemployment, GDP, etc.), and then generate human-readable insights:
- Definitions of the chosen economic term.
- Summaries of community sentiment (from comments/social input).
- Summaries of recent news coverage.
- Visualization of which economic topics are trending (bar charts of keyword frequency).
The system acts like an interactive “Economic Sense-Maker”:
- Pulls in community comments + news data.
- Groups them around economic concepts.
- Lets a user pick a keyword (like “market” or “inflation”).
- Generates a definition, summarizes views, detects sentiment, and blends in news coverage.
- Lets the user keep asking follow-up questions in a conversational loop.
-
Reddit ingestion (PRAW): Authenticates via PRAW and pulls recent top submissions from
r/economics,r/economy, andr/globalmarkets. For each submission it:- Records title, score (upvotes), upvote ratio.
- Expands all comment trees (
replace_more(limit=0)) and flattens text into a comment list.
-
Keyword telemetry: Maintains an economics keyword inventory (inflation, interest rates, GDP, unemployment, CPI/PPI, housing, oil/energy, markets/crypto, etc.). Counts keyword hits in both post bodies and aggregated comments to get crude topic salience over the pull window.
-
News ingestion (NewsAPI + HTML scrape): Queries NewsAPI for economy-related terms; for each result it:
- Fetches the full article URL and extracts readable text via BeautifulSoup (paragraph harvesting, basic de-HTML).
- (Prototype) Generates short abstractive summaries using Google Gemini 1.5 Flash through
google.genai(prompted for concise, factual synthesis).
-
Text normalization: Minimal preprocessing using NLTK:
- tokenization → lowercase → English stop-word removal.
(No stemming/lemmatization in the current notebook; this is intentionally simple to keep signal from proper nouns.)
- tokenization → lowercase → English stop-word removal.
-
Vectorization + clustering: Builds a TF-IDF matrix on aggregated Reddit text and clusters with K-Means:
TfidfVectorizer()with default params;KMeans(n_clusters=3); assigns cluster labels back to rows.
This gives coarse topical buckets (e.g., “inflation/rates”, “labor/housing”, “markets/energy”) without hand-coding.
-
Outputs / artifacts:
- A combined tabular view (titles, URLs, platform tag, keyword matched index, optional LLM summary).
- Basic bar charts of keyword counts (Matplotlib) to visualize which themes dominate.
- Prototype CSV export (e.g.,
Final.csv) for downstream analysis.
Eg:
...
Article 3 (Bremen):This article briefly notes a discrepancy between predicted and actual salary increases in several countries during the first half of 2025. The article is limited in scope and does not offer a strong opinion.
The news articles, unlike the community comments, present a relatively calm and analytical perspective, focusing on economic data and trends rather than speculative anxieties. They do, however, indirectly support some of the community's concerns about inflation and the potential for economic instability. The articles suggest that economic data shows ongoing inflation and offer predictions about future inflation based on existing trends.- Top posts + full comment trees capture both headline narratives and grass-roots reactions with bounded API calls.
- TF-IDF + K-Means is fast, deterministic enough for quick triage, and easy to interpret via top terms per cluster.
- LLM summarization (Gemini 1.5 Flash) provides short, human-readable context over long articles; kept separate from clustering to avoid leakage.
- No sentiment modeling (e.g., VADER/TextBlob) or stance detection.
- K is fixed at 3; no model selection (silhouette/Davies-Bouldin) yet.
- TF-IDF uses defaults (no n-grams, no domain stoplist, no lemmatization).
- News extraction is best-effort paragraph scraping; no boilerplate removal heuristics beyond basic tag filtering.
- Interactive cells (e.g., selecting a keyword by index) and API keys for PRAW/NewsAPI/
google.genaiare required to run. - Summaries are prototype calls; batching, caching, and rate-limit handling are not implemented.
- Add lemmatization + domain stoplist; enable bigrams for phrases (“interest rates”, “housing market”).
- Tune k with internal metrics; surface top terms per cluster.
- Optional sentiment layer (document and comment-level).
- Robust article cleaner (readability heuristics) and duplicate detection.
- Persist unified corpus with timestamps for time-series tracking.

