Skip to content

Latest commit

 

History

History
135 lines (110 loc) · 3.33 KB

File metadata and controls

135 lines (110 loc) · 3.33 KB

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[1.0.0] - 2026-02-03

Added

  • Initial release of Financial News Scraper
  • Multi-source news aggregation from 7 major sources
  • Automated daily scraping via GitHub Actions
  • Sentiment analysis using TextBlob
  • Financial entity extraction (stocks, companies, persons)
  • SQLite database with comprehensive schema
  • Export functionality in 4 formats (JSON, CSV, XML, Parquet)
  • RESTful API with FastAPI
  • Comprehensive documentation suite
  • Automated testing framework
  • Data retention and cleanup policies

Features

  • Scraping Module

    • Async HTTP requests with aiohttp
    • RSS feed parsing with feedparser
    • Content extraction with trafilatura
    • Fallback parsing with BeautifulSoup
    • Duplicate detection via URL uniqueness
    • Error handling and retry logic
  • Database

    • SQLite with SQLAlchemy ORM
    • Indexed queries for performance
    • Automatic schema creation
    • Data retention management
    • Scraping log tracking
  • Export System

    • JSON: Full structured data
    • CSV: Flattened spreadsheet format
    • XML: Hierarchical structure
    • Parquet: Compressed columnar format
    • Daily summary statistics
  • API

    • Article retrieval with filtering
    • Full-text search
    • Statistics and aggregations
    • Export endpoints
    • Pagination support
  • Automation

    • GitHub Actions workflow
    • Daily scheduled runs (2:00 AM UTC)
    • Manual trigger capability
    • Artifact management
    • Git-based version control

Documentation

  • Installation guide
  • Usage guide with examples
  • Complete API reference
  • System architecture documentation
  • Test results and metrics
  • Contributing guidelines
  • Code of conduct

Performance

  • Scraping speed: 2.6 articles/second
  • Success rate: 99.5%
  • Export time: <5 seconds for all formats
  • Memory efficient async processing

Known Issues

  • Reuters RSS feed returns 404 (URL needs update)
  • Yahoo Finance header length errors on some articles
  • Seeking Alpha rate limiting (403 errors)

Dependencies

  • Python 3.11+
  • aiohttp 3.9.1
  • feedparser 6.0.10
  • trafilatura 1.6.0
  • beautifulsoup4 4.12.2
  • textblob 0.17.1
  • sqlalchemy 2.0.23
  • fastapi 0.104.1
  • pandas 2.1.4
  • pyarrow 14.0.1

[Unreleased]

Planned

  • Real-time scraping with WebSockets
  • Advanced ML models for entity extraction
  • Multi-language support
  • Cryptocurrency news sources
  • Social media integration
  • Data visualization dashboard
  • PostgreSQL migration
  • Caching layer
  • GraphQL API
  • Full-text search engine
  • Comprehensive test suite
  • Performance optimizations

Version History

Version Numbering

  • Major version (X.0.0): Breaking changes
  • Minor version (0.X.0): New features, backward compatible
  • Patch version (0.0.X): Bug fixes, backward compatible

Release Process

  1. Update CHANGELOG.md
  2. Update version in setup.py
  3. Create git tag
  4. Push to GitHub
  5. Create GitHub release
  6. Deploy to production

Support Policy

  • Current version: Full support
  • Previous major version: Security fixes only
  • Older versions: No support

For detailed commit history, see GitHub Commits