Skip to content

ArielleTolome/ai-news-summarizer

Repository files navigation

AI News Summarizer Pipeline

An automated, modular, and scalable AI-powered news summarization system using Dagger. This pipeline automatically collects, summarizes, and publishes content from configurable sources, making it easy to stay updated on any niche topic.

πŸš€ Features

  • Multi-Source Scraping: Support for both web scraping (CSS/XPath) and RSS/Atom feeds
  • AI-Powered Summarization: Uses OpenAI GPT or Anthropic Claude for intelligent content summarization
  • Smart Deduplication: Avoids processing duplicate content across sources
  • Multi-Platform Publishing: Publish to Markdown files, Twitter threads, and GitHub Pages
  • Trend Detection: Automatically identifies emerging topics and patterns
  • Dagger Integration: Containerized pipeline with parallel processing and caching
  • Highly Configurable: YAML-based configuration for easy customization
  • Production Ready: Includes error handling, logging, and monitoring

πŸ“‹ Prerequisites

  • Python 3.11+
  • Dagger CLI installed (installation guide)
  • API Keys:
    • Anthropic Claude API key (or OpenAI API key)
    • GitHub token (for GitHub publishing)
    • Twitter API keys (optional, for Twitter publishing)

πŸ› οΈ Installation

  1. Clone the repository:
git clone https://github.com/ArielleTolome/ai-news-summarizer.git
cd ai-news-summarizer
  1. Install dependencies:
pip install -r requirements.txt
  1. Set up environment variables:
export ANTHROPIC_API_KEY="your-anthropic-api-key"
# Or for OpenAI:
# export OPENAI_API_KEY="your-openai-api-key"

# For GitHub publishing:
export GITHUB_TOKEN="your-github-token"

# For Twitter publishing (optional):
export TWITTER_API_KEYS='{"consumer_key":"...","consumer_secret":"...","access_token":"...","access_token_secret":"..."}'
  1. Configure your sources in config/config.yaml:
niche: "AI"  # Change to your desired niche
sources:
  rss_feeds:
    - url: "https://example.com/feed"
      name: "Example Feed"

πŸš€ Quick Start

Run Once

python -m src.pipeline.news_pipeline

Run with Preview (no publishing)

python -m src.pipeline.news_pipeline --preview

Run on Schedule

python -m src.pipeline.news_pipeline --scheduled

Using Docker

docker build -t ai-news-summarizer .
docker run -v $(pwd)/output:/app/output ai-news-summarizer

Using Dagger

dagger run python -m src.pipeline.news_pipeline

πŸ“ Project Structure

ai-news-summarizer/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ scrapers/
β”‚   β”‚   β”œβ”€β”€ web_scraper.py      # Web scraping with CSS/XPath
β”‚   β”‚   └── rss_parser.py       # RSS/Atom feed parsing
β”‚   β”œβ”€β”€ summarizers/
β”‚   β”‚   └── gpt_summarizer.py   # AI-powered summarization
β”‚   β”œβ”€β”€ publishers/
β”‚   β”‚   β”œβ”€β”€ markdown_publisher.py
β”‚   β”‚   β”œβ”€β”€ twitter_publisher.py
β”‚   β”‚   └── github_publisher.py
β”‚   └── pipeline/
β”‚       └── news_pipeline.py    # Main pipeline orchestration
β”œβ”€β”€ config/
β”‚   └── config.yaml            # Configuration file
β”œβ”€β”€ templates/
β”‚   └── newsletter_template.md # Jinja2 template
β”œβ”€β”€ output/                    # Generated newsletters
β”œβ”€β”€ cache/                     # Article deduplication cache
└── metrics/                   # Pipeline metrics

βš™οΈ Configuration

Basic Configuration

Edit config/config.yaml to customize:

niche: "AI"  # Your topic of interest

sources:
  rss_feeds:
    - url: "https://techcrunch.com/feed/"
      name: "TechCrunch"
      max_articles: 10
      
  web_scraping:
    - url: "https://example.com/news"
      selectors:
        articles: "article h2 a"  # CSS selector
        title: "//h1[@class='title']"  # XPath
        content: ".article-body"

summarization:
  provider: "anthropic"  # or "openai"
  model: "claude-3-opus-20240229"
  max_articles_per_run: 20

publishing:
  markdown:
    enabled: true
    output_dir: "./output"

Advanced Features

Custom Selectors

web_scraping:
  - url: "https://news.site.com"
    selectors:
      articles: "//article//a[@class='headline']"  # XPath
      title: "h1.article-title"  # CSS
      content: ".story-body"
      author: "span.byline"
      date: "time[datetime]"

Multi-Language Support

languages: ["en", "es", "fr"]  # Coming soon

Email Newsletter

email:
  enabled: true
  provider: "sendgrid"
  api_key_env: "SENDGRID_API_KEY"
  subscriber_list_id: "your-list-id"

πŸ“Š Output Examples

Markdown Newsletter

The pipeline generates beautiful markdown newsletters with:

  • Executive summary
  • Top stories with key insights
  • Trend analysis
  • Full article summaries
  • Metadata and statistics

Twitter Thread

Automatically creates engaging Twitter threads with:

  • Newsletter highlights
  • Top 3 stories
  • Trending topics
  • Link to full newsletter

GitHub Pages

Publishes to GitHub repository with:

  • Organized directory structure
  • Auto-generated index
  • Archive of all newsletters
  • Individual article pages

πŸ”§ Customization

Adding a New Source Type

  1. Create a new scraper in src/scrapers/:
class CustomScraper:
    async def scrape(self, config):
        # Your scraping logic
        return articles
  1. Update the pipeline to use your scraper.

Custom Summarization Prompts

Modify prompts in src/summarizers/gpt_summarizer.py:

prompt = f"""
Your custom prompt here...
Article: {article_content}
"""

New Publisher

  1. Create a new publisher in src/publishers/:
class CustomPublisher:
    async def publish(self, content):
        # Your publishing logic
  1. Add configuration in config.yaml.

πŸ§ͺ Testing

Run tests:

pytest tests/

Run with coverage:

pytest --cov=src tests/

πŸ› Troubleshooting

Common Issues

  1. Rate Limiting: Adjust rate_limit_delay in config
  2. API Errors: Check your API keys and quotas
  3. Scraping Failures: Verify selectors match current site structure
  4. Memory Issues: Reduce max_articles_per_run

Debug Mode

Enable detailed logging:

import logging
logging.basicConfig(level=logging.DEBUG)

πŸ“ˆ Monitoring

Pipeline metrics are saved to metrics/pipeline_metrics.json:

{
  "start_time": "2024-06-04T09:00:00",
  "end_time": "2024-06-04T09:15:30",
  "articles_scraped": 45,
  "articles_summarized": 20,
  "articles_published": 20,
  "duration_seconds": 930
}

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Submit a pull request

πŸ“ License

MIT License - see LICENSE file for details.

πŸ™ Acknowledgments

  • Built with Dagger for containerized pipelines
  • Powered by Anthropic Claude and OpenAI
  • Inspired by the need for curated, AI-powered news digests

🚧 Roadmap

  • Web dashboard for configuration
  • Slack/Discord integration
  • Custom ML models for relevance scoring
  • Multi-language support
  • Podcast generation
  • Mobile app notifications

Built with ❀️ for the AI community

About

AI-powered news summarization pipeline using Dagger

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors