AI News Summarizer Pipeline

An automated, modular, and scalable AI-powered news summarization system using Dagger. This pipeline automatically collects, summarizes, and publishes content from configurable sources, making it easy to stay updated on any niche topic.

🚀 Features

Multi-Source Scraping: Support for both web scraping (CSS/XPath) and RSS/Atom feeds
AI-Powered Summarization: Uses OpenAI GPT or Anthropic Claude for intelligent content summarization
Smart Deduplication: Avoids processing duplicate content across sources
Multi-Platform Publishing: Publish to Markdown files, Twitter threads, and GitHub Pages
Trend Detection: Automatically identifies emerging topics and patterns
Dagger Integration: Containerized pipeline with parallel processing and caching
Highly Configurable: YAML-based configuration for easy customization
Production Ready: Includes error handling, logging, and monitoring

📋 Prerequisites

Python 3.11+
Dagger CLI installed (installation guide)
API Keys:
- Anthropic Claude API key (or OpenAI API key)
- GitHub token (for GitHub publishing)
- Twitter API keys (optional, for Twitter publishing)

🛠️ Installation

Clone the repository:

git clone https://github.com/ArielleTolome/ai-news-summarizer.git
cd ai-news-summarizer

Install dependencies:

pip install -r requirements.txt

Set up environment variables:

export ANTHROPIC_API_KEY="your-anthropic-api-key"
# Or for OpenAI:
# export OPENAI_API_KEY="your-openai-api-key"

# For GitHub publishing:
export GITHUB_TOKEN="your-github-token"

# For Twitter publishing (optional):
export TWITTER_API_KEYS='{"consumer_key":"...","consumer_secret":"...","access_token":"...","access_token_secret":"..."}'

Configure your sources in config/config.yaml:

niche: "AI"  # Change to your desired niche
sources:
  rss_feeds:
    - url: "https://example.com/feed"
      name: "Example Feed"

🚀 Quick Start

Run Once

python -m src.pipeline.news_pipeline

Run with Preview (no publishing)

python -m src.pipeline.news_pipeline --preview

Run on Schedule

python -m src.pipeline.news_pipeline --scheduled

Using Docker

docker build -t ai-news-summarizer .
docker run -v $(pwd)/output:/app/output ai-news-summarizer

Using Dagger

dagger run python -m src.pipeline.news_pipeline

📁 Project Structure

ai-news-summarizer/
├── src/
│   ├── scrapers/
│   │   ├── web_scraper.py      # Web scraping with CSS/XPath
│   │   └── rss_parser.py       # RSS/Atom feed parsing
│   ├── summarizers/
│   │   └── gpt_summarizer.py   # AI-powered summarization
│   ├── publishers/
│   │   ├── markdown_publisher.py
│   │   ├── twitter_publisher.py
│   │   └── github_publisher.py
│   └── pipeline/
│       └── news_pipeline.py    # Main pipeline orchestration
├── config/
│   └── config.yaml            # Configuration file
├── templates/
│   └── newsletter_template.md # Jinja2 template
├── output/                    # Generated newsletters
├── cache/                     # Article deduplication cache
└── metrics/                   # Pipeline metrics

⚙️ Configuration

Basic Configuration

Edit config/config.yaml to customize:

niche: "AI"  # Your topic of interest

sources:
  rss_feeds:
    - url: "https://techcrunch.com/feed/"
      name: "TechCrunch"
      max_articles: 10
      
  web_scraping:
    - url: "https://example.com/news"
      selectors:
        articles: "article h2 a"  # CSS selector
        title: "//h1[@class='title']"  # XPath
        content: ".article-body"

summarization:
  provider: "anthropic"  # or "openai"
  model: "claude-3-opus-20240229"
  max_articles_per_run: 20

publishing:
  markdown:
    enabled: true
    output_dir: "./output"

Advanced Features

Custom Selectors

web_scraping:
  - url: "https://news.site.com"
    selectors:
      articles: "//article//a[@class='headline']"  # XPath
      title: "h1.article-title"  # CSS
      content: ".story-body"
      author: "span.byline"
      date: "time[datetime]"

Multi-Language Support

languages: ["en", "es", "fr"]  # Coming soon

Email Newsletter

email:
  enabled: true
  provider: "sendgrid"
  api_key_env: "SENDGRID_API_KEY"
  subscriber_list_id: "your-list-id"

📊 Output Examples

Markdown Newsletter

The pipeline generates beautiful markdown newsletters with:

Executive summary
Top stories with key insights
Trend analysis
Full article summaries
Metadata and statistics

Twitter Thread

Automatically creates engaging Twitter threads with:

Newsletter highlights
Top 3 stories
Trending topics
Link to full newsletter

GitHub Pages

Publishes to GitHub repository with:

Organized directory structure
Auto-generated index
Archive of all newsletters
Individual article pages

🔧 Customization

Adding a New Source Type

Create a new scraper in src/scrapers/:

class CustomScraper:
    async def scrape(self, config):
        # Your scraping logic
        return articles

Update the pipeline to use your scraper.

Custom Summarization Prompts

Modify prompts in src/summarizers/gpt_summarizer.py:

prompt = f"""
Your custom prompt here...
Article: {article_content}
"""

New Publisher

Create a new publisher in src/publishers/:

class CustomPublisher:
    async def publish(self, content):
        # Your publishing logic

Add configuration in config.yaml.

🧪 Testing

Run tests:

pytest tests/

Run with coverage:

pytest --cov=src tests/

🐛 Troubleshooting

Common Issues

Rate Limiting: Adjust rate_limit_delay in config
API Errors: Check your API keys and quotas
Scraping Failures: Verify selectors match current site structure
Memory Issues: Reduce max_articles_per_run

Debug Mode

Enable detailed logging:

import logging
logging.basicConfig(level=logging.DEBUG)

📈 Monitoring

Pipeline metrics are saved to metrics/pipeline_metrics.json:

{
  "start_time": "2024-06-04T09:00:00",
  "end_time": "2024-06-04T09:15:30",
  "articles_scraped": 45,
  "articles_summarized": 20,
  "articles_published": 20,
  "duration_seconds": 930
}

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests
Submit a pull request

📝 License

MIT License - see LICENSE file for details.

🙏 Acknowledgments

Built with Dagger for containerized pipelines
Powered by Anthropic Claude and OpenAI
Inspired by the need for curated, AI-powered news digests

🚧 Roadmap

Built with ❤️ for the AI community

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
config		config
src		src
templates		templates
.env.example		.env.example
.gitignore		.gitignore
DEPLOY_DAGGER_CLOUD.md		DEPLOY_DAGGER_CLOUD.md
Dockerfile		Dockerfile
README.md		README.md
dagger.json		dagger.json
dagger.yaml		dagger.yaml
deploy.sh		deploy.sh
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
run.py		run.py
test_dagger.py		test_dagger.py
test_demo.py		test_demo.py
test_quick.py		test_quick.py
test_simple.py		test_simple.py

Folders and files

Latest commit

History

Repository files navigation

AI News Summarizer Pipeline

🚀 Features

📋 Prerequisites

🛠️ Installation

🚀 Quick Start

Run Once

Run with Preview (no publishing)

Run on Schedule

Using Docker

Using Dagger

📁 Project Structure

⚙️ Configuration

Basic Configuration

Advanced Features

Custom Selectors

Multi-Language Support

Email Newsletter

📊 Output Examples

Markdown Newsletter

Twitter Thread

GitHub Pages

🔧 Customization

Adding a New Source Type

Custom Summarization Prompts

New Publisher

🧪 Testing

🐛 Troubleshooting

Common Issues

Debug Mode

📈 Monitoring

🤝 Contributing

📝 License

🙏 Acknowledgments

🚧 Roadmap

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages