An automated, modular, and scalable AI-powered news summarization system using Dagger. This pipeline automatically collects, summarizes, and publishes content from configurable sources, making it easy to stay updated on any niche topic.
- Multi-Source Scraping: Support for both web scraping (CSS/XPath) and RSS/Atom feeds
- AI-Powered Summarization: Uses OpenAI GPT or Anthropic Claude for intelligent content summarization
- Smart Deduplication: Avoids processing duplicate content across sources
- Multi-Platform Publishing: Publish to Markdown files, Twitter threads, and GitHub Pages
- Trend Detection: Automatically identifies emerging topics and patterns
- Dagger Integration: Containerized pipeline with parallel processing and caching
- Highly Configurable: YAML-based configuration for easy customization
- Production Ready: Includes error handling, logging, and monitoring
- Python 3.11+
- Dagger CLI installed (installation guide)
- API Keys:
- Anthropic Claude API key (or OpenAI API key)
- GitHub token (for GitHub publishing)
- Twitter API keys (optional, for Twitter publishing)
- Clone the repository:
git clone https://github.com/ArielleTolome/ai-news-summarizer.git
cd ai-news-summarizer- Install dependencies:
pip install -r requirements.txt- Set up environment variables:
export ANTHROPIC_API_KEY="your-anthropic-api-key"
# Or for OpenAI:
# export OPENAI_API_KEY="your-openai-api-key"
# For GitHub publishing:
export GITHUB_TOKEN="your-github-token"
# For Twitter publishing (optional):
export TWITTER_API_KEYS='{"consumer_key":"...","consumer_secret":"...","access_token":"...","access_token_secret":"..."}'- Configure your sources in
config/config.yaml:
niche: "AI" # Change to your desired niche
sources:
rss_feeds:
- url: "https://example.com/feed"
name: "Example Feed"python -m src.pipeline.news_pipelinepython -m src.pipeline.news_pipeline --previewpython -m src.pipeline.news_pipeline --scheduleddocker build -t ai-news-summarizer .
docker run -v $(pwd)/output:/app/output ai-news-summarizerdagger run python -m src.pipeline.news_pipelineai-news-summarizer/
βββ src/
β βββ scrapers/
β β βββ web_scraper.py # Web scraping with CSS/XPath
β β βββ rss_parser.py # RSS/Atom feed parsing
β βββ summarizers/
β β βββ gpt_summarizer.py # AI-powered summarization
β βββ publishers/
β β βββ markdown_publisher.py
β β βββ twitter_publisher.py
β β βββ github_publisher.py
β βββ pipeline/
β βββ news_pipeline.py # Main pipeline orchestration
βββ config/
β βββ config.yaml # Configuration file
βββ templates/
β βββ newsletter_template.md # Jinja2 template
βββ output/ # Generated newsletters
βββ cache/ # Article deduplication cache
βββ metrics/ # Pipeline metrics
Edit config/config.yaml to customize:
niche: "AI" # Your topic of interest
sources:
rss_feeds:
- url: "https://techcrunch.com/feed/"
name: "TechCrunch"
max_articles: 10
web_scraping:
- url: "https://example.com/news"
selectors:
articles: "article h2 a" # CSS selector
title: "//h1[@class='title']" # XPath
content: ".article-body"
summarization:
provider: "anthropic" # or "openai"
model: "claude-3-opus-20240229"
max_articles_per_run: 20
publishing:
markdown:
enabled: true
output_dir: "./output"web_scraping:
- url: "https://news.site.com"
selectors:
articles: "//article//a[@class='headline']" # XPath
title: "h1.article-title" # CSS
content: ".story-body"
author: "span.byline"
date: "time[datetime]"languages: ["en", "es", "fr"] # Coming soonemail:
enabled: true
provider: "sendgrid"
api_key_env: "SENDGRID_API_KEY"
subscriber_list_id: "your-list-id"The pipeline generates beautiful markdown newsletters with:
- Executive summary
- Top stories with key insights
- Trend analysis
- Full article summaries
- Metadata and statistics
Automatically creates engaging Twitter threads with:
- Newsletter highlights
- Top 3 stories
- Trending topics
- Link to full newsletter
Publishes to GitHub repository with:
- Organized directory structure
- Auto-generated index
- Archive of all newsletters
- Individual article pages
- Create a new scraper in
src/scrapers/:
class CustomScraper:
async def scrape(self, config):
# Your scraping logic
return articles- Update the pipeline to use your scraper.
Modify prompts in src/summarizers/gpt_summarizer.py:
prompt = f"""
Your custom prompt here...
Article: {article_content}
"""- Create a new publisher in
src/publishers/:
class CustomPublisher:
async def publish(self, content):
# Your publishing logic- Add configuration in
config.yaml.
Run tests:
pytest tests/Run with coverage:
pytest --cov=src tests/- Rate Limiting: Adjust
rate_limit_delayin config - API Errors: Check your API keys and quotas
- Scraping Failures: Verify selectors match current site structure
- Memory Issues: Reduce
max_articles_per_run
Enable detailed logging:
import logging
logging.basicConfig(level=logging.DEBUG)Pipeline metrics are saved to metrics/pipeline_metrics.json:
{
"start_time": "2024-06-04T09:00:00",
"end_time": "2024-06-04T09:15:30",
"articles_scraped": 45,
"articles_summarized": 20,
"articles_published": 20,
"duration_seconds": 930
}- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
MIT License - see LICENSE file for details.
- Built with Dagger for containerized pipelines
- Powered by Anthropic Claude and OpenAI
- Inspired by the need for curated, AI-powered news digests
- Web dashboard for configuration
- Slack/Discord integration
- Custom ML models for relevance scoring
- Multi-language support
- Podcast generation
- Mobile app notifications
Built with β€οΈ for the AI community