A Python web scraper for extracting Microsoft Message Center announcements from mc.merill.net. This tool scrapes message center entries for specific dates and saves them as structured JSON data with detailed metadata, summaries, and images.
- Date-based scraping: Extract all message center entries for a specific date
- Structured data extraction: Captures metadata including message ID, service, tags, platforms, and roadmap information
- Rich content parsing: Extracts both summary and detailed information sections
- Image handling: Downloads and catalogs all images with alt text
- JSON output: Saves data in structured JSON format for easy processing
- Rate limiting: Built-in throttling to be respectful to the target website
- Error handling: Robust error handling with detailed logging
This project uses Python 3.13+ and can be installed using uv (recommended) or pip.
# Install dependencies
uv sync# Clone the repository
# Install dependencies
pip install -r requirements.txtThe main script provides a simple command-line interface:
# Scrape messages for a specific date
python main.py --date 2025-10-08
# Scrape messages for yesterday
python main.py --yesterday
# Scrape messages for today
python main.py --today
# Scrape messages for the last 7 days
python main.py --last-days 7
# Show help
python main.py --helpYou can also use the scraper programmatically:
from scraper import run_for_date
from datetime import datetime
# Scrape messages for a specific date
target_date = datetime(2025, 10, 8)
run_for_date(target_date, throttle_sec=0.5)The scraper generates JSON files in the mc_messages/ directory with the following structure:
[
{
"id": "MC1168294",
"url": "https://mc.merill.net/message/MC1168294",
"title": "Microsoft Teams: Apps now supported in Shared Channels",
"service": "Microsoft Teams",
"published": "Oct 8, 2025",
"meta": {
"tags": ["Admin impact", "New feature", "User impact"],
"platforms": ["Desktop", "Mac"],
"message_center_url": "https://admin.microsoft.com/#/MessageCenter/:/messages/MC1168294",
"service_exact": "Microsoft Teams",
"published_exact": "Oct 8, 2025",
"roadmap_id": "505791",
"roadmap_url": "https://www.microsoft.com/en-US/microsoft-365/roadmap?filters=&searchterms=505791"
},
"summary": {
"text": "Brief summary text...",
"blocks": [{"type": "paragraph", "text": "..."}],
"images": []
},
"more_information": {
"text": "Detailed information text...",
"blocks": [{"type": "paragraph", "text": "..."}],
"images": [
{
"src": "https://example.com/image.jpg",
"alt": "Image description"
}
]
},
"images": [
{
"src": "https://example.com/image.jpg",
"alt": "Image description"
}
]
}
]- id: Message Center ID (e.g., "MC1168294")
- url: Direct link to the message
- title: Message title
- service: Microsoft service (e.g., "Microsoft Teams", "SharePoint")
- published: Publication date as string
- tags: Array of tags/categories
- platforms: Array of supported platforms
- message_center_url: Link to Microsoft Admin Center
- service_exact: Exact service name from metadata
- published_exact: Exact publication date from metadata
- roadmap_id: Associated roadmap ID
- roadmap_url: Link to Microsoft 365 roadmap
Both summary and more_information contain:
- text: Plain text version of the content
- blocks: Structured content blocks (paragraphs, lists, tables, HTML)
- images: Array of images with src and alt text
The scraper includes a built-in delay between requests (default: 0.5 seconds) to be respectful to the target website. You can adjust this in the code:
run_for_date(target_date, throttle_sec=1.0) # 1 second delayScraped data is saved to the mc_messages/ directory by default. The directory is created automatically if it doesn't exist.
# Get yesterday's messages
python main.py --yesterday
# Get messages from the last week
python main.py --last-days 7# Get messages from October 8, 2025
python main.py --date 2025-10-08# Get messages for multiple dates
python main.py --date 2025-10-08
python main.py --date 2025-10-07
python main.py --date 2025-10-06The scraper includes comprehensive error handling:
- Network errors: Retries and graceful failure
- Parsing errors: Logs issues and continues with other messages
- Missing data: Handles missing fields gracefully
- Rate limiting: Built-in delays to avoid overwhelming the server
- requests: HTTP library for web requests
- beautifulsoup4: HTML parsing and extraction
- lxml: XML/HTML parser backend
This project is for educational and research purposes. Please respect the terms of service of the target website and use responsibly.
Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.
This tool is not affiliated with Microsoft. Use responsibly and in accordance with the target website's terms of service.