Microsoft Message Center Scraper

A Python web scraper for extracting Microsoft Message Center announcements from mc.merill.net. This tool scrapes message center entries for specific dates and saves them as structured JSON data with detailed metadata, summaries, and images.

Features

Date-based scraping: Extract all message center entries for a specific date
Structured data extraction: Captures metadata including message ID, service, tags, platforms, and roadmap information
Rich content parsing: Extracts both summary and detailed information sections
Image handling: Downloads and catalogs all images with alt text
JSON output: Saves data in structured JSON format for easy processing
Rate limiting: Built-in throttling to be respectful to the target website
Error handling: Robust error handling with detailed logging

Installation

This project uses Python 3.13+ and can be installed using uv (recommended) or pip.

Using uv (recommended)

# Install dependencies
uv sync

Using pip

# Clone the repository

# Install dependencies
pip install -r requirements.txt

Usage

Command Line Interface

The main script provides a simple command-line interface:

# Scrape messages for a specific date
python main.py --date 2025-10-08

# Scrape messages for yesterday
python main.py --yesterday

# Scrape messages for today
python main.py --today

# Scrape messages for the last 7 days
python main.py --last-days 7

# Show help
python main.py --help

Python API

You can also use the scraper programmatically:

from scraper import run_for_date
from datetime import datetime

# Scrape messages for a specific date
target_date = datetime(2025, 10, 8)
run_for_date(target_date, throttle_sec=0.5)

Output Format

The scraper generates JSON files in the mc_messages/ directory with the following structure:

[
  {
    "id": "MC1168294",
    "url": "https://mc.merill.net/message/MC1168294",
    "title": "Microsoft Teams: Apps now supported in Shared Channels",
    "service": "Microsoft Teams",
    "published": "Oct 8, 2025",
    "meta": {
      "tags": ["Admin impact", "New feature", "User impact"],
      "platforms": ["Desktop", "Mac"],
      "message_center_url": "https://admin.microsoft.com/#/MessageCenter/:/messages/MC1168294",
      "service_exact": "Microsoft Teams",
      "published_exact": "Oct 8, 2025",
      "roadmap_id": "505791",
      "roadmap_url": "https://www.microsoft.com/en-US/microsoft-365/roadmap?filters=&searchterms=505791"
    },
    "summary": {
      "text": "Brief summary text...",
      "blocks": [{"type": "paragraph", "text": "..."}],
      "images": []
    },
    "more_information": {
      "text": "Detailed information text...",
      "blocks": [{"type": "paragraph", "text": "..."}],
      "images": [
        {
          "src": "https://example.com/image.jpg",
          "alt": "Image description"
        }
      ]
    },
    "images": [
      {
        "src": "https://example.com/image.jpg",
        "alt": "Image description"
      }
    ]
  }
]

Data Structure

Message Fields

id: Message Center ID (e.g., "MC1168294")
url: Direct link to the message
title: Message title
service: Microsoft service (e.g., "Microsoft Teams", "SharePoint")
published: Publication date as string

Metadata (meta)

tags: Array of tags/categories
platforms: Array of supported platforms
message_center_url: Link to Microsoft Admin Center
service_exact: Exact service name from metadata
published_exact: Exact publication date from metadata
roadmap_id: Associated roadmap ID
roadmap_url: Link to Microsoft 365 roadmap

Content Sections

Both summary and more_information contain:

text: Plain text version of the content
blocks: Structured content blocks (paragraphs, lists, tables, HTML)
images: Array of images with src and alt text

Configuration

Rate Limiting

The scraper includes a built-in delay between requests (default: 0.5 seconds) to be respectful to the target website. You can adjust this in the code:

run_for_date(target_date, throttle_sec=1.0)  # 1 second delay

Output Directory

Scraped data is saved to the mc_messages/ directory by default. The directory is created automatically if it doesn't exist.

Examples

Scrape Recent Messages

# Get yesterday's messages
python main.py --yesterday

# Get messages from the last week
python main.py --last-days 7

Scrape Specific Date

# Get messages from October 8, 2025
python main.py --date 2025-10-08

Batch Processing

# Get messages for multiple dates
python main.py --date 2025-10-08
python main.py --date 2025-10-07
python main.py --date 2025-10-06

Error Handling

The scraper includes comprehensive error handling:

Network errors: Retries and graceful failure
Parsing errors: Logs issues and continues with other messages
Missing data: Handles missing fields gracefully
Rate limiting: Built-in delays to avoid overwhelming the server

Dependencies

requests: HTTP library for web requests
beautifulsoup4: HTML parsing and extraction
lxml: XML/HTML parser backend

License

This project is for educational and research purposes. Please respect the terms of service of the target website and use responsibly.

Contributing

Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.

Disclaimer

This tool is not affiliated with Microsoft. Use responsibly and in accordance with the target website's terms of service.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
scraper.py		scraper.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Microsoft Message Center Scraper

Features

Installation

Using uv (recommended)

Using pip

Usage

Command Line Interface

Python API

Output Format

Data Structure

Message Fields

Metadata (meta)

Content Sections

Configuration

Rate Limiting

Output Directory

Examples

Scrape Recent Messages

Scrape Specific Date

Batch Processing

Error Handling

Dependencies

License

Contributing

Disclaimer

About

Uh oh!

Releases

Packages

Languages

License

WestphalJonas/microsoft-message-center-scraper

Folders and files

Latest commit

History

Repository files navigation

Microsoft Message Center Scraper

Features

Installation

Using uv (recommended)

Using pip

Usage

Command Line Interface

Python API

Output Format

Data Structure

Message Fields

Metadata (meta)

Content Sections

Configuration

Rate Limiting

Output Directory

Examples

Scrape Recent Messages

Scrape Specific Date

Batch Processing

Error Handling

Dependencies

License

Contributing

Disclaimer

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages