Skip to content

Python scraper for Microsoft Message Center announcements with structured JSON output

License

Notifications You must be signed in to change notification settings

WestphalJonas/microsoft-message-center-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Microsoft Message Center Scraper

A Python web scraper for extracting Microsoft Message Center announcements from mc.merill.net. This tool scrapes message center entries for specific dates and saves them as structured JSON data with detailed metadata, summaries, and images.

Features

  • Date-based scraping: Extract all message center entries for a specific date
  • Structured data extraction: Captures metadata including message ID, service, tags, platforms, and roadmap information
  • Rich content parsing: Extracts both summary and detailed information sections
  • Image handling: Downloads and catalogs all images with alt text
  • JSON output: Saves data in structured JSON format for easy processing
  • Rate limiting: Built-in throttling to be respectful to the target website
  • Error handling: Robust error handling with detailed logging

Installation

This project uses Python 3.13+ and can be installed using uv (recommended) or pip.

Using uv (recommended)

# Install dependencies
uv sync

Using pip

# Clone the repository

# Install dependencies
pip install -r requirements.txt

Usage

Command Line Interface

The main script provides a simple command-line interface:

# Scrape messages for a specific date
python main.py --date 2025-10-08

# Scrape messages for yesterday
python main.py --yesterday

# Scrape messages for today
python main.py --today

# Scrape messages for the last 7 days
python main.py --last-days 7

# Show help
python main.py --help

Python API

You can also use the scraper programmatically:

from scraper import run_for_date
from datetime import datetime

# Scrape messages for a specific date
target_date = datetime(2025, 10, 8)
run_for_date(target_date, throttle_sec=0.5)

Output Format

The scraper generates JSON files in the mc_messages/ directory with the following structure:

[
  {
    "id": "MC1168294",
    "url": "https://mc.merill.net/message/MC1168294",
    "title": "Microsoft Teams: Apps now supported in Shared Channels",
    "service": "Microsoft Teams",
    "published": "Oct 8, 2025",
    "meta": {
      "tags": ["Admin impact", "New feature", "User impact"],
      "platforms": ["Desktop", "Mac"],
      "message_center_url": "https://admin.microsoft.com/#/MessageCenter/:/messages/MC1168294",
      "service_exact": "Microsoft Teams",
      "published_exact": "Oct 8, 2025",
      "roadmap_id": "505791",
      "roadmap_url": "https://www.microsoft.com/en-US/microsoft-365/roadmap?filters=&searchterms=505791"
    },
    "summary": {
      "text": "Brief summary text...",
      "blocks": [{"type": "paragraph", "text": "..."}],
      "images": []
    },
    "more_information": {
      "text": "Detailed information text...",
      "blocks": [{"type": "paragraph", "text": "..."}],
      "images": [
        {
          "src": "https://example.com/image.jpg",
          "alt": "Image description"
        }
      ]
    },
    "images": [
      {
        "src": "https://example.com/image.jpg",
        "alt": "Image description"
      }
    ]
  }
]

Data Structure

Message Fields

  • id: Message Center ID (e.g., "MC1168294")
  • url: Direct link to the message
  • title: Message title
  • service: Microsoft service (e.g., "Microsoft Teams", "SharePoint")
  • published: Publication date as string

Metadata (meta)

  • tags: Array of tags/categories
  • platforms: Array of supported platforms
  • message_center_url: Link to Microsoft Admin Center
  • service_exact: Exact service name from metadata
  • published_exact: Exact publication date from metadata
  • roadmap_id: Associated roadmap ID
  • roadmap_url: Link to Microsoft 365 roadmap

Content Sections

Both summary and more_information contain:

  • text: Plain text version of the content
  • blocks: Structured content blocks (paragraphs, lists, tables, HTML)
  • images: Array of images with src and alt text

Configuration

Rate Limiting

The scraper includes a built-in delay between requests (default: 0.5 seconds) to be respectful to the target website. You can adjust this in the code:

run_for_date(target_date, throttle_sec=1.0)  # 1 second delay

Output Directory

Scraped data is saved to the mc_messages/ directory by default. The directory is created automatically if it doesn't exist.

Examples

Scrape Recent Messages

# Get yesterday's messages
python main.py --yesterday

# Get messages from the last week
python main.py --last-days 7

Scrape Specific Date

# Get messages from October 8, 2025
python main.py --date 2025-10-08

Batch Processing

# Get messages for multiple dates
python main.py --date 2025-10-08
python main.py --date 2025-10-07
python main.py --date 2025-10-06

Error Handling

The scraper includes comprehensive error handling:

  • Network errors: Retries and graceful failure
  • Parsing errors: Logs issues and continues with other messages
  • Missing data: Handles missing fields gracefully
  • Rate limiting: Built-in delays to avoid overwhelming the server

Dependencies

  • requests: HTTP library for web requests
  • beautifulsoup4: HTML parsing and extraction
  • lxml: XML/HTML parser backend

License

This project is for educational and research purposes. Please respect the terms of service of the target website and use responsibly.

Contributing

Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.

Disclaimer

This tool is not affiliated with Microsoft. Use responsibly and in accordance with the target website's terms of service.

About

Python scraper for Microsoft Message Center announcements with structured JSON output

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages