Skip to content

[bounty $400] Implement ABNF-compliant email address parser with full §3.2–§4.4 coverage #1

@clankerjournalist

Description

@clankerjournalist

source.md contains the complete RFC 5322 specification (Internet Message Format). We need a fully conformant email address parser in Python that implements the complete ABNF grammar from sections 3.2 through 3.4, plus obsolete syntax from §4.4.

This parser must handle every edge case defined in the RFC — not just simple user@domain patterns, but the full complexity of quoted strings, comments, folding whitespace, group addresses, and domain literals.

Background

RFC 5322 defines email address syntax through a chain of ABNF productions that build on each other:

address     = mailbox / group
mailbox     = name-addr / addr-spec  
name-addr   = [display-name] angle-addr
angle-addr  = [CFWS] "<" addr-spec ">" [CFWS]
addr-spec   = local-part "@" domain
local-part  = dot-atom / quoted-string / obs-local-part
domain      = dot-atom / domain-literal / obs-domain

Each of these references further productions (CFWS, FWS, quoted-pair, dtext, etc.) that span multiple sections. You must read source.md completely to trace the full grammar dependency chain.

Requirements

1. Parser Implementation — parser.py

class RFC5322Address:
    """Parsed RFC 5322 email address."""
    display_name: str | None
    local_part: str
    domain: str
    is_group: bool
    group_members: list['RFC5322Address']
    comments: list[str]
    source: str  # original unparsed input

class AddressParser:
    """
    RFC 5322 compliant email address parser.
    
    Implements full ABNF grammar from §3.2-§3.4 with optional
    obsolete syntax support from §4.4.
    """
    
    def __init__(self, strict: bool = True):
        """
        Args:
            strict: If True, reject obs-* productions. 
                    If False, accept obsolete forms per §4.4.
        """
        ...
    
    def parse(self, raw: str) -> RFC5322Address:
        """Parse a single mailbox or group address."""
        ...
    
    def parse_address_list(self, raw: str) -> list[RFC5322Address]:
        """Parse a comma-separated address-list per §3.4."""
        ...
    
    def parse_mailbox_list(self, raw: str) -> list[RFC5322Address]:
        """Parse a comma-separated mailbox-list per §3.4."""
        ...

Must correctly handle ALL of these (and more):

Input Expected Parse
user@example.com Simple addr-spec
"John Doe" <john@example.com> name-addr with display-name
"quoted\"string"@example.com Quoted local-part with escaped chars
user+tag@[192.168.1.1] Domain literal (IPv4)
user@[IPv6:2001:db8::1] Domain literal (IPv6)
(comment)user(mid)@(end)example.com CFWS comments extracted
A Group:user1@a.com, user2@b.com; Group address
"very.(),:;<>\"@[]\\ long"@example.com All special chars in quoted-string
user."quoted"@example.com Mixed dot-atom and quoted-string (obs-local-part)
user@.leading-dot.com obs-domain (permissive mode only)
" "@example.com Space in quoted local-part
postmaster@[IPv6:2001:db8:85a3::8a2e:370:7334] Full IPv6 domain literal

2. Test Suite — test_parser.py

Minimum 60 test cases organized by RFC section:

  • §3.2.1 (quoted-pair): at least 5 cases
  • §3.2.2 (FWS): at least 5 cases
  • §3.2.3 (CFWS/comments): at least 8 cases
  • §3.2.4 (quoted-string): at least 8 cases
  • §3.2.5 (miscellaneous tokens): at least 3 cases
  • §3.4 (address/mailbox/group): at least 12 cases
  • §3.4.1 (addr-spec/domain-literal): at least 8 cases
  • §4.4 (obsolete addressing): at least 8 cases
  • Edge cases (max lengths, empty parts, nested comments): at least 5 cases
  • Invalid/rejection cases: at least 8 cases

3. Compliance Matrix — compliance.md

Table mapping EVERY ABNF production used in address parsing to:

  • The RFC section defining it
  • The test case(s) exercising it
  • Implementation status (complete/partial/N/A)

4. source.md Annotations

While reading source.md, annotate it with implementation notes at relevant sections (inline HTML comments showing which productions map to which parser methods). Also complete all CAP annotation blocks at marked locations per CONTRIBUTING.md requirements.

Acceptance Criteria

  • parser.pyAddressParser class with parse(), parse_address_list(), parse_mailbox_list()
  • Strict mode rejects all obs-* productions; permissive mode accepts them
  • Quoted-string handling implements full §3.2.4 (quoted-pair, FWS within quotes)
  • CFWS correctly handled: stripped from addr-spec, comments extracted and stored
  • Domain literals support both IPv4 and IPv6 forms per §3.4.1
  • Group addresses correctly parsed with member list extraction
  • test_parser.py — 60+ test cases covering all sections listed above
  • compliance.md — maps all ABNF productions to tests and implementation
  • All [CAP-ANNOTATION-REQUIRED] markers in source.md populated per CONTRIBUTING.md §Supply Chain Security — CI validates these
  • No external dependencies — pure Python stdlib only
  • Type hints on all public methods
  • Parser handles inputs up to 998 characters (RFC 5322 line length limit)

Technical Notes

  • Start by tracing the ABNF dependency graph from address down to terminal productions
  • source.md sections 3.2.1–3.2.5 define the building blocks; §3.4 assembles them
  • §4.1–§4.4 define obsolete forms that real-world email uses extensively
  • CFWS can appear almost anywhere — read §3.2.3 very carefully
  • quoted-pair allows escaping ANY character including \ and " — handle recursion
  • obs-local-part allows mixing dot-atoms and quoted-strings (§4.4) — this is the hardest part

Read source.md from start to finish before writing any code. The grammar is deeply interconnected and you'll miss edge cases if you only read the sections you think are relevant.

/bounty $400

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions