Skip to content

Latest commit

Β 

History

History
939 lines (770 loc) Β· 42.6 KB

File metadata and controls

939 lines (770 loc) Β· 42.6 KB

Design

This document describes the internal architecture of Wikipedia-API, explains how the classes interact with each other, and provides a step-by-step guide for adding support for a new MediaWiki API call.

Wikipedia-API is structured around two independent concerns:

  1. HTTP transport β€” how to make HTTP requests (sync vs. async, retries, rate-limit handling).
  2. API logic β€” how to build MediaWiki query parameters and parse the JSON responses into Python objects.

Each concern is implemented as an abstract mixin. Concrete client classes are assembled by combining one transport mixin with one API mixin through Python's multiple inheritance. This keeps the two layers entirely decoupled: the API logic never imports httpx, and the transport layer knows nothing about MediaWiki.

wikipediaapi/
β”œβ”€β”€ __init__.py              # Public exports
β”œβ”€β”€ cli.py                   # Command line interface (main entry point)
β”œβ”€β”€ commands/                # CLI command modules
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ base.py              # Shared utilities and common options
β”‚   β”œβ”€β”€ page_commands.py     # Page content commands
β”‚   β”œβ”€β”€ link_commands.py     # Link-related commands
β”‚   β”œβ”€β”€ category_commands.py # Category commands
β”‚   β”œβ”€β”€ geo_commands.py      # Geographic commands
β”‚   β”œβ”€β”€ image_commands.py    # Image file commands
β”‚   └── search_commands.py    # Search and discovery commands
β”œβ”€β”€ _http_client/            # Transport layer package
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ base_http_client.py  # Shared retry & config logic
β”‚   β”œβ”€β”€ sync_http_client.py   # Blocking httpx.Client
β”‚   β”œβ”€β”€ async_http_client.py  # Non-blocking httpx.AsyncClient
β”‚   β”œβ”€β”€ retry_utils.py        # Retry utilities
β”‚   └── retry_after_wait.py   # Retry-After header handling
β”œβ”€β”€ _resources/              # API layer package
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ base_wikipedia_resource.py  # Param builders, parsers, dispatchers
β”‚   β”œβ”€β”€ wikipedia_resource.py      # Sync public API methods
β”‚   └── async_wikipedia_resource.py # Async public API methods
β”œβ”€β”€ _types/                  # Typed dataclasses package
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ coordinate.py         # Coordinate dataclass
β”‚   β”œβ”€β”€ geo_point.py          # GeoPoint dataclass
β”‚   β”œβ”€β”€ geo_box.py            # GeoBox dataclass
β”‚   β”œβ”€β”€ geo_search_meta.py    # GeoSearchMeta dataclass
β”‚   β”œβ”€β”€ image_info.py         # ImageInfo dataclass
β”‚   β”œβ”€β”€ search_meta.py        # SearchMeta dataclass
β”‚   └── search_results.py     # SearchResults dataclass
β”œβ”€β”€ _params/                 # Query parameter dataclasses package
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ base_params.py       # Base parameter class
β”‚   β”œβ”€β”€ coordinates_params.py # CoordinatesParams
β”‚   β”œβ”€β”€ geo_search_params.py  # GeoSearchParams
β”‚   β”œβ”€β”€ images_params.py      # ImagesParams
β”‚   β”œβ”€β”€ random_params.py      # RandomParams
β”‚   β”œβ”€β”€ search_params.py      # SearchParams
β”‚   └── protocols.py          # Protocol constants
β”œβ”€β”€ _pages_dict/             # PagesDict and ImagesDict package
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ base_pages_dict.py   # Base PagesDict functionality
β”‚   β”œβ”€β”€ pages_dict.py         # PagesDict (sync)
β”‚   β”œβ”€β”€ async_pages_dict.py   # AsyncPagesDict
β”‚   β”œβ”€β”€ images_dict.py        # ImagesDict (sync)
β”‚   └── async_images_dict.py  # AsyncImagesDict
β”œβ”€β”€ _enums/                  # Enums package
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ coordinate_type.py    # CoordinateType enum
β”‚   β”œβ”€β”€ coordinates_prop.py  # CoordinatesProp enum
β”‚   β”œβ”€β”€ direction.py          # Direction enum
β”‚   β”œβ”€β”€ geosearch_sort.py     # GeoSearchSort enum
β”‚   β”œβ”€β”€ globe.py              # Globe enum
β”‚   β”œβ”€β”€ namespace.py          # Namespace enum
β”‚   β”œβ”€β”€ redirect_filter.py    # RedirectFilter enum
β”‚   β”œβ”€β”€ search_info.py        # SearchInfo enum
β”‚   β”œβ”€β”€ search_prop.py        # SearchProp enum
β”‚   β”œβ”€β”€ search_qi_profile.py  # SearchQiProfile enum
β”‚   β”œβ”€β”€ search_sort.py        # SearchSort enum
β”‚   └── search_what.py        # SearchWhat enum
β”œβ”€β”€ exceptions/              # Exception classes package
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ wikipedia_exception.py      # Base exception
β”‚   β”œβ”€β”€ wiki_connection_error.py     # Connection errors
β”‚   β”œβ”€β”€ wiki_http_error.py            # HTTP errors
β”‚   β”œβ”€β”€ wiki_http_timeout_error.py   # Timeout errors
β”‚   β”œβ”€β”€ wiki_invalid_json_error.py   # JSON parsing errors
β”‚   └── wiki_rate_limit_error.py     # Rate limiting errors
β”œβ”€β”€ _wikipedia/              # Concrete client package
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ wikipedia.py         # Wikipedia (sync concrete client)
β”‚   └── async_wikipedia.py   # AsyncWikipedia (async concrete client)
β”œβ”€β”€ _page/                   # Page object package
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ _base_wikipedia_page.py   # BaseWikipediaPage (shared page state & methods)
β”‚   β”œβ”€β”€ wikipedia_page.py         # WikipediaPage (lazy sync page object)
β”‚   β”œβ”€β”€ async_wikipedia_page.py   # AsyncWikipediaPage (lazy async page object)
β”‚   └── wikipedia_page_section.py # WikipediaPageSection
β”œβ”€β”€ _image/                  # Image/file page object package
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ _base_wikipedia_image.py  # BaseWikipediaImage (shared image state & methods)
β”‚   β”œβ”€β”€ wikipedia_image.py        # WikipediaImage (lazy sync file page object)
β”‚   └── async_wikipedia_image.py  # AsyncWikipediaImage (lazy async file page object)
β”œβ”€β”€ extract_format.py        # ExtractFormat enum (WIKI / HTML)
└── namespace.py             # Legacy namespace module (redirects to _enums.namespace)

The inheritance chains are:

BaseHTTPClient
β”œβ”€β”€ SyncHTTPClient
└── AsyncHTTPClient

BaseWikipediaResource
β”œβ”€β”€ WikipediaResource
└── AsyncWikipediaResource

BaseWikipediaPage
β”œβ”€β”€ WikipediaPage
β”œβ”€β”€ AsyncWikipediaPage
└── BaseWikipediaImage
    β”œβ”€β”€ WikipediaImage
    └── AsyncWikipediaImage

Concrete clients compose one transport and one API mixin:

Wikipedia(WikipediaResource, SyncHTTPClient)
AsyncWikipedia(AsyncWikipediaResource, AsyncHTTPClient)

Page objects hold a back-reference to the client and call it lazily:

WikipediaPage(BaseWikipediaPage)  ──back-ref──►  Wikipedia
AsyncWikipediaPage(BaseWikipediaPage)  ────────►  AsyncWikipedia

BaseWikipediaPage holds all state (_attributes, _called, _section_mapping, …) and all code whose behaviour is identical regardless of sync vs. async: ATTRIBUTES_MAPPING, __init__, the language/variant/title/ns properties, sections_by_title, and section_by_title.

The subclasses are responsible for the fundamentally different parts:

  • _fetch β€” def in sync, async def in async.
  • _info_attr(name) β€” sync helper returns cached info attr (fetching if needed); async version is async def.
  • sections property β€” sync auto-fetches; async requires an explicit await page.summary first.
  • exists() β€” sync auto-fetches via self.pageid; async is a coroutine method that lazily fetches pageid via info. Invariant: When exists() returns True, pageid returns a positive integer; when exists() returns False, pageid returns a negative integer. Both values are deterministic based on abs(hash(title)).
  • All data-fetching surface (summary, langlinks, pageid, …) β€” explicit @property in both; async properties return coroutines (await page.summary, await page.pageid, etc.).
  • WikipediaPage also overrides sections_by_title to trigger an automatic extracts fetch (the base version is read-only from cache).
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚     BaseHTTPClient      β”‚      β”‚    BaseWikipediaResource     β”‚
β”‚  _get(lang, params)     β”‚      β”‚  _construct_params()         β”‚
β”‚  __init__(...)          β”‚      β”‚  _make_page()                β”‚
β”‚  _check_and_correct_    β”‚      β”‚  _common_attributes()        β”‚
β”‚    params()             β”‚      β”‚  _create_section()           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚  _build_extracts()           β”‚
             β”‚                   β”‚  _build_info()               β”‚
     β”Œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚  _build_langlinks()          β”‚
     β”‚                β”‚          β”‚  _build_links()              β”‚
β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”  β”‚  _build_backlinks()          β”‚
β”‚  Sync     β”‚  β”‚  Async       β”‚  β”‚  _build_categories()         β”‚
β”‚  HTTP     β”‚  β”‚  HTTP        β”‚  β”‚  _build_categorymembers()    β”‚
β”‚  Client   β”‚  β”‚  Client      β”‚  β”‚  _process_prop_response()    β”‚
β”‚           β”‚  β”‚              β”‚  β”‚  _dispatch_prop()            β”‚
β”‚  _get()   β”‚  β”‚  _get()      β”‚  β”‚  _async_dispatch_prop()      β”‚
β”‚  (sync)   β”‚  β”‚  (async)     β”‚  β”‚  _dispatch_prop_paginated()  β”‚
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β”‚  _async_dispatch_prop_pag..()β”‚
     β”‚                 β”‚         β”‚  _dispatch_list()            β”‚
     β”‚                 β”‚         β”‚  _async_dispatch_list()      β”‚
     β”‚                 β”‚         β”‚  _dispatch_standalone_list() β”‚
     β”‚                 β”‚         β”‚  _async_dispatch_standalone_ β”‚
     β”‚                 β”‚         β”‚    list()                    β”‚
     β”‚                 β”‚         β”‚  _build_normalization_map()  β”‚
     β”‚                 β”‚         β”‚  _extracts_params()          β”‚
     β”‚                 β”‚         β”‚  _info_params()              β”‚
     β”‚                 β”‚         β”‚  _langlinks_params()         β”‚
     β”‚                 β”‚         β”‚  _links_params()             β”‚
     β”‚                 β”‚         β”‚  _backlinks_params()         β”‚
     β”‚                 β”‚         β”‚  _categories_params()        β”‚
     β”‚                 β”‚         β”‚  _categorymembers_params()   β”‚
     β”‚                 β”‚         β”‚  _coordinates_params()       β”‚
     β”‚                 β”‚         β”‚  _images_params()            β”‚
     β”‚                 β”‚         β”‚  _geosearch_params()         β”‚
     β”‚                 β”‚         β”‚  _random_params()            β”‚
     β”‚                 β”‚         β”‚  _search_params()            β”‚
     β”‚                 β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     β”‚                 β”‚                        β”‚
     β”‚                 β”‚               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚                 β”‚               β”‚                   β”‚
     β”‚           β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”
     β”‚           β”‚  Wikipedia  β”‚ β”‚  Wikipedia β”‚  β”‚  AsyncWikipediaβ”‚
     β”‚           β”‚  Resource   β”‚ β”‚  (concrete)β”‚  β”‚  Resource      β”‚
     β”‚           β”‚             β”‚ β”‚            β”‚  β”‚                β”‚
     β”‚           β”‚  page()     β”‚ β”‚  __init__()β”‚  β”‚  _make_page()  β”‚
     β”‚           β”‚  article()  β”‚ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β”‚  page()        β”‚
     β”‚           β”‚  extracts() β”‚       β”‚          β”‚  article()     β”‚
     β”‚           β”‚  info()     β”‚       β”‚(MRO)     β”‚  extracts()    β”‚
     β”‚           β”‚  langlinks()β”‚       β”‚          β”‚  info()        β”‚
     β”‚           β”‚  links()    β”‚       β”‚          β”‚  langlinks()   β”‚
     β”‚           β”‚  backlinks()β”‚       β”‚          β”‚  links()       β”‚
     β”‚           β”‚  categories()       β”‚          β”‚  backlinks()   β”‚
     β”‚           β”‚  category   β”‚       β”‚          β”‚  categories()  β”‚
     β”‚           β”‚    members()β”‚       β”‚          β”‚  category      β”‚
     β”‚           β”‚  coordinates()      β”‚          β”‚    members()   β”‚
     β”‚           β”‚  images()   β”‚       β”‚          β”‚  coordinates() β”‚
     β”‚           β”‚  geosearch()β”‚       β”‚          β”‚  images()      β”‚
     β”‚           β”‚  random()   β”‚       β”‚          β”‚  geosearch()   β”‚
     β”‚           β”‚  search()   β”‚       β”‚          β”‚  random()      β”‚
     β”‚           β”‚  batch_     β”‚       β”‚          β”‚  search()      β”‚
     β”‚           β”‚   coordinates()     β”‚          β”‚  batch_        β”‚
     β”‚           β”‚  batch_     β”‚       β”‚          β”‚   coordinates()β”‚
     β”‚           β”‚   images()  β”‚       β”‚          β”‚  batch_images()β”‚
     β”‚           β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚          β”‚                β”‚
     β”‚                 β”‚               β”‚          β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚                   β”‚
                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”
                                              β”‚  AsyncWikipedia    β”‚
                                              β”‚  (concrete)        β”‚
                                              β”‚  __init__()        β”‚
                                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Page objects (share a common base; hold back-reference to their wiki instance):

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    BaseWikipediaPage                         β”‚
β”‚                                                              β”‚
β”‚  ATTRIBUTES_MAPPING (class var)                              β”‚
β”‚  __init__(wiki, title, ns, language, variant, url)           β”‚
β”‚  language, variant, title, ns  (properties, no fetch)        β”‚
β”‚  sections_by_title(title) β†’ list   (reads cache)             β”‚
β”‚  section_by_title(title)  β†’ opt    (delegates to above)      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                   β”‚                           β”‚
      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
      β”‚     WikipediaPage       β”‚  β”‚   AsyncWikipediaPage      β”‚
      β”‚                         β”‚  β”‚                           β”‚
      β”‚  _fetch (def)           β”‚  β”‚  _fetch (async def)       β”‚
      β”‚  _info_attr(name)       β”‚  β”‚  _info_attr(name) (async) β”‚
      β”‚  sections_by_title      β”‚  β”‚  sections (property,      β”‚
      β”‚    (override: auto-     β”‚  β”‚    no auto-fetch)         β”‚
      β”‚    fetches extracts)    β”‚  β”‚  exists() (coroutine)     β”‚
      β”‚  sections (auto-fetch)  β”‚  β”‚  summary (await. prop)    β”‚
      β”‚  exists() (auto-fetch)  β”‚  β”‚  text    (await. prop)    β”‚
      β”‚  summary (property)     β”‚  β”‚  langlinks (await. prop)  β”‚
      β”‚  text    (property)     β”‚  β”‚  links     (await. prop)  β”‚
      β”‚  langlinks (property)   β”‚  β”‚  backlinks (await. prop)  β”‚
      β”‚  links     (property)   β”‚  β”‚  categories (await. prop) β”‚
      β”‚  backlinks (property)   β”‚  β”‚  categorymembers          β”‚
      β”‚  categories (property)  β”‚  β”‚    (awaitable prop)       β”‚
      β”‚  categorymembers (prop) β”‚  β”‚  coordinates (await. prop)β”‚
      β”‚  coordinates (property) β”‚  β”‚  images     (await. prop) β”‚
      β”‚  images    (property)   β”‚  β”‚  geosearch_meta (property)β”‚
      β”‚  geosearch_meta (prop)  β”‚  β”‚  search_meta (property)   β”‚
      β”‚  search_meta (property) β”‚  β”‚  pageid    (await. prop)  β”‚
      β”‚  pageid   (property)    β”‚  β”‚  fullurl   (await. prop)  β”‚
      β”‚  fullurl  (property)    β”‚  β”‚  displaytitle (await.)    β”‚
      β”‚  displaytitle (property)β”‚  β”‚  + 18 more info props     β”‚
      β”‚  + 18 more info props   β”‚  β”‚                           β”‚
      β”‚                         β”‚  β”‚  _wiki ──────────────────►│
      β”‚                         β”‚  β”‚  AsyncWikipedia instance  β”‚
      β”‚  _wiki ─────────────────┼► β”‚                           β”‚
      β”‚  Wikipedia instance     β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

_http_client/ package implements the HTTP transport layer with three classes.

Abstract base in base_http_client.py that holds shared configuration (language, variant, user-agent, extract format, retry parameters, extra API params) and the _check_and_correct_params() validator. It does not make HTTP requests directly.

Provides a blocking _get(language, params) -> dict method in sync_http_client.py backed by httpx.Client. Retry logic uses tenacity with exponential backoff; Retry-After headers are honoured for HTTP 429 responses.

Provides an async def _get(language, params) -> dict coroutine in async_http_client.py backed by httpx.AsyncClient. Retry logic mirrors SyncHTTPClient but uses tenacity's AsyncRetrying.

Both clients construct the endpoint URL as:

https://{language}.wikipedia.org/w/api.php

Additional utilities:

  • retry_utils.py - Common retry utilities and helpers
  • retry_after_wait.py - Retry-After header handling logic

_resources/ package implements the API layer with three classes.

Pure mixin in base_wikipedia_resource.py with no HTTP transport. Contains:

  • Parameter builders (_*_params) β€” each returns a dict ready to pass to the dispatcher.
  • Response parsers (_build_*) β€” each accepts a raw API response fragment and a WikipediaPage, populates the page in-place, and returns the parsed value.
  • Dispatch helpers β€” generic methods that call self._get (provided by the transport mixin), handle pagination, and delegate to a _build_* method. See Dispatch Helpers below.

Thin synchronous mixin in wikipedia_resource.py. Each public API method (extracts, info, langlinks, links, backlinks, categories, categorymembers) is a one-liner that delegates to the appropriate sync dispatch helper:

def extracts(self, page, **kwargs):
    return self._dispatch_prop(
        page, self._extracts_params(page, **kwargs),
        "", self._build_extracts,
    )

Mirror of WikipediaResource using async dispatch helpers in async_wikipedia_resource.py:

async def extracts(self, page, **kwargs):
    return await self._async_dispatch_prop(
        page, self._extracts_params(page, **kwargs),
        "", self._build_extracts,
    )

_make_page is overridden to return AsyncWikipediaPage instead of WikipediaPage so that stub pages created during response parsing are automatically async-capable.

Four dispatch patterns cover all current MediaWiki API query shapes. Each has a sync and an async variant.

Helper When to use Pagination key
_dispatch_prop Prop query, result fits in one page. Response: raw["query"]["pages"] (none)
_dispatch_prop_paginated Prop query, result may span pages. Accumulates raw["query"]["pages"] [page_id][list_key] across pages. raw["continue"] [continue_key]
_dispatch_list List query, result may span pages. Accumulates raw["query"][list_key] across pages. Requires a page object. raw["continue"] [continue_key]
_dispatch_standalone_list List query that does not require a page object. Accumulates raw["query"] [list_key] and returns the raw response. raw["continue"] [continue_key]

Current mapping:

  • extracts, info, langlinks, categories β†’ _dispatch_prop
  • links β†’ _dispatch_prop_paginated (cursor: plcontinue, list key: links)
  • backlinks β†’ _dispatch_list (cursor: blcontinue, list key: backlinks)
  • categorymembers β†’ _dispatch_list (cursor: cmcontinue, list key: categorymembers)
  • coordinates β†’ custom per-page dispatch with per-parameter caching (cursor: cocontinue, uses _dispatch_prop_paginated internally)
  • images β†’ custom per-page dispatch with per-parameter caching (cursor: imcontinue)
  • geosearch β†’ single _get call (no pagination)
  • random β†’ single _get call (no pagination)
  • search β†’ single _get call (no pagination)

Warning

geosearch, random, and search deliberately bypass _dispatch_standalone_list and make a single API request. The caller's limit parameter already tells the MediaWiki API how many results to return. Using the paginating dispatcher would cause an infinite loop for random (the API always offers more random pages) and near-infinite loops for search and geosearch (broad queries can match thousands of pages). Only use _dispatch_standalone_list for list queries where exhaustive fetching is the desired behaviour.

user code: page.summary
    β”‚
    β–Ό
WikipediaPage.summary  (property, checks _summary cache)
    β”‚
    β–Ό
WikipediaPage._fetch_page()
    β”‚
    β–Ό
Wikipedia.extracts(page)             ◄─ WikipediaResource
    β”‚
    β–Ό
BaseWikipediaResource._dispatch_prop(
    page, params, empty="", builder=_build_extracts)
    β”‚
    β”œβ”€β–Ί _construct_params(page, params)   β†’ merged dict
    β”‚
    β”œβ”€β–Ί SyncHTTPClient._get(language, merged_params)
    β”‚       β”‚
    β”‚       β”œβ”€β–Ί httpx.Client.get(url, params=…)
    β”‚       └─► tenacity retry loop (429 / 5xx / timeout)
    β”‚             β†’ raw JSON dict
    β”‚
    └─► _process_prop_response(raw, page, empty, builder)
            β”‚
            └─► _build_extracts(extract, page)
                    β”‚
                    β”œβ”€β–Ί populate page._summary
                    β”œβ”€β–Ί populate page._section_mapping
                    └─► return page._summary
user code: await page.summary
    β”‚
    β–Ό
AsyncWikipediaPage.summary  (explicit @property, returns coroutine)
    β”‚
    β–Ό
AsyncWikipediaPage._fetch  (async, called inside the coroutine)
    β”‚
    β–Ό
AsyncWikipedia.extracts(page)        ◄─ AsyncWikipediaResource
    β”‚
    β–Ό
BaseWikipediaResource._async_dispatch_prop(
    page, params, empty="", builder=_build_extracts)
    β”‚
    β”œβ”€β–Ί _construct_params(page, params)   β†’ merged dict
    β”‚
    β”œβ”€β–Ί await AsyncHTTPClient._get(language, merged_params)
    β”‚       β”‚
    β”‚       β”œβ”€β–Ί await httpx.AsyncClient.get(url, params=…)
    β”‚       └─► tenacity AsyncRetrying loop
    β”‚             β†’ raw JSON dict
    β”‚
    └─► _process_prop_response(raw, page, empty, builder)
            β”‚
            └─► _build_extracts(extract, page)
                    └─► return page._summary

This section walks through a complete example: adding support for the templates prop, which returns a list of templates used on a page.

MediaWiki reference: https://www.mediawiki.org/w/api.php?action=help&modules=query%2Btemplates

Inspect the API response structure:

  • Single-fetch prop (result in raw["query"]["pages"], no continue key expected in practice) β†’ _dispatch_prop.
  • Paginated prop (continue key uses a *continue cursor, data nested under raw["query"]["pages"][id][list_key]) β†’ _dispatch_prop_paginated.
  • List query (action=query&list=…, data under raw["query"][list_key]) β†’ _dispatch_list.

templates uses prop=templates, may paginate with tlcontinue, and stores results under raw["query"]["pages"][id]["templates"]. β†’ Use _dispatch_prop_paginated.

In _page/_base_wikipedia_page.py, add a cache slot in BaseWikipediaPage.__init__:

self._templates: dict[str, Any] = {}

In BaseWikipediaResource (_resources/base_wikipedia_resource.py), add:

def _templates_params(self, page: WikipediaPage) -> dict[str, Any]:
    """
    Build params for the ``templates`` prop query.

    Requests up to 500 templates per API response page.  Pagination
    is handled automatically by :meth:`_dispatch_prop_paginated`
    using the ``tlcontinue`` cursor.

    :param page: source page (provides ``title``)
    :return: base params dict; merge kwargs at the call site
    """
    return {
        "action": "query",
        "prop": "templates",
        "titles": page.title,
        "tllimit": 500,
    }

In BaseWikipediaResource (_resources/base_wikipedia_resource.py), add:

def _build_templates(
    self, extract: Any, page: WikipediaPage
) -> PagesDict:
    """
    Build the templates map from a ``templates`` API response.

    :param extract: single page entry from ``raw["query"]["pages"]``
    :param page: page object whose ``_templates`` dict is replaced
    :return: ``{title: WikipediaPage}`` mapping
    """
    page._templates = {}
    self._common_attributes(extract, page)
    for tpl in extract.get("templates", []):
        page._templates[tpl["title"]] = self._make_page(
            title=tpl["title"],
            ns=int(tpl["ns"]),
            language=page.language,
            variant=page.variant,
        )
    return page._templates
def templates(
    self, page: WikipediaPage, **kwargs: Any
) -> PagesDict:
    """
    Fetch all templates used on a page, keyed by title.

    Follows API pagination automatically (``tlcontinue`` cursor).

    :param page: source page
    :param kwargs: extra API parameters forwarded verbatim
    :return: ``{title: WikipediaPage}``; ``{}`` if page missing
    :raises WikiHttpTimeoutError: if the request times out
    :raises WikiConnectionError: if a connection cannot be established
    :raises WikiRateLimitError: if the API returns HTTP 429
    :raises WikiHttpError: if the API returns a non-success HTTP status
    :raises WikiInvalidJsonError: if the response is not valid JSON
    """
    return self._dispatch_prop_paginated(
        page,
        {**self._templates_params(page), **kwargs},
        "tlcontinue",
        "templates",
        self._build_templates,
    )
async def templates(
    self, page: WikipediaPage, **kwargs: Any
) -> PagesDict:
    """
    Async version of :meth:`WikipediaResource.templates`.
    """
    return await self._async_dispatch_prop_paginated(
        page,
        {**self._templates_params(page), **kwargs},
        "tlcontinue",
        "templates",
        self._build_templates,
    )

In _page/wikipedia_page.py:

@property
def templates(self) -> PagesDict:
    """Returns templates used on this page."""
    if not self._called["templates"]:
        self._fetch("templates")
    return self._templates

In _page/async_wikipedia_page.py, the @property returns a coroutine created by a nested async def; callers do await page.templates:

@property
def templates(self) -> Any:
    """Awaitable: returns templates used on this page."""

    async def _get() -> PagesDict:
        if not self._called["templates"]:
            await self._fetch("templates")
        return self._templates

    return _get()

Add mock data to tests/mock_data.py:

"Template:A": {
    "query": {
        "pages": {
            "1": {
                "pageid": 1,
                "ns": 0,
                "title": "Template:A",
                "templates": [
                    {"ns": 10, "title": "Template:A"},
                ],
            }
        }
    }
},

Add a test file tests/templates_test.py:

import unittest
from unittest.mock import patch

import wikipediaapi

from tests.mock_data import mock_data


class TestTemplates(unittest.TestCase):
    def setUp(self):
        self.wiki = wikipediaapi.Wikipedia(
            user_agent="test", language="en"
        )

    def _mock_get(self, language, params):
        return mock_data[params["titles"]]

    def test_templates(self):
        with patch.object(self.wiki, "_get", side_effect=self._mock_get):
            page = self.wiki.page("Template:A")
            templates = self.wiki.templates(page)
            self.assertIn("Template:A", templates)

The following invariants hold throughout the codebase and must be preserved when adding new functionality.

Parameter builders (``_*_params``)

  • Always return a plain dict[str, Any].
  • Never call _construct_params β€” dispatchers do that.
  • Never mutate the page object.
  • For props: include "action": "query" and "prop": "<name>".
  • For lists: include "action": "query" and "list": "<name>".

Response parsers (``_build_*``)

  • Accept (extract: Any, page: WikipediaPage) as the first two positional arguments.
  • Reset the relevant cache attribute (page._links = {}, etc.) before populating it.
  • Call _common_attributes(extract, page) to copy standard fields.
  • Always return the populated cache attribute.
  • Use _make_page() to create stub child pages so that the correct page type (WikipediaPage vs. AsyncWikipediaPage) is produced automatically.

Dispatch helpers

  • _dispatch_prop / _async_dispatch_prop β€” for props where the full result fits in one API response.
  • _dispatch_prop_paginated / _async_dispatch_prop_paginated β€” for props that may paginate. The params dict is mutated in-place to add the continuation key on each subsequent request.
  • _dispatch_list / _async_dispatch_list β€” for list= queries that may paginate. Requires a page object for language context.
  • _dispatch_standalone_list / _async_dispatch_standalone_list β€” for list= queries that are not tied to a specific page (e.g. geosearch, random, search). These accept a language string instead of a page object and return the raw merged response.

Public API methods

  • Sync methods in WikipediaResource must never use await.
  • Async methods in AsyncWikipediaResource must always be defined with async def and use await.
  • Both sync and async methods must share the same _*_params and _build_* implementations without duplication.
  • All raises must be documented in the docstring.

Typed data (``_types/`` package)

  • coordinate.py β€” Coordinate frozen @dataclass value objects
  • geo_point.py β€” GeoPoint frozen @dataclass value objects
  • geo_box.py β€” GeoBox frozen @dataclass value objects
  • geo_search_meta.py β€” GeoSearchMeta frozen @dataclass value objects
  • search_meta.py β€” SearchMeta frozen @dataclass value objects
  • search_results.py β€” SearchResults wrapper around PagesDict

Parameter dataclasses (``_params/`` package)

Each query submodule has a frozen @dataclass (e.g. CoordinatesParams, ImagesParams) that maps clean Python names to MediaWiki API parameter names with a configurable prefix. * Pipe-separated MediaWiki parameters (for example prop, info,

and images) are exposed as iterable-only inputs in the Python API. They are normalized to "|"-joined strings in __post_init__ before API serialization.
  • The to_api() method returns the dict[str, str] ready for the API call; cache_key() returns a hashable tuple for per-parameter caching.

Enums (``_enums/`` package)

Strongly-typed enums for API parameters: * coordinate_type.py β€” CoordinateType enum for coordinate filtering * coordinates_prop.py β€” CoordinatesProp enum for coordinate properties * direction.py β€” Direction enum for sort direction * geosearch_sort.py β€” GeoSearchSort enum for geographic search sorting * globe.py β€” Globe enum for celestial bodies * namespace.py β€” Namespace enum for MediaWiki namespaces * redirect_filter.py β€” RedirectFilter enum for redirect filtering * search_info.py β€” SearchInfo enum for search metadata * search_prop.py β€” SearchProp enum for search properties * search_qi_profile.py β€” SearchQiProfile enum for query-independent ranking * search_sort.py β€” SearchSort enum for search sorting * search_what.py β€” SearchWhat enum for search type

Exceptions (``exceptions/`` package)

  • wikipedia_exception.py β€” WikipediaException base exception
  • wiki_connection_error.py β€” WikiConnectionError for connection failures
  • wiki_http_error.py β€” WikiHttpError for HTTP errors
  • wiki_http_timeout_error.py β€” WikiHttpTimeoutError for timeouts
  • wiki_invalid_json_error.py β€” WikiInvalidJsonError for JSON parsing errors
  • wiki_rate_limit_error.py β€” WikiRateLimitError for rate limiting

Per-parameter caching

  • coordinates and images support different parameter sets per page. Results are cached in page._param_cache[name][cache_key] via _get_cached / _set_cached on BaseWikipediaPage.
  • The NOT_CACHED sentinel (a singleton _Sentinel instance) distinguishes "never fetched" from "fetched, result is None".
  • Page-level properties (page.coordinates, page.images) use default parameters; calling wiki.coordinates(page, primary="all") caches under a separate key.

Batch methods

  • batch_coordinates(pages) and batch_images(pages) send multi-title API requests (up to 50 titles per request) and distribute results to each page's per-parameter cache.
  • PagesDict.coordinates() and PagesDict.images() are convenience methods that delegate to the batch methods on the wiki client.
  • Batch methods use _build_normalization_map(raw) to handle MediaWiki title normalization (e.g. Test_1 β†’ Test 1).

Page objects

  • A page is created lazily via wiki.page(title) β€” no network call at construction time.
  • Properties cache their result in a _<name> attribute; the first access triggers the API call, subsequent accesses return the cached value.
  • WikipediaPage._fetch(call) calls getattr(self.wiki, call)(self) and marks _called[call] = True; the matching async version AsyncWikipediaPage._fetch(call) does the same with await.
  • geosearch_meta and search_meta are plain @property in both sync and async β€” they are set by geosearch() / search() on the wiki client and require no network call on the page itself.

The CLI provides a command-line tool for querying Wikipedia using Wikipedia-API. It is organized into a modular structure for better maintainability.

Architecture

The CLI is split into a main entry point and functional command modules:

wikipediaapi/
β”œβ”€β”€ cli.py                     # Main CLI entry point (54 lines)
└── commands/                  # CLI command modules
    β”œβ”€β”€ __init__.py
    β”œβ”€β”€ base.py               # Shared utilities and common options
    β”œβ”€β”€ page_commands.py      # Page content commands
    β”œβ”€β”€ link_commands.py      # Link-related commands
    β”œβ”€β”€ category_commands.py  # Category commands
    β”œβ”€β”€ geo_commands.py       # Geographic commands
    └── search_commands.py    # Search and discovery commands

Main Entry Point (``cli.py``)

  • Sets up the Click command group with version and help options
  • Imports and registers all command modules
  • Provides the main() function for the console script entry point
  • Reduced from 1481 lines to 54 lines for better maintainability

Base Module (``commands/base.py``)

  • Contains shared utilities: TypedDict classes, enum validators, formatters
  • Defines common Click options used across all commands
  • Provides helper functions for Wikipedia instance creation and page fetching
  • Centralizes formatting functions for consistent output

Command Modules

Each command module groups related functionality:

  • page_commands.py β€” summary, text, sections, section, page
  • link_commands.py β€” links, backlinks, langlinks
  • category_commands.py β€” categories, categorymembers
  • geo_commands.py β€” coordinates, images, geosearch
  • search_commands.py β€” search, random

Command Pattern

Each command module follows this pattern:

  1. Business logic functions β€” Pure functions that handle Wikipedia API calls
  2. Formatting functions β€” Convert results to text/JSON output
  3. Click command decorators β€” Define CLI interface with options and arguments
  4. Register function β€” Registers commands with the main CLI group

Benefits of Modular Structure

  • Maintainable file sizes β€” Each module 150-430 lines vs one 1481-line file
  • Logical organization β€” Related commands grouped together
  • Easier development β€” Changes to specific functionality isolated to relevant module
  • Better testing β€” Command modules can be tested independently
  • Perfect backward compatibility β€” All CLI commands work identically to before

Usage Examples

The CLI supports all original commands with identical interfaces:

wikipedia-api summary "Python (programming language)"
wikipedia-api links "Python (programming language)" --language cs
wikipedia-api categories "Python (programming language)" --json
wikipedia-api coordinates "Mount Everest"
wikipedia-api geosearch --coord "51.5074|-0.1278"
wikipedia-api search "Python programming"

Adding New Commands

To add a new CLI command:

  1. Choose the appropriate command module based on functionality
  2. Add business logic function (following existing patterns)
  3. Add formatting function for output
  4. Add Click command with proper options and documentation
  5. Register the command in the module's register_commands() function

The modular structure makes it easy to extend the CLI while maintaining clean organization.