Skip to content

chore: release v0.1.2#4

Open
MagicalTux wants to merge 1 commit into
masterfrom
release-plz-2026-05-10T06-03-13Z
Open

chore: release v0.1.2#4
MagicalTux wants to merge 1 commit into
masterfrom
release-plz-2026-05-10T06-03-13Z

Conversation

@MagicalTux
Copy link
Copy Markdown
Contributor

@MagicalTux MagicalTux commented May 10, 2026

🤖 New release

  • oxideav-pdf: 0.1.1 -> 0.1.2
Changelog

0.1.2 - 2026-05-12

Other

  • PDF /Sig annotation writer (ISO 32000-1 §12.7.4.5 + §12.8.1 + RFC 5652 §5 + §5.4 + §11.2)
  • reading-order layout pass over Tagged PDF StructTreeRoot (ISO 32000-1 §14.6 + §14.7 + §14.8)
  • simple-font /Encoding /Differences resolver wired into text extraction (ISO 32000-1 §9.6.6.1 + §D.2 + AGL v2.0)
  • linearization param dict + hierarchy validator + PDF/A signals
  • annotations beyond Link (Text/FreeText/Stamp/markup/geometry/Widget) + XMP packet field extraction (DC/XMP/PDF/PDF-A)
  • PDF outline (bookmarks) tree + Link annotations
  • CMS KARI X448 ECDH (RFC 7748 §5 + RFC 8410 §3 + RFC 8418 §2.1+§2.2)
  • JPEG passthrough on /DCTDecode Image XObjects (ISO 32000-1 §7.4.8 + §8.9)
  • PDF text extraction (ISO 32000-1 §9 + §9.10)
  • PDF /Sig annotation reader (ISO 32000-1 §12.7.4.5 + §12.8.1)

Added

  • Round 30: PDF /Sig annotation writer (ISO 32000-1 §12.7.4.5 +
    §12.8.1 + RFC 5652 §5 + §5.4 + §11.2). Symmetric encoder side of
    the round-21 reader + round-27 verifier: given an
    [oxideav_scene::Scene] + a [Signer] + a signer-cert chain, the
    new writer emits a signed PDF whose AcroForm contains a /FT /Sig
    terminal field whose /V points at a signature dictionary
    (/Type /Sig /Filter /Adobe.PPKLite /SubFilter /adbe.pkcs7.detached)
    carrying valid /ByteRange placeholders + a hex-encoded CMS
    SignedData ContentInfo blob. The classic "ByteRange-placeholder
    fill-in" pattern of §12.8.1.1 is implemented end-to-end:

    Step 1 — the base PDF is rendered via the existing
    write_pdf_from_scene.

    Step 2 — an incremental-update revision (§7.5.6) is appended that
    overrides the Catalog with /AcroForm <ref>, plus an AcroForm
    dict (/Fields [<sig-field-ref>] /SigFlags 3), a Sig form
    field (/FT /Sig /T (Signature1)), and a Sig dictionary with
    fixed-width /ByteRange (4 × 10-digit slots) +
    /Contents <0…0> (8192 hex chars = 4096 raw bytes — enough
    for any RSA-2048 / ECDSA-P256 SHA-256 SignedData with a single
    signer + cert).

    Step 3 — /ByteRange is patched in place with the actual offsets
    (the four integers themselves are inside the signed range, so they
    reach their final value BEFORE the hash is computed).

    Step 4 — the bytes named by /ByteRange are SHA-256-hashed, the
    hash is wrapped into a CAdES-BES-style signedAttrs SET
    (contentType 1.2.840.113549.1.9.3 = id-data, messageDigest
    1.2.840.113549.1.9.4 = SHA-256(signed) per RFC 5652 §11.2), the
    SET is canonical-re-tagged from [0] IMPLICIT to the universal
    SET tag per §5.4 and hashed, and the resulting digest is signed by
    the Signer.

    Step 5 — the signature is wrapped in a CMS SignedData
    ContentInfo (version=1, single SignerInfo,
    IssuerAndSerialNumber slot, full cert chain in the
    SET-of-CertificateChoices field, detached eContent),
    hex-encoded, and overwritten into the /Contents placeholder
    (length-preserving — the bytes between < and > are the
    EXCLUDED range under /ByteRange, so this write does not
    invalidate the hash computed in step 4).

    New public surface under oxideav_pdf::sig:

    • pub trait Signer { fn algorithm() -> SigningAlgorithm; fn sign(&self, tbs_hash: &[u8]) -> Result<Vec<u8>, PdfError>; }
      — abstract signing primitive; user plugs in whatever crypto
      stack they want (ring, hardware token, HSM, ...). The trait
      receives a SHA-2 digest and returns wire-form signature octets
      (PKCS#1 v1.5 padded big-endian for RSA, DER-encoded
      Ecdsa-Sig-Value for ECDSA).
    • SigningAlgorithm { RsaPkcs1v15Sha256, EcdsaP256Sha256 }
      enum of the two algorithm slots round 30 ships; the writer
      picks the right CMS digestAlgorithm (SHA-256) +
      signatureAlgorithm (rsaEncryption / ecdsa-with-SHA256) OIDs
      based on the implementor's choice.
    • RsaPkcs1v15Sha256Signer / EcdsaP256Sha256Signer
      reference Signer impls that wrap the in-crate rsa / p256
      deps (no new crypto deps added for the writer).
    • SignerIdentity { issuer_der, serial, cert_chain }
      decoupled identity bundle; from_signer_cert_der(der) is the
      convenience constructor for the typical single-cert
      self-signed deployment.
    • SigWriter::new(scene, signer, identity).sign() -> Vec<u8>
      the builder.
    • sign_pdf_from_scene(scene, signer, identity) -> Vec<u8>
      one-shot convenience wrapper.
    • pkcs7_wrap_signed_data(algorithm, issuer_der, serial, cert_chain, signed_attrs_body, signature_bytes) -> Vec<u8>
      standalone CMS DER builder; useful when stitching a signed PDF
      together at a lower level than SigWriter.

    Six integration tests under tests/sig_writer_round30.rs cover:

    • RSA-PKCS#1 v1.5 + SHA-256 round-trip (writer → round-21
      reader → round-20 verify_signature end-to-end).
    • ECDSA-P256 + SHA-256 round-trip.
    • /ByteRange placeholder filled correctly (start = 0, second
      range starts at the > after a fixed 8192-byte-wide
      /Contents gap, two ranges cover everything but the gap, last
      byte of range 1 is <, first byte of range 2 is >).
    • Tamper-detection (flipping a body byte fails the
      messageDigest cross-check per RFC 5652 §11.2).
    • qpdf --check accepts the RSA-signed PDF.
    • qpdf --check accepts the ECDSA-signed PDF.

    Provenance: ISO 32000-1 §12.7.4.5 + §12.8.1 + §7.5.6 (incremental
    updates) + RFC 5652 §5 + §5.4 + §11.1 (contentType attribute) +
    §11.2 (messageDigest attribute) + RFC 5754 §2 (SHA-256 with NULL
    params in CMS) + RFC 5753 §2.1 (ECDSA Ecdsa-Sig-Value SEQUENCE).
    No third-party PDF / CMS source consulted.

  • Round 29: Reading-order layout pass over Tagged PDF
    StructTreeRoot
    (ISO 32000-1 §14.6 + §14.7 + §14.8). New
    oxideav_pdf::reader::layout::read_in_logical_order(reader) — and
    the convenience DocumentReader::read_in_logical_order() — walks
    the catalog's /StructTreeRoot /K tree and emits text runs in
    author-intended reading order rather than the painter's raster
    order. For a 2-column document, raster extraction interleaves
    column 1's first row, column 2's first row, column 1's second row,
    …; the round-29 pass walks [Sect_col1, Sect_col2] and emits all
    of column 1 before any of column 2. The walker handles every leaf
    shape ISO 32000-1 §14.7.4.4 defines:

    • Bare-integer MCID kids resolve against the inheritable /Pg
      field on the nearest ancestor.
    • <</Type /MCR /Pg p /MCID m>> marked-content references override
      the inherited /Pg, supporting Tagged tables whose rows draw
      from multiple pages.
    • <</Type /OBJR …>> object references (annotations, not content)
      are skipped — they carry no text.
    • Nested /StructElem kids (Sect inside Div inside …) recurse;
      indirect refs are followed with a 64-deep cycle guard.
      Documents without a /StructTreeRoot (or a malformed / empty
      tree) fall back to the existing raster-order extraction with
      LayoutMode::Raster set on the return so callers can branch.

    The pass piggybacks on a round-29 addition to the round-22 text
    walker: the new extract_text_marked(reader) (and matching
    DocumentReader::marked_text_extraction()) emits every text run
    alongside the marked-content /MCID it was painted under (ISO
    32000-1 §14.6 — BDC / BMC / EMC operators). The walker
    recognises BDC / BMC / EMC / MP / DP keywords and parses
    the /MCID slot out of inline <</MCID n>> property dicts at the
    top level. New public surfaces under oxideav_pdf::reader:

    • MarkedTextRun { run, mcid, page_obj_num, page_index }
    • PdfMarkedTextExtraction { runs }
    • LayoutMode { Tagged, Raster }
    • ReadingOrderText { mode, runs } (with flat_text())

    Seven fixtures under tests/reading_order_round29.rs cover:
    two-column tagged-PDF logical reordering vs. raster baseline,
    non-tagged fallback, cross-page MCRs (/MCR /Pg ... /MCID ...),
    marked-text MCID accounting, and nested /Sect > /P > MCID
    recursion. No external library was consulted.

  • Round 28: Simple-font /Encoding /Differences resolver wired into
    text extraction
    (ISO 32000-1 §9.6.6.1 + §D.2 + Adobe Glyph List v2.0
    public document). When a simple Type1 / TrueType / Type3 font carries
    an encoding dictionary (not just a name) the reader now overlays the
    /Differences array onto the /BaseEncoding map before mapping bytes
    back to Unicode. Three new public surfaces under
    oxideav_pdf::reader::encoding:

    • parse_encoding_differences(arr) -> EncodingDifferences walks the
      flat [N name1 name2 … M nameK …] form per §9.6.6.1 — numeric
      tokens reset the running code, names land at consecutive slots,
      unknown tokens are tolerated. Honours Object::Integer AND
      Object::Real numeric forms.
    • apply_encoding_differences(&base, &diffs) -> EncodingMap overlays
      one parsed array on top of any of the six named BaseEncoding
      variants (WinAnsi / MacRoman / MacExpert / Standard /
      Symbol / ZapfDingbats). Unknown glyph names leave the slot
      empty so the decoder emits U+FFFD as a marker (matching what
      pdftotext --raw does for un-resolvable glyphs).
    • EncodingMap::from_base(BaseEncoding) ships a 256-entry table per
      Annex D.2 / D.4 / D.5 / D.6 plus the Adobe Type 1 Standard
      encoding. Multi-character glyph expansions (/fi → "fi", /fl
      "fl") are accommodated; the table slot is a short String rather
      than a single char.

    The Adobe Glyph List subset shipped with the resolver covers the
    PostScript Latin character set, common Greek letters, smart-quote /
    dash / fraction set, math operators, arrows, and the /fi and /fl
    ligatures — about 320 glyph names. Extension to the full ~4280-line
    AGL is round-29+. Glyph list staged under
    docs/document/pdf/agl/subset.txt and the README there cites the AGL
    v2.0 public-document source. Seven new fixtures under
    tests/encoding_differences_round28.rs cover smart-quote overrides,
    Greek glyph remap, /fi / /fl ligature expansion, multi-segment
    arrays with running-code resets, unknown-glyph replacement-char
    fallback, empty /Differences, and /MacRomanEncoding base
    encoding. Three of them feed the fixture PDF to a system pdftotext
    binary when available and assert the extracted text contains the
    expected substring.

  • Round 27: Linearization Parameter Dictionary reader + Object
    Hierarchy validator + PDF/A conformance detection beyond XMP

    (ISO 32000-1 §F.2 + §7.7.2 + §7.7.3 / ISO 19005-1..4 §6.x).
    Three new reader-side surfaces:

    • parse_linearization_dict(bytes) -> Result<Option<LinearizationParams>>
      and DocumentReader::linearization() parse the /Linearized 1 /L /H [off len] /O /E /N /T first-object dictionary every Fast-Web-View
      PDF emits in its head (§F.3.3 — entirely within first 1024 bytes).
      Round 9's writer-side emission now has its reader-side complement.
      LinearizationParams::verify(&bytes) cross-checks /L against the
      actual file length and bounds-checks /T, /E, /H. The parser
      returns Ok(None) for plain (non-linearized) files so callers can
      branch on the Option. Hint-table decoding (Annex F.4) is round 28+.
    • verify_pdf_hierarchy(reader) -> Result<HierarchyReport> (and
      DocumentReader::verify_hierarchy()) walks Catalog → Pages → Page
      and collects every spec divergence as a HierarchyIssue with
      IssueSeverity::Error or Warning: Catalog /Type + /Pages
      presence (§7.7.2 Table 28), /Pages node /Type / /Kids /
      /Count (§7.7.3.2 Table 29), /Page leaf /Parent back-reference
      • /MediaBox presence (§7.7.3.3 Table 30), cycle detection with
        a 32-hop depth guard. Never aborts the walk — surfaces every issue
        at once so a downstream tool can report.is_valid() or filter by
        severity.
    • read_pdf_pdfa_signals(reader) -> Result<PdfACatalogSignals> (and
      DocumentReader::pdfa_signals() + ::pdfa_conformance()) surface
      the structural PDF/A signals from the catalog independently of the
      XMP pdfaid:part claim: /MarkInfo /Marked|UserProperties|Suspects,
      /StructTreeRoot presence, /Lang, /OutputIntents count, and
      /Metadata presence. PdfAConformance::from_signals_and_xmp cross-
      verifies the XMP-declared part + conformance against the structural
      prerequisites ISO 19005-1 §6.2.2 / §6.7 / §6.8 require — an A-level
      claim missing /MarkInfo /Marked true or /StructTreeRoot flags
      claim_inconsistent = true with a free-form diagnostic.
      Tested end-to-end with +33 tests (15 integration in tests/round27.rs
    • 10 unit in src/reader/linearize.rs + 4 unit in
      src/reader/hierarchy.rs + 7 unit in src/reader/pdfa.rs).
  • Round 26: Annotations beyond Link + XMP packet field extraction
    (ISO 32000-1 §12.5.6 Tables 169..209 + §14.3.2 / Adobe XMP Spec
    2012 / ISO 16684-1 / ISO 19005-1..3 §6.x). New reader entry
    DocumentReader::annotations() (free function: read_pdf_annotations)
    walks every page's /Annots array and surfaces every entry as a
    PdfAnnotation. Per-subtype payload covers /Text (§12.5.6.4 Table
    172 — /Open, /Name icon, /State, /StateModel), /FreeText
    (§12.5.6.6 Table 174 — /DA, /Q quadding, /RC, /IT intent),
    /Stamp (§12.5.6.13 Table 184 — icon name), the four text-markup
    variants /Highlight / /Underline / /Squiggly / /StrikeOut
    (§12.5.6.10 Table 179 — /QuadPoints), /Square + /Circle
    (§12.5.6.8 Table 177 — /IC, /RD), /Link (re-uses round-25's
    go-to / URI dispatch), and /Widget (§12.5.6.19 Table 188 + §12.7.4
    Table 220 — /FT, /T, /V). Unknown subtypes surface as
    AnnotationKind::Other { subtype }. Common Table 164 fields
    (/Rect, /Contents, /NM, /M, /F, /C, /Border) are
    decoded for every subtype.
    New DocumentReader::xmp_packet() (and XmpPacket::parse(bytes) for
    callers with the raw bytes already in hand) parses the document-level
    XMP packet round-19 surfaces into a structured view of the most-used
    Dublin Core (dc:title through rdf:Alt / dc:creator through
    rdf:Seq / dc:subject rdf:Bag / dc:rights / dc:format),
    XMP Basic (xmp:CreateDate / xmp:ModifyDate / xmp:MetadataDate
    / xmp:CreatorTool), PDF schema (pdf:Producer / pdf:Keywords /
    pdf:PDFVersion / pdf:Trapped), and PDF/A identification schema
    (pdfaid:part / pdfaid:conformance) fields. Element-body and
    attribute forms both recognised; the standard five XML entities
    (&amp; / &lt; / &gt; / &quot; / &apos;) plus numeric
    character references decode. XmpPacket::is_pdf_a() and
    pdf_a_conformance() collapse the pair into a 1B-style designator
    for PDF/A conformance detection. Tested end-to-end with +36 tests
    (19 integration in tests/annotations_round26.rs covering every
    subtype dispatch, common-field decode, page-without-annots baseline,
    unified-reader round-trip of the writer's Link annotations, XMP
    Dublin Core / XMP Basic / PDF / PDF/A identification, attribute-form
    XMP, XML-entity decode, and absent-XMP None; +6 unit tests in
    src/reader/annotation.rs and +11 unit tests in src/reader/xmp.rs).

  • Round 25: Document outline (bookmarks) + Link annotations
    (ISO 32000-1 §12.3.3 Tables 152+153 + §12.5.6.5 Table 173 + §12.3.2
    Table 151 destinations). New writer entry points
    write_pdf_from_scene_with_outlines + …_with_outlines_and_links
    attach a /Outlines tree to the catalog and per-page /Annots [/Subtype /Link] arrays without disturbing the existing single-/
    multi-page entry points. New reader functions read_pdf_outline

    • read_pdf_links walk the bookmark tree (the doubly-linked
      /First//Last//Next//Prev shape collapses back into a
      parent-owned children Vec) and per-page link list. Destinations
      cover all eight Table 151 forms — Xyz / Fit / FitH / FitV
      / FitR / FitB / FitBH / FitBV — with null retain-current
      semantics on the optional numerics. Link targets cover both
      internal /Dest <explicit-array> go-to and external
      /A << /S /URI /URI (...) >> action forms. Outline /Count
      honours the open / closed sign per Table 153 (open ⇒
      +visible_descendants; closed ⇒ -|hidden_descendants|), and the
      reader's OutlineNode::is_open() / descendant_count() helpers
      expose the same convention to callers. Tested end-to-end with
      +19 tests (16 integration in tests/outline_round25.rs covering
      three-bookmark catalog, nested open/closed chapters, every dest
      variant, Unicode title, URI + go-to link, multi-page link
      grouping, out-of-range writer rejection, combined outline+link
      round-trip, and empty-input baseline; +13 unit tests across
      src/outline.rs + src/reader/outline.rs + src/reader/link.rs).
  • Round 24: CMS KARI X448 ECDH (RFC 7748 §5 + RFC 8410 §3 + RFC 8418
    §2.1 + §2.2). New KariCurve::X448 joins the existing P-256/P-384/
    P-521/X25519 dispatch — id-X448 (OID 1.3.101.111), 56-byte raw
    u-coordinate keys, 224-bit security level. Default KDF binding is
    X9.63-SHA-512 (security-strength match); HKDF SHA-256/384/512 are
    also valid via the new KariRecipient::x448_hkdf_* constructors.
    Reader (unwrap_kari / read_pdf_to_scene_with_certificate) and
    writer (write_pdf_from_scene_pubsec_kari) both handle X448 KARI
    envelopes through the existing entry points. RFC 7748 §6.2
    Alice/Bob test vector cross-checked byte-for-byte. Backed by the
    pure-Rust x448 (RustCrypto / ed448-goldilocks) crate.

  • Round 23: JPEG passthrough on /Filter /DCTDecode Image XObjects
    (ISO 32000-1 §7.4.8 + §8.9). New DocumentReader::image_xobjects()
    walks every page's /Resources /XObject subdict and surfaces every
    Image XObject whose final filter is /DCTDecode. The returned
    PdfImageXObject carries the unmodified JPEG bytes (ready for any
    JPEG decoder), the /Width / /Height, the /ColorSpace
    (DeviceRGB / DeviceCMYK / DeviceGray / Indexed / Other),
    and the /BitsPerComponent. Wrapping /ASCII85Decode /
    /ASCIIHexDecode / /FlateDecode filters preceding /DCTDecode are
    unwrapped before the JPEG payload is returned. Cross-checked against
    pdfimages -all (poppler-utils) as a black-box validator — extracted
    bytes are byte-identical to both the source JPEG and pdfimages's
    dump.

  • Round 22: text extraction. DocumentReader::text_extraction() walks
    every page's content stream and emits TextRuns (text + position +
    font name + font size) for Tj / TJ / ' / " operators. Maps
    encoded glyphs back to Unicode through embedded /ToUnicode CMaps
    (bfchar / bfrange per ISO 32000-1 §9.10.3), Identity-H Type 0
    CIDs, WinAnsiEncoding, and MacRomanEncoding (Annex D.2). Cross-checked
    against pdftotext (poppler) as a black-box validator.


This PR was generated with release-plz.

@MagicalTux MagicalTux force-pushed the release-plz-2026-05-10T06-03-13Z branch 8 times, most recently from 4104ffc to 33dd3c0 Compare May 11, 2026 23:51
@MagicalTux MagicalTux force-pushed the release-plz-2026-05-10T06-03-13Z branch from 33dd3c0 to b5282ac Compare May 12, 2026 07:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant