Skip to content

OxideAV/oxideav-pdf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

oxideav-pdf

Pure-Rust PDF writer + reader for the oxideav framework. The writer emits PDF 1.4 vector documents from VectorFrame / Scene inputs (paths stay paths, fills stay fills); the reader walks bytes back into a Scene, with optional decryption for password-protected files. Zero C dependencies.

Part of the oxideav framework — a pure-Rust media stack. Codec, container, and filter crates are implemented from the spec (no C codec libraries linked or wrapped, no *-sys crates).

What round 1 supports

  • Paths: MoveTo (m), LineTo (l), CubicCurveTo (c), QuadCurveTo (lifted to cubic via the 2/3 * (control - endpoint) trick), ArcTo (flattened to cubic per SVG 1.1 Appendix F.6.5), Close (h).
  • Fills: Paint::Solid (DeviceRGB sc), Paint::LinearGradient (axial pattern shading, Pattern Type 2 + Function Type 2), Paint::RadialGradient (radial shading, Function Type 3).
  • Strokes: width (w), cap (J), join (j), miter limit (M), dash pattern (d).
  • Transforms: every Group::transform emits one cm operator.
  • Groups: q ... Q save/restore brackets around children. Group opacity becomes an ExtGState resource referenced via /GSx gs.
  • Clip paths: emitted before the children's content stream as W n (or W* n for even-odd fill rule).
  • Fill rules: NonZero (f / B) vs. EvenOdd (f* / B*).
  • Embedded raster: ImageRef whose underlying VideoFrame is RGBA8 lands as a FlateDecode Image XObject and is painted with Do.

Encryption decode (full Standard handler)

The reader handles password-protected PDFs under the standard security handler across the full revision range ISO 32000 defines:

  • R=2 — RC4-40 (V=1, Length=40).
  • R=3 — RC4-128 (V=2, Length=128).
  • R=4 — AES-128 CBC or RC4-128, picked from the crypt-filter CFM (AESV2 vs V2).
  • R=5 — AES-256 CBC, V=5, CFM=AESV3. Adobe extension level 3 (PDF 1.7); plain SHA-256 password derivation with validation + key salts.
  • R=6 — AES-256 CBC, V=5, CFM=AESV3. ISO 32000-2:2020 (PDF 2.0); iterated SHA-256/384/512 hash chain (Algorithm 2.B) plus /Perms block validation (Algorithm 13).

Both user and owner passwords authenticate (Algorithms 6 + 7 for R≤4; Algorithms 11 + 12 for R≥5); the default empty user password is tried first so PDFs encrypted "just for permission flags" open with no caller intervention. Strings and stream payloads are decrypted via per-object keys (Algorithm 1) for R≤4 and via the file key directly (no per-object derivation) for R≥5.

let pdf = std::fs::read("locked.pdf")?;
// Default API tries the empty user password.
match oxideav_pdf::read_pdf_to_scene(&pdf) {
    Ok(scene) => println!("opened: {} pages", scene.pages.unwrap().len()),
    Err(_)    => {
        // Password-protected — supply one.
        let scene = oxideav_pdf::read_pdf_to_scene_with_password(&pdf, b"hunter2")?;
    }
}
# Ok::<(), Box<dyn std::error::Error>>(())

Per-stream crypt-filter overrides land in a follow-up round.

Public-key encryption (decode + encode)

The reader and writer both handle public-key-encrypted PDFs under the adbe.pkcs7.s3 / s4 / s5 SubFilters of the public-key security handler (ISO 32000-1 §7.6.4 + ISO 32000-2 §7.6.5):

  • adbe.pkcs7.s3 — RC4-40, V=1, SHA-1 file-key derivation.
  • adbe.pkcs7.s4 — RC4-128, V=2, SHA-1.
  • adbe.pkcs7.s5, V=4 — RC4-128 or AES-128 CBC via CFM (V2 / AESV2).
  • adbe.pkcs7.s5, V=5 — AES-256 CBC, CFM=AESV3, SHA-256.

The trailer's /Recipients array (or /CF /<StmF> /Recipients for s5) carries one CMS EnvelopedData (RFC 5652 §6.1) per access- permission set; each envelope's KeyTransRecipientInfo SET wraps the content-encryption key with RSAES-PKCS1-v1_5 to a recipient's RSA public key. The reader matches by either IssuerAndSerialNumber (CMS v0) or SubjectKeyIdentifier (CMS v2 — RFC 5280 §4.2.1.2 method 1 SHA-1 of the SPKI BIT STRING contents), RSA-decrypts the wrapped CEK, decrypts the envelope contents (RC4 / AES-128 / AES-256 CBC), then derives the file encryption key per §7.6.4.3 / §7.6.5.3.

use oxideav_pdf::{read_pdf_to_scene_with_certificate, PubSecCredential};

let cert_der    = std::fs::read("user.cert.der")?;
let pkcs8_der   = std::fs::read("user.key.pkcs8.der")?;
let credential  = PubSecCredential::from_der(&cert_der, &pkcs8_der)?;
let scene = read_pdf_to_scene_with_certificate(&pdf_bytes, &credential)?;
# Ok::<(), Box<dyn std::error::Error>>(())

Round 11 lands the symmetric encoder side: the writer emits public-key-encrypted PDFs that round-trip through the reader.

use oxideav_pdf::{
    write_pdf_from_scene_pubsec_encrypted, PubSecEncoderConfig, PubSecRecipient,
};

// One recipient — IssuerAndSerial form.
let recipient = PubSecRecipient::from_issuer_and_serial(
    issuer_der,           // recipient cert's `issuer` SEQUENCE bytes
    serial_bytes,         // recipient cert's serial INTEGER body
    rsa_public_key,
);
let cfg = PubSecEncoderConfig::pkcs7_s5_v5_aes256(vec![recipient]);
let pdf = write_pdf_from_scene_pubsec_encrypted(&scene, &cfg)?;
# Ok::<(), oxideav_pdf::PdfError>(())

PubSecRecipient also exposes from_subject_key_identifier(ski, key) for the CMS v2 form. Round 12 adds per-crypt-filter recipient listswrite_pdf_from_scene_pubsec_multi_cf + PubSecMultiCfConfig

  • PubSecCfGroup emit a doc with multiple permission sets (each its own envelope), and open_with_certificate_with_permissions surfaces the matched recipient's permission mask. Round 12 lands the CMS KARI decoder (RFC 5652 §6.2.2) — KeyAgree (ECDH/DH) recipients parse structurally. Round 14 closes the unwrap: P-256 ECDH + RFC 5753 §7.1.2 X9.63-SHA-256 KDF + RFC 3394 AES Key Wrap (128/192/256-bit) for the dhSinglePass-stdDH-sha256kdf-scheme KEA OID. Round 15 extends the curve set: P-384 (dhSinglePass-stdDH-sha384kdf-scheme, X9.63-SHA-384) and X25519 (RFC 8418 §2.1, secg-scheme …sha256kdf + id-X25519) join P-256 — pass PubSecCredential::from_parsed_ec(cert, KariCurve::P384, scalar) (or P256 / X25519) and the KARI envelope opens through the same read_pdf_to_scene_with_certificate entry point as KTRI. Round 15 also lands the writer-side KARI encode: write_pdf_from_scene_pubsec_kari(scene, &PubSecKariConfig) mirrors the round-11 KTRI writer — each KariRecipient { curve, … } becomes one CMS KARI envelope with AES-256-WRAP. Round 16 lands P-521 (dhSinglePass-stdDH-sha512kdf-scheme, X9.63-SHA-512) + RFC 8418 §2.2 HKDF binding for X25519 (dhSinglePass-stdDH-hkdf-sha256/384/512-scheme, smime-alg 19/20/21). Round 24 closes the RFC 8418 curve set with X448 (RFC 7748 §5 / RFC 8410 §3 — id-X448 1.3.101.111, 56-byte raw u-coordinate, 224-bit security level): pass KariCurve::X448 and the same writer + reader entry points handle it. Default KDF is X9.63-SHA-512 (security-strength match); HKDF SHA-256/384/512 are also valid via the KariRecipient::x448_hkdf_* constructors. Cross-checked against the RFC 7748 §6.2 Alice/Bob shared-secret vector byte-for-byte. Round 17 closes the long-term-cert originator gap: when a KARI envelope's OriginatorIdentifierOrKey is IssuerAndSerial or SubjectKeyIdentifier rather than the in-band OriginatorPublicKey, the recipient resolves the originator cert through a TrustStore — pass it via read_pdf_to_scene_with_certificate_and_trust_store(pdf, &cred, &store). Round 17 also adds read-only decode for legacy RC2-CBC (RFC 2268 + RFC 3217) and DES-EDE3-CBC (3DES, RFC 3370 §5.2) envelope content algorithms so PDF 2.0-deprecated archives still open; no encode-side support — the writer always uses AES. Round 18 surfaces previously-discarded CMS metadata: the envelope's OriginatorInfo (RFC 5652 §10.2.1 — certs[] / crls[]) is now exposed via EnvelopedData::originator_info(), and the RecipientKeyIdentifier's OPTIONAL date (GeneralizedTime) + other (OtherKeyAttribute) fields are captured by the parser. New TrustStore::find_with_temporal_validity(ski, instant) uses the RKID date to pick the cert generation that was active when the envelope was authored — useful for long-lived archives where multiple cert generations exist for the same SKI. The Certificate parser now also extracts the validity window (notBefore / notAfter), normalising UTCTime to GeneralizedTime per RFC 5280 §4.1.2.5.1's 1950..2049 pivot for direct byte-comparison. Round 19 ships two orthogonal additions. Document-level XMP /Metadata stream end-to-end (ISO 32000-1 §14.3.2 + Adobe XMP Spec 2012): writer entry write_pdf_from_scene_with_xmp(scene, xmp_bytes) attaches the raw XMP RDF/XML packet to the catalog as a /Type /Metadata /Subtype /XML stream (no /Filter); reader accessor DocumentReader::xmp_metadata() returns Some(bytes) for documents that carry one. CMS SignedData parser scaffolding (RFC 5652 §5 — PKCS#7): pubsec::signed_data::parse_signed_data decodes id-signedData blobs into typed SignedData { digest_algorithms, encap_content, certs, crls, signer_infos } + SignerInfo (sid, digest / signature OIDs, signed / unsigned attribute lists with raw-DER values, raw signature octets).

Round 20 closes the round-19 verification deferral. New pubsec::verify::verify_signature(signer, certs, content) resolves the signer's certificate from a pool by IssuerAndSerial or SubjectKeyIdentifier, hashes the canonical (universal-SET-tag) re-encoding of signedAttrs per digestAlgorithm, and verifies the hash against signature per signatureAlgorithm (RFC 5652 §5.4 + §11.2). Hash side: SHA-1 / SHA-256 / SHA-384 / SHA-512. Signature side: RSA-PKCS#1 v1.5 (the rsaEncryption + four sha*WithRSA OIDs all map here), RSA-PSS (id-RSASSA-PSS), and ECDSA on P-256 / P-384 / P-521 (curve dispatch by the cert SPKI's named-curve OID per RFC 5480 §2.1.1.1). When signedAttrs is present, the verifier also cross-checks the messageDigest attribute against the eContent hash (RFC 5652 §11.2) — so a tampered eContent fails even when the outer signature still verifies. Detached signatures (PAdES — eContent absent) feed the document bytes through AttachedContent::External(&[u8]). Round-20 also extends x509::Certificate to capture spki_algorithm_oid + spki_algorithm_params so the verifier can route ECDSA on the named-curve OID without re-parsing the certificate.

Round 21 closes the reader half of the round-20 follow-up list: PDF /Sig annotation reader (ISO 32000-1 §12.7.4.5 + §12.8.1). DocumentReader::signatures() walks the catalog → /AcroForm /Fields tree (honouring /FT inheritance through non-terminal /Kids parents per §12.7.3.1) and surfaces one [PdfSignature] per /V signature dictionary it can parse. Each value carries the [a, b, c, d] /ByteRange, the hex-decoded /Contents blob, the /SubFilter (adbe.pkcs7.detached / ETSI.CAdES.detached etc.), the optional metadata fields (/Name, /Reason, /Location, /ContactInfo, /M), and — for the CMS-detached SubFilters — the parsed [pubsec::signed_data::SignedData]. PdfSignature::signed_message(pdf) concatenates the two /ByteRange-named slices into the byte string the signing tool hashed; pass it as AttachedContent::External(...) to the existing [pubsec::verify::verify_signature] for a full end-to-end verify.

use oxideav_pdf::reader::DocumentReader;
use oxideav_pdf::pubsec::verify::{verify_signature, AttachedContent};
use oxideav_pdf::pubsec::x509::parse_certificate;

let mut r = DocumentReader::open(&pdf_bytes)?;
for sig in r.signatures()? {
    if !sig.is_cms_detached() { continue; }
    let signed = sig.signed_message(&pdf_bytes)?;
    let sd = sig.signed_data.as_ref().expect("CMS-detached parsed");
    let certs: Vec<_> = sd.certs.iter()
        .filter_map(|der| parse_certificate(der).ok())
        .collect();
    let ok = verify_signature(
        &sd.signer_infos[0],
        &certs,
        AttachedContent::External(&signed),
    )?;
    println!("signature verifies: {ok}");
}
# Ok::<(), oxideav_pdf::PdfError>(())

The reader is tolerant of unsigned slots (a Sig form field whose /V is absent — common for "approval line still pending" templates), of non-terminal parent fields without their own /V, and of malformed /Contents blobs (the dict surfaces but signed_data is None).

Round 30 closes the symmetric writer half: the new oxideav_pdf::sig module emits signed PDFs with valid /ByteRange

  • PKCS#7 / CMS SignedData /Contents blobs (ISO 32000-1 §12.7.4.5 + §12.8.1 + §7.5.6 + RFC 5652 §5 + §5.4 + §11.2). The classic "ByteRange-placeholder fill-in" pattern is implemented end-to-end — build PDF with a fixed-width /ByteRange [?? ?? ?? ??] + a /Contents <0…0> placeholder (8192 hex chars = 4096 raw bytes, enough for any RSA-2048 / ECDSA-P256 SHA-256 SignedData with a single signer + cert), patch /ByteRange with the computed offsets, hash the bytes spanned by /ByteRange, wrap into a CAdES-BES-style CMS SignedData with signedAttrs = { contentType, messageDigest } per RFC 5652 §11.1+§11.2, hex-encode, overwrite the placeholder. A [Signer] trait decouples the crypto: bring your own ring / rsa / p256 / HSM impl, or use the reference [RsaPkcs1v15Sha256Signer] / [EcdsaP256Sha256Signer] that wrap the in-crate deps.
use oxideav_pdf::{sign_pdf_from_scene, RsaPkcs1v15Sha256Signer, SignerIdentity};

let private_key = rsa::RsaPrivateKey::new(&mut rsa::rand_core::OsRng, 2048)?;
let signer = RsaPkcs1v15Sha256Signer::new(private_key);
let identity = SignerIdentity::from_signer_cert_der(cert_der)?;
let signed_pdf = sign_pdf_from_scene(&scene, &signer, identity)?;
# Ok::<(), Box<dyn std::error::Error>>(())

Round-30 ships RSA-PKCS#1 v1.5 + SHA-256 and ECDSA-P256 + SHA-256. RSA-PSS, ECDSA on P-384 / P-521, and Ed25519 plug in through the same [Signer] trait without touching the writer surface. The output is accepted by qpdf --check and verifies end-to-end against the round-27 PKCS#7 verify dispatch.

Encryption encode (writer side)

The writer emits password-protected PDFs across the same revision range the reader handles. [oxideav_pdf::write_pdf_from_scene_encrypted] takes a [Scene] and an [encrypt::EncryptionConfig] and produces bytes that round-trip through read_pdf_to_scene_with_password:

use oxideav_pdf::encrypt::EncryptionConfig;

let cfg = EncryptionConfig::aes_256_r6(b"hunter2", b"FILE-ID-16-BYTES");
let pdf = oxideav_pdf::write_pdf_from_scene_encrypted(&scene, &cfg)?;
# Ok::<(), oxideav_pdf::PdfError>(())

Writer-side coverage matches the reader: R=2 (RC4-40), R=3 (RC4-128), R=4 (AES-128 / RC4 via CFM), R=5 (Adobe ext L3), R=6 (ISO 2.0). /O, /U, /OE, /UE, and /Perms come from the canonical algorithms (3, 4, 5 for V≤4; 8, 9, 10 for V=5); per-object key derivation is Algorithm 1 (V≤4) or the file key directly (V=5).

Cross-reference streams

Both reader and writer support the binary cross-reference stream form introduced in PDF 1.5 (ISO 32000-1 §7.5.8): a /Type /XRef stream object whose body packs each entry into /W [w1 w2 w3] big-endian fields, Flate-compressed with /Predictor 12 (PNG-Up). The classical xref-keyword form (PDF 1.0..1.4) is also accepted on input and remains the writer's default; opt into the stream form via [oxideav_pdf::write_pdf_from_scene_xref_stream].

Object streams

Both reader and writer support PDF 1.5+ object streams (/Type /ObjStm, ISO 32000-1 §7.5.7). The reader resolves Compressed xref entries by fetching the containing object stream, parsing its (obj_num offset) header, and returning the body bytes from the matching slot. The writer packs every compressible indirect object (every dict that isn't a stream and isn't the Catalog) into one ObjStm container — opt in via [oxideav_pdf::write_pdf_from_scene_object_stream]. Stream objects (content streams, image XObjects, the xref stream itself) cannot be compressed per §7.5.7 and remain at their own byte offsets.

Incremental updates

[oxideav_pdf::write_pdf_incremental_update] appends new revisions to a previously-written PDF per ISO 32000-1 §7.5.6 — the new revision's body is appended verbatim, followed by a new xref subsection that lists only the changed slots, plus a trailer carrying /Prev <prev_xref_off> pointing back at the original revision. The reader follows the /Prev chain and merges entries: the newest revision wins on overlap.

let original = oxideav_pdf::write_pdf_from_scene(&scene_v1)?;
// ... time passes; user adds two pages ...
let updated = oxideav_pdf::write_pdf_incremental_update(&original, &new_pages)?;
// `updated` starts with `original` byte-for-byte, then appends.

Per-stream /Crypt /Identity opt-out

ISO 32000-1 §7.6.5 lets a single stream opt out of per-object encryption by listing /Crypt as its first /Filter with /DecodeParms /Name /Identity (or no /Name — the default per §7.4.10 Table 24). The writer leaves such streams untouched while encrypting the rest of the file; the reader applies the same rule on input. The classic consumer is XMP metadata streams that need to remain searchable in encrypted PDFs.

Linearization (Fast Web View)

Round 9 emits Linearized PDF per ISO 32000-1 §7.5.6 + Annex F. [write_pdf_from_scene_linearized] produces a PDF whose first 1024 bytes carry a complete linearization parameter dictionary (/Linearized 1 + /L + /H + /O + /E + /N + /T); the on-wire layout follows F.3.1 (header → lin-dict → first-page xref → catalog → hint stream → first-page section → remaining pages → main xref). startxref at EOF points at the first-page xref; the first-page trailer's /Prev points at the main xref. The output is also a valid plain PDF — readers ignoring /Linearized walk the same Catalog + Pages tree + page content.

The hint stream emits the page offset table (F.4.1) with full per-page entries (round 13: items 1, 2, 6, 7 — object count, page length, content stream offset relative to page start, content stream length) at fixed 32-bit width, plus minimal shared-object (F.4.2), thumbnail (F.4.3), and outline (F.4.4) header sections. Entry counts for the latter three are zero so no per-shared-object / per-thumbnail / per-outline bytes are generated. The hint dict carries /S, /T, /O offsets into the decoded hint stream so a reader walking the optional tables sees a fully-formed (if empty) layout. Extended generic (F.4.5) and embedded-file-stream (F.4.6) tables are still deferred — we generate no interactive forms / structure trees / embedded files.

Text extraction (round 22)

[DocumentReader::text_extraction] walks every page's content stream and emits one [TextRun] per Tj / TJ / ' / " operator, with the text-matrix origin and Tf font + size resolved per ISO 32000-1 §9.4.4. Encoded glyphs are mapped back to Unicode through the font's /ToUnicode CMap when present (parsing the bfchar / bfrange blocks defined in §9.10.3 + Adobe Tech Note #5014); for Identity-H Type 0 fonts without /ToUnicode the walker falls back to interpreting each 2-byte CID as a BMP code point. Simple fonts honour /Encoding /WinAnsiEncoding and /Encoding /MacRomanEncoding (Annex D.2), with a Latin-1 fallback for everything else.

use oxideav_pdf::reader::DocumentReader;

let pdf = std::fs::read("invoice.pdf")?;
let mut reader = DocumentReader::open(&pdf)?;
let extraction = reader.text_extraction()?;
for run in &extraction.runs {
    println!("@({:.0},{:.0}) {}/{}: {}",
        run.position.0, run.position.1,
        run.font_name, run.font_size, run.text);
}
println!("flat: {}", extraction.flat_text());
# Ok::<(), Box<dyn std::error::Error>>(())

Runs come out in stream order — the rendering order the page would have laid down. Reading-order reconstruction (column / paragraph segmentation) is a future-round followup; round 22 gives the raw runs plus matrix positions so a downstream layout pass can do its own segmentation.

JPEG passthrough on Image XObjects (round 23)

[DocumentReader::image_xobjects] walks every page's /Resources /XObject subdict and surfaces every Image XObject whose final filter is /DCTDecode (ISO 32000-1 §7.4.8). The returned [PdfImageXObject] carries the unmodified JPEG bytes — the exact JPEG-1 / JFIF stream a JPEG decoder needs — plus the dictionary's /Width, /Height, /ColorSpace (mapped to the [ColorSpace] tag: DeviceRGB / DeviceCMYK / DeviceGray / Indexed / Other), and /BitsPerComponent. Wrapping /ASCII85Decode / /ASCIIHexDecode / /FlateDecode filters preceding /DCTDecode are unwrapped before the JPEG payload is returned, so callers always get a self-contained JPEG stream (the standard pdfimages -all shape).

use oxideav_pdf::reader::DocumentReader;

let pdf = std::fs::read("photos.pdf")?;
let mut reader = DocumentReader::open(&pdf)?;
for (id, image) in reader.image_xobjects()? {
    let path = format!("xobj-{}.jpg", id.number);
    std::fs::write(&path, &image.data)?;
    println!("{} ({}x{} {:?}, {} bpc)", path,
        image.width, image.height, image.color_space,
        image.bits_per_component);
}
# Ok::<(), Box<dyn std::error::Error>>(())

The same XObject referenced from multiple pages is returned once (deduplicated by ObjectId). Image XObjects with non-DCTDecode filters (FlateDecode-only raster XObjects, JBIG2Decode, JPXDecode, CCITTFaxDecode) are silently skipped — the round-23 walker is JPEG-only. Cross-checked against pdfimages -all (poppler-utils): the bytes are byte-identical.

Annotations beyond Link + XMP packet fields (round 26)

[DocumentReader::annotations] walks every page's /Annots array and surfaces every entry as a [PdfAnnotation] (ISO 32000-1 §12.5.6 Tables 169..209). Per-subtype payload covers /Text (sticky notes — /Open, /Name icon, /State, /StateModel), /FreeText (/DA, /Q quadding, /RC, /IT intent), /Stamp (icon name), the four text-markup variants /Highlight / /Underline / /Squiggly / /StrikeOut (/QuadPoints), /Square + /Circle (/IC, /RD), /Link (re-uses the round-25 go-to / URI decoder), and /Widget (/FT, /T, /V). Unknown subtypes (Movie, Sound, 3D, RichMedia, …) surface as AnnotationKind::Other { subtype }. Common Table 164 fields (/Rect, /Contents, /NM, /M, /F, /C, /Border) are decoded for every subtype.

use oxideav_pdf::{reader::DocumentReader, AnnotationKind};

let mut r = DocumentReader::open(&pdf_bytes)?;
for a in r.annotations()? {
    println!("page {} {:?}: {}", a.source_page_index, a.rect,
        a.contents.as_deref().unwrap_or(""));
    if let AnnotationKind::Stamp { icon } = &a.kind {
        println!("  stamp icon: {icon}");
    }
}
# Ok::<(), oxideav_pdf::PdfError>(())

[DocumentReader::xmp_packet] parses the document-level XMP packet round-19 surfaces into a structured [XmpPacket] (ISO 32000-1 §14.3.2 + Adobe XMP Spec 2012 / ISO 16684-1 / ISO 19005-1..3 §6.x). Covers the most-used Dublin Core (dc:title through rdf:Alt, dc:creator through rdf:Seq, dc:subject rdf:Bag, dc:rights, dc:format), XMP Basic (xmp:CreateDate / xmp:ModifyDate / xmp:MetadataDate / xmp:CreatorTool), PDF schema (pdf:Producer / pdf:Keywords / pdf:PDFVersion / pdf:Trapped), and PDF/A identification (pdfaid:part / pdfaid:conformance) fields. Element and attribute forms both recognised; XML entities (&amp; / &lt; / &gt; / &quot; / &apos;) plus numeric character references decode. XmpPacket::is_pdf_a() + pdf_a_conformance() collapse the pair into a 1B-style PDF/A conformance designator.

let mut r = oxideav_pdf::reader::DocumentReader::open(&pdf_bytes)?;
if let Some(p) = r.xmp_packet()? {
    println!("title:    {:?}", p.dc_title);
    println!("creator:  {:?}", p.dc_creator);
    println!("producer: {:?}", p.pdf_producer);
    if p.is_pdf_a() {
        println!("PDF/A conformance: {:?}", p.pdf_a_conformance());
    }
}
# Ok::<(), oxideav_pdf::PdfError>(())

Simple-font /Encoding /Differences resolver (round 28)

Simple Type 1 / TrueType / Type 3 fonts may carry their /Encoding as a dictionary that overlays a /Differences array on top of a named /BaseEncoding (ISO 32000-1 §9.6.6.1). The reader resolves this properly: the array's flat [N name1 name2 … M nameK …] form is parsed (numeric tokens reset the running code; names land at consecutive slots), and each glyph name maps to its Unicode scalar through the Adobe Glyph List (subset staged under docs/document/pdf/agl/subset.txt, ~320 glyph names). The resolver plugs into the [DocumentReader::text_extraction] path so a /Differences-using font decodes correctly to Unicode.

use oxideav_pdf::reader::{
    apply_encoding_differences, parse_encoding_differences, BaseEncoding,
    EncodingMap,
};
// Imagine an inline encoding dict resolved from a PDF font:
//   /Encoding << /BaseEncoding /WinAnsiEncoding
//                /Differences [24 /breve /caron /circumflex] >>
let diffs = parse_encoding_differences(&diffs_array)?;
let base  = EncodingMap::from_base(BaseEncoding::WinAnsi);
let map   = apply_encoding_differences(&base, &diffs);
assert_eq!(map.decode(&[0x18]), "\u{02D8}"); // breve
# Ok::<(), oxideav_pdf::PdfError>(())

Unknown glyph names emit U+FFFD as a marker (matching what pdftotext --raw does for un-resolvable glyphs). Multi-character glyph expansions (/fi → "fi", /fl → "fl") are accommodated. Six base encodings are recognised: WinAnsi / MacRoman / MacExpert / Standard / Symbol / ZapfDingbats. Full AGL coverage (CJK, Cyrillic, Devanagari) is round-29+.

Reading-order layout pass (round 29)

[DocumentReader::read_in_logical_order] walks the catalog's /StructTreeRoot /K tree and emits text runs in author-intended reading order rather than the painter's raster order (ISO 32000-1 §14.6 + §14.7 + §14.8 — Tagged PDF). For a 2-column document, naive raster extraction interleaves column 1's first row, column 2's first row, column 1's second row, …; the round-29 pass walks [Sect_col1, Sect_col2] and emits all of column 1 before any of column 2. The walker handles every leaf shape ISO 32000-1 §14.7.4.4 defines: bare-integer MCID kids (resolve against the ancestor's inheritable /Pg), <</Type /MCR /Pg p /MCID m>> marked-content references with their own /Pg overrides (cross-page tables), <</Type /OBJR …>> object references (skipped — they reference annotations, not text), and nested /StructElem kids which recurse with a 64-deep cycle guard.

use oxideav_pdf::reader::{DocumentReader, LayoutMode};

let mut r = DocumentReader::open(&pdf_bytes)?;
let result = r.read_in_logical_order()?;
match result.mode {
    LayoutMode::Tagged => println!("logical reading order:"),
    LayoutMode::Raster => println!("raster fallback (no /StructTreeRoot):"),
}
for run in &result.runs {
    println!("  {}", run.text);
}
# Ok::<(), oxideav_pdf::PdfError>(())

Documents without a /StructTreeRoot (or with a malformed / empty tree) fall back to the existing raster-order extraction with LayoutMode::Raster set on the return so callers can branch. The pass also exposes extract_text_marked(reader) which emits every text run alongside the marked-content /MCID it was painted under (for callers that want to assemble a custom logical order outside the StructTreeRoot — e.g. PDF/UA accessibility audits).

Deferred

  • Text emission — writer-side BT … Tj … ET for Node::Text using Type 0 fonts with a CIDFont built via oxideav-ttf/oxideav-otf. The reader-side extraction surface landed in round 22 (see above).
  • Writer-side JPEG passthrough on ImageRef (DCTDecode XObject) — needs core IR support for "raw codec bytes" alongside the decoded VideoFrame so the writer can emit /Filter /DCTDecode instead of re-encoding every JPEG to FlateDecoded raw RGBA. The reader-side surface landed in round 23 (see above).
  • Extended generic hint tables (F.4.5) and embedded-file-stream hint tables (F.4.6) for linearized output — we generate no interactive forms / structure trees / embedded files, so the per-table content would be empty anyway.
  • Ed25519 / Ed448 signature dispatch in pubsec::verify — round 20 covers RSA-PKCS#1 v1.5 / RSA-PSS / ECDSA on P-256 / P-384 / P-521; EdDSA needs an ed25519-dalek (or ed448-goldilocks) dep.
  • Transparency groups beyond a per-Group /ca+/CA opacity.

Usage

[dependencies]
oxideav-core = "0.1"
oxideav-pdf  = "0.0"
use oxideav_core::{
    FillRule, Group, Node, Paint, Path, PathNode, Point, Rgba, VectorFrame,
};
use oxideav_core::TimeBase;

let mut p = Path::new();
p.move_to(Point::new(10.0, 10.0))
    .line_to(Point::new(110.0, 10.0))
    .line_to(Point::new(110.0, 60.0))
    .line_to(Point::new(10.0, 60.0))
    .close();

let frame = VectorFrame {
    width: 200.0,
    height: 100.0,
    view_box: None,
    root: Group {
        children: vec![Node::Path(PathNode {
            path: p,
            fill: Some(Paint::Solid(Rgba::opaque(0xFF, 0x80, 0x00))),
            stroke: None,
            fill_rule: FillRule::NonZero,
        })],
        ..Group::default()
    },
    pts: None,
    time_base: TimeBase::new(1, 1),
};

let pdf = oxideav_pdf::write_pdf(&frame).expect("vector → PDF");
std::fs::write("out.pdf", pdf).unwrap();
# Ok::<(), Box<dyn std::error::Error>>(())

License

MIT — see LICENSE.

About

Pure-Rust PDF read + write for the oxideav framework — vector-stays-vector path

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages