Fork of: alphanome-ai/sec-parser Scope: US-listed entities, English-only filings, 2015+ (inline CSS era). Current support: 10-K, 10-Q. Expansion planned for 8-K, DEF 14A, S-1.
scotch-parser is the structural parsing module in a larger SEC filing analysis pipeline. It takes raw SEC filing HTML and produces a fully addressed, hierarchically structured document — every element (paragraph, title, table, image) identified by type, assigned a stable segment ID, and placed in a navigable tree.
The parser serves two downstream consumers:
-
NLP Pipeline — Extracts every sentence from a filing with provenance (segment address). Downstream layers label each sentence for signal (
is_boilerplate,has_signal: low|medium|high) and tag with business drivers (revenue, concentration risk, KPI metrics, etc.) against a predetermined ontology. An LLM constructs evidence cards that synthesize fact-spans and surface mechanical insights (accounting changes, timing shifts, mix effects). The parser's job is to deliver clean, addressed, structurally segmented text — not to perform the labeling or synthesis. -
Reflowed Reader — A modern SEC filing reading experience with proper typography, scrollable tables, intelligent TOC navigation, and Intersection Observer-driven active-section tracking. The parser provides the structural backbone: the semantic tree powers the TOC, element types drive rendering components, and section boundaries enable scroll-position mapping.
A separate numeric pipeline (edgartools) handles XBRL fact extraction and financial statement/KPI processing. scotch-parser does not duplicate that work.
| # | Goal | Description | Primary Consumer |
|---|---|---|---|
| G1 | Segment-Addressed Text Extraction | Every element gets a stable, hierarchical segment_id (e.g., part2item7:revenue_recognition:text_3). All text is extractable with its structural address for full provenance. |
NLP Pipeline |
| G2 | Smart TOC & Navigation | A serializable Table of Contents data structure with section IDs, hierarchy levels, and section boundaries. Powers a floating pill UI with Intersection Observer (idle/scroll/hover states, section-jump, progress-within-section). | Reflowed Reader |
| G3 | Complete Table Inventory | Identify and classify every table by purpose: financial statement, footnote, layout, TOC, schedule, exhibit. For non-XBRL tables, capture surrounding prose as context. | Both |
| G4 | Prose-Table Relationship | For each table, determine whether surrounding prose is substantive context (keep) or a stub reference like "refer to the table below" (discard prose, present table only). | Reflowed Reader |
| G5 | Diff-Ready Output | Normalized, segment-addressed text and table output suitable for text-diffing (prose changes across periods) and number-diffing (financial figure changes across periods). | NLP Pipeline |
| G6 | Sentence-Level Segmentation | Split each text element into individual sentences with inherited segment IDs (e.g., part2item7:revenue_recognition:text_3:s2). Filing-aware boundary detection handles SEC-specific abbreviations ("Inc.", "No.", "vs.", note references). |
NLP Pipeline |
| G7 | Cross-Reference Linking | Detect intra-document references ("see Note 5", "as described in Item 7") and produce link targets using segment IDs. | Reflowed Reader |
| G8 | Multi-Filing-Type Support | Extend parsing beyond 10-K/10-Q to 8-K, DEF 14A, and S-1 filing types with appropriate section definitions and pipeline adjustments. | Both |
| G9 | Rendering Hint Metadata | Expose typography and layout hints (scrollable table thresholds, bold/italic/uppercase as rendering directives, table row counts) so the reader can apply proper typography rather than raw inline CSS. | Reflowed Reader |
| Responsibility | Detail |
|---|---|
| HTML structural parsing | Parse SEC filing HTML into typed semantic elements (titles, text, tables, images, sections). |
| Section detection | Identify Part/Item boundaries for all supported filing types. |
| Hierarchical tree construction | Build a navigable parent-child tree from the flat element list using composable nesting rules. |
| Stable element addressing | Assign deterministic segment_id to every element for provenance and cross-period alignment. |
| TOC data structure | Produce a serializable TOC with section IDs, hierarchy, and boundary data. |
| Sentence segmentation | Split text elements into sentences with inherited segment IDs, using filing-aware boundary detection. |
| Table classification | Classify tables by purpose (financial statement, footnote, layout, TOC, etc.). |
| Prose-table relationship | Determine whether prose adjacent to a table is substantive or a stub reference. |
| Text normalization | Collapse whitespace, strip page artifacts, normalize quotes/dashes for diff-ready output. |
| Cross-reference detection | Identify "see Note X" / "as described in Item Y" patterns and produce link targets. |
| Rendering hints | Expose text style properties, table metrics, and layout metadata for frontend rendering. |
| Boilerplate flagging (structural) | Use statistical methods (repeated text across filing, identical to prior period) to flag likely boilerplate at the element level. This is cheap and deterministic — no LLM needed. |
| HTML pre-processing | Normalize messy SEC HTML (non-breaking spaces, nested empty tags, <font> tags, inconsistent <br> vs <p>) before parsing. |
| Responsibility | Owner | Detail |
|---|---|---|
| Ontology definition | NLP Pipeline | The taxonomy of business drivers (revenue, concentration risk, KPI metrics, etc.) is defined and maintained outside the parser. |
| Sentence labeling | NLP Pipeline | is_boilerplate, has_signal, and business driver tags are applied by downstream LLM/ML layers using the parser's addressed output. |
| Evidence card construction | NLP Pipeline | Synthesizing fact-spans into cards with mechanical insights is an LLM task consuming parser output. |
| Insight generation | NLP Pipeline | Surfacing accounting, timing, and mix insights from labeled sentences is downstream analysis. |
| XBRL fact extraction | edgartools | Tagged financial facts, line items, and dimensional data are handled by edgartools. |
| Financial statement parsing | edgartools | Structured extraction of income statements, balance sheets, cash flows into dataframes. |
| KPI / metric extraction | edgartools | Numeric processing of financial metrics from XBRL-tagged data. |
| Frontend rendering | Reflowed Reader | The reader app consumes the parser's tree, TOC, and rendering hints but owns all UI/UX. |
| LLM inference | NLP Pipeline | All model calls for classification, summarization, or synthesis happen outside the parser. |
The parser produces structure, addresses, and metadata. It does not produce judgments, labels, or analysis. The one exception is structural boilerplate detection (repeated identical text), which is statistical and deterministic.
| Filing Type | Effort | Value | Rationale |
|---|---|---|---|
| 8-K (Current Reports) | Moderate | High | Event-driven disclosures (CEO departures, restatements, acquisitions, covenant breaches). Highest signal density per page. Defined item structure (Items 1.01–9.01) with two-decimal format. Variable item sets per filing. Needs FilingSectionsIn8K and TopSectionManagerFor8K with relaxed matching. Most pipeline steps work as-is. |
| DEF 14A (Proxy Statements) | High | High | Executive compensation tables, shareholder proposals, governance disclosures. Rich NLP content but complex structure — doesn't follow rigid Part/Item format. Nested exhibits. Requires significant new section detection heuristics. |
| S-1 / S-3 (Registration Statements) | Moderate | Medium | Similar Part/Item structure to 10-K with additional sections (Use of Proceeds, Dilution, etc.). Valuable for IPO analysis. Straightforward to add after 8-K. |
| 13-F (Institutional Holdings) | Low | Low | Essentially one big table of holdings. The interesting parsing is already handled by edgartools. Deprioritize unless cover page narrative extraction is needed. |
Each new filing type requires:
- A
FilingSectionsInXXdefinition intop_section_title_types.py - A
TopSectionManagerForXXsubclass with appropriate regex patterns - A
EdgarXXParsersubclass incore.pywith its step configuration - Section-specific test fixtures and accuracy benchmarks
Recommendation: Integrate spaCy with a custom sentencizer component tuned for SEC filings.
Why spaCy over naive splitting:
- SEC filings contain dense abbreviations that break naive regex splitting: "Inc.", "No.", "vs.", "Reg.", "approx.", "Corp.", "Ltd.", "Jr.", "Sr."
- Note references ("see Note 5.") are not sentence ends in context
- Legal citations ("Section 13(a) of the Exchange Act") contain periods that aren't boundaries
- Table stub sentences end with colons, not periods
Implementation approach:
- Add spaCy as an optional dependency (sentence segmentation is opt-in)
- Create a
SentenceSegmenterprocessing step that runs afterTextElementMerger - Use spaCy's
en_core_web_smmodel with a custom rule-basedsentencizerpipeline component that adds SEC-specific abbreviation exceptions - Each sentence gets the parent element's
segment_idplus:s{index}suffix - Produce
SentenceElementobjects (new type) as children of the parentTextElement - Fallback: if spaCy is not installed, provide a regex-based splitter with known limitations
Alternative considered: pySBD (Python Sentence Boundary Disambiguation). Lighter than spaCy, rule-based, handles abbreviations well. Could be the better choice if we want to avoid the spaCy dependency weight. Worth prototyping both.
Problem: SEC filings have notoriously messy HTML — non-breaking spaces ( ) everywhere, nested empty <span> tags, <font> tags with inline styles, inconsistent <br> vs <p> usage, invisible zero-width characters.
Implementation: Add an HtmlNormalizer that runs before HtmlTagParser:
- Collapse multiple
into single spaces - Strip empty wrapper elements (
<span></span>,<font></font>,<div></div>) that add no styling - Normalize
<br/><br/>sequences to<p>boundaries - Strip zero-width characters and other invisible Unicode
- Preserve all attributes and styles on elements that carry them
Problem: Boilerplate sentences (risk factor disclaimers, safe harbor statements, standard legal text) appear identically across filings and across periods. Detecting these statistically is cheap and accurate — no LLM needed.
Implementation approach (two layers):
- Cross-filer frequency: Hash each sentence (after normalization). Maintain a frequency table across a corpus. Sentences appearing in >N filings from different CIKs are boilerplate. This runs offline to build the hash table; the parser just looks up each sentence.
- Same-filer period-over-period: Compare sentences between the current filing and the prior period for the same filer. Identical sentences are "unchanged boilerplate." Changed sentences are higher-signal.
Parser responsibility: The parser provides normalized, addressed sentences. Boilerplate detection is a lookup table operation on that output. The parser could ship with a pre-built boilerplate hash table as a data asset, or this can live entirely downstream.
SEC filings contain extensive footnotes to financial statements (Note 1 through Note 20+). These are structurally distinct — subsections within Item 8, each with a numbered title and content.
Implementation:
- Add a
FootnoteClassifierstep that recognizes "Note N" or "NOTE N" patterns within financial statement sections - Produce
FootnoteElement(new type) with a.note_numberattribute - Improves TOC (footnotes listed as collapsible children of Item 8)
- High-value for NLP pipeline (footnotes contain dense, high-signal content about accounting policies, contingencies, and segment data)
Filings end with exhibits (certifications, consent letters, press releases). Most are boilerplate but some (Exhibit 99.1 in 8-Ks — earnings press releases) contain material content.
Implementation:
- Add an
ExhibitClassifierstep that detects exhibit boundaries ("EXHIBIT X.X", "Exhibit Index") - Classify as material vs. boilerplate based on exhibit number conventions
- Allows the reader to collapse/hide boilerplate exhibits while surfacing material ones
Problem: Re-parsing the same filing during dev iteration or ontology updates is wasteful.
Implementation: Content-hash-based cache. The tree is deterministic given the same HTML input and pipeline configuration. Store serialized SemanticTree keyed by hash(html_content + pipeline_config_hash). Avoid re-parsing on cache hit.
The library follows a Pipeline + Strategy pattern:
Raw HTML
→ HtmlTagParser (BeautifulSoup4 wrapper)
→ Flat list of HtmlTag objects
→ 14-step sequential processing pipeline
→ Each step classifies/reclassifies elements
→ Flat list of typed AbstractSemanticElement
→ TreeBuilder (stack-based, composable nesting rules)
→ SemanticTree of TreeNode objects
Key architectural properties:
- Each element starts as
NotYetClassifiedElementand is progressively reclassified - Steps are composable: injectable, replaceable, orderable via
get_stepscallable _NUM_ITERATIONSenables multi-pass behavior per step (used for statistical steps)types_to_process/types_to_excludefilters scope each step precisely- Errors are non-fatal: caught per-element as
ErrorWhileProcessingElement CompositeSemanticElementhandles mixed-content containers (prose + table in one tag)HtmlTagcaching layer makes repeated access O(1) over immutable HTML
Dependencies: BeautifulSoup4, lxml, pandas, loguru, cssutils, xxhash, frozendict.
AbstractSemanticElement
├── NotYetClassifiedElement
├── ErrorWhileProcessingElement
├── IrrelevantElement
│ ├── EmptyElement
│ ├── PageNumberElement
│ ├── PageHeaderElement
│ └── IntroductorySectionElement
├── TextElement (with DictTextContentMixin)
├── SupplementaryText (with DictTextContentMixin)
├── ImageElement
├── HighlightedTextElement (carries TextStyle)
├── TitleElement (AbstractLevelElement, with level: int)
├── TopSectionStartMarker (AbstractLevelElement, with section_type: TopSectionInFiling)
│ └── TopSectionTitle (also has DictTextContentMixin)
├── TableElement
│ └── TableOfContentsElement
└── CompositeSemanticElement (container with inner_elements tuple)
Every element exposes: .text, .html_tag, .get_source_code(), .get_summary(), .to_dict(), .processing_log, .contains_words(), .create_from_element().
| # | Step Class | What It Does | Quality | Goal Relevance |
|---|---|---|---|---|
| 1 | IndividualSemanticElementExtractor |
Splits mixed-content containers into CompositeSemanticElement |
Good | G3, G4 |
| 2 | ImageClassifier |
Detects <img> tags → ImageElement |
Fine | — |
| 3 | EmptyElementClassifier |
Tags empty elements as EmptyElement (filtered from output) |
Fine | G1 |
| 4 | TableClassifier |
Tags <table> with >1 row as TableElement |
Functional, needs enhancement | G3 |
| 5 | TableOfContentsClassifier |
Reclassifies TableElement to TableOfContentsElement if "page" found in cells |
Too simplistic | G2, G3 |
| 6 | TopSectionManagerFor10Q/10K |
Two-pass regex match for "Part [I-IV]" and "Item [N]" | Core value-add, some brittleness | G1, G2 |
| 7 | IntroductorySectionElementClassifier |
Marks everything before "Part 1" as IntroductorySectionElement |
Fine | G1 |
| 8 | TextClassifier |
Catch-all: remaining elements with text become TextElement |
Fine | G1 |
| 9 | HighlightedTextClassifier |
Analyzes inline CSS → HighlightedTextElement with TextStyle |
Good for 2015+ | G2, G9 |
| 10 | SupplementaryTextClassifier |
Reclassifies parenthetical text, italic notes, "accompanying notes" refs | Too narrow | G4 |
| 11 | PageHeaderClassifier |
Statistical: short text appearing 5+ times → PageHeaderElement |
Clever, effective | G1 |
| 12 | PageNumberClassifier |
Statistical: digit-bearing short text appearing 5+ times → PageNumberElement |
Clever, effective | G1 |
| 13 | TitleClassifier |
Converts HighlightedTextElement → TitleElement with hierarchy level |
Fragile — core concern | G1, G2 |
| 14 | TextElementMerger |
Merges adjacent TextElement separated by IrrelevantElement gaps |
Important for XBRL text | G1, G5 |
Stack-based algorithm with three default nesting rules:
AlwaysNestAsParentRule(TopSectionStartMarker)— Part/Item sections parent everythingAlwaysNestAsParentRule(TitleElement, exclude_children={TopSectionStartMarker})— Titles parent non-title contentNestSameTypeDependingOnLevelRule()— Same-type elements nest by level
Strengths: Composable rule system, depth-first iteration, ASCII rendering for debugging. Gaps: No section boundary data, no TOC serialization, no query/filter APIs, destructive stack popping.
TopSectionInFiling dataclass: identifier, title, order, level.
- 10-Q: 12 sections (Part I Items 1–4, Part II Items 1–6, plus Part headings)
- 10-K: 27 sections (Part I–IV, Items 1–16 including 1c Cybersecurity)
- Detection:
TableClassifier+TableCheck - Metrics:
get_approx_table_metrics()→ApproxTableMetrics(rows, numbers) - Markdown:
TableToMarkdown.convert()— unmergescolspan, parses viapd.read_html - DataFrame:
TableParser.parse_as_df()— naive, parses only first table - TOC detection:
check_table_contains_text_page()— exact match only - Summary:
TableElement.get_summary()— has f-string bug (see Flaws)
TextStyle frozen dataclass: is_all_uppercase, bold_with_font_weight, italic, centered, underline. All derived from inline CSS with 80% threshold. Adequate for 2015+ filings. Thresholds not configurable.
| Function / Class | File | Status |
|---|---|---|
TreeBuilder.build() |
tree_builder.py |
Needs enhancement: no section boundary data, no TOC output, stack-popping bug |
TopSectionManager._process_element() |
top_section_manager.py |
Needs enhancement: regex brittleness, formatting variant intolerance, dead code |
TopSectionTitle + TopSectionStartMarker |
top_section_title.py, top_section_start_marker.py |
Usable as-is. .section_type.identifier is a natural segment address root |
TitleClassifier._process_element() |
title_classifier.py |
Needs fix: order-of-first-occurrence level assignment is fragile |
TableClassifier._process_element() |
table_classifier.py |
Needs enhancement: no table type classification |
IndividualSemanticElementExtractor |
individual_semantic_element_extractor.py |
Good as-is for splitting mixed containers |
AbstractSemanticElement.text / .get_source_code() |
abstract_semantic_element.py |
Usable. Needs normalized text output for diffing |
SemanticTree.nodes |
semantic_tree.py |
Usable as-is |
CompositeSemanticElement.unwrap_elements() |
composite_semantic_element.py |
Usable as-is |
| Function / Class | File | Status |
|---|---|---|
HighlightedTextClassifier._process_element() |
highlighted_text_classifier.py |
Good for 2015+ filings |
TextStyle.from_style_and_text() |
highlighted_text_element.py |
Good. Make thresholds configurable |
PageHeaderClassifier._process_element() |
page_header_classifier.py |
Good. Statistical approach is robust |
PageNumberClassifier._process_element() |
page_number_classifier.py |
Good. Same statistical approach |
TextElementMerger._process_elements() |
text_element_merger.py |
Important for XBRL-split text fragments |
SupplementaryTextClassifier._process_element() |
supplementary_text_classifier.py |
Too narrow for general stub detection |
HtmlTag (entire class) |
html_tag.py |
Good. Needs byte/character offset exposure |
| Function / Class | File | Status |
|---|---|---|
ImageClassifier |
image_classifier.py |
Fine as-is |
TableElement.table_to_markdown() |
table_element.py, table_to_markdown.py |
Usable for diff-friendly table output |
TableOfContentsClassifier |
table_of_contents_classifier.py |
Too simplistic to rely on |
| Processing log system | processing_log.py |
Fine as-is |
| Nesting rules | nesting_rules.py |
Good composable design |
| Function / Class | File | Why Low Value |
|---|---|---|
TableParser.parse_as_df() |
table_parser.py |
Parses only first table, naive headers, fragile $/% merging |
get_approx_table_metrics() |
approx_table_metrics.py |
Rough row/number counts only; insufficient for table classification |
check_table_contains_text_page() |
table_check_data_cell.py |
Exact-match only, misses common variants |
ParsingOptions |
types.py |
Single field (html_integrity_checks). Severely underutilized |
TitleClassifier assigns level 0 to the first unique TextStyle encountered, level 1 to the second, etc. If a styled disclaimer appears before the first real heading, all levels shift. PageHeaderClassifier and SupplementaryTextClassifier catch some cases but edge cases remain (bold legal notices, centered company names, styled exhibit references).
Impact: Incorrect TOC nesting (G2), unreliable segment hierarchy (G1).
Fix: Two-pass with style scoring. Pass 1: collect all unique styles. Score by visual weight: (bold * 3) + (uppercase * 2) + (centered * 2) + (underline * 1) - (italic * 1). Pass 2: assign levels by score rank.
Elements have no id, index, or content-hash identifier. Identified only by Python object identity. to_dict() produces html_hash via xxhash on HtmlTag but it's not propagated as a stable element ID.
Impact: No segment addresses (G1), no TOC anchors (G2), no cross-period alignment (G5).
Fix: Post-tree-building step assigning hierarchical IDs. TopSectionTitle uses .section_type.identifier. TitleElement derives from parent + slugified title. Content elements use parent ID + type + index.
TableClassifier produces a single TableElement type. No distinction between financial statements, footnotes, layout tables, exhibit lists, or schedules.
Impact: G3 requires classified inventory. G4 depends on table type for prose-stub logic.
Fix: TableTypeClassifier step. Heuristics: $ markers and numeric columns = financial statement; smaller tables in footnote sections = footnote; single-column or text-heavy = layout.
No step examines the relationship between a TextElement and adjacent TableElement. SupplementaryTextClassifier catches only a few specific patterns.
Impact: G4 cannot distinguish substantive prose from stub references.
Fix: TableContextClassifier step. Classify preceding text as stub-reference, substantive-context, or standalone.
TreeNode has no start/end index, character offset, or document-position information.
Impact: G2's Intersection Observer needs scroll-to-section mapping. Without positional data the frontend must re-walk the DOM.
Fix: Track start_index, end_index, char_offset during tree construction. Expose tree.get_section_map().
The while stack loop in TreeBuilder._find_parent_node() pops elements that don't match as parents. If element N can't find a parent, it destroys context that element N+1 might need.
Impact: After standalone tables or images, accumulated title context is lost. Subsequent elements become root nodes. Produces broken trees for G2 and incorrect addresses for G1.
Fix: Non-destructive backward search. Only trim the stack after finding a match or determining root status.
| ID | Flaw | Location | Impact |
|---|---|---|---|
| m1 | Missing f-string in TableElement.get_summary() — returns literal "{len(self.text)}" instead of count |
table_element.py:23 |
Bug |
| m2 | Dead code: module-level docstrings with "ChatGPT improved version" | top_section_manager.py:347-379 |
Clutter |
| m3 | All thresholds hard-coded, none exposed via ParsingOptions |
Multiple files | Inflexibility |
| m4 | match_part() roman numeral map limited to I–IV |
top_section_manager.py |
Blocks G8 for some filing types |
| m5 | Page header/number detection requires 5+ occurrences; fails on short filings | page_header_classifier.py, page_number_classifier.py |
Edge case for amendments |
| m6 | render_() has quadratic prefix growth for deeply nested trees |
render_.py:98 |
Performance |
| m7 | Single-pass pipeline cannot reclassify elements once typed | Pipeline design | Table captions misclassified as titles |
| m8 | TableOfContentsClassifier only matches exact {"page", "page no.", "page number"} |
table_check_data_cell.py |
Missed TOC tables |
| m9 | CompositeSemanticElement.unwrap_elements() loses container context |
composite_semantic_element.py |
Affects G4 prose-table sibling detection |
These are blocking improvements required before meaningful downstream integration.
Add a post-tree-building step that assigns a deterministic segment_id to every element.
Format: {section_identifier}:{title_slug}:{element_type}_{index}
Examples: part2item7:revenue_recognition:text_3, part1item1:financial_statements:table_1
For TopSectionTitle, use .section_type.identifier. For TitleElement, slugify title text. For content elements, parent ID + type + index.
Enables: G1, G2, G5. This is the provenance anchor for the entire NLP pipeline.
Convert TitleClassifier to two-pass (_NUM_ITERATIONS = 2). Pass 1: collect all unique TextStyle instances. Score by visual weight. Pass 2: assign levels by score rank.
Scoring: (bold * 3) + (uppercase * 2) + (centered * 2) + (underline * 1) - (italic * 1).
Enables: G1, G2 — correct TOC hierarchy and segment nesting.
Add the f prefix to the fallback return string in TableElement.get_summary(). Trivial.
Track start_index and end_index (position in flat element list) on each TreeNode during TreeBuilder.build(). Compute cumulative character offsets.
Expose SemanticTree.get_section_boundaries() → list[(segment_id, start_idx, end_idx, char_start, char_end)].
Enables: G2 — Intersection Observer scroll-to-section mapping, progress-within-section display.
Add SemanticTree.to_toc() method producing a serializable TOC.
Each entry: segment_id, title, level, element_type, parent_id, child_count, has_tables, content_length.
Enables: G2 — the floating pill UI's expandable TOC list needs exactly this data.
Replace the destructive stack.pop() loop in _find_parent_node() with a backward index scan. Only trim after finding a match or determining root status.
Enables: G1, G2 — fewer orphaned elements, more robust tree building.
Add HtmlNormalizer before HtmlTagParser: collapse , strip empty wrappers, normalize <br/> sequences, strip zero-width characters.
Enables: All goals — cleaner input improves all downstream classification accuracy.
Add FilingSectionsIn8K (Items 1.01–9.01), TopSectionManagerFor8K (two-decimal item format), Edgar8KParser.
Enables: G8 — highest-signal filing type after 10-K/10-Q.
Integrate spaCy (or pySBD as lighter alternative) with SEC-specific abbreviation rules. Produce SentenceElement objects as children of TextElement, each inheriting the parent's segment_id with :s{index} suffix.
Enables: G6 — sentence-level granularity for the NLP pipeline.
- Build a corpus of known-tricky filings (Berkshire Hathaway, Tesla, SPACs, bank holding companies)
- Manually annotate expected section boundaries for accuracy scoring
- Add cross-period consistency tests (same company across 3–4 years — structural diffs are either real changes or parser errors)
- Sentence-level accuracy tests for SEC-specific edge cases (abbreviations, note references, legal citations)
- Snapshot tests for regression detection
Enables: Quantitative accuracy tracking, confidence in parser output quality.
These improve quality and enable secondary goals but don't block initial integration.
Add TableTypeClassifier step. Classify each TableElement as: FINANCIAL_STATEMENT, FOOTNOTE, LAYOUT, TOC, SCHEDULE, OTHER.
Heuristics: $ markers + numeric columns → financial; smaller tables in footnote sections → footnote; single-column or text-heavy → layout.
Enables: G3.
Add TableContextClassifier step. For each TableElement, classify preceding text as:
- Stub: length < 120 chars AND matches "following table", "table below", "as follows", etc.
- Substantive: preceding text > 120 chars with analytical content
- Standalone: no preceding text, or preceding element is another table/title
Enables: G4.
Broaden patterns: "Refer to [Note/table/section]...", "The following [table/chart]...", single-sentence paragraphs ending with colons, parenthetical references like "(see table X)".
Enables: G4.
Add SemanticTree.to_diff_text() — one line per element: {segment_id}\t{element_type}\t{normalized_text}. Normalize: collapse whitespace, strip page artifacts, normalize quotes/dashes. Tables use table_to_markdown() output.
Enables: G5.
Regex-based detector for "see Note N", "as described in Item N", "refer to Part N" patterns. Produce link targets using segment_id of the referenced section.
Enables: G7.
FootnoteClassifier step recognizing "Note N" / "NOTE N" patterns within financial statement sections. Produce FootnoteElement with .note_number. Improves TOC and provides high-signal NLP content.
Enables: G2, G6.
Expand ParsingOptions to expose all hardcoded thresholds (style %, bold weight, page header occurrences, table row minimum). Pass through pipeline so each step reads from config.
Enables: Tuning for different filer styles without modifying internals.
Expose on each element: is_scrollable_table (row count > threshold), text style as rendering directive (not just classification signal), table dimensions. Allow the reader to consume structural hints directly.
Enables: G9.
These are worth investigating but may not justify the effort, or may be better solved outside the parser.
Ship a pre-built hash table of known boilerplate sentences (built from a large corpus). Parser looks up each normalized sentence and flags matches. Cheap, deterministic, high-accuracy.
Trade-off: Requires corpus construction and maintenance. May be better as a downstream data asset than a parser feature. Alternatively, the parser provides the normalized sentences and a separate boilerplate service does the lookup.
Complex structure without rigid Part/Item format. Requires new section detection heuristics (compensation tables, proposal numbering, governance sections). High value but high effort.
Similar to 10-K with additional sections. Moderate effort. Lower priority than 8-K and DEF 14A.
ExhibitClassifier step detecting "EXHIBIT X.X" boundaries. Classify as material vs. boilerplate by exhibit number convention (99.1 = material, 31.x/32.x = boilerplate certifications).
Case-insensitive matching, support for "Pg.", "PAGE", whitespace tolerance. Second heuristic: tables with 5+ rows where one column has section-title text and another has numbers → likely TOC.
Content-hash-based disk cache for SemanticTree. Keyed by hash(html_content + pipeline_config_hash). Avoids re-parsing during dev iteration.
Remove module-level algorithm descriptions at end of top_section_manager.py. Low priority but trivial.
Address m7: TitleClassifier promotes HighlightedTextElement to TitleElement even when the styled text is a table caption. Add a post-classification check: if a TitleElement immediately precedes a TableElement and is short, reclassify as TableCaptionElement.
Before committing to spaCy for sentence segmentation, prototype both approaches against a set of SEC-specific test cases. pySBD is lighter (pure Python, no model download) and may be sufficient for the rule-based abbreviation handling we need. spaCy brings more power (NER, dependency parsing) that could be useful for cross-reference detection but adds ~100MB of model weight.
| ID | Improvement | Category | Effort | Primary Goals |
|---|---|---|---|---|
| C1 | Stable hierarchical element IDs | Core | Medium | G1, G2, G5 |
| C2 | Fix title level assignment scoring | Core | Medium | G1, G2 |
| C3 | Fix f-string bug | Core | Trivial | Bug fix |
| C4 | Section boundary data on tree nodes | Core | Medium | G2 |
| C5 | TOC data structure generation | Core | Medium | G2 |
| C6 | Non-destructive tree builder stack | Core | Low | G1, G2 |
| C7 | HTML pre-processing | Core | Medium | All |
| C8 | 8-K filing type support | Core | Medium | G8 |
| C9 | Sentence boundary detection | Core | Medium | G6 |
| C10 | Expanded test suite & accuracy benchmarks | Core | High | All |
| G2-1 | Table type classification | Good to Have | Medium | G3 |
| G2-2 | Table-prose context classifier | Good to Have | Medium | G4 |
| G2-3 | Expand supplementary text rules | Good to Have | Low | G4 |
| G2-4 | Diff-ready serialization | Good to Have | Medium | G5 |
| G2-5 | Cross-reference detection | Good to Have | Medium | G7 |
| G2-6 | Footnote detection | Good to Have | Medium | G2, G6 |
| G2-7 | Configurable thresholds | Good to Have | Low | All |
| G2-8 | Rendering hint metadata | Good to Have | Low | G9 |
| T1 | Structural boilerplate hash table | To Consider | High | G6 |
| T2 | DEF 14A filing support | To Consider | High | G8 |
| T3 | S-1/S-3 filing support | To Consider | Medium | G8 |
| T4 | Exhibit boundary detection | To Consider | Medium | G2, G9 |
| T5 | Improve TOC classifier | To Consider | Low | G2 |
| T6 | Parsed output caching | To Consider | Medium | Performance |
| T7 | Clean up dead code | To Consider | Trivial | Maintenance |
| T8 | Table caption disambiguation | To Consider | Low | G2, G3 |
| T9 | pySBD vs. spaCy evaluation | To Consider | Low | G6 |
| Component | File |
|---|---|
| Public API | sec_parser/__init__.py |
| Parser entry point | sec_parser/processing_engine/core.py |
| HTML tag wrapper | sec_parser/processing_engine/html_tag.py |
| HTML tag parser | sec_parser/processing_engine/html_tag_parser.py |
| Parsing options | sec_parser/processing_engine/types.py |
| Processing step base | sec_parser/processing_steps/abstract_classes/abstract_elementwise_processing_step.py |
| Batch step base | sec_parser/processing_steps/abstract_classes/abstract_element_batch_processing_step.py |
| Element extractor | sec_parser/processing_steps/individual_semantic_element_extractor/individual_semantic_element_extractor.py |
| Table check | sec_parser/processing_steps/individual_semantic_element_extractor/single_element_checks/table_check.py |
| XBRL check | sec_parser/processing_steps/individual_semantic_element_extractor/single_element_checks/xbrl_tag_check.py |
| Image classifier | sec_parser/processing_steps/image_classifier.py |
| Empty classifier | sec_parser/processing_steps/empty_element_classifier.py |
| Table classifier | sec_parser/processing_steps/table_classifier.py |
| TOC classifier | sec_parser/processing_steps/table_of_contents_classifier.py |
| Section manager | sec_parser/processing_steps/top_section_manager.py |
| Intro classifier | sec_parser/processing_steps/introductory_section_classifier.py |
| Text classifier | sec_parser/processing_steps/text_classifier.py |
| Highlighted text | sec_parser/processing_steps/highlighted_text_classifier.py |
| Supplementary text | sec_parser/processing_steps/supplementary_text_classifier.py |
| Page header | sec_parser/processing_steps/page_header_classifier.py |
| Page number | sec_parser/processing_steps/page_number_classifier.py |
| Title classifier | sec_parser/processing_steps/title_classifier.py |
| Text merger | sec_parser/processing_steps/text_element_merger.py |
| Base element | sec_parser/semantic_elements/abstract_semantic_element.py |
| Element types | sec_parser/semantic_elements/semantic_elements.py |
| Highlighted element | sec_parser/semantic_elements/highlighted_text_element.py |
| Title element | sec_parser/semantic_elements/title_element.py |
| Top section title | sec_parser/semantic_elements/top_section_title.py |
| Top section marker | sec_parser/semantic_elements/top_section_start_marker.py |
| Section definitions | sec_parser/semantic_elements/top_section_title_types.py |
| Table element | sec_parser/semantic_elements/table_element/table_element.py |
| Table parser | sec_parser/semantic_elements/table_element/table_parser.py |
| Table to markdown | sec_parser/utils/bs4_/table_to_markdown.py |
| TOC check | sec_parser/utils/bs4_/table_check_data_cell.py |
| Table metrics | sec_parser/utils/bs4_/approx_table_metrics.py |
| Composite element | sec_parser/semantic_elements/composite_semantic_element.py |
| Tree builder | sec_parser/semantic_tree/tree_builder.py |
| Nesting rules | sec_parser/semantic_tree/nesting_rules.py |
| Semantic tree | sec_parser/semantic_tree/semantic_tree.py |
| Tree node | sec_parser/semantic_tree/tree_node.py |
| Render | sec_parser/semantic_tree/render_.py |