CannabisCOA.Parser is a C#/.NET parser for cannabis Certificates of Analysis (COAs). It converts messy lab PDF output into structured lab, product, batch, cannabinoid, terpene, compliance, warning, and audit data.
The project is currently focused on Nevada COA layouts and the Flower v1 audit workflow. It handles real-world PDF extraction problems such as flattened tables, inconsistent headers, lab-specific layout drift, product-type edge cases, amended reports, side-by-side table bleed, missing lab headers, embedded quote characters, malformed table headers, and partial/single-panel reports.
Latest confirmed local validation:
Test suite: 307/307 passing
Current batch folder: G:\COA_BatchTests\combined-current\
Latest stress batch size: 3,333 reports
Latest stress batch result: 3,333 / 3,333 Flower rows
Unknown product types: 0
False Topical rows: 0
False Edible rows: 0
Missing core fields: 0
Latest confirmed 3,333-row batch audit:
Total rows: 3,333
Flower rows: 3,333
Unknown rows: 0
Topical rows: 0
Edible rows: 0
FullComplianceCoa rows: 3,293
SinglePanelTest rows: 39
PartialPanelReport rows: 1
Rows with no missing core fields: 3,333 / 3,333
Current clean rate: 100.0%
This is the current Flower v1 milestone: every report in the 3,333-file stress batch is classified as Flower, every row has required core audit fields, and partial/single-panel reports are separated from full compliance COAs instead of being treated as parser failures.
CannabisCOA.Parser now successfully processes a 3,333-report Nevada Flower stress batch with:
- β 3,333 / 3,333 rows classified as Flower
- β 3,333 / 3,333 rows with no missing core fields
- β 0 Unknown product-type rows
- β 0 false Topical rows in the Flower batch
- β 0 false Edible rows in the Flower batch
- β
39 single-panel reports correctly classified as
SinglePanelTest - β
1 dual/partial-panel report correctly classified as
PartialPanelReport - β 8 labs represented in the stress batch
- β 307 / 307 tests passing
This moves the project from sample parsing into real batch audit readiness for Flower v1.
Flower: 3,333
Unknown: 0
Topical: 0
Edible: 0
FullComplianceCoa: 3,293
SinglePanelTest: 39
PartialPanelReport: 1
SinglePanelTest and PartialPanelReport rows are legitimate lab reports, but they are not full Flower compliance COAs. They are intentionally classified separately so they do not appear as cannabinoid parser failures or false missing-core-field failures.
No missing fields: 3,333
Missing fields: 0
Digipath: 644
Kaycha Labs: 628
G3 Labs: 502
MA Analytics: 502
374 Labs: 303
NV Cann Labs: 297
Ace Analytical Laboratory: 267
RSR Analytical Laboratories: 190
Digipath: 644 / 644 clean 100.0%
Kaycha Labs: 628 / 628 clean 100.0%
G3 Labs: 502 / 502 clean 100.0%
MA Analytics: 502 / 502 clean 100.0%
374 Labs: 303 / 303 clean 100.0%
NV Cann Labs: 297 / 297 clean 100.0%
Ace Analytical Laboratory: 267 / 267 clean 100.0%
RSR Analytical Laboratories: 190 / 190 clean 100.0%
Warning-quality review is now underway. The latest confirmed audit before the newest G3 terpene parser fix showed:
No warning: 2,499
TERPENE_TOTAL_MISMATCH: 465
AMENDED_COA: 258
AMENDED_COA|TERPENE_TOTAL_MISMATCH: 59
SINGLE_PANEL_TEST: 39
TERPENE_BREAKDOWN_MISSING: 7
TOTAL_THC_HIGH: 2
TOTAL_TERPENES_HIGH|TERPENE_TOTAL_MISMATCH: 2
TOTAL_TERPENES_HIGH|TERPENE_BREAKDOWN_MISSING: 1
PARTIAL_PANEL_REPORT: 1
Interpretation:
AMENDED_COAis compliance metadata, not a parser failure.SINGLE_PANEL_TESTandPARTIAL_PANEL_REPORTare expected document-classification signals.TERPENE_TOTAL_MISMATCHis now the main warning-quality review board.- A narrow G3-only terpene parser fix has been added and validated by tests. The next batch rerun should quantify the warning reduction.
The Flower v1 sprint focused on turning real PDF chaos into a stable audit pipeline.
Major completed wins:
- β Added flat batch CSV audit output
- β Added Flower v1 audit fields
- β
Added
DocumentClassification - β
Added
IsFullComplianceCoa - β
Added
SinglePanelTestandPartialPanelReporthandling - β Cleaned Flower product-type classification to 3,333 / 3,333 in the stress batch
- β Cleaned Flower core metadata to 3,333 / 3,333 in the stress batch
- β Added lab-specific ProductName and BatchId extraction across the main Nevada Flower lab set
- β Added Digipath compact sample-header ProductName extraction
- β
Added Digipath plant-material variants such as
Popcorn Buds,Shake & Duff, and false-Topical protection for strain names likeIce Cream Cake - β Added Digipath single-panel / partial-panel classification
- β Added Digipath collapsed/malformed cannabinoid table parsing
- β
Added 374 Labs
Popcorn Buds,Trim,Ground Flower,Bulk Flower, andBulk, Flowerhandling - β
Added Ace
Popcorn BudsandTrimhandling - β
Added G3
Popcorn Buds,Light Deprivation, andTrimhandling - β Added G3 expanded terpene parsing for positive rows such as Ξ²-Myrcene, Ξ±-Pinene, Ξ²-Pinene, Linalool, Terpinolene, and Ξ²-Ocimene
- β
Added MA Analytics
Trim,Popcorn Buds, and BatchId fallback handling - β
Added RSR
TrimandBulk Flowerhandling - β
Added NV Cann Labs footer-marker lab detection using
nvcann.com/ Schuster Street markers - β
Added NV Cann Labs plant-material detection for
Popcorn BudsandTrim - β
Added Kaycha raw plant handling for
Flower - Cured,Trim,Shake,Popcorn Buds, andOther - Not Listed - β
Added CSV escaping coverage for embedded quote values like
8" Bagel - β Preserved conservative generic parsing by keeping layout fixes lab-specific
Current lab adapters / lab coverage include:
- β 374 Labs
- β Ace Analytical Laboratory
- β Digipath
- β G3 Labs
- β Kaycha Labs
- β MA Analytics
- β NV Cann Labs
- β RSR Analytical Laboratories
The current production-grade workflow is Flower v1.
For Nevada audit purposes, Flower v1 includes usable cannabis / plant-material variants such as:
- Flower
- Flower - Cured
- Flower Cured
- Popcorn Buds
- Small Buds
- Shake
- Shake & Duff
- Trim
- Ground Flower
- Bulk Flower
- Raw Plant / Plant Material layouts where the test matrix clearly matches Flower or usable cannabis
Current coverage matrix:
| Lab | Flower / Plant Material | Pre-Roll | Edible | Vape | Concentrate | Tincture | Topical |
|---|---|---|---|---|---|---|---|
| 374 Labs | β | Partial | β | Partial | Partial | β | β |
| Ace Analytical Laboratory | β | Partial | β | Partial | β | β | β |
| Digipath | β | Partial | β | Partial | Partial | β | β |
| G3 Labs | β | Partial | β | β | β | β | β |
| Kaycha Labs | β | Partial | β | Partial | Partial | β | β |
| MA Analytics | β | Partial | β | β | Partial | β | β |
| NV Cann Labs | β | β | β | Partial | β | β | β |
| RSR Analytical Laboratories | β | Partial | β | β | β | β | β |
Legend:
- β = fixture-backed and/or strongly batch-validated
- Partial = observed support exists, but more fixtures are needed
- β = not yet validated
The parser currently normalizes:
- Lab name
- Product type
- Product name
- Batch ID
- Harvest date
- Test date
- Package date
- Amended COA status
- Document classification
- Full compliance COA flag
- Major cannabinoid values:
- THC
- THCA
- CBD
- CBDA
- Total THC
- Total CBD
- Total terpenes
- Individual terpene breakdowns where available
- Source text for key cannabinoid values
- Confidence values
- Parser / validation warnings
The parser separates product type from document completeness.
Example:
ProductType = Flower
DocumentClassification = SinglePanelTest
IsFullComplianceCoa = false
Warnings = SINGLE_PANEL_TEST
This distinction matters because some lab reports describe a Flower product but are not full compliance COAs.
Used for normal Flower v1 COAs with the expected compliance-style panel set.
Used for one-panel reports, such as pesticide-only or heavy-metals-only reports.
These files may describe Flower products, but they should not be treated as complete Flower COAs or as cannabinoid parser failures.
Used for reports with more than one panel but still not a full compliance COA, such as a mycotoxins + pesticides report.
These reports are real lab documents, but they are intentionally separated from full compliance COAs.
The parser distinguishes between:
ProductType
Compliance-style document behavior
DocumentClassification
Parser warnings
This matters because Nevada plant-material products can include:
- Flower
- Trim
- Shake
- Popcorn Buds
- Non-infused pre-roll material
- Other usable cannabis / raw plant variants
Core rule:
Explicit COA text determines product type.
Test panel structure helps validate the compliance matrix.
DocumentClassification determines whether the report is a full COA, a partial-panel report, or a single-panel report.
dotnet run --project src\CannabisCOA.Parser.Cli -- --file "G:\path\to\coa.pdf"Useful when creating fixtures or debugging PDF extraction:
dotnet run --project src\CannabisCOA.Parser.Cli -- --file "G:\path\to\coa.pdf" --dump-textdotnet run --project src\CannabisCOA.Parser.Cli -- --batch "G:\COA_BatchTests\combined-current" --out "G:\COA_BatchTests\parsed.jsonl"Current Flower v1 audit workflow:
dotnet run --project src\CannabisCOA.Parser.Cli -- --batch "G:\COA_BatchTests\combined-current\" --csv "G:\COA_BatchTests\combined-current\batch-audit.csv"Expected console behavior:
Processed <n> files β CSV <path>
The CLI supports flat CSV export for Excel, Power BI, and batch QA review.
CSV behavior:
- One row per parsed report
- Dates are written as
yyyy-MM-dd - Missing/null values are blank
- Decimal values use invariant culture
- Warnings are pipe-delimited in a single cell
- Embedded commas, quotes, and line breaks are escaped correctly
- Embedded quote values such as
8" Bagelare CSV-escaped correctly - Terpene breakdown columns are intentionally excluded from the v1 summary CSV to keep it flat and stable
Current CSV audit columns include:
SourceFile
AuditProfile
IsFlowerV1Candidate
MapperSchemaVersion
DocumentClassification
IsFullComplianceCoa
LabName
ProductType
ProductName
BatchId
HarvestDate
TestDate
PackageDate
IsAmended
OverallStatus
MissingCoreFields
CannabinoidCount
TerpeneCount
TotalTHC
TotalCBD
TotalTerpenes
THC
THCA
CBD
CBDA
THCSourceText
THCConfidence
THCASourceText
THCAConfidence
CBDSourceText
CBDConfidence
CBDASourceText
CBDAConfidence
Warnings
Run the full test suite locally:
dotnet testCurrent latest confirmed result:
307/307 passing
The parser is developed with real fixture-backed regression tests wherever possible.
Fixture location:
tests/CannabisCOA.Parser.Core.Tests/Fixtures/Labs/
Fixture strategy:
- Add a narrow fixture or raw text snippet for the exact failing layout.
- Add one focused regression test.
- Make the smallest lab-specific production fix.
- Avoid broad generic parser changes unless the issue is truly generic.
- Preserve existing lab/product behavior.
- Run the full local test suite.
- Regenerate batch audit CSV.
- Use the audit board to choose the next target.
This project intentionally favors lab-specific parsing over aggressive generic guessing.
Core rules:
- β Prefer real fixtures over assumptions
- β Fix one lab/product/layout at a time
- β Keep generic parsers conservative
- β Avoid broad refactors during parser hardening
- β Do not loosen validators to hide parser misses
- β Preserve source precision where the COA provides it
- β Treat side-by-side PDF extraction as a first-class problem
- β Use lab-specific parsing when layout identity is known
- β Keep source text traceability for key parsed values
- β Separate parser failures from partial/single-panel source documents
- β Treat source-document reality differently from parser failure
Generated build outputs should not be committed.
Common generated folders:
bin/
obj/
TestResults/
Local scratch output should also stay out of Git:
unknown-labs.txt
.tmp-build/
If generated files appear in Git status, clean or restore them before committing.
Example:
git status
git diff --statCompleted / current milestone:
- Stable Flower parsing baseline across main Nevada labs
- Flower v1 CSV audit export
- JSONL batch output
- Lab-specific ProductName and BatchId extraction
- Kaycha raw plant layout handling
- Kaycha
Shake,Trim,Popcorn Buds, andOther - Not Listedhandling - 374 Labs
Popcorn Buds,Trim,Ground Flower,Bulk Flower, andBulk, Flowerhandling - Ace
Popcorn BudsandTrimhandling - G3
Popcorn Buds,Trim, andLight Deprivationhandling - G3 expanded terpene breakdown parsing
- MA Analytics
Trim,Popcorn Buds, and BatchId fallback handling - RSR
TrimandBulk Flowerhandling - NV Cann Labs footer-marker lab detection
- NV Cann Labs
Popcorn BudsandTrimhandling - Digipath compact sample-header ProductName extraction
- Digipath plant-material descriptor handling
- Digipath false-Topical prevention for strain-name collisions
- Digipath collapsed/malformed cannabinoid table parsing
- Digipath single-panel and partial-panel report classification
- CSV quote escaping coverage
-
DocumentClassificationandIsFullComplianceCoa - Batch audit loop validated against 3,333 real COA/report files
- 3,333 / 3,333 Flower rows in the latest stress batch
- 3,333 / 3,333 rows with no missing core fields
Next practical targets:
- Rerun the 3,333-file audit after the G3 terpene parser fix to quantify warning reduction
- Review remaining
TERPENE_TOTAL_MISMATCHwarnings, especially RSR and any remaining G3 patterns - Review
TERPENE_BREAKDOWN_MISSINGwarnings - Review
TOTAL_THC_HIGHandTOTAL_TERPENES_HIGHwarnings - Add normalized cannabinoid CSV export
- Add normalized terpene CSV export
- Add safety/compliance panel extraction
- Add database-ready output/loading path
- Add automated support matrix generation from fixture coverage
- Run larger folder-level stress tests across the sorted COA archive
The Flower v1 metadata/classification board is clean on the 3,333-report stress batch.
Recommended next target:
Terpene warning review
Why:
MissingCoreFields is now 0.
ProductType false positives are now 0.
The remaining quality signals are warning-based, especially TERPENE_TOTAL_MISMATCH and TERPENE_BREAKDOWN_MISSING.
Recommended next workflow:
- Rerun the current 3,333-file stress audit after the G3 terpene parser fix.
- Group remaining warning rows by lab.
- Inspect representative
TERPENE_TOTAL_MISMATCHrows. - Determine whether the mismatch is a parser issue, source rounding issue, unit conversion issue, or intentional COA display behavior.
- Add fixture-backed warning behavior where needed.
- Keep warning fixes separate from core metadata fixes.
# Run full tests locally
dotnet test
# Parse current batch to CSV audit
dotnet run --project src\CannabisCOA.Parser.Cli -- --batch "G:\COA_BatchTests\combined-current\" --csv "G:\COA_BatchTests\combined-current\batch-audit.csv"
# Review git changes
git status
git diff --statCannabisCOA.Parser has moved from a Flower-focused parser into a real-world Nevada COA audit engine.
Current state:
307/307 tests passing
3,333-row real batch stress test
3,333 / 3,333 Flower rows
3,333 / 3,333 rows with no missing core fields
3,293 FullComplianceCoa rows
39 SinglePanelTest rows correctly classified
1 PartialPanelReport row correctly classified
CSV audit output
JSONL output
fixture-backed lab-specific parser hardening
The parser now supports a practical analyst workflow:
Parse real COA folders
Export clean audit CSV
Identify warning patterns by lab/layout
Patch narrow parser behavior
Validate with fixtures
Repeat
This project is now ready for deeper warning-quality review, normalized cannabinoid/terpene exports, compliance panel extraction, database-ready outputs, API/service integration planning, and larger folder-level stress testing across the sorted COA archive.