Skip to content

LJrobinson/CannabisCOA.Parser

Repository files navigation

CannabisCOA.Parser

CannabisCOA.Parser is a C#/.NET parser for cannabis Certificates of Analysis (COAs). It converts messy lab PDF output into structured lab, product, batch, cannabinoid, terpene, compliance, warning, and audit data.

The project is currently focused on Nevada COA layouts and the Flower v1 audit workflow. It handles real-world PDF extraction problems such as flattened tables, inconsistent headers, lab-specific layout drift, product-type edge cases, amended reports, side-by-side table bleed, missing lab headers, embedded quote characters, malformed table headers, and partial/single-panel reports.


🚦 Current Status

Latest confirmed local validation:

Test suite:                 307/307 passing
Current batch folder:       G:\COA_BatchTests\combined-current\
Latest stress batch size:   3,333 reports
Latest stress batch result: 3,333 / 3,333 Flower rows
Unknown product types:      0
False Topical rows:         0
False Edible rows:          0
Missing core fields:        0

Latest confirmed 3,333-row batch audit:

Total rows:                    3,333
Flower rows:                   3,333
Unknown rows:                      0
Topical rows:                      0
Edible rows:                       0

FullComplianceCoa rows:        3,293
SinglePanelTest rows:             39
PartialPanelReport rows:           1

Rows with no missing core fields: 3,333 / 3,333
Current clean rate:              100.0%

This is the current Flower v1 milestone: every report in the 3,333-file stress batch is classified as Flower, every row has required core audit fields, and partial/single-panel reports are separated from full compliance COAs instead of being treated as parser failures.


πŸ”₯ Major Milestone

CannabisCOA.Parser now successfully processes a 3,333-report Nevada Flower stress batch with:

  • βœ… 3,333 / 3,333 rows classified as Flower
  • βœ… 3,333 / 3,333 rows with no missing core fields
  • βœ… 0 Unknown product-type rows
  • βœ… 0 false Topical rows in the Flower batch
  • βœ… 0 false Edible rows in the Flower batch
  • βœ… 39 single-panel reports correctly classified as SinglePanelTest
  • βœ… 1 dual/partial-panel report correctly classified as PartialPanelReport
  • βœ… 8 labs represented in the stress batch
  • βœ… 307 / 307 tests passing

This moves the project from sample parsing into real batch audit readiness for Flower v1.


πŸ§ͺ Latest 3,333-Report Stress Test

Product Type Distribution

Flower:   3,333
Unknown:      0
Topical:      0
Edible:       0

Document Classification

FullComplianceCoa:    3,293
SinglePanelTest:         39
PartialPanelReport:       1

SinglePanelTest and PartialPanelReport rows are legitimate lab reports, but they are not full Flower compliance COAs. They are intentionally classified separately so they do not appear as cannabinoid parser failures or false missing-core-field failures.

Missing Core Field Summary

No missing fields: 3,333
Missing fields:       0

Lab Counts in the Stress Batch

Digipath:                       644
Kaycha Labs:                    628
G3 Labs:                        502
MA Analytics:                   502
374 Labs:                       303
NV Cann Labs:                   297
Ace Analytical Laboratory:      267
RSR Analytical Laboratories:    190

Clean Rate by Lab

Digipath:                       644 / 644 clean   100.0%
Kaycha Labs:                    628 / 628 clean   100.0%
G3 Labs:                        502 / 502 clean   100.0%
MA Analytics:                   502 / 502 clean   100.0%
374 Labs:                       303 / 303 clean   100.0%
NV Cann Labs:                   297 / 297 clean   100.0%
Ace Analytical Laboratory:      267 / 267 clean   100.0%
RSR Analytical Laboratories:    190 / 190 clean   100.0%

Current Warning Board

Warning-quality review is now underway. The latest confirmed audit before the newest G3 terpene parser fix showed:

No warning:                                      2,499
TERPENE_TOTAL_MISMATCH:                            465
AMENDED_COA:                                       258
AMENDED_COA|TERPENE_TOTAL_MISMATCH:                 59
SINGLE_PANEL_TEST:                                  39
TERPENE_BREAKDOWN_MISSING:                           7
TOTAL_THC_HIGH:                                      2
TOTAL_TERPENES_HIGH|TERPENE_TOTAL_MISMATCH:          2
TOTAL_TERPENES_HIGH|TERPENE_BREAKDOWN_MISSING:       1
PARTIAL_PANEL_REPORT:                                1

Interpretation:

  • AMENDED_COA is compliance metadata, not a parser failure.
  • SINGLE_PANEL_TEST and PARTIAL_PANEL_REPORT are expected document-classification signals.
  • TERPENE_TOTAL_MISMATCH is now the main warning-quality review board.
  • A narrow G3-only terpene parser fix has been added and validated by tests. The next batch rerun should quantify the warning reduction.

🎯 Flower v1 Sprint Summary

The Flower v1 sprint focused on turning real PDF chaos into a stable audit pipeline.

Major completed wins:

  • βœ… Added flat batch CSV audit output
  • βœ… Added Flower v1 audit fields
  • βœ… Added DocumentClassification
  • βœ… Added IsFullComplianceCoa
  • βœ… Added SinglePanelTest and PartialPanelReport handling
  • βœ… Cleaned Flower product-type classification to 3,333 / 3,333 in the stress batch
  • βœ… Cleaned Flower core metadata to 3,333 / 3,333 in the stress batch
  • βœ… Added lab-specific ProductName and BatchId extraction across the main Nevada Flower lab set
  • βœ… Added Digipath compact sample-header ProductName extraction
  • βœ… Added Digipath plant-material variants such as Popcorn Buds, Shake & Duff, and false-Topical protection for strain names like Ice Cream Cake
  • βœ… Added Digipath single-panel / partial-panel classification
  • βœ… Added Digipath collapsed/malformed cannabinoid table parsing
  • βœ… Added 374 Labs Popcorn Buds, Trim, Ground Flower, Bulk Flower, and Bulk, Flower handling
  • βœ… Added Ace Popcorn Buds and Trim handling
  • βœ… Added G3 Popcorn Buds, Light Deprivation, and Trim handling
  • βœ… Added G3 expanded terpene parsing for positive rows such as Ξ²-Myrcene, Ξ±-Pinene, Ξ²-Pinene, Linalool, Terpinolene, and Ξ²-Ocimene
  • βœ… Added MA Analytics Trim, Popcorn Buds, and BatchId fallback handling
  • βœ… Added RSR Trim and Bulk Flower handling
  • βœ… Added NV Cann Labs footer-marker lab detection using nvcann.com / Schuster Street markers
  • βœ… Added NV Cann Labs plant-material detection for Popcorn Buds and Trim
  • βœ… Added Kaycha raw plant handling for Flower - Cured, Trim, Shake, Popcorn Buds, and Other - Not Listed
  • βœ… Added CSV escaping coverage for embedded quote values like 8" Bagel
  • βœ… Preserved conservative generic parsing by keeping layout fixes lab-specific

🧬 Supported Labs

Current lab adapters / lab coverage include:

  • βœ… 374 Labs
  • βœ… Ace Analytical Laboratory
  • βœ… Digipath
  • βœ… G3 Labs
  • βœ… Kaycha Labs
  • βœ… MA Analytics
  • βœ… NV Cann Labs
  • βœ… RSR Analytical Laboratories

🌿 Product Coverage

The current production-grade workflow is Flower v1.

For Nevada audit purposes, Flower v1 includes usable cannabis / plant-material variants such as:

  • Flower
  • Flower - Cured
  • Flower Cured
  • Popcorn Buds
  • Small Buds
  • Shake
  • Shake & Duff
  • Trim
  • Ground Flower
  • Bulk Flower
  • Raw Plant / Plant Material layouts where the test matrix clearly matches Flower or usable cannabis

Current coverage matrix:

Lab Flower / Plant Material Pre-Roll Edible Vape Concentrate Tincture Topical
374 Labs βœ… Partial β€” Partial Partial β€” β€”
Ace Analytical Laboratory βœ… Partial β€” Partial β€” β€” β€”
Digipath βœ… Partial β€” Partial Partial β€” β€”
G3 Labs βœ… Partial β€” β€” β€” β€” β€”
Kaycha Labs βœ… Partial βœ… Partial Partial β€” β€”
MA Analytics βœ… Partial β€” β€” Partial β€” β€”
NV Cann Labs βœ… βœ… β€” Partial βœ… β€” β€”
RSR Analytical Laboratories βœ… Partial β€” β€” β€” β€” β€”

Legend:

  • βœ… = fixture-backed and/or strongly batch-validated
  • Partial = observed support exists, but more fixtures are needed
  • β€” = not yet validated

πŸ“¦ What the Parser Extracts

The parser currently normalizes:

  • Lab name
  • Product type
  • Product name
  • Batch ID
  • Harvest date
  • Test date
  • Package date
  • Amended COA status
  • Document classification
  • Full compliance COA flag
  • Major cannabinoid values:
    • THC
    • THCA
    • CBD
    • CBDA
    • Total THC
    • Total CBD
  • Total terpenes
  • Individual terpene breakdowns where available
  • Source text for key cannabinoid values
  • Confidence values
  • Parser / validation warnings

🧾 Document Classification

The parser separates product type from document completeness.

Example:

ProductType = Flower
DocumentClassification = SinglePanelTest
IsFullComplianceCoa = false
Warnings = SINGLE_PANEL_TEST

This distinction matters because some lab reports describe a Flower product but are not full compliance COAs.

FullComplianceCoa

Used for normal Flower v1 COAs with the expected compliance-style panel set.

SinglePanelTest

Used for one-panel reports, such as pesticide-only or heavy-metals-only reports.

These files may describe Flower products, but they should not be treated as complete Flower COAs or as cannabinoid parser failures.

PartialPanelReport

Used for reports with more than one panel but still not a full compliance COA, such as a mycotoxins + pesticides report.

These reports are real lab documents, but they are intentionally separated from full compliance COAs.


🧠 Nevada / METRC Methodology

The parser distinguishes between:

ProductType
Compliance-style document behavior
DocumentClassification
Parser warnings

This matters because Nevada plant-material products can include:

  • Flower
  • Trim
  • Shake
  • Popcorn Buds
  • Non-infused pre-roll material
  • Other usable cannabis / raw plant variants

Core rule:

Explicit COA text determines product type.
Test panel structure helps validate the compliance matrix.
DocumentClassification determines whether the report is a full COA, a partial-panel report, or a single-panel report.

πŸ–₯️ CLI Usage

Parse a Single PDF

dotnet run --project src\CannabisCOA.Parser.Cli -- --file "G:\path\to\coa.pdf"

Dump Extracted Text From a PDF

Useful when creating fixtures or debugging PDF extraction:

dotnet run --project src\CannabisCOA.Parser.Cli -- --file "G:\path\to\coa.pdf" --dump-text

Batch Parse COAs to JSONL

dotnet run --project src\CannabisCOA.Parser.Cli -- --batch "G:\COA_BatchTests\combined-current" --out "G:\COA_BatchTests\parsed.jsonl"

Batch Parse COAs to CSV Audit

Current Flower v1 audit workflow:

dotnet run --project src\CannabisCOA.Parser.Cli -- --batch "G:\COA_BatchTests\combined-current\" --csv "G:\COA_BatchTests\combined-current\batch-audit.csv"

Expected console behavior:

Processed <n> files β†’ CSV <path>

πŸ“Š CSV Audit Export

The CLI supports flat CSV export for Excel, Power BI, and batch QA review.

CSV behavior:

  • One row per parsed report
  • Dates are written as yyyy-MM-dd
  • Missing/null values are blank
  • Decimal values use invariant culture
  • Warnings are pipe-delimited in a single cell
  • Embedded commas, quotes, and line breaks are escaped correctly
  • Embedded quote values such as 8" Bagel are CSV-escaped correctly
  • Terpene breakdown columns are intentionally excluded from the v1 summary CSV to keep it flat and stable

Current CSV audit columns include:

SourceFile
AuditProfile
IsFlowerV1Candidate
MapperSchemaVersion
DocumentClassification
IsFullComplianceCoa
LabName
ProductType
ProductName
BatchId
HarvestDate
TestDate
PackageDate
IsAmended
OverallStatus
MissingCoreFields
CannabinoidCount
TerpeneCount
TotalTHC
TotalCBD
TotalTerpenes
THC
THCA
CBD
CBDA
THCSourceText
THCConfidence
THCASourceText
THCAConfidence
CBDSourceText
CBDConfidence
CBDASourceText
CBDAConfidence
Warnings

πŸ§ͺ Testing

Run the full test suite locally:

dotnet test

Current latest confirmed result:

307/307 passing

🧩 Fixture Strategy

The parser is developed with real fixture-backed regression tests wherever possible.

Fixture location:

tests/CannabisCOA.Parser.Core.Tests/Fixtures/Labs/

Fixture strategy:

  1. Add a narrow fixture or raw text snippet for the exact failing layout.
  2. Add one focused regression test.
  3. Make the smallest lab-specific production fix.
  4. Avoid broad generic parser changes unless the issue is truly generic.
  5. Preserve existing lab/product behavior.
  6. Run the full local test suite.
  7. Regenerate batch audit CSV.
  8. Use the audit board to choose the next target.

🧱 Development Principles

This project intentionally favors lab-specific parsing over aggressive generic guessing.

Core rules:

  • βœ… Prefer real fixtures over assumptions
  • βœ… Fix one lab/product/layout at a time
  • βœ… Keep generic parsers conservative
  • βœ… Avoid broad refactors during parser hardening
  • βœ… Do not loosen validators to hide parser misses
  • βœ… Preserve source precision where the COA provides it
  • βœ… Treat side-by-side PDF extraction as a first-class problem
  • βœ… Use lab-specific parsing when layout identity is known
  • βœ… Keep source text traceability for key parsed values
  • βœ… Separate parser failures from partial/single-panel source documents
  • βœ… Treat source-document reality differently from parser failure

🧹 Repository Hygiene

Generated build outputs should not be committed.

Common generated folders:

bin/
obj/
TestResults/

Local scratch output should also stay out of Git:

unknown-labs.txt
.tmp-build/

If generated files appear in Git status, clean or restore them before committing.

Example:

git status
git diff --stat

🧭 Current Roadmap

Completed / current milestone:

  • Stable Flower parsing baseline across main Nevada labs
  • Flower v1 CSV audit export
  • JSONL batch output
  • Lab-specific ProductName and BatchId extraction
  • Kaycha raw plant layout handling
  • Kaycha Shake, Trim, Popcorn Buds, and Other - Not Listed handling
  • 374 Labs Popcorn Buds, Trim, Ground Flower, Bulk Flower, and Bulk, Flower handling
  • Ace Popcorn Buds and Trim handling
  • G3 Popcorn Buds, Trim, and Light Deprivation handling
  • G3 expanded terpene breakdown parsing
  • MA Analytics Trim, Popcorn Buds, and BatchId fallback handling
  • RSR Trim and Bulk Flower handling
  • NV Cann Labs footer-marker lab detection
  • NV Cann Labs Popcorn Buds and Trim handling
  • Digipath compact sample-header ProductName extraction
  • Digipath plant-material descriptor handling
  • Digipath false-Topical prevention for strain-name collisions
  • Digipath collapsed/malformed cannabinoid table parsing
  • Digipath single-panel and partial-panel report classification
  • CSV quote escaping coverage
  • DocumentClassification and IsFullComplianceCoa
  • Batch audit loop validated against 3,333 real COA/report files
  • 3,333 / 3,333 Flower rows in the latest stress batch
  • 3,333 / 3,333 rows with no missing core fields

Next practical targets:

  • Rerun the 3,333-file audit after the G3 terpene parser fix to quantify warning reduction
  • Review remaining TERPENE_TOTAL_MISMATCH warnings, especially RSR and any remaining G3 patterns
  • Review TERPENE_BREAKDOWN_MISSING warnings
  • Review TOTAL_THC_HIGH and TOTAL_TERPENES_HIGH warnings
  • Add normalized cannabinoid CSV export
  • Add normalized terpene CSV export
  • Add safety/compliance panel extraction
  • Add database-ready output/loading path
  • Add automated support matrix generation from fixture coverage
  • Run larger folder-level stress tests across the sorted COA archive

πŸ”₯ Recommended Next Engineering Step

The Flower v1 metadata/classification board is clean on the 3,333-report stress batch.

Recommended next target:

Terpene warning review

Why:

MissingCoreFields is now 0.
ProductType false positives are now 0.
The remaining quality signals are warning-based, especially TERPENE_TOTAL_MISMATCH and TERPENE_BREAKDOWN_MISSING.

Recommended next workflow:

  1. Rerun the current 3,333-file stress audit after the G3 terpene parser fix.
  2. Group remaining warning rows by lab.
  3. Inspect representative TERPENE_TOTAL_MISMATCH rows.
  4. Determine whether the mismatch is a parser issue, source rounding issue, unit conversion issue, or intentional COA display behavior.
  5. Add fixture-backed warning behavior where needed.
  6. Keep warning fixes separate from core metadata fixes.

🧾 Example Workflow

# Run full tests locally
dotnet test

# Parse current batch to CSV audit
dotnet run --project src\CannabisCOA.Parser.Cli -- --batch "G:\COA_BatchTests\combined-current\" --csv "G:\COA_BatchTests\combined-current\batch-audit.csv"

# Review git changes
git status
git diff --stat

πŸ“Œ Project Summary

CannabisCOA.Parser has moved from a Flower-focused parser into a real-world Nevada COA audit engine.

Current state:

307/307 tests passing
3,333-row real batch stress test
3,333 / 3,333 Flower rows
3,333 / 3,333 rows with no missing core fields
3,293 FullComplianceCoa rows
39 SinglePanelTest rows correctly classified
1 PartialPanelReport row correctly classified
CSV audit output
JSONL output
fixture-backed lab-specific parser hardening

The parser now supports a practical analyst workflow:

Parse real COA folders
Export clean audit CSV
Identify warning patterns by lab/layout
Patch narrow parser behavior
Validate with fixtures
Repeat

This project is now ready for deeper warning-quality review, normalized cannabinoid/terpene exports, compliance panel extraction, database-ready outputs, API/service integration planning, and larger folder-level stress testing across the sorted COA archive.

Packages

 
 
 

Contributors

Languages