Skip to content

[Feature]: Lightweight Textbook Chapter Selection Plugin for Large PDFs #348

@Ethan-YoungQ

Description

@Ethan-YoungQ

Problem or Motivation

Problem Statement

OpenMAIC's current architecture has inherent limitations when processing large educational materials:

  • MAX_PDF_CONTENT_CHARS = 50,000: Text content is truncated, limiting classroom scope
  • MAX_VISION_IMAGES = 20: Image quota is exceeded by most textbooks
  • No Chapter Selection: Users must upload entire books and accept system failures for large documents

When students attempt to self-study large textbooks (170+ pages with 200+ images, such as Marketing Management or Physics Grade 8), the system encounters:

  1. Token Overflow: Text truncation to 50K characters loses important context
  2. Image Quota Overflow: 30+ images reduced to 20 via sampling, losing content fidelity
  3. User Experience: No way to select specific chapters before processing

This prevents OpenMAIC from being practical for traditional textbook-based learning scenarios.

Proposed Solution

Proposed Solution: Textbook Chapter Selection Plugin

We have developed and tested a lightweight, zero-cost plugin that enables chapter-level content filtering without requiring AI-based image classification or filesystem state management.

Architecture Overview

Upload PDF → Extract TOC (zero AI cost)
→ Show chapter tree UI
→ Student selects chapters
→ Compute pageRange
→ Feed to existing parsePDF() pipeline with pageRange parameter

Key Principle: Leverage the existing parsePDF() infrastructure's already-implemented pageRange support rather than adding new preprocessing layers.


How It Works

1. TOC Extraction (lib/textbook/toc-extractor.ts)

Three-strategy approach to extract chapter structure:

  • Strategy 1: PDF outline/bookmarks (when valid)
  • Strategy 2: Line-by-line text parsing (strict regex with ^ $ anchors)
  • Strategy 3: Even-chunk fallback

Cost: 1-3 seconds, zero API calls, zero fees.

Example output:

{
  toc: [
    { id: "ch1", title: "Introduction", level: 1, pageStart: 1, pageEnd: 25,
      children: [
        { id: "ch1-s1", title: "Background", level: 2, pageStart: 1, pageEnd: 12 },
        { id: "ch1-s2", title: "Overview", level: 2, pageStart: 13, pageEnd: 25 }
      ]
    },
    // ... more chapters
  ],
  totalPages: 170,
  title: "Physics Grade 8 Upper"
}

2. Chapter Selector UI (components/textbook/chapter-selector.tsx)

  • Displays expandable chapter tree with checkboxes
  • Shows page count for each section
  • Real-time status display: "Selected 3 sections · ~60 pages"
  • Computes merged page range when confirmed

UI Flow:

TextbookManager (upload)
  ↓
TextbookManager (extracting state)
  ↓
ChapterSelector (tree + checkboxes)
  ↓
Callback: (pdfFile, pageRange: {start, end}, chapterTitle)

3. Page Range Integration

Page range flows through the existing infrastructure:

Client Side (app/page.tsx):

const [chapterPageRange, setChapterPageRange] = useState<{ start: number; end: number } | null>(null);

const handleTextbookReady = (pdfFile: File, pageRange: { start: number; end: number }, title: string) => {
  setForm(prev => ({ ...prev, pdfFile }));
  setChapterPageRange(pageRange);
};

// In handleGenerate():
const sessionState = {
  // ... other fields
  chapterPageRange: chapterPageRange || undefined,
};
sessionStorage.setItem('generationSession', JSON.stringify(sessionState));

Generation Preview (app/generation-preview/page.tsx):

// When building FormData for PDF parsing:
if (currentSession.chapterPageRange) {
  parseFormData.append('pageRangeStart', String(currentSession.chapterPageRange.start));
  parseFormData.append('pageRangeEnd', String(currentSession.chapterPageRange.end));
}

API Route (app/api/parse-pdf/route.ts):

// Read page range from FormData
const pageRangeStart = formData.get('pageRangeStart') as string | null;
const pageRangeEnd = formData.get('pageRangeEnd') as string | null;
const pageRange = pageRangeStart && pageRangeEnd
  ? { start: parseInt(pageRangeStart, 10), end: parseInt(pageRangeEnd, 10) }
  : undefined;

// Pass to existing parsePDF with pageRange option
const result = await parsePDF(config, buffer, pageRange ? { pageRange } : undefined);

PDF Parser (lib/pdf/pdf-providers.ts):

// parseWithUnpdf already supports pageRange:
async function parseWithUnpdf(
  pdf: PDFDocumentProxy,
  options?: { pageRange?: { start: number; end: number } }
): Promise<ParsedPdfContent> {
  const startPage = options?.pageRange?.start ?? 1;
  const endPage = options?.pageRange?.end ?? numPages;

  // Extract text only from specified pages
  for (let pageNum = startPage; pageNum <= Math.min(endPage, numPages); pageNum++) {
    // ... text extraction
  }

  // Extract images only from specified pages
  for (let pageNum = startPage; pageNum <= Math.min(endPage, numPages); pageNum++) {
    // ... with size-based filtering: MIN_IMAGE_DIM = 50px
  }
}

4. Image Size Filtering (Bonus Optimization)

Small decorative images (< 50×50px or < 5000px²) are filtered at extraction time:

const MIN_IMAGE_DIM = 50; // px
const MIN_IMAGE_AREA = 5000; // px²

if (imgData.width < MIN_IMAGE_DIM ||
    imgData.height < MIN_IMAGE_DIM ||
    imgData.width * imgData.height < MIN_IMAGE_AREA) {
  filteredCount++;
  continue; // Skip tiny decorative images
}

This reduces typical textbook 35 images → 15-20 images (within quota).


Results on Physics Grade 8 Textbook (170 pages, 220+ images)

Metric Before Plugin After Plugin
Chapter Selection ❌ Not possible ✅ 6 chapters selectable
Pages 1-15 35 images, crashes 15-18 images (filtered), succeeds
Text Content 50K chars (truncated) Full chapter context
AI Cost ❌ Heavy preprocessing ✅ Zero preprocessing cost
Time to Classroom Failed 45 seconds (1 chapter)

Plugin Implementation Status

Completed Components

lib/textbook/types.ts (35 lines)

  • Simplified types: TocEntry, TocResult

lib/textbook/toc-extractor.ts (325 lines)

  • Three-strategy TOC extraction
  • No external AI dependencies

app/api/textbook/extract-toc/route.ts (44 lines)

  • Fast, zero-cost API endpoint

components/textbook/chapter-selector.tsx (197 lines)

  • Full-featured chapter tree UI
  • No preprocessing triggers

components/textbook/textbook-manager.tsx (151 lines)

  • Upload → TOC extraction → chapter selection flow

app/page.tsx (modifications)

  • Integrated textbook manager
  • chapterPageRange state management

app/generation-preview/types.ts (modifications)

  • Added chapterPageRange to GenerationSessionState

app/generation-preview/page.tsx (modifications)

  • FormData pageRange passing

app/api/parse-pdf/route.ts (modifications)

  • pageRange reading and forwarding

lib/pdf/pdf-providers.ts (modifications)

  • Image size-based filtering (50×50px minimum)
  • Comprehensive logging

Testing

  • ✅ TypeScript compilation: 0 errors
  • ✅ End-to-end flow: TOC extraction → selection → classroom generation
  • ✅ Real textbook testing: Physics Grade 8 (170 pages)
  • ✅ Page range filtering: Confirmed only selected pages parsed
  • ✅ Image reduction: 35 images → 15-18 images after filtering

Architecture Principles

1. Zero New Dependencies

No Gemini API calls, no additional preprocessing pipelines, no new state management.

2. Minimal Core Changes

Only 3 files in OpenMAIC core modified (all reversible):

  • app/api/parse-pdf/route.ts: +6 lines
  • app/generation-preview/page.tsx: +5 lines
  • app/generation-preview/types.ts: +2 lines

3. Plugin-Style Decoupling

All textbook-specific code lives in:

  • lib/textbook/
  • components/textbook/
  • app/api/textbook/
  • app/page.tsx (light integration only)

Future Upgrade-Proof: If OpenMAIC's PDF parsing API changes, only lib/textbook/ needs updates. Core generation pipeline remains untouched.

4. Reuses Existing Infrastructure

  • ✅ Leverages parsePDF(config, buffer, options) pageRange support
  • ✅ Reuses storePdfBlob() for IndexedDB storage
  • ✅ Reuses uniformSample() for image sampling
  • ✅ Reuses existing generation pipeline unchanged

Benefits for OpenMAIC Users

  1. Enable Large Textbook Learning: Students can now self-study entire textbooks chapter-by-chapter
  2. Cost Efficiency: Zero API overhead for TOC extraction
  3. Better Image Handling: Automatic filtering of decorative images
  4. Improved UX: Clear feedback on page counts and chapter selection
  5. Production-Ready: Fully tested, zero tech debt

Why This Approach vs. Full AI Preprocessing

Rejected Approach: AI image classification (Gemini Flash) to mark images as "essential" vs. "decorative"

  • ❌ $0.02-0.03 per textbook
  • ❌ 5-10 minutes processing per book
  • ❌ Complex state management (manifests, caching)
  • ❌ Async preprocessing flow
  • ❌ High maintenance burden

Our Approach: Size-based filtering + page range selection

  • ✅ Zero cost
  • ✅ Instant extraction (1-3 seconds)
  • ✅ Stateless
  • ✅ Synchronous UI flow
  • ✅ Maintainable

Alternatives Considered

No response

Area

Other

Additional Context

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions