Lightweight PDF text extractor with a C++ core (PoDoFo 1.x), a Node.js native addon, and an Electron GUI.
- PDFX — PoDoFo 1.x Text Extractor
- Core: C++ static library
pdfxthat extracts text from PDFs via PoDoFo 1.x. - CLI:
pdfx_cliwraps the core for command-line extraction (txtorjsonoutput). - Node addon:
pdfx.nodeexposes the extractor to Node via N-API (node-addon-api). - GUI: Minimal Electron app that lets you open a PDF, extract pages (ranges), search, copy, and save results.
cpp/ # C++ core + CLI (CMake)
include/PdfTextExtractor.hpp
src/PdfTextExtractor.cpp
src/cli_main.cpp
CMakeLists.txt
cmake/FindOrFetchPoDoFo.cmake # Finds system PoDoFo or vendors from source
node/
addon/ # Node native addon (cmake-js)
binding.cpp
CMakeLists.txt
package.json
test-addon.sh
gui/ # Electron app (main + preload + renderer)
main.js
preload.cjs
renderer/index.html
package-lock.json
tests/
some.pdf # Minimal one-page PDF
some_pdf.py # Script that generates a minimal PDF
LICENSE
README.md # (this file supersedes the minimal stub)
-
A C++17 compiler and CMake ≥ 3.21.
-
PoDoFo 1.x:
- Prefer a system package (found via
find_package(PoDoFo CONFIG QUIET)). - If not found, the build can vendor PoDoFo from GitHub (see Configuration).
- Prefer a system package (found via
-
For the Node addon:
- Node.js with headers (handled by
cmake-js). npmto install dev dependencies listed innode/addon/package.json.
- Node.js with headers (handled by
-
For the GUI:
- Node.js and
npm. electron(already present innode/gui/package-lock.json, see Run: Electron GUI).
- Node.js and
# from repo root
cmake -S cpp -B build/cpp -DCMAKE_BUILD_TYPE=Release
cmake --build build/cpp --target pdfx pdfx_cli -jArtifacts:
- Static library:
build/cpp/libpdfx.* - CLI executable:
build/cpp/pdfx_cli[.exe]
Install (optional):
cmake --install build/cpp --prefix /your/prefixUsage (from cpp/src/cli_main.cpp):
Usage: pdfx_cli -i input.pdf [-o out.txt] [--pages 1-3,5] [--format txt|json]
Examples:
# Extract all pages to stdout as text
build/cpp/pdfx_cli -i tests/some.pdf
# Extract page 1 and 3..5 to a file
build/cpp/pdfx_cli -i tests/some.pdf --pages 1,3-5 -o extracted.txt
# JSON output
build/cpp/pdfx_cli -i tests/some.pdf --format jsonVerification steps:
- Expect non-empty text output for
tests/some.pdf. - With
--format json, output has apagesarray; each item contains"index"and"text".
# from repo root
cd node/addon
npm i
npm run build # uses cmake-js; produces build/Release/pdfx.node
# (optional) print the full path to the built artifact
npm run print:artifactTest the addon binary:
# quick export inspection
./test-addon.sh --build
# => prints exported methods: [ 'extractAll', 'extractPages' ]// replace with the actual path the build printed for pdfx.node
const addon = require('./node/addon/build/Release/pdfx.node');
(async () => {
const pages = addon.extractAll('tests/some.pdf');
console.log('page count:', pages.length);
console.log('page1:', pages[0]);
const some = addon.extractPages('tests/some.pdf', [0]); // zero-based indices
console.log('only page 1:', some[0]);
})();Verification steps:
extractAll()returns an array of strings (one per page).extractPages(path, [0,2])returns only the selected pages in order.- Invalid page indices throw (binding forwards C++ exceptions).
The GUI consists of:
node/gui/main.js(Electron main, ESM)node/gui/preload.cjs(context-isolated preload, CJS)node/gui/renderer/index.html
It expects the native addon at one of:
node/addon/build/Release/pdfx.node(repo dev build), or<app resources>/native/pdfx.node(if you package later)
Minimal run (dev)
Note:
node/guicurrently lacks apackage.json. Create one as shown below, then install and run Electron.
{
"name": "pdfx-gui",
"version": "0.1.0",
"type": "module",
"main": "main.js",
"private": true,
"scripts": {
"start": "electron ."
},
"devDependencies": {
"electron": "^31.3.0"
}
}Then:
# build the native addon first so the GUI can load it
(cd node/addon && npm i && npm run build)
# run the GUI
cd node/gui
npm i
npm run startKeyboard shortcuts (from the renderer UI):
- Open PDF:
Ctrl/⌘ + O - Extract:
Ctrl/⌘ + E - Export .txt:
Ctrl/⌘ + S
GUI features visible in renderer/index.html:
- Page range parsing (e.g.
1-3, 6) → zero-based indices internally. - Live search with regex highlighting.
- Copy per page / copy all / clear output.
- Save extracted text to
.txt.
Header: cpp/include/PdfTextExtractor.hpp
struct ExtractOptions {
bool preserve_layout = false; // currently unused by implementation
};
class PdfTextExtractor {
public:
std::vector<std::string> extractAll(const std::string& pdfPath,
const ExtractOptions& opts = {});
std::vector<std::string> extractPages(const std::string& pdfPath,
const std::vector<int>& pageIndices,
const ExtractOptions& opts = {});
private:
std::string extractOnePage(PoDoFo::PdfMemDocument& doc, int pageIndex,
const ExtractOptions& opts);
static std::string toUtf8(const std::string& s);
};Implementation notes (from PdfTextExtractor.cpp):
- Loads via
PoDoFo::PdfMemDocument. - For each page,
PdfPage::ExtractTextTo(std::vector<PdfTextEntry>&)is used. - Page text is built by concatenating
e.Textentries with'\n'. - Current
toUtf8is a pass-through (no re-encoding).
File: node/addon/binding.cpp
// const addon = require('./build/Release/pdfx.node')
addon.extractAll(path: string) // -> string[] per page
addon.extractPages(path: string, pages: number[]) // -> string[] for requested pages- Throws a JS
TypeErroron invalid arguments. - Forwards C++ exceptions to JS as
Error("pdfx native error: ...").
File: node/gui/preload.cjs
window.pdfx = {
selectPdf(): Promise<string|null>, // showOpenDialog; returns path or null
extractAll(filePath: string): Promise<string[]>, // IPC -> addon
extractPages(filePath: string, pages: number[]): Promise<string[]>,
saveText(defaultName: string, text: string): Promise<string|null>, // showSaveDialog
onError(cb: (msg: string) => void): () => void // subscribe to startup load errors
}-
The main process (
node/gui/main.js) resolves the addon from:../addon/build/Release/pdfx.node(dev), or<resources>/native/pdfx.node(packaged).
-
If loading fails, the renderer receives an error via
pdfx:onError.
-
Ready-made PDF:
tests/some.pdf(one page; Latin-1 text operators, quotes,TJarray, etc.). -
Generate a minimal PDF:
python tests/some_pdf.py # writes ./some.pdf in the current working directory
Handled by cmake/FindOrFetchPoDoFo.cmake:
-
Try system PoDoFo first (
find_package(PoDoFo CONFIG QUIET)). -
If not found, vendor from source:
- Git repo:
https://github.com/podofo/podofo.git - Tag:
PDFX_PODOFO_TAG(default:"1.0.0")
- Git repo:
Options:
-DPDFX_VENDOR_DEPS=ON— force vendored build.-DPDFX_PODOFO_TAG=<tag-or-branch>— pick a different PoDoFo ref when vendoring.
The module sets:
PDFX_PoDoFo_TARGET— target to link against.PDFX_PoDoFo_INCLUDE_DIRS— include directories passed to thepdfxtarget.
MIT — see LICENSE.
