Skip to content

skierpage/catdoc

 
 

Repository files navigation

catdoc version 0.97

catdoc is a program which reads MS-Office Word .doc files and prints their content as readable ASCII text to stdout. It can also produce correct escape sequences if some UNICODE characters have to be represented specially in your typesetting system such as (La)TeX.

The catdoc package also includes

  • catppt, which reads MS PowerPoint .ppt files and prints their content.
  • xls2csv, which reads MS Excel .xls files and prints their content as rows of comma-separated values.
  • wordview, which displays catdoc output in a window.

See INSTALL for information about compiling and installing the catdoc programs on Linux and Mac OS. Several Linux distributions build the catdoc package, though as of December 2025 most build the old version 0.95.

The KDE project's "baloo" file indexing and search framework uses these programs (via the KFileMetadata library) to index the text of old MS-Office files.

catdoc features runtime configuration, proper charset handling, user-definable output formats and support for Word97 files, which contain UNICODE internally.

version 0.97

This release of the catdoc programs addresses numerous vulnerabilities described below. To do so it has updated autoconf/automake tooling to make it easier to build with Address Sanitizer, and an automake test harness to check for memory errors. The steps to build it from source changed slightly, see INSTALL.

vbwagner's upstream came back to life after a 9 year absence with a couple of fixes in November 2025; this fork incorporates them.

End of DOS support

The patched source code no longer compiles in Borland Turbo C; v0.96 is the last release of the catdoc programs that builds and runs in 16-bit DOS. If anyone cares about DOS support, get in touch!

File format specifications

Microsoft publishes the file format specifications:

Limitations

Since 0.93.0 catdoc parses OLE structure and extracts the WordDocument stream, but doesn't parse internal structure of it.

This rough approach inevitable results in some garbage in the output, especially near the end of file and if the file contains embedded OLE objects such as pictures or equations.

ALTERNATIVES discusses alternate ways to extract text from Office 97-2003 files, or convert them to other formats.

Vulnerabilities

The catdoc programs are unsafe C code that parse old files. Unexpected or garbled file content will cause them to crash and running them on a specially-crafted file may allow an attacker to interfere with the operation of your computer. Version 0.97 fixes several memory access errors and Common Vulnerabilities and Exposures reported against various forks and distribution packages of catdoc over the years, but there may be more.

This release of the catdoc programs incorporates the Debian patches for the vulnerabilities CVE-2024-54028, CVE-2024-52035, and CVE-2024-48877 identified and addressed by the Cisco Talos team. See NEWS and the commit history (search history for "CVE") for other fixes made. Some were detected by Address Sanitizer tools, see tests/asan_failures for more details.

Documentation, bugs, more information

Catdoc is distributed under GNU Public License version 2 or above, see COPYING.

The catdoc programs are documented in their UNIX-style manual pages. For those who don't have man command (such as MS-DOS users), plain text and PostScript versions of the man pages are in the doc directory.

Your bug reports and suggestions are welcome, as are code contributions; TODO.md is an incomplete list of things to work on. In particular, if you have old MS-Office files from which the catdoc text extraction programs do not produce correct output, please file an issue and attach a small test file.

See the CREDITS file and git log for contributors. Special thanks to Victor Wagner vitus@45.free.net for working on this project and managing releases for over a decade.

About

My goal is to incorporate Debian patches and other patches and cleanup, so this can be a basis for a new release for packagers such as Fedora.

Resources

License

Stars

Watchers

Forks

Contributors

Languages

  • C 79.3%
  • Shell 9.5%
  • Tcl 5.3%
  • M4 3.0%
  • Makefile 2.9%