Theodoros Emmanouilidis

Notes & Thoughts

Convert PDF Document To Text From Command Line

September15

pdftotext is a nifty command line utility that can be used to convert a PDF document to text. Most Linux distributions include pdftotext as part of the poppler-utils package. Installation in Ubuntu is very easy using apt. Just type

1
sudo apt-get install poppler-utils

and pdftotext is installed.

Usage: pdftotext [options] <PDF-file> [<text-file>]
  -f <int>          : first page to convert
  -l <int>          : last page to convert
  -r <fp>           : resolution, in DPI (default is 72)
  -x <int>          : x-coordinate of the crop area top left corner
  -y <int>          : y-coordinate of the crop area top left corner
  -W <int>          : width of crop area in pixels (default is 0)
  -H <int>          : height of crop area in pixels (default is 0)
  -layout           : maintain original physical layout
  -raw              : keep strings in content stream order
  -htmlmeta         : generate a simple HTML file, including the meta information
  -enc <string>     : output text encoding name
  -listenc          : list available encodings
  -eol <string>     : output end-of-line convention (unix, dos, or mac)
  -nopgbrk          : don't insert page breaks between pages
  -opw <string>     : owner password (for encrypted files)
  -upw <string>     : user password (for encrypted files)
  -q                : don't print any messages or errors
  -v                : print copyright and version info

The simplest way to use it is by typing

1
pdftotext file-to-convert.pdf

and the utility will create a text file with the same name inside the directory that file-to-convert.pdf resides.
Wildcards (*), for example:

1
pdftotext *.pdf

for converting multiple files, cannot be used because pdftotext expects only one file name. Instead, a loop can be used for batch conversions like:

1
2
3
4
for f in *.pdf
do
pdftotext "$f"
done
posted under Tip Of The Day