Convert PDF Document To Text From Command Line
September15
pdftotext is a nifty command line utility that can be used to convert a PDF document to text. Most Linux distributions include pdftotext as part of the poppler-utils package. Installation in Ubuntu is very easy using apt. Just type
1 | sudo apt-get install poppler-utils |
and pdftotext is installed.
Usage: pdftotext [options] <PDF-file> [<text-file>] -f <int> : first page to convert -l <int> : last page to convert -r <fp> : resolution, in DPI (default is 72) -x <int> : x-coordinate of the crop area top left corner -y <int> : y-coordinate of the crop area top left corner -W <int> : width of crop area in pixels (default is 0) -H <int> : height of crop area in pixels (default is 0) -layout : maintain original physical layout -raw : keep strings in content stream order -htmlmeta : generate a simple HTML file, including the meta information -enc <string> : output text encoding name -listenc : list available encodings -eol <string> : output end-of-line convention (unix, dos, or mac) -nopgbrk : don't insert page breaks between pages -opw <string> : owner password (for encrypted files) -upw <string> : user password (for encrypted files) -q : don't print any messages or errors -v : print copyright and version info |
The simplest way to use it is by typing
1 | pdftotext file-to-convert.pdf |
and the utility will create a text file with the same name inside the directory that file-to-convert.pdf resides.
Wildcards (*), for example:
1 | pdftotext *.pdf |
for converting multiple files, cannot be used because pdftotext expects only one file name. Instead, a loop can be used for batch conversions like:
1 2 3 4 | for f in *.pdf do pdftotext "$f" done |