Theodoros Emmanouilidis » Blog Archive Convert PDF Document To Text From Command Line

Convert PDF Document To Text From Command Line

September15

pdftotext is a nifty command line utility that can be used to convert a PDF document to text. Most Linux distributions include pdftotext as part of the poppler-utils package. Installation in Ubuntu is very easy using apt. Just type

1	sudo apt-get install poppler-utils

and pdftotext is installed.

Usage: pdftotext [options] <PDF-file> [<text-file>]
  -f <int>          : first page to convert
  -l <int>          : last page to convert
  -r <fp>           : resolution, in DPI (default is 72)
  -x <int>          : x-coordinate of the crop area top left corner
  -y <int>          : y-coordinate of the crop area top left corner
  -W <int>          : width of crop area in pixels (default is 0)
  -H <int>          : height of crop area in pixels (default is 0)
  -layout           : maintain original physical layout
  -raw              : keep strings in content stream order
  -htmlmeta         : generate a simple HTML file, including the meta information
  -enc <string>     : output text encoding name
  -listenc          : list available encodings
  -eol <string>     : output end-of-line convention (unix, dos, or mac)
  -nopgbrk          : don't insert page breaks between pages
  -opw <string>     : owner password (for encrypted files)
  -upw <string>     : user password (for encrypted files)
  -q                : don't print any messages or errors
  -v                : print copyright and version info

The simplest way to use it is by typing

1	pdftotext file-to-convert.pdf

and the utility will create a text file with the same name inside the directory that file-to-convert.pdf resides.
Wildcards (*), for example:

1	pdftotext *.pdf

for converting multiple files, cannot be used because pdftotext expects only one file name. Instead, a loop can be used for batch conversions like:

for f in *.pdf
do
pdftotext "$f"
done

posted under Tip Of The Day

Remove ping Reply From Ubuntu Server
9:29 am , October 12 , 2014

Fix admin panel 404 Error When Upgrading To Plesk 12
7:14 am , October 12 , 2014

Home Directory Consumes All Disk Space In Ubuntu
12:10 pm , February 27 , 2013

MySQL ERROR 1148
8:49 am , February 15 , 2013

Remove Leading Whitespace Using Sed
2:50 pm , February 14 , 2013

Replace Spaces & Tabs With Spaces In File
2:47 pm , February 14 , 2013

How To Remount As Read-Write A Read Only Filesystem
6:27 am , May 18 , 2012

Make A List With All Unique Items In A List Python
3:27 pm , March 31 , 2012

Clone Entire Hard Drive With dd
9:35 am , February 10 , 2012

awk Script That Adds Specified Prefix To Each Line
9:00 am , February 10 , 2012