Adventures Converting Large PDF Files Into Text
PDF’s Are NOT COOL
Yesterday, I discovered that programmatically searching for text in PDF files is more complicated than you would imagine. A friend of mine that works at a non-profit came to me with a problem she thought could be automated. Her organization investigates the accounting practices of public institutions and works with lots of old government files as a result. This particular task involved searching large amounts of PDF files that were made between 1995 to present day. She needed a way to search through each states financial records, year by year, and record the amount certain search terms occurred in all of the PDF’s. To do this, she was opening each file and “ctrl + f” ing for the search term. Then manually recording each count in a spreadsheet. To me, this sounded like a microcosm of hell on earth. Writing a script for this seemed pretty straight forward. Then I started to learn about PDF files. They are tricky to say the least.
Here’s a bit more info about the problem. The directory structure looked like this. And each PDF is a minimum of 5 megs and there are over 450 files.
. ├── Alabama │ ├── 2005CAFR.pdf │ ├── 2006CAFR.pdf │ ├── CAFR.Ala.2011.pdf │ ├── cafr.2007.pdf │ ├── cafr.2008.pdf │ ├── cafr.2009.pdf │ ├── cafr.2010.pdf │ ├── cafr.2013.ala.pdf │ └── cafr.ala.2012\ (1).pdf ├── Alaska │ ├── 05cafr.pdf │ ├── 06cafr.pdf │ ├── 07cafr.pdf │ ├── 08cafr.pdf │ ├── 09cafr.pdf ├── Arizona │ ├── 2008_CAFR_RFS_0.pdf │ ├── 2010_CAFR-031511_0.pdf │ ├── CAFR2005_3.pdf
My initial assumption was that there was a library to easily search through PDF’s. And there is! It’s called pdfgrep. It’s a version of grep that works with PDF files. PDF’s are a complete nuisance to parse because their entire internal file schema is made for presentation and not for structure. You have to jump through all sorts of hoops to parse them. Plus it is hella slow. I would have loved to use pdfgrep, but it would have been too much work for my friend to set up all the dependencies needed. Instead, I decided to extract the text from each PDF using this nifty Python library called PDFMiner. Then search for the terms in the newly minted text files. This way I could compile the Python to an .exe and everything would be fine, just fine.
I thought it would be cool if you could specify the base directory for the documents, the search term(s), and finally a CSV file that could be imported into Excel.
Lemme lay down some info about the internal structure of PDF’s first.
A Lil Bit Bout PDF’s
I figured the PDF format was pretty old, and it is. 13 years old actually. What surprised me was that PDF version 1.0, from 1993, is still compatible with all modern PDF readers. Either they designed the original spec extremely well (haha), or there is a lot work done to maintain backwards compatibility. The data in a PDF may be preserved, but their is no method to the madness that goes on inside that file. They don’t use anything internally that resembles markup, instead it some sort of subset of PostScript. Weird eh? I guess Adobe originally christened the reader program ‘Camelot’. Perhaps they were hoping to strive to be solid as a castle made of stone.
This level of backwards compatibility plus PDF’s being an Adobe product make for some good times………
Back To The Problem
This is how I thought about solving this PDF dilemma. A PDF Plan of Progress you could say.
- Recursively find all of the PDF’s from the base directory
- Extract the text from the PDF’s
- Perform a search on each text file and return the occurrence count
- Combine all that data into a Excel friendly CSV
I usually have no clue what I am doing when tackling a new problem so Google is a big crutch. For recursively finding files, I started using the Python library glob. But it did not suffice, so I moved to good old OS WALK .
import os def recursively_find_pdfs(path): """ Recursively find all files starting from a path. Returns a dictionary with the full paths of all the pdfs """ pdflist = [os.path.join(dirpath, f) for dirpath, dirnames, files in os.walk(path) for f in files if f.endswith('.pdf')]
filelist = 
for pdf in pdflist:
filename = ntpath.basename(pdf)
dirname = ntpath.basename(ntpath.dirname(pdf))
filelist.append((os.getcwd() + ‘/’ + pdf))
We now know where all the PDFs are so let’s extract the obfuscated text from within. By “obfuscated text” I mean any text in the PDF that is meant for humans to read. SO no weird PDF jargon/styling malarchy. This takes a long time. Like a really long time.
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfpage import PDFPage from cStringIO import StringIO
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = ‘utf-8’
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, ‘rb’)
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = “”
maxpages = 0
caching = True
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
str = retstr.getvalue()
Finally, after waiting for a week or two you can search the extracted text. For each file, count the number of times the search term is found and write that data to a CSV.
def count_string_occurance(string, filepath): f = open(filepath) contents = f.read() f.close() return contents.count(string)
def csv_writer(pdfdict, csvpath, searchterm):
Write data to a CSV file path
with open(path, “wb”) as csv_file:
writer = csv.writer(csv_file, delimiter=’,’)
writer.writerow([‘State Name’, ‘File Name’, ‘Search Term’, ‘Term Count’])
for fullpath, splitpath in pdfdict.iteritems():
writer.writerow([splitpath[‘dirname’], splitpath[‘filename’], searchterm, splitpath[‘count’] ])
print key, value
for line in data:
That about does it. This seems like something that happens a lot at accounting firms doing audits, so I am going to work on making this a bit more robust.