Adventures Converting Large PDF Files Into Text


Yesterday, I discovered that programmatically searching for text in PDF files is more complicated than you would imagine. A friend of mine that works at a non-profit came to me with a problem she thought could be automated. Her organization investigates the accounting practices of public institutions and works with lots of old government files as a result. This particular task involved searching large amounts of PDF files that were made between 1995 to present day. She needed a way to search through each states financial records, year by year, and record the amount certain search terms occurred in all of the PDF’s. To do this, she was opening each file and “ctrl + f” ing for the search term. Then manually recording each count in a spreadsheet. To me, this sounded like a microcosm of hell on earth. Writing a script for this seemed pretty straight forward. Then I started to learn about PDF files. They are tricky to say the least.