Convert pdf to Excel

Sven · August 25, 2018, 3:31pm

Dear All,

recently I have become engaged as a hobby with a restaurant at the lake Chiem in the south of Germany.

One of the things I am looking at is everything around finance (and Social media, and advertisement, and … ).
In the finance arena I am developing a template for the calculation of dishes. We are receiving every other pdf documents from local suppliers of fish, meat, vegetables, etc. I have spoken to each of them, but they are unwilling to provide these table in a readable format, as they want to make sure that these are not being modified and then refers to.

Therefore I am looking for a solution to take a pdf (formatted in a table with borders, etc.) and change this into an Excel readable file.
I have found several online solutions, which I cannot use (I have tested some of them, but they are not really working). Also I have tested e.g. PdfElement 6 - a 3 pages table resulted in 15.000+ lines with a chaos of data

I appreciate any idea, workflow, automate, program, … whatever, which makes my life easier.

Thanks a lot!

Cheers

Sven

RosemaryOrchard · August 25, 2018, 4:06pm

Usually with these things I find they’ve done something weird to the PDF and there’s a lot of extra gobbledygook in there. For these I convert the PDF to an image and back to PDF and then OCR it. I use PDFPen Pro, but any OCR software should do. Then you will likely be able to extract the data more easily.

Sven · August 25, 2018, 4:49pm

Thanks!
Yes, there are a lot of weird things happening … really weird

I like your approach, but why wouldn’t you OCR the pdf from the beginning and first convert it to an image ?

DrJJWMac · August 25, 2018, 9:19pm

Things will be easier for you when you will get a standard PDF template from your vendors. The PDF -> (something else) conversion process can be standardized to remove the “weird things”.

So, can you invert the problem? How open are your vendors to working with you toward a solution where you give them a template (e.g. a standardized PDF form) for them to complete and return? Does this approach work across enough of your vendors? Do you have a way to incentivize your vendors to make such a change?

–
JJW

Ben_Lincoln · August 25, 2018, 10:05pm

I would suggest taking it a step further then @RosemaryOrchard and once you have the OCR’ed document using something like keyboard maestro, with some regular expressions to extract the data and insert it into whatever.

Also it might help your argument with them to get on board with a better system if you shatter the idea that a PDF is an uneditable source of truth document with these guys.

Whenever you do system integration, no matter the scale it’s important to get everyone on board and understanding. These guys may sell fish or whatever but they have to provide IT severices, even if it is just this PDF, they need to put on their IT hats, and be professional about delivering IT services.

ChrisUpchurch · August 26, 2018, 2:24am

I believe Rose is talking about a situation where a PDF already has a text layer (probably one generated along with the PDF, rather than via OCR) with bad data in it. The point of converting to an image and back is to get rid of that text layer so the OCR software can start fresh.

RosemaryOrchard · August 26, 2018, 11:30am

This is exactly what I meant, it’s a quick way of dropping everything it has and starting over.

Sven · August 26, 2018, 4:13pm

Thanks for the help! I have downloaded the pdfpen pro version but unfortunately I cannot test the excel export quality due to limitations of the testversion - considering the price for the SW (139,99€) I want to make sure that it does its job Do you have any experience with the quality of the Excel export ?

Thanks!

oldblueday · August 26, 2018, 4:20pm

I’ve recently had to do the same thing. Occasionally, text that appears like it’s on one line (and reads that way) ends up being 2 columns. Just weird stuff that made me frustratedly ask, WHY!!?!

The PDF → image → OCR can get rid of some of this stuff. Unfortunately, I’ve never found any of the OCR programs to do a decent job. So I ran a spell check to find some errors.

Oddly, Google Chrome’s PDF reader had the least weirdness with it. I could ⌘-A the text in there right into BBEdit with pretty good success.

Then lots of regex to add tabs into the text so that I can cut and paste into Excel. Learned RegEx just for this.

Or hire some cheap labor to do it for you (college student, virtual assistant, etc) the manual way.

Good luck.

(Also check out smallpdf.com - it may work for you, they have a PDF → Excel export).

Quahog · August 26, 2018, 6:39pm

PDFPen Pro can export directly to CSV format. I’ve found it reliable.

Just select the table you want data from, create a new document (File > New From Selection) and a new file with only the data of interest is created. (This eliminates any extraneous stuff you don’t want; if you want everything in the PDF to be sent to Excel, you can skip this step.) Then, just export to Excel (File > Export…)

Then you can toss the file you created from selection.

It would be nice if PDFPen could export from the selection, so that the intermediate file need not be created, but it takes only a second to do.

JimH · September 2, 2018, 10:01pm

Hi Sven,
I just signed up for this forum today and this is my first post. I hope I can help you out with your problem. It sounds similar to one I had.
I needed to convert a one page picture of a table that I received in pdf format into a Excel sheet. I have PDF Pen Pro but that did not do it for me. I ended up buying “PDF Converter with OCR” from the Mac App store. The developer is Enolsoft. Here is their web site.

https://www.enolsoft.com/pdf-converter-with-ocr-for-mac.html

The software currently is $24.00. Not cheap but it saved me a ton of time. The software did a very good job getting the data in the proper cells of the excel file. You will still have to use the excel spell checker, but I found the program’s OCR did a good job with a poor images and small text.

This might be just the tool you need.
Jim

Sven · September 3, 2018, 5:37am

Thank you - will try it asap and let you know of the outcome!

lemonzestlife · April 17, 2020, 4:40am

Am trying to export a group of pdfs from Vision Appraisal to Excel. Most people in the US would know what these forms look like. They have a picture of the property and a diagram of the plot, which, of course, I don’t want in the Excel file. I’ve already OCR’d the documents using PDFPen Pro, but when I try to export it, I get only two fields of data. Does anyone know how to do this? It’s so tedious to do it manually.

bowline · April 17, 2020, 12:39pm

If there’s a lot of work, it is not confidential, and scanning (and cleaning up the OCR) is too time-consuming I’d throw $30 at someone to do it for me somewhere in the world at fiverr.com. Search under “data entry”.