PDF automation - splitting two page scans

tonycapp · August 30, 2018, 10:54pm

Years ago, I had scanned in a bunch of magazines, for the sake of preservation. At that time, I wasn’t thinking about OCRing the text. Unfortunately, I have scanned in two pages at a time, which I believe will wreak havoc during OCR.

I’m looking for suggestions to see if it’s possible to automate (partially or entirely) splitting a two-page scan, into two individual pages.

Below is an example of a two-page scan.
two_pages

dfay · August 31, 2018, 12:02am

I use an Alfred workflow for this but the post below also has a link to the underlying script:

tonycapp · August 31, 2018, 12:06am

Nice, will try this. I wrote my own PDF joiner/splitter, to separate standard (single-page) PDF pages into separate files, using PyPDF or similar package. I didn’t know I could also split double scans though. I’m going to be busy the next few days trying this out.

Thank you!

dfay · August 31, 2018, 12:06am

And having tried it, you definitely want to split before you OCR.

tonycapp · August 31, 2018, 12:08am

Yes, of course. This is why I posted my question.

I just wish PDFPenPro was easier to automate/script. Unfortunately you can’t call it from the cmd line to do the OCR. I couldn’t get David/Katie’s PDF automation PDFPen script working, but will tackle that issue again, once I get the duoble pages split and back into a PDF.

dfay · August 31, 2018, 1:15am

I use OCRKit which has quite good scripting support.

http://ocrkit.com/help/

jahala · August 31, 2018, 8:05pm

Thanks for this link. I have been looking for something like OCRKit for a while. This just does what I want.

Cowpi · September 10, 2018, 10:15pm

If you have PDFpenPro, you can use the selection tool and then either crop the page or the whole document.

To do it right, you need to make two copies of the original, one for left/odd pages and another for right/even pages.

Keep the left/odds by cropping out the right/even pages.
Repeat for right/evens by cropping out the left/odd pages.
For lots pages, I use PDFGenius (Mac App Store) to merge and interweave the left/odds with right/evens into the proper order.