Clean up PDF-to-text conversions with an LLM

krocnyc · August 9, 2025, 12:38am

Today I learned that you can clean up an error-laden conversion from a PDF to a Word / text document with an LLM.

Someone sent me a lengthy document they’d exported from a badly-scanned PDF to a Word document (don’t ask …) for a project we’re working on, and it was predictably chock full of every kind of error—paragraph breaks in the middle of sentences, strings of conjoined words, inaccurately recognized characters, etc. The text has been long out of print, so there was little-to-no hope that we could find a cleaner copy anywhere else.

It would have taken me hours to clean it up in Word, even though spell check had highlighted many of the errors.

So, I took a flyer and uploaded it into Claude, and asked it to clean it up as best it could, and lo and behold, it did a fine job. (I did tell it what kinds of errors to expect, although I don’t know if that was strictly necessary.) I’ll have to review it, of course, to correct any lingering errors, but that will take much less time and effort than it would have otherwise.

If this is the only thing an LLM ever does for me, it will have been enough.

Bmosbacker · August 9, 2025, 1:01am

That is a great use case. I’m going to give that a go.

krocnyc · August 9, 2025, 2:09am

Even better: I asked Claude to check its cleaned-up version against the original PDF—which was itself barely readable. It did so, and came back with a list of things it probably got right, and a list of things I need to check myself, e.g., special characters and technical terms in languages other than English.

Daveb08 · August 10, 2025, 10:47am

Thanks…I feel another journey down the rabbit hole.

Bmosbacker · August 10, 2025, 11:06am

I just used this process this morning. It worked great!

krocnyc · August 10, 2025, 3:13pm

Today I learned that you don’t really need to bother with exporting the PDF to a text document first. If your PDF has already been OCR’d, just upload that and ask for a clean copy.

It doesn’t appear to matter if it’s a two-page spread or if the text has multiple columns: the document I uploaded was in two-page, book-fold format with an attached bibliography in two columns. I told the LLM I wanted single pages with single columns throughout, and that’s what I got.

It also helps to tell the LLM what you need to preserve—e.g., headings, footnotes, italicized foreign words, non-Latin characters, etc. I also let it know that there was no need to preserve any images or graphics (I’m not sure it could have retained them in any event), but I did ask it to preserve and italicize all image/graphics captions so that we could locate where the images were in the original document and reference/replace them as needed. (The image quality is abysmal in our original, so we’re better off finding the images on the internet and using those.)

I did note that the LLMs had better luck retaining the foot note numbers in the text if they started with the messy plain-text export than with the original PDF. I don’t know if that was particular to the document I fed them or if it’s a general deficiency.

Another note: the paid versions of Claude and Gemini handled the task without a hiccup. The free version of ChatGPT 5 screeched to a halt about half-way through and announced that it was out of analysis capacity and that I should come back in 24 hours. I don’t know if that was a GPT 5 issue, or a cap on the free version.

Bmosbacker · August 10, 2025, 5:38pm

I’ve already used your example at least three times, including once this morning.

Regarding ChatGPT 5, I use the paid version. It has worked flawlessly.

I had ChatGPT creat a template prompt for future use:

Clean Text AI Prompt

I have a text extract from a poorly converted PDF file, riddled with errors such as mis-split paragraphs, conjoined words, incorrect characters, and erratic formatting.

Please clean up the text adhering to the following specifications:

Preserve the original content faithfully—including headings, footnotes, italicized foreign words, and non-Latin characters.
Remove extraneous formatting elements such as headers, footers, page numbers—unless they’re part of meaningful content.
Ensure all paragraphs are properly re-formed; fix broken sentences, separate conjoined words, and correct obvious OCR misreads.
Convert the document to a single-column, page-flow format (even if the original was multi-column or book-fold layout).
Preserve captions for images or graphics by italicizing them so they remain identifiable for later reference.
Provide two summary lists at the end:
- Things I likely got right.
- Items requiring manual review (e.g., special characters, technical terms).

Here is the text for cleanup:

Thanks for sharing your example.

krocnyc · August 10, 2025, 6:57pm

I just copied your prompt and pasted it into my prompt library! It’s so much more coherent than the one I cobbled together.

Bmosbacker · August 10, 2025, 8:41pm

Well, like I said, ChatGPT created it.

krocnyc · August 10, 2025, 9:03pm

I’m embarrassed to admit that it never occurred to me to ask Claude or Gemini to turn my jumble of a prompt into something coherent …

Bmosbacker · August 10, 2025, 10:17pm

Well, I never thought to use AI to clean up text. I’d say you’ve one-upped me.

timlockridge · August 11, 2025, 3:45pm

A mild warning: be sure to double-check the LLM conversion, especially if the PDF is longer than a page or two. I’ve been using LLMs to digitize PDFs for an academic research project, and I routinely find incorrect (or clearly hallucinated) output. I suspect that once the PDF content exceeds the token limit, the AI starts predicting what the text should be. This frequently happens when the original OCR scan quality is poor.

krocnyc · August 12, 2025, 12:53am

Duly noted! The PDF-to-text conversion generated by Gemini 2.5 pro (paid version) provides sentence-by-sentence links back to the original document, which makes it easy to review for inaccuracies. And yes, you really do need to check.

For longer documents, I’ve asked the LLM (in a new chat) to compare the conversion with the original and flag any discrepancies between the two.

Bmosbacker · August 12, 2025, 10:35am

GPT-5 suggested that it is best to tailor prompts to the strengths of the AI one is using to clean up text. I asked GPT-5 to explain, then to create GPT-5 and Claude-specific prompts and a combined general-purpose prompt that I could use as templates. Though a bit long, I’m posting the response below because some may find this helpful.

NOTE: I have no way of verifying the accuracy of GTP-5’s response, but it sounds plausible.

Why GPT-5 and Claude Work Best with Different Prompt Styles for OCR Cleanup

Although both models can handle OCR cleanup, they have different “strength profiles” that affect how they interpret prompts.

GPT-5 Strengths and Behavior

Analytical precision – GPT-5 excels at highly structured, step-by-step execution. It performs best when instructions are clearly numbered in strict, procedural order.
Error correction accuracy – It’s strong at fine-grained fixes (like character substitutions or word splitting) when those are explicitly defined.
Rule compliance – GPT-5 tends to follow “must/never” rules precisely if they are isolated in a Rules section at the end.
Structured output – It reliably appends review lists in the exact requested order if the format is unambiguous.

Prompt implication: For GPT-5, use very explicit sequencing and clearly separated Rules so it locks into procedural mode.

Claude Strengths and Behavior

Long-context recall – Claude can hold and preserve more original text without losing detail over long passages, which makes it good for large OCR cleanups.
Flexible text repair – It’s better at holistic paragraph reconstruction without being micromanaged.
Implicit preservation – It often respects content preservation without as many “don’t change” reminders, but benefits from reminders about order and scope.
Human-like language restoration – Claude sometimes interprets underspecified repair instructions more gracefully for natural flow.

Prompt implication: For Claude, use slightly more narrative instructions with a focus on preservation and sequence, but less over-proceduralization to avoid mechanical, stilted repairs.

Bottom line:

GPT-5 → thrives on strict procedure, numbered steps, and clear “do/don’t” rules.
Claude → works better with ordered, narrative-style restoration guidelines that still enforce accuracy.

OCR Cleanup Prompts

GPT-5 Optimized Prompt

Purpose: Clean up OCR-extracted text with maximum accuracy while preserving the original content and structure.

You are a meticulous document restoration assistant. You will receive a block of OCR-extracted text from a poorly converted PDF. The text may contain broken paragraphs, merged or split words, incorrect characters, and inconsistent formatting.

Your task: Correct and reformat the text according to the exact instructions below, in the order given:

Preserve All Original Content

Retain all headings, subheadings, footnotes, italicized foreign words, and non-Latin characters exactly as in the original.
Do not remove or reword any content.

Remove Non-Content Elements

Delete headers, footers, page numbers, and other repeating artifacts unless they convey actual information.

Correct OCR Errors

Fix broken sentences and re-form paragraphs.
Separate conjoined words and correct clear OCR misreads (e.g., “rn” → “m”, “l” → “1” where context makes it clear).

Normalize Layout

Produce a continuous, single-column flow of text regardless of the original column layout.

Preserve Captions

Italicize captions for images or graphics so they remain identifiable for later reference.

Produce Two Review Lists (append after the cleaned text):

Discrepancy List: Note any deviations from the original text that require human verification.
Manual Review List: Flag unclear words, questionable characters, technical terms, or formatting ambiguities.

Rules:

Do not summarize, paraphrase, or modernize the language.
Do not add commentary outside of the two review lists.
Maintain original ordering of all text elements except where corrections are required.

Here is the text for cleanup:

Claude Optimized Prompt

Purpose: Restore OCR-extracted text while maintaining historical and formatting fidelity.

You are acting as an expert in historical text restoration. You will receive OCR text from a poorly converted PDF containing broken formatting, merged words, character errors, and misplaced elements.

Follow these exact steps in sequence:

Preserve Original Wording and Features

Keep all headings, subheadings, footnotes, italicized foreign words, and non-Latin characters exactly as found.
Do not alter the meaning or wording of the source.

Strip Non-Content Elements

Remove headers, footers, page numbers, and recurring artifacts unless they contain relevant information.

Repair OCR Damage

Restore full sentences and re-form paragraphs.
Split merged words, correct mis-split words, and fix obvious OCR character errors based on context.

Unify Layout

Present the text in a single continuous column, even if the original had multiple columns or book-fold format.

Handle Captions

Italicize all captions for images, tables, or graphics.

Append Two Lists at the End:

Discrepancy List: All possible deviations from the original text that need verification.
Manual Review List: All ambiguous or questionable elements for human review (e.g., unclear characters, unusual symbols, or specialized terms).

Additional Directives:

Do not modernize, paraphrase, or summarize the text.
Keep exact sequence of text unless reordering is required to fix OCR disruption.
Your final output should contain only the cleaned text followed by the two review lists.

Here is the text for cleanup:

Dual-Purpose Smart Prompt (Works for Both GPT-5 and Claude)

Purpose: Single prompt that adapts to both GPT-5’s procedural precision and Claude’s narrative restoration style.

You are a meticulous document restoration assistant. You will receive OCR-extracted text from a poorly converted PDF. The text may contain broken paragraphs, merged or split words, incorrect characters, and inconsistent formatting.

Your mission: Restore the text so it is accurate, readable, and faithful to the original while removing only unwanted OCR artifacts.

Follow these exact requirements in order:

Preserve All Original Content

Keep all headings, subheadings, footnotes, italicized foreign words, and non-Latin characters exactly as they appear in the original.
Do not omit or reword any content.

Remove Non-Content Elements

Delete headers, footers, page numbers, and other repeating artifacts unless they convey actual information.

Correct OCR-Related Errors

Re-form proper paragraphs and fix broken sentences.
Separate conjoined words and correct clear OCR misreads (e.g., “rn” → “m”, “l” → “1” where context makes it clear).

Normalize Layout

Present the text in a single-column, continuous flow, regardless of original layout.

Preserve Captions

Italicize captions for images, tables, or graphics so they remain identifiable.

Append Review Lists at the End

Discrepancy List: Note any differences from the original that need human verification.
Manual Review List: Flag unclear characters, technical terms, unusual symbols, or formatting ambiguities.

Rules:

Do not summarize, paraphrase, or modernize the language.
Maintain the exact wording and sequence of the original text except where corrections are required.
Output only the cleaned text followed by the two review lists—nothing else.

Here is the text for cleanup:

Glimfeather · August 12, 2025, 4:17pm

I tried using ChatGPT to extract data from polling station marked registers — these are sheets of paper that tellers use to check off who has voted but each teller has an idiosycratic way of doing it with some putting mark against the name, ruling through the entire entry or other marks. Problem is that the marks (other than the rules) are handwritten and on more than one occasion it has been difficult determine exactly which voter was marked as voting. Also those ruled entries are effectively strikethrough. It’s a mess. ChatGPT failed abysmally to extract the data. Oh and by the way these registers are printed in two columns.

(Thankfully the council has now progressed to more modern appraches with the paper replaced by a iPad or other tablet device which means they can send election agents a simple CSV file rather than a messed around with PDF file.)