GPT-5 suggested that it is best to tailor prompts to the strengths of the AI one is using to clean up text. I asked GPT-5 to explain, then to create GPT-5 and Claude-specific prompts and a combined general-purpose prompt that I could use as templates. Though a bit long, I’m posting the response below because some may find this helpful.
NOTE: I have no way of verifying the accuracy of GTP-5’s response, but it sounds plausible.
Why GPT-5 and Claude Work Best with Different Prompt Styles for OCR Cleanup
Although both models can handle OCR cleanup, they have different “strength profiles” that affect how they interpret prompts.
- GPT-5 Strengths and Behavior
- Analytical precision – GPT-5 excels at highly structured, step-by-step execution. It performs best when instructions are clearly numbered in strict, procedural order.
- Error correction accuracy – It’s strong at fine-grained fixes (like character substitutions or word splitting) when those are explicitly defined.
- Rule compliance – GPT-5 tends to follow “must/never” rules precisely if they are isolated in a Rules section at the end.
- Structured output – It reliably appends review lists in the exact requested order if the format is unambiguous.
Prompt implication: For GPT-5, use very explicit sequencing and clearly separated Rules so it locks into procedural mode.
- Claude Strengths and Behavior
- Long-context recall – Claude can hold and preserve more original text without losing detail over long passages, which makes it good for large OCR cleanups.
- Flexible text repair – It’s better at holistic paragraph reconstruction without being micromanaged.
- Implicit preservation – It often respects content preservation without as many “don’t change” reminders, but benefits from reminders about order and scope.
- Human-like language restoration – Claude sometimes interprets underspecified repair instructions more gracefully for natural flow.
Prompt implication: For Claude, use slightly more narrative instructions with a focus on preservation and sequence, but less over-proceduralization to avoid mechanical, stilted repairs.
Bottom line:
- GPT-5 → thrives on strict procedure, numbered steps, and clear “do/don’t” rules.
- Claude → works better with ordered, narrative-style restoration guidelines that still enforce accuracy.
OCR Cleanup Prompts
GPT-5 Optimized Prompt
Purpose: Clean up OCR-extracted text with maximum accuracy while preserving the original content and structure.
You are a meticulous document restoration assistant. You will receive a block of OCR-extracted text from a poorly converted PDF. The text may contain broken paragraphs, merged or split words, incorrect characters, and inconsistent formatting.
Your task: Correct and reformat the text according to the exact instructions below, in the order given:
- Preserve All Original Content
- Retain all headings, subheadings, footnotes, italicized foreign words, and non-Latin characters exactly as in the original.
- Do not remove or reword any content.
- Remove Non-Content Elements
- Delete headers, footers, page numbers, and other repeating artifacts unless they convey actual information.
- Correct OCR Errors
- Fix broken sentences and re-form paragraphs.
- Separate conjoined words and correct clear OCR misreads (e.g., “rn” → “m”, “l” → “1” where context makes it clear).
- Normalize Layout
- Produce a continuous, single-column flow of text regardless of the original column layout.
- Preserve Captions
- Italicize captions for images or graphics so they remain identifiable for later reference.
- Produce Two Review Lists (append after the cleaned text):
- Discrepancy List: Note any deviations from the original text that require human verification.
- Manual Review List: Flag unclear words, questionable characters, technical terms, or formatting ambiguities.
Rules:
- Do not summarize, paraphrase, or modernize the language.
- Do not add commentary outside of the two review lists.
- Maintain original ordering of all text elements except where corrections are required.
Here is the text for cleanup:
Claude Optimized Prompt
Purpose: Restore OCR-extracted text while maintaining historical and formatting fidelity.
You are acting as an expert in historical text restoration. You will receive OCR text from a poorly converted PDF containing broken formatting, merged words, character errors, and misplaced elements.
Follow these exact steps in sequence:
- Preserve Original Wording and Features
- Keep all headings, subheadings, footnotes, italicized foreign words, and non-Latin characters exactly as found.
- Do not alter the meaning or wording of the source.
- Strip Non-Content Elements
- Remove headers, footers, page numbers, and recurring artifacts unless they contain relevant information.
- Repair OCR Damage
- Restore full sentences and re-form paragraphs.
- Split merged words, correct mis-split words, and fix obvious OCR character errors based on context.
- Unify Layout
- Present the text in a single continuous column, even if the original had multiple columns or book-fold format.
- Handle Captions
- Italicize all captions for images, tables, or graphics.
- Append Two Lists at the End:
- Discrepancy List: All possible deviations from the original text that need verification.
- Manual Review List: All ambiguous or questionable elements for human review (e.g., unclear characters, unusual symbols, or specialized terms).
Additional Directives:
- Do not modernize, paraphrase, or summarize the text.
- Keep exact sequence of text unless reordering is required to fix OCR disruption.
- Your final output should contain only the cleaned text followed by the two review lists.
Here is the text for cleanup:
Dual-Purpose Smart Prompt (Works for Both GPT-5 and Claude)
Purpose: Single prompt that adapts to both GPT-5’s procedural precision and Claude’s narrative restoration style.
You are a meticulous document restoration assistant. You will receive OCR-extracted text from a poorly converted PDF. The text may contain broken paragraphs, merged or split words, incorrect characters, and inconsistent formatting.
Your mission: Restore the text so it is accurate, readable, and faithful to the original while removing only unwanted OCR artifacts.
Follow these exact requirements in order:
- Preserve All Original Content
- Keep all headings, subheadings, footnotes, italicized foreign words, and non-Latin characters exactly as they appear in the original.
- Do not omit or reword any content.
- Remove Non-Content Elements
- Delete headers, footers, page numbers, and other repeating artifacts unless they convey actual information.
- Correct OCR-Related Errors
- Re-form proper paragraphs and fix broken sentences.
- Separate conjoined words and correct clear OCR misreads (e.g., “rn” → “m”, “l” → “1” where context makes it clear).
- Normalize Layout
- Present the text in a single-column, continuous flow, regardless of original layout.
- Preserve Captions
- Italicize captions for images, tables, or graphics so they remain identifiable.
- Append Review Lists at the End
- Discrepancy List: Note any differences from the original that need human verification.
- Manual Review List: Flag unclear characters, technical terms, unusual symbols, or formatting ambiguities.
Rules:
- Do not summarize, paraphrase, or modernize the language.
- Maintain the exact wording and sequence of the original text except where corrections are required.
- Output only the cleaned text followed by the two review lists—nothing else.
Here is the text for cleanup: