How to Convert a Scanned PDF to Editable Text (OCR Guide 2026)
Turn images of text into actual text you can copy, edit, and search
By Ben Praveen J · March 24, 2026
You have a scanned PDF — maybe a contract your landlord emailed you, old tax forms, a research paper from a library archive, or meeting notes someone photographed on a whiteboard. You need to copy a paragraph, search for a name, or edit the content. But when you try to select text, nothing happens. Your cursor just drags a blue rectangle across the page like you are selecting an image.
That is because a scanned PDF is an image. It looks like a document, but to your computer it is just a picture of text — no different from a photograph of a street sign. To make that text selectable and editable, you need OCR.
What Makes Scanned PDFs Different
There are two fundamentally different types of PDFs, and the distinction matters:
- Native (digital) PDFs are created directly from software — exported from Word, generated by a web application, or produced by a design tool. They contain actual text data: characters, fonts, positions. You can select text, copy it, search it, and screen readers can read it aloud. The text is stored as code, not pixels.
- Scanned (image-based) PDFs are created by scanning a physical document or photographing it. Each page is stored as a raster image — a grid of pixels. The PDF is essentially a container holding photographs. Even though you can see words on the page, the computer sees only colored dots arranged in patterns that happen to look like letters to human eyes.
This distinction explains why you cannot select text in a scanned PDF. There is no text to select — only an image of text. To bridge this gap, you need Optical Character Recognition.
How OCR Works
OCR (Optical Character Recognition) is the technology that converts images of text into machine-readable text. Here is what happens when an OCR engine processes your scanned PDF, step by step:
Step 1: Image Preprocessing
The OCR engine first cleans up the image to improve recognition accuracy. This includes converting to grayscale, adjusting contrast and brightness, straightening skewed pages (deskewing), removing speckles and noise, and sharpening blurry edges. A clean, high-contrast image with straight text lines is far easier to process than a dark, tilted scan with coffee stains.
Step 2: Layout Analysis
The engine identifies the structure of the page: where are the text blocks, columns, headers, footers, tables, and images? This step determines the reading order and groups characters into words, words into lines, and lines into paragraphs. For simple single-column documents, this is straightforward. For multi-column layouts, forms, or pages with mixed content, layout analysis is one of the most challenging parts of OCR.
Step 3: Character Recognition
This is the core of OCR. The engine examines each character-shaped region and determines which letter, number, or symbol it represents. Modern OCR engines use two approaches simultaneously:
- Pattern matching: Compares the shape of each character against a library of known character templates. Fast but inflexible — struggles with unusual fonts or damaged characters.
- Feature extraction: Analyzes the geometric features of each character (lines, curves, intersections, angles) and classifies them using machine learning models. More robust and accurate, especially for diverse fonts and imperfect scans.
The engine also uses language context — dictionary lookups and statistical language models — to resolve ambiguous characters. If the pattern recognizer is unsure whether a character is "rn" or "m," the language model checks whether the resulting word exists in the dictionary and picks the more likely option.
Step 4: Text Reconstruction
Finally, the recognized characters are assembled into structured text output: words separated by spaces, lines separated by line breaks, paragraphs grouped logically. The result is plain text (or formatted text) that you can copy, paste, search, and edit.
Factors That Affect OCR Accuracy
OCR accuracy varies dramatically depending on the quality of your source material. Here are the key factors:
| Factor | Ideal | Problematic |
| Scan resolution | 300 DPI or higher | Below 200 DPI |
| Contrast | Black text on white background | Light gray text, colored backgrounds |
| Font type | Standard fonts (Arial, Times, Calibri) | Decorative, script, or artistic fonts |
| Text size | 10pt or larger | Below 8pt (fine print, footnotes) |
| Page condition | Clean, flat, no marks | Creased, stained, highlighted |
| Orientation | Straight, properly aligned | Skewed, rotated, or curved |
| Language | English, common European languages | Mixed languages, non-Latin scripts |
Resolution is the single most important factor. At 300 DPI, a standard 12pt letter is represented by enough pixels for reliable recognition. At 150 DPI, the same letter has only a quarter of the pixel data, and character edges become ambiguous. If you are scanning a document specifically for OCR, always scan at 300 DPI or higher.
Step-by-Step: Convert Your Scanned PDF to Text
Here is how to extract text from a scanned PDF using GoToolsOnline:
- Open the PDF to Text converter.
- Upload your scanned PDF by dragging it onto the page or clicking to browse.
- The tool processes each page, running OCR on image-based pages and extracting native text from digital pages.
- Review the extracted text in the output panel. You can copy it to your clipboard or download it as a text file.
- Paste the text into your preferred editor (Word, Google Docs, Notion, etc.) and make any necessary corrections.
After extracting text, you may also want to:
Common OCR Mistakes and How to Fix Them
Even high-quality OCR produces errors. Knowing what to watch for helps you proofread efficiently.
Character Confusion
Certain characters look nearly identical in many fonts, and OCR engines frequently mix them up:
- l (lowercase L) / 1 (one) / I (uppercase i) — The most common OCR error. "Illinois" might become "I11inois" or "Il1inois." Check any word containing these characters.
- O (uppercase o) / 0 (zero) — "2024" might become "2O24." Pay special attention to numbers in dates, amounts, and reference codes.
- rn / m — At low resolution, "rn" can look like "m." "morning" might become "moming" or vice versa.
- c / e and a / o — In degraded scans, rounded characters blur together.
Table and Column Alignment
OCR engines read text in the order they detect it, which can scramble tables and multi-column layouts. A two-column page might be read straight across rather than column by column, interleaving text from both columns into nonsense. Similarly, table data may lose its row-column structure, producing a stream of values without clear association.
Fix: For tables, consider extracting each column separately by cropping the PDF first. For multi-column layouts, check that paragraphs are coherent and have not been mixed together.
Special Characters and Formatting
OCR typically loses formatting: bold, italic, underline, font size, and color are not preserved in plain text output. Special characters like em dashes, curly quotes, and accented letters may be replaced with incorrect substitutes. Bullet points may become random characters. Review any special characters carefully.
Headers, Footers, and Page Numbers
Repeated headers and footers from each page will appear inline in the extracted text. Page numbers may be inserted in the middle of paragraphs. You will need to manually remove these repetitions from the output.
When OCR Will Not Work Well
OCR has real limitations. In some cases, it is not the right tool for the job:
- Handwritten documents. OCR engines are designed for printed text. Handwriting — especially cursive — is fundamentally different: inconsistent letter shapes, connected characters, variable spacing. Even the best OCR achieves only 60-80% accuracy on neat handwriting, and drops below 50% on cursive or messy writing. For handwritten notes, manual transcription is usually faster and more accurate than correcting bad OCR output.
- Very poor scans. If the scan is dark, blurry, heavily creased, or low resolution (below 150 DPI), OCR will produce too many errors to be useful. Rescanning the document at higher quality is the best fix. If you do not have access to the original, try adjusting the image brightness and contrast before running OCR.
- Artistic or decorative fonts. Fonts designed for visual impact — calligraphy, graffiti, retro, or handwritten-style typefaces — use character shapes that deviate significantly from standard letterforms. OCR engines may fail to recognize them entirely. If possible, locate a digital version of the document or retype the content manually.
- Photographs of documents at angles. A photograph taken at an angle introduces perspective distortion — lines are no longer parallel, characters are stretched unevenly. OCR accuracy drops sharply. If you must photograph a document, shoot directly from above with the page flat and evenly lit.
- Mixed content: diagrams, equations, sheet music. OCR is designed for text. Mathematical equations, chemical formulas, musical notation, and flowcharts are not text — they are specialized notations that require domain-specific recognition engines.
Alternatives When OCR Falls Short
If OCR produces too many errors to be useful, consider these alternatives:
- Retype manually. For short documents (under 2-3 pages), retyping is often faster than running OCR and then correcting dozens of errors. Open the scanned PDF on one side of your screen and type in a text editor on the other.
- Use dictation software. Read the scanned document aloud and let speech-to-text software transcribe it. Modern dictation (built into macOS, Windows, Google Docs, and mobile keyboards) is surprisingly accurate for clear speech. This is especially effective for handwritten documents that defeat OCR.
- Request a digital copy. If the document originated from a computer (as most documents do today), ask the sender for the original digital file rather than the scanned version. The original Word document, Google Doc, or native PDF will contain perfect, selectable text.
Tips for Best OCR Results
- Scan at 300 DPI minimum. This is the single most impactful setting. If your scanner offers 400 or 600 DPI, use it for documents with small text.
- Use grayscale, not color. Color scans are larger but do not improve OCR accuracy. Grayscale provides sufficient contrast information while keeping file sizes manageable.
- Keep the glass clean. Dust, fingerprints, and smudges on the scanner glass create artifacts that confuse OCR engines. Wipe the glass before scanning.
- Flatten the document. Creases and folds create shadows and distortion. Flatten the paper as much as possible before scanning.
- Check the scan before running OCR. Preview the scan and verify it is straight, evenly lit, and high contrast. Rescanning takes less time than correcting hundreds of OCR errors.
- Always proofread. No OCR engine is 100% accurate. Budget time to review the output, especially for names, numbers, and dates — the data points where errors matter most.
FAQ
- Can OCR convert handwritten text in a scanned PDF?
- OCR struggles with handwriting. Modern OCR engines can handle neat, printed-style handwriting with moderate accuracy (60-80%), but cursive, messy, or stylized handwriting produces unreliable results. For handwritten documents, manual transcription or dictation software is usually more practical.
- What scan resolution do I need for accurate OCR?
- 300 DPI is the recommended minimum for reliable OCR. At 200 DPI, accuracy drops noticeably for small text. At 150 DPI or lower, OCR will produce frequent errors. If you are scanning a document specifically for OCR, always use 300 DPI or higher.
- Why does my scanned PDF already have selectable text?
- Some scanners and scanning apps automatically run OCR during the scanning process, embedding a hidden text layer behind the page image. This is called a "searchable PDF." If the text is already selectable, you can copy it directly — no additional OCR step is needed.
- Is OCR 100% accurate?
- No. Even under ideal conditions (clean scan, 300+ DPI, standard fonts, good contrast), OCR typically achieves 95-99% character accuracy. That sounds high, but in a 1,000-word document, 95% accuracy means roughly 250 incorrect characters. Always proofread OCR output before using it in important documents.
- Can I convert a scanned PDF to an editable Word document?
- Yes. First extract the text using OCR with a PDF to text tool, then format it in Word. Alternatively, use a PDF to DOCX converter that includes OCR capability. The result will need formatting adjustments since OCR cannot perfectly recreate the original document layout.
← Blog index | PDF to Text | Compress PDF | PDF to Word | All tools