Page Content
Optical Character Recogniton (OCR) refers to technologies that allows images of text to be converted to actual text. OCR allows for the extraction of text content from old documents, infographics, photos of signs, PDFs of scanned files and any situation when text is embedded within an image.
Test: Image vs. Text
Within a PDF, it is important to determine whether text is selectable and therefore more accessible or from a scan where the text is actually part of an image.
Pre vs. Post OCR.
The images below show a newspaper article PDF made from a scan and then the output of an OCR process. When text is embedded within an image, it cannot be copied and pasted into another document.


In the examples above, the left is a scanned PDF and the right a scanned PDF with a hidden text layer. On the left, the cursor is not able to select individual words, but it can on the right.
Check Accuracy
If a PDF appears to be a scan, but with selectable text, you should verify that the text layer is accurate. To verify accuracy, copy and paste a sample of text into Word or other text file. In some cases, the output may need to be corrected.
See examples of an initial OCR process versus a corrected one for the first two sentences of the article. Although there are ongoing improvements from AI, it is always wise to check the output.
Initial OCR
LATEST NEWS FROM nlE SlNICING SHIP. CAPE RACE. N. F., Sunday nlhl, April 14,-At 10:2S o•ctoclc
to-nlfht the \Vhile Star line stumshlp Tila.nic called •• C. Q. D." to tbe Marconk wireless station bcre, tnd rcpor1ed having struck an fubcrg. The steamer said that immediate usbtance was rtqulrtd.
Corrected OCR
LATEST NEWS FROM T HE SINKING SHIP.
CAPE RACE, N. F., Sunday night, April 14 – At 10:25 o’clock tonight the White Star line steamship Titanic called "C.Q.D." to the Marconic wireless station here, and reported having struck an iceberg. The steamer said that immediate assistance was required.
Service Parameters for OCR
Text Transcript vs. Hidden Text
OCR tools offer a choice of outputting a text file or embedding an hidden layer of text on top of a PDF image. An example of a tool which creates hidden text is Adobe Acrobat.
We recommend the text transcript option in most cases because errors are easier to detect and correct. You can also add headings, table headers and lists as needed within the document.
Note: If you are creating a PDF with hidden text, you should copy and paste the text into a separate file to verify its accuracy.
PDF vs. Image
Note that some services work with both images and scanned PDFs and others with scanned PDFs only. If you need to use a PDF-only service, you can print or export the image as a PDF.
Penn State Service Options
The following services are licensed to Penn State staff, students and instructors.
- Sensus Access
- Equatio (equations)
- Read and Write
Note: Provides output as speech or in a Word file. - Adobe Acrobat
Note: Copy and paste text into a separate file to verify accuracy. - Anthology Ally OCR PDF (Canvas)
Note: Copy and paste text or convert to HTML to verify accuracy. - Additional OCR Tools (University Libraries)
Image Management Tips
OCR tools depend on good image quality to provide optimal results. To improve your success rates, we recommend the following tips:
- Crop photos and images to include just text.
- Use rotation tools to ensure that text is perfectly horizontal.
- If text color does not meet contrast guidelines, use tools like Photoshop to make text darker.
- If text is relatively small, use zoom tools make the image larger.
- If the file is an infographic with multiple sections, consider splitting the file into multiple sections.
Additional project tips are available from the University Libraries.
AI Tools
AI (Artificial Intelligence) tools which describe images can include powerful OCR algorithms. These tools, such as the ASU Image Description (Chat GPT) tool are highly recommended for extracting text from images, photos, graphs, infographics maps and charts.
The AI tools do not currently support conversion of multi-page dociments.
Note: Some people have also had success with Microsoft Copilot, although others have encountered some deficits.
Equations
Two tools designed to convert equation images to MathML or LaTeX are Equatio, which includes the MathPix engine, and MathPix.
Last Update: July 2, 2024