Page Content

Optical Character Recogniton (OCR) refers to technologies that allows images of text to be converted to actual text. OCR allows for the extraction of text content from old documents, infographics, photos of signs, PDFs of scanned files and any situation when text is embedded within an image.

Test: Image vs. Text

Within a PDF, it is important to determine whether text is selectable and therefore more accessible or from a scan where the text is actually part of an image.

Pre vs. Post OCR.

The images below show a newspaper article PDF made from a scan and then the output of an OCR process. When text is embedded within an image, it cannot be copied and pasted into another document.

1912 newspaper article about the search for the Titanic. A blue rectangle covers half the page.  
The same article, but each word is individually highlighted.

In the examples above, the left is a scanned PDF and the right a scanned PDF with a hidden text layer. On the left, the cursor is not able to select individual words, but it can on the right.

Check Accuracy

If a PDF appears to be a scan, but with selectable text, you should verify that the text layer is accurate. To verify accuracy, copy and paste a sample of text into Word or other text file. In some cases, the output may need to be corrected.

See examples of an initial OCR process versus a corrected one for the first two sentences of the article. Although there are ongoing improvements from AI, it is always wise to check the output.

Initial OCR

LATEST NEWS FROM nlE SlNICING SHIP. CAPE RACE. N. F., Sunday nl􀄖hl, April 14,-At 10:2S o•ctoclc
to-nlfht the \Vhile Star line stumshlp Tila.nic called •• C. Q. D." to tbe Marconk wireless station bcre, tnd rcpor1ed having struck an fubcrg. The steamer said that immediate usbtance was rtqulrtd.

Corrected OCR

LATEST NEWS FROM T HE SINKING SHIP.

CAPE RACE, N. F., Sunday night, April 14 – At 10:25 o’clock tonight the White Star line steamship Titanic called "C.Q.D." to the Marconic wireless station here, and reported having struck an iceberg. The steamer said that immediate assistance was required.

Service Parameters for OCR

Text Transcript vs. Hidden Text

OCR tools offer a choice of outputting a text file or embedding an hidden layer of text on top of a PDF image. An example of a tool which creates hidden text is Adobe Acrobat.

We recommend the text transcript option in most cases because errors are easier to detect and correct. You can also add headings, table headers and lists as needed within the document.
Note: If you are creating a PDF with hidden text, you should copy and paste the text into a separate file to verify its accuracy.

PDF vs. Image

Note that some services work with both images and scanned PDFs and others with scanned PDFs only. If you need to use a PDF-only service, you can print or export the image as a PDF.

Penn State Service Options

The following services are licensed to Penn State staff, students and instructors.

Image Management Tips

OCR tools depend on good image quality to provide optimal results. To improve your success rates, we recommend the following tips:

  • Crop photos and images to include just text.
  • Use rotation tools to ensure that text is perfectly horizontal.
  • If text color does not meet contrast guidelines, use tools like Photoshop to make text darker.
  • If text is relatively small, use zoom tools make the image larger.
  • If the file is an infographic with multiple sections, consider splitting the file into multiple sections.

Additional project tips are available from the University Libraries.

AI Tools

AI (Artificial Intelligence) tools which describe images can include powerful OCR algorithms. These tools, such as the ASU Image Description (Chat GPT) tool are highly recommended for extracting text from images, photos, graphs, infographics maps and charts.

The AI tools do not currently support conversion of multi-page dociments.

Note: Some people have also had success with Microsoft Copilot, although others have encountered some deficits.

Equations

Two tools designed to convert equation images to MathML or LaTeX are Equatio, which includes the MathPix engine, and MathPix.

Top of Page

Last Update: July 2, 2024