Page Content

Optical Character Recogniton (OCR) refers to technologies that allows images of text to be converted to actual text. OCR allows for the extraction of text content from old documents, infographics, photos of signs, PDFs of scanned files and any situation when text is embedded within an image.

Service Parameters

Text Transcript vs. Hidden Text

OCR tools offer a choice of outputting a text file or embedding an hidden layer of text on top of a PDF image. An example of a tool which creates hidden text is Adobe Acrobat.

We recommend the text transcript option in most cases because errors are easier to detect and correct. You can also add headings, table headers and lists as needed within the document.
Note: If you are creating a PDF with hidden text, you should copy and paste the text into a separate file to verify its accuracy.

PDF vs. Image

Note that some services work with both images and scanned PDFs and others with scanned PDFs only. If you need to use a PDF-only service, you can print or export the image as a PDF.

Penn State Service Options

The following services are licensed to Penn State staff, students and instructors.

Image Management Tips

OCR tools depend on good image quality to provide optimal results. To improve your success rates, we recommend the following tips:

  • Crop photos and images to include just text.
  • Use rotation tools to ensure that text is perfectly horizontal.
  • If text color does not meet contrast guidelines, use tools like Photoshop to make text darker.
  • If text is relatively small, use zoom tools make the image larger.
  • If the file is an infographic with multiple sections, consider splitting the file into multiple sections.

Additional project tips are available from the University Libraries.

Top of Page

Last Update: October 11, 2023