OCR Process for Theses
Purpose: To prepare searchable text for University of Maine theses and dissertations
scanned for online access as part of the Electronic Theses & Dissertations database. After the thesis
has been OCRed, the next step is to add it to Digital Commons.
- Open ABBYY. Go to "Open PDF File/Images" and navigate to the location of the PDF file.
- In the Open File dialogue box, make sure that the "Enable image processing" box is checked.
- Click on the PDF to open and begin reading it in ABBYY. Wait for ABBYY to finish opening and
reading all of the pages.
- Run a spell check on the title page and abstract pages until those pages have a 0% character
uncertainty level. Fix "special characters" (e.g., Greek letters, mathematical symbols, accented
letters) in the title page and abstract, if possible. (Note that if the entire thesis is in a
foreign language, such as French, the spell check language should be changed to that language.)
- Read the OCRed text of the title page to check for spelling errors that ABBYY may have missed.
(This is rare, but it's important that the title page be free of typos.) If the title or abstract
has any "special characters," read through the OCRed text of the abstract to double check that the
special characters were properly interpreted by ABBYY.
- For the rest of the pages in the thesis, reduce the character uncertainty level to 5% or lower
by taking the following steps:
- If there is a text or table block drawn around content that is non-OCRable (e.g., maps,
graphs, images), delete the entire block. (This also holds true for text blocks that consist
entirely of mathematical equations or computer code, since it is often impossible or too
time-consuming to enter the correct symbols.)
- If there is a text block which includes both readable text and non-OCRable content, redraw
the box to eliminate the non-OCRable section and reread the text box. (With the box selected,
Ctrl-Shift-B rereads the single box. Alternately, Ctrl-R will reread the entire page.)
- If all the text blocks are drawn around readable text, run a spell check until the character
uncertainty level is at 5% or lower.
- When you have completed the spell check, save the pages as PDF file
with the same name as the original, plus a suffix such as "-OCR" to keep
the original file as a short term backup. The standard naming
convention is last name, first initial, thesis date. So, a thesis written by John Smith in 2012
would be named SmithJ2012.pdf. (In case of a naming conflict, such another thesis written by Jane
Smith in 2012, the second thesis would be named SmithJ2012a.pdf.)
Created by: Library Staff |