OCR Process for Town Reports
Purpose: To prepare searchable text for Maine town reports being
scanned for online access as part of the Kirtas Digitization Project.
Opening ABBYY and Setup
- Open Abbyy FindReader OCR application at one of the designated
Kirtas workstations located in the lab, Technical Services or Special
- From the tool bar select "Details Batch View"
- Note: This view and other options from
the setup screen are saved from session to session; if these
options have already been selected once, you do not need to
select them every time you open Abbyy.
- Set up screen should have the following items checked off. To
access setup, right click the gray bar under batch.
- Page number
- Uncertain characters
- Spelling checked
- Error warning
- Source image (full path)
General Steps for OCR Processing
- Step 1. Open the menu under the "Scan and read" icon and select "Open and read."
- Step 2. Navigate to the folder where the files to be OCRed are
located. Highlight all files to be opened. (For the Town Reports project, this will be all
of the TIFF files in the "output" folder for a single volume. For a
thesis, this will be one PDF.) Select "Open."
- Depending on the type of file and
number of pages to be opened, this step can take anywhere from 5 minutes
to 1 hour to complete. It may be advisable to work on another task while waiting
for the files to be opened and read.
- Step 3. After all pages have been opened and read, check that the
uncertainty level for all pages is below 10%. The easiest way to do this
is to click on the "Uncertainty level" heading at the top of the "Batch"
section. Clicking on this heading will sort all of the pages from
greatest to least character uncertainty. If the first page has an uncertainty level
under 10%, then all pages in the batch have an uncertainty level under
10%, and you are finished with this step. If the top page is at 10% or above,
then this page and all other pages at that level must be edited and
spell checked until their uncertainty level is under 10%. For more
detail on this step, see the "Common Problems" section, below.
- Step 4. After all pages have an uncertainty level below 10%, return
the pages to numerical order by clicking on "Page number." (Note:
If the pages are not returned to numerical order before the file is
saved, the pages will stay out of order in the resulting PDF.) Run a complete
spell check on the title page(s), by selecting the page, then selecting the
"Check spelling" icon
and checking all words on that page. When
you have finished the spell check, a check mark should appear next to
that page number on the "Batch" section of the screen. If
there is an index or table of contents, do a complete spell check of
these pages, as well.
- Step 5: Save the pages as a PDF by highlighting all the pages in one
volume, then selecting "Save Wizard" from the drop down menu on the
"Save" icon. Select "Save Pages," then select "OK." Choose the location where the file will be saved. (Town
reports documents should be saved to the desktop.) Select "PDF
document" from the dropdown list and name the file according to the
town and year, with an underscore between them (e.g., Augusta_1884). If the town
consists of two or more words (e.g., "Deer Isle" or "Fort Kent") remove
the space between the words.
- Note: The following settings should be
selected under the "Save Wizard," "Formats Settings," "PDF"
tab. These settings are saved from session to session and do not
need to be reset every time
a file is saved.
- Keep original image size (checked)
- Save Mode: Text under the page image
- Enable tagged PDF (checked)
- Quality: High (for printing)
- Format: Automatic
- Font: Use standard fonts.
- Note: PDFs over 15 MB must be segmented into smaller files either by using the
PDF segmenting tool or by exporting smaller ranges of pages from
Abbyy. However, the Town Reports PDFs have typically been well under
this size, since each year is saved individually. Town reports
scanned in grayscale and under 200 pages may be assumed to be small
enough not to require segmentation.
- Step 5a: Town Reports documents also need to have a separate .txt
file exported from Abbyy. To create this file, highlight the pages to be
saved, select "Save Wizard," then "Save Pages," then "OK." Save the
files to the desktop using the same naming conventions, but select "Text
document" from the dropdown menu.
Common Problems that will need to be addressed using Step 3: Spellcheck
- Wrongly rotated image - occasionally Abbyy will interpret the file
in such as way as to rotate. [Insert image samples of incorrect vs.
correctly rotated pages]
- Unrotate the image. This will automatically clear all
- Draw appropriate boxes manually [Specs on using draw tool?]
- Re-read the document
- Gothic text - Abbyy unable to interpret Gothic font often found on
cover page and used for major headings such as titles.
- Use Check Spelling mode to edit manually
- Clamp mark obscuring text - text of document blocked by clamp and
unreadable by software
- Use Check Spelling mode to manually correct text as best you can
make out from image
- Offset columns of text lines. [insert image example here?]
- Use Check Spelling mode to draw new boxes
Tips for Working with Spell Check Mode
Confirming word by word vs. confirming whole phrases -- phrases more
efficient use of time but must keep an eye out for goofy characters at
beginning and ends of words. Stray marks, quotes, etc. are sometimes
interpreted as letters. If removed the software can properly interpret
Created by: Library Staff |