Mouse Clix

 

By: Hobie Lunin

 

OCR? …….What’s that?

 

 I have said from time to time, the miracles of the computer are endless and OCR is just another miracle.  Well, super high tech stuff, anyway.  Here, you have a sheet of text; maybe it is something you typed up a long time ago, way before you even heard of computers.  Now you want to store it in your computer as an editable text file.  Yes, you can do it.  All you need is a scanner and OCR software.  You can probably get it from the same disk that loaded the software for your scanner.  I did.  It is called TextBridge and is a popular Optical Character Reading (OCR) software. 

 

So, you go to your scanner and set it on optical code scan (or similar instruction), if you have it in your software, and after you scan, you can have the (OCR) software scan the document.  In this process, the computer examines the letters in your text and decides which letter is which and the next thing you know the text is in your Word Processor (Word).  It is just like any text you have typed into the program, you can edit it and you can store it in your hard drive as a text file.  So, why not just scan it and file it?  You can do that, but you cannot edit it and you must store it as a picture of one format or another.  The difference is that the hard drive space required is much larger than a text document.  Well, aside from the file not being editable, why is there so much space required.  Well text files just record the letters (characters) and the white background is just an assumption and takes no space.  A scanned document has code for all the white space, it sure adds up as some of you have probably already discovered.  In addition, it comes off in a grayish color with lots of black specks, etc.  There is no comparison with a regular text file, which is editable, cleaner, and with less space required.

 

Hmmm, well how does it do that?  Here is an explanation. 

 

In its first pass of converting images to text, the software program attempts to match each character with a pixel-to-pixel comparison of characters in a template that the software holds in memory.  These templates include complete fonts of letters, numbers, punctuation and other characters (like parenthesis and question marks).  Of course, the program can be fooled especially if the text is a poor image such as a copy (Xerox) of a copy or if it uses a very unusual font.  On the other hand, any clean text using a standard font will be recognized easily.  At any rate, the copy will need to be proofread and corrections made when necessary in the Word Processor,

 

OK, well how do we do this?  If you have a choice, pick the best copy of the text to scan.  Next, in addition to the scanner and scanning software you will need OCR software.  TextBridge is popular and while it was very expensive a few short years ago, it can now be obtained from scanner software disks included in the prices of the scanner, (which are not expensive at all these days). 

 

In my software, I can choose at the outset that I am scanning for Optical Character Recognition.  After the scan I can right click on the scanned text and click on “save as text.”  A yellow highlight appears as the OCR reads the text and evaluates it.  This takes a second or two at the most.  My software then allows me to call for Microsoft Word (which is installed in the computer).  In a second or two, the text appears in a Word page.  I usually click somewhere on it with my mouse to see if it edits and I am always pleased to see that it does.

 

The scan, which originally has a gray look to it in the background, suddenly becomes brilliant white with black letters much cleaner than the scanned text had.  It is a rejuvenation of the text.  If your computer does not have the exact same font that the original text has, the program will pick something that is as close as possible.  A change in the font height might also appear if your computer does not have the exact font size of the text, especially true if it has been copied.  This may change the appearance on the page.  You may have to select a different font or font size to fill the page as it was in the original.  These changes just take moments.  Next, you will have to proof read the page.  You may discover that there are no errors at all or that one character is being misread in most or all locations on the page.  You will have to change these, as you would normally edit a page. 

 

Some OCR programs invoke a spellchecker to help with its interpretation.  For instance, a word like dominate may be read as clominate and the spell checker will ascertain that the former is the right word.

 

My own experience is that the errors are very few.  After editing, or spell checking, the page can be saved to file.  When you are editing, look for confusion between 1 (one) and l (ell) (they sure do look alike, don’t they?).  You may need to make the correction.  Spell check will help, as the one (1) will show up as an error if you have not set your spell checker to ignore words with numbers in it.  (See Tools, Options and Spelling and Grammar.  Uncheck “ignore words with numbers.”)

 

Note that some OCR software programs will put a special character like “@” in words it cannot figure out and will alert you to a word that needs to be edited.

 

Hobie Lunin can be reached at mouseclix2@yahoo.com

 

Previous articles can be seen at http://mouseclix.tripod.com