In this paper a system is presented which is used to read low quality machine-printed characters. It is used to read computer printouts when the data file is not available. The assumptions on the characters are that their font belongs to a set of known fonts and that they are organized into tables or columns. Usually, the printer used for these documents is fast and the printing quality is low, due to the used up inked ribbon and to damaged nozzles or print head. Hence standard machine-printed OCR systems feature about 15% error rate on these sheets, a specific technique is needed. In order to cope with the recognition of broken characters and character pieces, the system is based on a two step strategy. First, it tries to match the unknown character using a moving-window technique. Then, if this fails, it creates a new reference image set using the already recognized characters of the document and repeats the first matching step. Thus, the correlation among damaged characters is used. The first step allows to reach a 2% error rate and the application of the second step lowers it to 0.15%. This low error rate is possible thanks to the ability of the system to adapt its behavior to the damaged characters produced by the printer. The average recognition time on a SUN SparcStation 10 is 15ms/character, computed on about 100,000 characters contained in 50 documents.
|Microelectronics Group Home Page||Staff|
|Research Activities||Teaching Activity|
|DEIS Home Page||University of Bologna Home Page|