bug-ocrad
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-ocrad] The function ignore_wide_blobs() doth ignore too much, m


From: Antonio Diaz Diaz
Subject: Re: [Bug-ocrad] The function ignore_wide_blobs() doth ignore too much, methinks
Date: Thu, 02 Sep 2010 16:28:41 +0200
User-agent: Mozilla/5.0 (X11; U; Linux i586; en-US; rv:1.7.11) Gecko/20050905

Tilman Hausherr wrote:
Then I tested with production. I generally got better results (many
pages now do have useful output that didn't before), with one exception.
One of the images was a huge b/w photograph, thus, from an OCR point of
view, a huge amount of noise. The OCR needed several minutes (!). Thus,
although I can't look into your mind, I guess that the "if" statement
was probably meant as a safety against exactly that. However, that
safety measure also prevents the OCR of printed excel tables with grey
background cells. So instead of making a an assumption about individual
areas, OCRAD ignores the whole file.

Ocrad can waste a lot of time trying to create text lines from noise just to produce a lot of garbage. This is what the "if" statement tries to prevent.

But in this case the problem is in image preprocessing (before feeding it to ocrad). Those "grey background cells" are not grey at all, but full of speckles. This works well enough for the human eye, but makes life difficult to OCR programs.

Just see what one of those "grey background cells" looks like to ocrad:

.......O.......O..O....O..O.O..O..O....O..O....O..O....O..O....O..O.O..O....
...O..O....O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O...
..O....O..O....O..OO...O..O.O..O..O....O..O.O..O..O....O..O.O..O..O.O..O..O.
...O..O.O..O..O.O..O..O....O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O...
.......O.......O..O....O..O.O..O.......O..O....O..O.O..O..O.O..O..O.O..O....
.O.O..O..O.O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O...
.......OOOOOOOOOOOOOOO.O..O.O..O..O.O..O..O.O..O..O.O..O.OOOO..O..O.O..O..O.
O..O..OOOOOOOOOOOOOOOOO.O..O..O.O..O..O.O..O..O.O..OOOO.OOOOOOOOOOOOO.O.O..O
....O..OOOOOOOOOOOOOOO.O.......O..O....O..O.O..O..OOO..O.OOOOOOOOOOOO..O..O.
O..OO.O....O.OOOO..O..O.O..O..O.OO.O..O.O..O..O.O.OOO.O.OOOOOOOOOOOOO.O.O..O
..O.O..O.....OOO....O..O....O..O.......O..O....O..OOO..O....O..O.OOOO..O..O.
OO.O..O.OO.O.OOOO..O..O.O..OOOOOO..O.OOOO..O..OOOOOOOOO.O..O..O.OOOO..O.O...
..O....O.....OOO..O....O.OOOOOOOOOO.OOOOO.O..OOOOOOOOOOO..O....OOOO....O..O.
O..O....O..O.OOOO..O..O.OOOOOOOOOOOO..OOOO.OOOOOOOOOOOO.O..O..OOOO.O..O.O..O
....O..O....OOOO..O.O..OOOOOO..OOOO...OOOOO.OOOO.OOOOOOO..O....OOO.....O..O.
...O..O....O.OOOO..O..OOOO.O..O.OOOO..OOOOOOOOO.O.OOO.O.O..O..OOO..O..O.O...
.......O.....OOO....O..OOOOOOOOOOOOO...OOOOOOO.O..OOO..O..O...OOO.O.O..O..O.
O..O..O..O.O.OOOO..O..OOOOOOOOOOOOOO..O.OOOOO.O.O..OO.O.O..O.OOOO..O..O.O..O
O.O....O.....OOO......OOOOOOOOOOOOOO...O.OOOO..O..OOO..O....OOOO..O....O..O.
O..O..O....O.OOOO..O..OOOO.O....O..O..O.OOOOOOO.O.OOO.O.O..OOOOOO..O..O.O..O
..O....O..O..OOO..O....OOOO....O.OO.O..OOOOOOO.O..OOO..O..O.OOOO..O.O..O..O.
O..O..O.O..O.OOOO..O..OOOOOO..O.OOOO..OOOO.OOOO.O.OOO.O.O..OOOO.O..O..O.O..O
.......O.....OOO....O..OOOOOOOOOOOO.O.OOOO..OOOO..OOOO.O..O.OOOO..O.O..O....
...O..O....O.OOOO..O..O.OOOOOOOOOOOO.OOOO..OOOOOO.OOOOO.O..OOOO.O..O..O.O..O
....O..O..O..OOO..O.O..O.OOOOOOOOOO.OOOO..O.OOOOO.OOOOOO..OOOOOO..O....O..O.
...O..O.O..O..O.O..O..O.O..OOOO.O..O..O.O..O..O.O..OOOO.O..OOOO.O..O..O.O...
....O..O.......O..O.O..O..O....O.......O..O.O..O..O.O..O..O.O..O..O.O..O..O.
.O.O..O.OO.O..O.OO.O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O...
..O....O..O....O..O....O..O....O..O.O..O..O.O..O..O.O..O..O.O..O....O..O..O.
...O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O...
.......O..O....O..O....O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O....O..O..O.

Processing images for human consumption is very different from processing them for OCR. The right way to handle non-white backgrounds in the latter case is binarizing the image using a suitable threshold or feeding the image to ocrad as a greymap or colormap.

For images already processed as the one above an option telling ocrad not to remove noise inside a wide blob may help in some cases.


Regards,
Antonio.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]