bug-ocrad
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-ocrad] The function ignore_wide_blobs() doth ignore too much, m


From: Tilman Hausherr
Subject: Re: [Bug-ocrad] The function ignore_wide_blobs() doth ignore too much, methinks
Date: Sun, 05 Sep 2010 21:18:16 +0200

Hello Antonio,

I'm aware that there is no such thing as grey in a b/w image (especially
considering that I printed it on a b/w printer before scanning it :-)),
which is why I also said "noisy" in an earlier mail. I also don't really
expect you to OCR the segment mentioned; my hopes are for the other ones
that don't have the "grey" noise.

What would help would be a modification that noise doesn't trigger a
total abort, instead only a skip of the segment that's being a pain.

Tilman

On Thu, 02 Sep 2010 16:28:41 +0200, Antonio Diaz Diaz wrote:

>Tilman Hausherr wrote:
>> Then I tested with production. I generally got better results (many
>> pages now do have useful output that didn't before), with one exception.
>> One of the images was a huge b/w photograph, thus, from an OCR point of
>> view, a huge amount of noise. The OCR needed several minutes (!). Thus,
>> although I can't look into your mind, I guess that the "if" statement
>> was probably meant as a safety against exactly that. However, that
>> safety measure also prevents the OCR of printed excel tables with grey
>> background cells. So instead of making a an assumption about individual
>> areas, OCRAD ignores the whole file.
>
>Ocrad can waste a lot of time trying to create text lines from noise 
>just to produce a lot of garbage. This is what the "if" statement tries 
>to prevent.
>
>But in this case the problem is in image preprocessing (before feeding 
>it to ocrad). Those "grey background cells" are not grey at all, but 
>full of speckles. This works well enough for the human eye, but makes 
>life difficult to OCR programs.
>
>Just see what one of those "grey background cells" looks like to ocrad:
>
>.......O.......O..O....O..O.O..O..O....O..O....O..O....O..O....O..O.O..O....
>...O..O....O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O...
>..O....O..O....O..OO...O..O.O..O..O....O..O.O..O..O....O..O.O..O..O.O..O..O.
>...O..O.O..O..O.O..O..O....O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O...
>.......O.......O..O....O..O.O..O.......O..O....O..O.O..O..O.O..O..O.O..O....
>.O.O..O..O.O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O...
>.......OOOOOOOOOOOOOOO.O..O.O..O..O.O..O..O.O..O..O.O..O.OOOO..O..O.O..O..O.
>O..O..OOOOOOOOOOOOOOOOO.O..O..O.O..O..O.O..O..O.O..OOOO.OOOOOOOOOOOOO.O.O..O
>....O..OOOOOOOOOOOOOOO.O.......O..O....O..O.O..O..OOO..O.OOOOOOOOOOOO..O..O.
>O..OO.O....O.OOOO..O..O.O..O..O.OO.O..O.O..O..O.O.OOO.O.OOOOOOOOOOOOO.O.O..O
>..O.O..O.....OOO....O..O....O..O.......O..O....O..OOO..O....O..O.OOOO..O..O.
>OO.O..O.OO.O.OOOO..O..O.O..OOOOOO..O.OOOO..O..OOOOOOOOO.O..O..O.OOOO..O.O...
>..O....O.....OOO..O....O.OOOOOOOOOO.OOOOO.O..OOOOOOOOOOO..O....OOOO....O..O.
>O..O....O..O.OOOO..O..O.OOOOOOOOOOOO..OOOO.OOOOOOOOOOOO.O..O..OOOO.O..O.O..O
>....O..O....OOOO..O.O..OOOOOO..OOOO...OOOOO.OOOO.OOOOOOO..O....OOO.....O..O.
>...O..O....O.OOOO..O..OOOO.O..O.OOOO..OOOOOOOOO.O.OOO.O.O..O..OOO..O..O.O...
>.......O.....OOO....O..OOOOOOOOOOOOO...OOOOOOO.O..OOO..O..O...OOO.O.O..O..O.
>O..O..O..O.O.OOOO..O..OOOOOOOOOOOOOO..O.OOOOO.O.O..OO.O.O..O.OOOO..O..O.O..O
>O.O....O.....OOO......OOOOOOOOOOOOOO...O.OOOO..O..OOO..O....OOOO..O....O..O.
>O..O..O....O.OOOO..O..OOOO.O....O..O..O.OOOOOOO.O.OOO.O.O..OOOOOO..O..O.O..O
>..O....O..O..OOO..O....OOOO....O.OO.O..OOOOOOO.O..OOO..O..O.OOOO..O.O..O..O.
>O..O..O.O..O.OOOO..O..OOOOOO..O.OOOO..OOOO.OOOO.O.OOO.O.O..OOOO.O..O..O.O..O
>.......O.....OOO....O..OOOOOOOOOOOO.O.OOOO..OOOO..OOOO.O..O.OOOO..O.O..O....
>...O..O....O.OOOO..O..O.OOOOOOOOOOOO.OOOO..OOOOOO.OOOOO.O..OOOO.O..O..O.O..O
>....O..O..O..OOO..O.O..O.OOOOOOOOOO.OOOO..O.OOOOO.OOOOOO..OOOOOO..O....O..O.
>...O..O.O..O..O.O..O..O.O..OOOO.O..O..O.O..O..O.O..OOOO.O..OOOO.O..O..O.O...
>....O..O.......O..O.O..O..O....O.......O..O.O..O..O.O..O..O.O..O..O.O..O..O.
>.O.O..O.OO.O..O.OO.O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O...
>..O....O..O....O..O....O..O....O..O.O..O..O.O..O..O.O..O..O.O..O....O..O..O.
>...O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O...
>.......O..O....O..O....O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O....O..O..O.
>
>Processing images for human consumption is very different from 
>processing them for OCR. The right way to handle non-white backgrounds 
>in the latter case is binarizing the image using a suitable threshold or 
>feeding the image to ocrad as a greymap or colormap.
>
>For images already processed as the one above an option telling ocrad 
>not to remove noise inside a wide blob may help in some cases.
>
>
>Regards,
>Antonio.
>
>_______________________________________________
>Bug-ocrad mailing list
>address@hidden
>http://lists.gnu.org/mailman/listinfo/bug-ocrad



reply via email to

[Prev in Thread] Current Thread [Next in Thread]