|
From: | Antonio Diaz Diaz |
Subject: | Re: [Bug-ocrad] The function ignore_wide_blobs() doth ignore too much, methinks |
Date: | Thu, 02 Sep 2010 16:28:41 +0200 |
User-agent: | Mozilla/5.0 (X11; U; Linux i586; en-US; rv:1.7.11) Gecko/20050905 |
Tilman Hausherr wrote:
Then I tested with production. I generally got better results (many pages now do have useful output that didn't before), with one exception. One of the images was a huge b/w photograph, thus, from an OCR point of view, a huge amount of noise. The OCR needed several minutes (!). Thus, although I can't look into your mind, I guess that the "if" statement was probably meant as a safety against exactly that. However, that safety measure also prevents the OCR of printed excel tables with grey background cells. So instead of making a an assumption about individual areas, OCRAD ignores the whole file.
Ocrad can waste a lot of time trying to create text lines from noise just to produce a lot of garbage. This is what the "if" statement tries to prevent.
But in this case the problem is in image preprocessing (before feeding it to ocrad). Those "grey background cells" are not grey at all, but full of speckles. This works well enough for the human eye, but makes life difficult to OCR programs.
Just see what one of those "grey background cells" looks like to ocrad: .......O.......O..O....O..O.O..O..O....O..O....O..O....O..O....O..O.O..O.... ...O..O....O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O... ..O....O..O....O..OO...O..O.O..O..O....O..O.O..O..O....O..O.O..O..O.O..O..O. ...O..O.O..O..O.O..O..O....O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O... .......O.......O..O....O..O.O..O.......O..O....O..O.O..O..O.O..O..O.O..O.... .O.O..O..O.O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O... .......OOOOOOOOOOOOOOO.O..O.O..O..O.O..O..O.O..O..O.O..O.OOOO..O..O.O..O..O. O..O..OOOOOOOOOOOOOOOOO.O..O..O.O..O..O.O..O..O.O..OOOO.OOOOOOOOOOOOO.O.O..O ....O..OOOOOOOOOOOOOOO.O.......O..O....O..O.O..O..OOO..O.OOOOOOOOOOOO..O..O. O..OO.O....O.OOOO..O..O.O..O..O.OO.O..O.O..O..O.O.OOO.O.OOOOOOOOOOOOO.O.O..O ..O.O..O.....OOO....O..O....O..O.......O..O....O..OOO..O....O..O.OOOO..O..O. OO.O..O.OO.O.OOOO..O..O.O..OOOOOO..O.OOOO..O..OOOOOOOOO.O..O..O.OOOO..O.O... ..O....O.....OOO..O....O.OOOOOOOOOO.OOOOO.O..OOOOOOOOOOO..O....OOOO....O..O. O..O....O..O.OOOO..O..O.OOOOOOOOOOOO..OOOO.OOOOOOOOOOOO.O..O..OOOO.O..O.O..O ....O..O....OOOO..O.O..OOOOOO..OOOO...OOOOO.OOOO.OOOOOOO..O....OOO.....O..O. ...O..O....O.OOOO..O..OOOO.O..O.OOOO..OOOOOOOOO.O.OOO.O.O..O..OOO..O..O.O... .......O.....OOO....O..OOOOOOOOOOOOO...OOOOOOO.O..OOO..O..O...OOO.O.O..O..O. O..O..O..O.O.OOOO..O..OOOOOOOOOOOOOO..O.OOOOO.O.O..OO.O.O..O.OOOO..O..O.O..O O.O....O.....OOO......OOOOOOOOOOOOOO...O.OOOO..O..OOO..O....OOOO..O....O..O. O..O..O....O.OOOO..O..OOOO.O....O..O..O.OOOOOOO.O.OOO.O.O..OOOOOO..O..O.O..O ..O....O..O..OOO..O....OOOO....O.OO.O..OOOOOOO.O..OOO..O..O.OOOO..O.O..O..O. O..O..O.O..O.OOOO..O..OOOOOO..O.OOOO..OOOO.OOOO.O.OOO.O.O..OOOO.O..O..O.O..O .......O.....OOO....O..OOOOOOOOOOOO.O.OOOO..OOOO..OOOO.O..O.OOOO..O.O..O.... ...O..O....O.OOOO..O..O.OOOOOOOOOOOO.OOOO..OOOOOO.OOOOO.O..OOOO.O..O..O.O..O ....O..O..O..OOO..O.O..O.OOOOOOOOOO.OOOO..O.OOOOO.OOOOOO..OOOOOO..O....O..O. ...O..O.O..O..O.O..O..O.O..OOOO.O..O..O.O..O..O.O..OOOO.O..OOOO.O..O..O.O... ....O..O.......O..O.O..O..O....O.......O..O.O..O..O.O..O..O.O..O..O.O..O..O. .O.O..O.OO.O..O.OO.O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O... ..O....O..O....O..O....O..O....O..O.O..O..O.O..O..O.O..O..O.O..O....O..O..O. ...O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O..O.O... .......O..O....O..O....O..O.O..O..O.O..O..O.O..O..O.O..O..O.O..O....O..O..O.Processing images for human consumption is very different from processing them for OCR. The right way to handle non-white backgrounds in the latter case is binarizing the image using a suitable threshold or feeding the image to ocrad as a greymap or colormap.
For images already processed as the one above an option telling ocrad not to remove noise inside a wide blob may help in some cases.
Regards, Antonio.
[Prev in Thread] | Current Thread | [Next in Thread] |