bug-ocrad
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-ocrad] A few ocrad problems


From: Don Moir
Subject: Re: [Bug-ocrad] A few ocrad problems
Date: Tue, 11 Jun 2013 08:48:25 -0400

Hi Antonio,

4) Failure to detect merged ti, vi, im, ll,  in merged_ti_vi_im_ll.pbm

This one is the most difficult. The last version of ocrad has fixed some problems like those, but there are lots of them (even with more than two letters merged). I'll try to fix as many as I can, but I don't promise anything.

I am pretty new to OCR, but I tested several OCR programs before I tried using ocrad. Now for my purposes ocrad has worked better but the space problem and merged characters are a problem (not just for ocrad of course).

I like the idea of feature extraction, but I am thinking I would like to generate the feature/rule set either on the fly or in database files. It seems to me that this could aid in the detection of merged characters. Like we absolutely know what a cap T looks like and a merged TT is just 2 cap T's together. I don't know about this as I have not looked into it to much. But some character sets may be like old english style and those might be hard to recognize with a fixed rule set.

Hopefully I will eventually have the time to look at it in more detail.

Thanks,

Don

----- Original Message ----- From: "Antonio Diaz Diaz" <address@hidden>
To: "Don Moir" <address@hidden>
Cc: <address@hidden>
Sent: Sunday, June 02, 2013 3:46 PM
Subject: Re: [Bug-ocrad] A few ocrad problems


Hello Don.

Don Moir wrote:
Results so far look good and better than other other sources I have tried.

Thanks for the feedback.


1) An orphan capitial letter I fails to be detected.

Textline::recognize2 is supposed to become some kind of expert system for post-processing of recognized text, but it yet lacks a lot of "rules". This is one of them. I'll add rules for isolated 'I' and 'UP' in the next release of ocrad.


3) Failure to detect a space character in latin_space.pbm.

This one is a little trickier. Currently ocrad measures the distance between "character boxes". It should measure the distance between the black blobs inside those character boxes, but this is more difficult to do. I plan to fix this in a future version of ocrad.


4) Failure to detect merged ti, vi, im, ll,  in merged_ti_vi_im_ll.pbm

This one is the most difficult. The last version of ocrad has fixed some problems like those, but there are lots of them (even with more than two letters merged). I'll try to fix as many as I can, but I don't promise anything.


The attached zip contains 6 files:

Next time, please, send the images to my email address, not to the list. Thanks.


I am wondering if possible merged characters should be added as special characters. like TT, ti, etc so then in future it's easy to add such combinations.

I have in fact removed some such combinations from the last version of ocrad. There are just too many of them and trying to recognize them worsens recognition results.


Best regards,
Antonio.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]