#1 2020-04-16 21:29:09


Inconsistent results converting short strings

I'm not sure this is a bug but is surprising me how inconsistent tesseract is in the various modes of operation.

I have the need to convert single small images that say either "Yes" or "No", for example. But we also have many other more complex strings. For strings that are longer, we find that using --psm 6 seems to do the best job. For the short yes or no strings, psm 6 does not always work at all, and sometimes produces extra crap or results that are far from right. Using psm 8 works better SOMETIMES, but not always.

Here is a small PNG file with the single word 'Yes'
https://drive.google.com/open?id=1RZAOd … xoDKRnklYB

When I run tesseract using pytesseract, I get these results:
>>> print(pytesseract.image_to_string("C:/temp/4000-rois_image151.png", config='--psm 6 --oem 3 -l eng+spa'))
>>> print(pytesseract.image_to_string("C:/temp/4000-rois_image151.png", config='--psm 8 --oem 3 -l eng+spa'))
Yes 2 —<‘iO

So, "single word mode" produces more than one word, "Yes" plus other crap. While text block mode produces the correct "Yes"

Here is another example:
https://drive.google.com/open?id=1YUZ22 … SzK2ZFG_D1

Also, just "Yes" in the image. And here is the result
(single word mode)  --psm 8: "Yes 2 —i(i‘sOSOSOS™~™S"
(block of text mode) --psm 6:  "Yes"

Strange that single-word mode again produces more than one word while the block of text mode produces a better result.
The big problem is that it is not consistent. Sometimes single-word mode works better:

Here is the single word 'No'
https://drive.google.com/open?id=1Pcs1U … CeJZuws4V8

single word mode --psm 8: "No"
text block mode   --psm 6: "No ee"

These words are very short and it is important to get them right. We use levenshtein difference on most strings but these are so short that we have to make them a special case.

Here is another example to make the point.
https://drive.google.com/open?id=1FQ4k4 … Jv-VzCigbM
Image is "No".
single word mode --psm 8: "No"
text block mode   --psm 6: "LS"

The results from image to image are just not consistent. I don't really want to have to convert it both ways just to try to get a good result. I am hoping something simple and stupid that I am doing is the root cause.

Any hints??


Last edited by raylutz (2020-04-16 21:52:36)


#2 2020-04-17 14:55:52


Re: Inconsistent results converting short strings

I think we have figured this out. That last one that converts using mode 6 to LS occurs because there is a tiny bit of horizontal line at the bottom.


Board footer