#1 2015-08-17 14:31:24


How can I add more unicode characters?

For the past week I've been trying to add more utf-8 character support, but haven't had much luck.

I'm testing my efforts by running a simple script:

echo "begin test"
echo "latin - hello world"
echo "cyrillic - Здравейте свят"
echo "kana - こんにちはセカイ"
echo "end test"

which results in the following console output in unmodified Tesseract:

begin test
latin - hello world
cyrillic - Здpaвeйтe cвят
kana -
end test

Latin and cyrillic print fine, but kana is unprintable and blank. I tried to add kana to the existing default font bitmaps, but there seems to be a 256 character limit, all of which are taken up by the original latin and cyrillic character set. I then created an entirely different set of font bitmaps and cfg files for kana using a japanese ttf font with a modified tessfont tool (making appropriate changes to #define CUBECTYPE and cube2unichars[256]), but the extra characters still remain unprintable.

So I tried ignoring scripts and printing to console directly from inside the engine:

conoutf(CON_INIT, "Здpaвeйтe cвят");
conoutf(CON_INIT, "こんにちはセカイ");

I did get some printing, but the output was gibberish not only for kana but now also for cyrillic:

ГöГňpaГńeГŘДàe cГńДíДà

(I get similar gibberish in the terminal window when using logoutf(), while using boring old printf() handles it just fine.)

In unicode most of the cyrillic characters are 2-byte, and the kana characters are all 3-byte, which corresponds to the length of the gibberish strings. So I'm guessing I'm having problems with multi-byte character recognition. I'm looking through rendertext.cpp, but I'm at a loss as to how I could modify it to recognize additional multi-byte characters and use the appropriate non-default font bitmaps for them.

This is really starting to annoy me, especially since I get the feeling I'm just being stupid and have overlooked something obvious. I know this isn't as "exciting" to work on as other features of the engine, but I really would appreciate any help on this matter.


Board footer