#1 2015-08-17 14:31:24

windsock
Member

How can I add more unicode characters?

For the past week I've been trying to add more utf-8 character support, but haven't had much luck.

I'm testing my efforts by running a simple script:

echo "begin test"
echo "latin - hello world"
echo "cyrillic - Здравейте свят"
echo "kana - こんにちはセカイ"
echo "end test"

which results in the following console output in unmodified Tesseract:

begin test
latin - hello world
cyrillic - Здpaвeйтe cвят
kana -
end test

Latin and cyrillic print fine, but kana is unprintable and blank. I tried to add kana to the existing default font bitmaps, but there seems to be a 256 character limit, all of which are taken up by the original latin and cyrillic character set. I then created an entirely different set of font bitmaps and cfg files for kana using a japanese ttf font with a modified tessfont tool (making appropriate changes to #define CUBECTYPE and cube2unichars[256]), but the extra characters still remain unprintable.

So I tried ignoring scripts and printing to console directly from inside the engine:

conoutf(CON_INIT, "Здpaвeйтe cвят");
conoutf(CON_INIT, "こんにちはセカイ");

I did get some printing, but the output was gibberish not only for kana but now also for cyrillic:

ГöГňpaГńeГŘДàe cГńДíДà
ЯßòЯàòЯßĞЯßĆЯߣЯàŚЯàĞЯàč

(I get similar gibberish in the terminal window when using logoutf(), while using boring old printf() handles it just fine.)

In unicode most of the cyrillic characters are 2-byte, and the kana characters are all 3-byte, which corresponds to the length of the gibberish strings. So I'm guessing I'm having problems with multi-byte character recognition. I'm looking through rendertext.cpp, but I'm at a loss as to how I could modify it to recognize additional multi-byte characters and use the appropriate non-default font bitmaps for them.

This is really starting to annoy me, especially since I get the feeling I'm just being stupid and have overlooked something obvious. I know this isn't as "exciting" to work on as other features of the engine, but I really would appreciate any help on this matter.

Offline

Board footer