Apache OpenOffice (AOO) Bugzilla – Issue 75632
[surrogate pair] Japanese word break does not work property for surrogate pair charas.
Last modified: 2013-08-07 15:01:20 UTC
- word break. We use dictionary for Japanese word break, currently there is not surrogate characters in the dictionary, it should not be problem now. But it seems not work properly, when I hit control-arrow key, no movement, work count under 'Tools' count 0 word and double amount of characters. (Karl Hong)
Fixed.
SBA->Naoyuki: Please verify in CWS i18n31. Thank you. Reassigned to Naoyuki.
Naoyuki -> Karl I can not verify this as fixed. (i18n31 solsparc debug) Steps 1. open new Writer doc 2. enter (last one character is surrogate paired char U+2840C) abc日本語kanji𨌠3. Tools -> Word count Words: 3 <---- expected 4 Characters: 13 <---- expected 12 It seems one surrogate paired char is counted as two characters and no word char.
SBA: Issue reopened.
This may as well be a problem of the Writer, depending on how it tries to determine the counts. At least for the character count it probably is not prepared to encounter surrogates, so this would be independent of the i18n implementation and a separate issue. Don't know how Writer determines the word count at that place. Frank?
There seem to two remaining issues: 1. Word count is incorrect with Naoyuki's sample string 2. Character count is incorrect with the sample string 2. Is clearly a Writer issue, since the character count is calculated by adding the string lengths of the single words. Please file a separate issue for this. 1. Looks like a break iterator issue, seems like the mode WordType::WORD_COUNT does not work correctly.
Filed issue 79562 is for character count for surrogate pair code.
Word count mode has a problem in skipping white space. I have fixed it, but word count does still not include surrogate characters, while StarBasic test program below get correct boundary on word count mode. bk=createUnoService("com.sun.star.i18n.BreakIterator") dim l as new com.sun.star.lang.Locale l.Language="ja" s="韠䩶䩮𩈚𩅅" t=com.sun.star.i18n.WordType.WORD_COUNT b=bk.nextWord(s, 0, l, t, true) print b.startPos, b.endPos s contains a list of surrogate characters and it prints "2 4". I will updated 79562 for Writer to fix it.
Ready for QA.
SBA: Issue 79562 is about character count.I understand from Karls last comment that he wanted to add information in that one about word count. Or do we need another issue? SBA->Naoyuki: Can we regard this one (word break does not work for surrogate pairs) as verified for this CWS (as far as Karl and the break iterator are concerned)?
I confirmed BreakIterator in i18n31 recognizes one surrogate paired char as one word with Karl's Basic example, although printed char count is wrong ("2, 4" should be "1, 2"). I agree that marking this issue as verified (keep 79562 for char count issue).
rbircher -> sba and the rest Wath is with this issue? Close or reopen or samething else? Add myself to CC
SBA->naoyuki: If you can re-verify this in current DEV300 master, please do so and close (Better late than never :-). However, a sample doc for non-Japanese to understand and verify the surrogate pair thingie woud be a good idea as well. Please comment, thank you.
I've verified this with DEV300m21 writer. Writer Tool -> Character count shows word count 3 with below sample string. (shown character count 8 is wrong, though) <surrogate paired char>Book<surrogate paired char> Although character count problem 79562 is not fixed yet, I close this as word count seems fine now.