Issue 75632 - [surrogate pair] Japanese word break does not work property for surrogate pair charas.
Summary: [surrogate pair] Japanese word break does not work property for surrogate pai...
Status: CLOSED FIXED
Alias: None
Product: Internationalization
Classification: Code
Component: code (show other issues)
Version: OOo 2.2 RC2
Hardware: All All
: P4 Trivial (vote)
Target Milestone: ---
Assignee: naoyuki
QA Contact: issues@l10n
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-03-22 06:38 UTC by naoyuki
Modified: 2013-08-07 15:01 UTC (History)
6 users (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments

Note You need to log in before you can comment on or make changes to this issue.
Description naoyuki 2007-03-22 06:38:21 UTC
- word break. We use dictionary for Japanese word break, currently there is not
surrogate characters in the dictionary, it should not be problem now. But it
seems not work properly, when I hit control-arrow key, no movement, work count
under 'Tools' count 0 word and double amount of characters. (Karl Hong)
Comment 1 karl.hong 2007-07-05 08:31:34 UTC
Fixed. 
Comment 2 stefan.baltzer 2007-07-10 13:23:56 UTC
SBA->Naoyuki: Please verify in CWS i18n31. Thank you.
Reassigned to Naoyuki.
Comment 3 naoyuki 2007-07-12 08:22:08 UTC
Naoyuki -> Karl

I can not verify this as fixed. (i18n31 solsparc debug)
Steps
1. open new Writer doc
2. enter (last one character is surrogate paired char U+2840C)
abc日本語kanjið¨Œ
3. Tools -> Word count
Words: 3 <---- expected 4
Characters: 13 <---- expected 12

It seems one surrogate paired char is counted as two characters and no word char.


Comment 4 stefan.baltzer 2007-07-12 10:21:46 UTC
SBA: Issue reopened.
Comment 5 ooo 2007-07-12 10:49:02 UTC
This may as well be a problem of the Writer, depending on how it tries to
determine the counts. At least for the character count it probably is not
prepared to encounter surrogates, so this would be independent of the i18n
implementation and a separate issue. Don't know how Writer determines the word
count at that place. Frank?
Comment 6 frank.meies 2007-07-12 11:27:39 UTC
There seem to two remaining issues:

1. Word count is incorrect with Naoyuki's sample string
2. Character count is incorrect with the sample string

2. Is clearly a Writer issue, since the character count is calculated by adding
the string lengths of the single words. Please file a separate issue for this.

1. Looks like a break iterator issue, seems like the mode WordType::WORD_COUNT
does not work correctly.
Comment 7 naoyuki 2007-07-13 09:50:11 UTC
Filed issue 79562 is for character count for surrogate pair code. 
Comment 8 karl.hong 2007-07-17 22:13:17 UTC
Word count mode has a problem in skipping white space. I have fixed it, but word
count does still not include surrogate characters, while StarBasic test program
below get correct boundary on word count mode.

bk=createUnoService("com.sun.star.i18n.BreakIterator")
dim l as new com.sun.star.lang.Locale
l.Language="ja"
s="韠䩶䩮𩈚𩅅"

t=com.sun.star.i18n.WordType.WORD_COUNT

b=bk.nextWord(s, 0, l, t, true)

print b.startPos, b.endPos

s contains a list of surrogate characters and it prints "2 4". 

I will updated 79562 for Writer to fix it.
Comment 9 karl.hong 2007-07-17 22:21:30 UTC
Ready for QA.
Comment 10 stefan.baltzer 2007-07-20 13:35:06 UTC
SBA: Issue 79562 is about character count.I understand from Karls last comment
that he wanted to add information in that one about word count. Or do we need
another issue?

SBA->Naoyuki: Can we regard this one (word break does not work for surrogate
pairs) as verified for this CWS (as far as Karl and the break iterator are
concerned)?

Comment 11 naoyuki 2007-07-23 06:46:06 UTC
I confirmed BreakIterator in i18n31 recognizes one surrogate paired char as one
word with Karl's Basic example, although printed char count is wrong ("2, 4"
should be "1, 2"). I agree that marking this issue as verified (keep 79562 for
char count issue). 

Comment 12 Raphael Bircher 2008-06-18 22:56:59 UTC
rbircher -> sba and the rest

Wath is with this issue? Close or reopen or samething else?

Add myself to CC
Comment 13 stefan.baltzer 2008-06-19 15:25:16 UTC
SBA->naoyuki: If you can re-verify this in current DEV300 master, please do so
and close (Better late than never :-). However, a sample doc for non-Japanese to
understand and verify the surrogate pair thingie woud be a good idea as well. 
Please comment, thank you.
Comment 14 naoyuki 2008-06-30 09:41:53 UTC
I've verified this with DEV300m21 writer.

Writer Tool -> Character count shows
word count 3 with below sample string. (shown character count 8 is wrong, though)

<surrogate paired char>Book<surrogate paired char>

Although character count problem 79562 is not fixed yet, I close this as word count
seems fine now.