Apache OpenOffice (AOO) Bugzilla – Issue 51661
Quote marks in 2.0 Hebrew workbreaking
Last modified: 2013-08-07 15:03:05 UTC
Hebrew workbreaking in 1.1 does not see a double quote mark as the end of a word. This is correct behavior. In 2.0 beta m104, this does not happen, and a quote make *is* seen as as a word-breaker. See the thread at: http://l10n.openoffice.org/servlets/BrowseList?list=dev&by=thread&from=935705 for details. I have tried to integrate my changes to 1.1 into m104, but unsuccesfully. I'm posting the changes I made to m104, so that others can examine them, and find out what's wrong or missing.
Created attachment 27755 [details] Changes to existing breakiterator code
Created attachment 27756 [details] New file rules for Hebrew wordbreaking
Grabbing issue.
Hi Karl, As I won't find the time to dive into this the next days/weeks, could you please have a look at this one? See also the mails on the dev@l10n list of the thread starting with the message mentioned above. If there is an easy solution, just add it to my CWS locales201. Thanks Eike
I assume that both edit and dictionary modes need to treat double quote as part of the word. I create two files edit_word_he.txt and dict_word_he.txt. I test it on other language, if someone could upload a Hebrew file with double quote for me to test, that will be great. Thanks in advance.
Created attachment 29837 [details] Hebrew test file withquotes for Karl
Thanks, Alan. Ready for QA. re-open issue and reassign to oc@openoffice.org
reassign to oc@openoffice.org
reset resolution to FIXED
Hi Stefan, please take over re-open issue and reassign to sba@openoffice.org
reassign to sba@openoffice.org
Karl, there still seems to be a problem when I try to spellcheck the sample document. I'm attaching a screenshot. Note that in the 6th and 7th lines, toward the left, there are two identical words, one above the other, that are marked as misspelled. Those words have double-quotes in the middle, but the red line stops at the double-quote. It should continue past the quote, to the end of the word. The same problem exists in the text in the top-right cell of the table. Also in the left cell of the table's second row.
Created attachment 29960 [details] Sample doc - note misspelled words broken by a double-quote
SBA->ayaninger: When I compare the CWS build and an OOo installation WITHOUT this break iterator patch, I see no difference in treatment of quotes. I will attach a document with a couple of "quotes" (single and double). Their Unicode IDs are 2018, 2019, 201B, 201C, 201E, 05F2, 05F4, 05D9, 05F3. Please comment (1) wich ones should be treated as "character" and wich ones as "quote" (=word limiter). (2) Wich ones are commonly used (= can be inserted directly) when typing Hebrew? Subsequently (difference=none), I must regard this issue as "not fixed". -> Back to NEW and reassigned to Karl. re-open issue and reassign to khong@openoffice.org
reassign to khong@openoffice.org
Reopened.
Created attachment 30187 [details] 9 different quote characters in Hebrew words
Karl->SBA, None of your quotes is what they want. They want english, or ASCII, double quote (0022). You can see it as $MidLetter in the attachment of "New file rules for Hebrew wordbreaking". I made both word type mode, dictionary and edit modes, take (0022) as mid letter. In Alan's attached document , HebrewQuoteTest.odt, when you do word travel by (Cntr->Arrow key), you will see (") is part of a word as mid letter. Karl->Alan, I don't have Hebrew spellchecker, I could not see what you see in your screen shot. As to test word break in spellchecker, which uses DICTIONARY_WORD mode, here is StarBasic program, you can change to different language and get different word boundary, Sub Main bi=createUnoService("com.sun.star.i18n.BreakIterator") dim lo as new com.sun.star.lang.Locale lo.Language="he" ty=com.sun.star.i18n.WordType.DICTIONARY_WORD st="aa"+chr$(34)+"b" bd=bi.getWordBoundary(st, 0, lo, ty, TRUE) print st, bd.startPos, bd.endPos st="כע"+chr$(34)+"ז" bd=bi.getWordBoundary(st, 0, lo, ty, TRUE) print st, bd.startPos, bd.endPos End Sub
ayaniger->sba, Yes, Karl is correct, we are referring to the English ASCII double-quote. However, it would be proper to treat all the other characters you listed in the same way, as "characters", and not as word-breakers. Take a look at Jonathan Ben-Avraham's background explanation, which I quoted in my comments to Issue 51772. ayaniger->Karl, Word travel using Ctrl-<Arrow> does jump over the quote marks. I ran your StarBasic program, and saw the results, which also show that the quote marks do not break the word. Nevertheless, in spell checking the word is broken at the quote marks. I am attaching Hebrew dictionaries and dictionary.lst, which you can install in share/dict/ooo, so you can take a look.
Created attachment 30203 [details] Hebrew dictionary files and dictionary.lst
SBA: I correct the status to "Fixed". Thomas Lange is digging a little into Karls code in order to find out why the hebrew spellchecker is not accepting the entire word (with ASCII 0022) while cursor travelling behaves like "this is one word". Tho outcome will probably lead to another issue that will not be fixed within this CWS.
SBA: Verified in CWS i18n20. Follow up is issue 51772.
SBA: OK in Master (and still OK in OOo 2.02). Closed