Apache OpenOffice (AOO) Bugzilla – Issue 74002
%xx notation in a hyperlink does not work properly
Last modified: 2017-05-20 11:30:58 UTC
Problem A hyperlink does not work under certain circumstances. If a URL of a hyperlink includes characters in %xx and/or &#xx; notation encoded in not UTF-8, unexpected garbage characters might appear. Example Look at the following URL encoded in EUC-JP: http://xxx/%A5%BB%A5%F3%A5%C8%A1%A6%A5%D8%A5%EC%A5%F3%A5%BA%BB%B3 There are 9 Japanese characters in it: %A5%BB %A5%F3 %A5%C8 %A1%A6 %A5%D8 %A5%EC %A5%F3 %A5%BA %BB%B3 According to the Section 3.9 "Unicode Encoding Forms" [1], the following sequences of bytes can be mistakenly recognized as UTF-8: %C8%A1 %D8%A5 %F3%A5%BA%BB Look at an attached example file, the sequeces memtioned above are forcedly converted into characters. One of possible solutions: 1. Try to convert a string in %xx and/or &#xx; notation into UCS-2. 2a. If a conversion error occurs, leave the string untached. 2b. If a conversion into UCS-2 is successful, the string will be substitued with the converted one. For the example above, the first byte, %A5, is not a correct byte of UTF-8, thus is cannot be converted into UCS-2. Therefore, the string, whole URL, should remain in an original one. Acknowlegemets toumatsu firstly reported this henomeon at the OOo FAQ site [2] and then kimotomasaya and M.Kamataki confirmed the phenomeon. [1] http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf#G7404 [2] http://oooug.jp/faq/index.php?faq%2F4%2F227
Created attachment 42564 [details] A sample bugdoc.
@ es: Is this yours?
ES->OF: no, this this not a Writer issue (happens in all modules). Reassigning to MBA.
.
Stephan, can you please shed some light on this?
@tora: I do not know I yet fully understand what your problem is. Both hyperlinks in 74002.ods work for me (they at least bring up existing web pages in the browser, hopefully the right ones). I assume your problem is that the second hyperlink, as displayed by OOo, contains a mixture of %xx encodings and (bogus) interpretations of some of those %xxs as UTF-8 characters (where in fact they all represent EUC-JP characters). That is the consequence of how INetURLObject's DECODE_TO_IURI DecodeMechanism works (see tools/inc/urlobj.hxx:1.36 l. 227--232). It was assumed that people would like to see URLs displayed with readable characters rather than %xx escapes (as with the first hyperlink in 74002.ods), and that most URLs would encode UTF-8 data (again, as with the first hyperlink in 74002.ods). To do so, it uses the heuristic of interpreting any %xx escapes as representing UTF-8 data that can be so interpreted, and leaving any other %xxs alone. A fix might be to change DECODE_TO_IURI to not interpret any %xx as UTF-8 data if at least one contained %xx cannot be interpreted as UTF-8. For the given 74002.ods, that modified heuristic would cause the first hyperlink to still be displayed with readable characters rather than %xxs, and the second hyperlinik to be displayed with only the original %xxs and no bogus UTF-8-interpreted characters. Would that suite your needs?
@sb: Exactly, you understand user's needs.
ok, accepted
Reset assigne to the default "issues@openoffice.apache.org".