Issue 74002 - %xx notation in a hyperlink does not work properly
Summary: %xx notation in a hyperlink does not work properly
Status: ACCEPTED
Alias: None
Product: General
Classification: Code
Component: code (show other issues)
Version: OOo 2.1
Hardware: All All
: P3 Trivial with 2 votes (vote)
Target Milestone: AOO Later
Assignee: AOO issues mailing list
QA Contact:
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-01-30 07:26 UTC by tora3
Modified: 2017-05-20 11:30 UTC (History)
5 users (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
A sample bugdoc. (7.72 KB, application/vnd.oasis.opendocument.spreadsheet)
2007-01-30 07:28 UTC, tora3
no flags Details

Note You need to log in before you can comment on or make changes to this issue.
Description tora3 2007-01-30 07:26:46 UTC
Problem
A hyperlink does not work under certain circumstances.
If a URL of a hyperlink includes characters in %xx and/or &#xx; notation 
encoded in not UTF-8, unexpected garbage characters might appear.

Example
Look at the following URL encoded in EUC-JP:
http://xxx/%A5%BB%A5%F3%A5%C8%A1%A6%A5%D8%A5%EC%A5%F3%A5%BA%BB%B3 

There are 9 Japanese characters in it:
%A5%BB
%A5%F3
%A5%C8
%A1%A6
%A5%D8
%A5%EC
%A5%F3
%A5%BA
%BB%B3 

According to the Section 3.9 "Unicode Encoding Forms" [1], the following 
sequences of bytes can be mistakenly recognized as UTF-8:
%C8%A1
%D8%A5
%F3%A5%BA%BB

Look at an attached example file, the sequeces memtioned above are forcedly 
converted into characters. 

One of possible solutions:
 1. Try to convert a string in %xx and/or &#xx; notation into UCS-2.
 2a. If a conversion error occurs, leave the string untached. 
 2b. If a conversion into UCS-2 is successful, the string will be substitued 
     with the converted one.

For the example above, the first byte, %A5, is not a correct byte of UTF-8, 
thus is cannot be converted into UCS-2. Therefore, the string, whole URL, 
should remain in an original one.

Acknowlegemets
toumatsu firstly reported this henomeon at the OOo FAQ site [2] and then 
kimotomasaya and M.Kamataki confirmed the phenomeon.

[1] http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf#G7404 
[2] http://oooug.jp/faq/index.php?faq%2F4%2F227
Comment 1 tora3 2007-01-30 07:28:05 UTC
Created attachment 42564 [details]
A sample bugdoc.
Comment 2 Olaf Felka 2007-01-30 08:53:35 UTC
@ es: Is this yours?
Comment 3 eric.savary 2007-01-30 10:55:52 UTC
ES->OF: no, this this not a Writer issue (happens in all modules). Reassigning
to MBA.
Comment 4 eric.savary 2007-01-30 10:56:35 UTC
.
Comment 5 Mathias_Bauer 2007-02-02 10:20:17 UTC
Stephan, can you please shed some light on this?
Comment 6 Stephan Bergmann 2007-02-02 10:53:14 UTC
@tora:  I do not know I yet fully understand what your problem is.  Both
hyperlinks in 74002.ods work for me (they at least bring up existing web pages
in the browser, hopefully the right ones).  I assume your problem is that the
second hyperlink, as displayed by OOo, contains a mixture of %xx encodings and
(bogus) interpretations of some of those %xxs as UTF-8 characters (where in fact
they all represent EUC-JP characters).

That is the consequence of how INetURLObject's DECODE_TO_IURI DecodeMechanism
works (see tools/inc/urlobj.hxx:1.36 l. 227--232).  It was assumed that people
would like to see URLs displayed with readable characters rather than %xx
escapes (as with the first hyperlink in 74002.ods), and that most URLs would
encode UTF-8 data (again, as with the first hyperlink in 74002.ods).  To do so,
it uses the heuristic of interpreting any %xx escapes as representing UTF-8 data
that can be so interpreted, and leaving any other %xxs alone.

A fix might be to change DECODE_TO_IURI to not interpret any %xx as UTF-8 data
if at least one contained %xx cannot be interpreted as UTF-8.  For the given
74002.ods, that modified heuristic would cause the first hyperlink to still be
displayed with readable characters rather than %xxs, and the second hyperlinik
to be displayed with only the original %xxs and no bogus UTF-8-interpreted
characters.  Would that suite your needs?
Comment 7 tora3 2007-02-02 14:59:21 UTC
@sb: Exactly, you understand user's needs. 
Comment 8 Stephan Bergmann 2007-02-02 15:50:38 UTC
ok, accepted
Comment 9 Marcus 2017-05-20 11:30:58 UTC
Reset assigne to the default "issues@openoffice.apache.org".