74002 – %xx notation in a hyperlink does not work properly

Issue 74002 - %xx notation in a hyperlink does not work properly

Summary: %xx notation in a hyperlink does not work properly

Status:	ACCEPTED

Alias:	None

Product:	General
Classification:	Code
Component:	code (show other issues)
Version:	OOo 2.1
Hardware:	All All

Importance:	P3 Trivial with 2 votes (vote)
Target Milestone:	AOO Later
Assignee:	AOO issues mailing list
QA Contact:

URL:
Keywords:

Depends on:
Blocks:

Reported:	2007-01-30 07:26 UTC by tora3
Modified:	2017-05-20 11:30 UTC (History)
CC List:	5 users (show)

See Also:
Issue Type:	DEFECT
Latest Confirmation in:	---
Developer Difficulty:	---

Attachments
A sample bugdoc. (7.72 KB, application/vnd.oasis.opendocument.spreadsheet) 2007-01-30 07:28 UTC, tora3	no flags	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description tora3 2007-01-30 07:26:46 UTC

Problem
A hyperlink does not work under certain circumstances.
If a URL of a hyperlink includes characters in %xx and/or &#xx; notation 
encoded in not UTF-8, unexpected garbage characters might appear.

Example
Look at the following URL encoded in EUC-JP:
http://xxx/%A5%BB%A5%F3%A5%C8%A1%A6%A5%D8%A5%EC%A5%F3%A5%BA%BB%B3 

There are 9 Japanese characters in it:
%A5%BB
%A5%F3
%A5%C8
%A1%A6
%A5%D8
%A5%EC
%A5%F3
%A5%BA
%BB%B3 

According to the Section 3.9 "Unicode Encoding Forms" [1], the following 
sequences of bytes can be mistakenly recognized as UTF-8:
%C8%A1
%D8%A5
%F3%A5%BA%BB

Look at an attached example file, the sequeces memtioned above are forcedly 
converted into characters. 

One of possible solutions:
 1. Try to convert a string in %xx and/or &#xx; notation into UCS-2.
 2a. If a conversion error occurs, leave the string untached. 
 2b. If a conversion into UCS-2 is successful, the string will be substitued 
     with the converted one.

For the example above, the first byte, %A5, is not a correct byte of UTF-8, 
thus is cannot be converted into UCS-2. Therefore, the string, whole URL, 
should remain in an original one.

Acknowlegemets
toumatsu firstly reported this henomeon at the OOo FAQ site [2] and then 
kimotomasaya and M.Kamataki confirmed the phenomeon.

[1] http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf#G7404 
[2] http://oooug.jp/faq/index.php?faq%2F4%2F227

Comment 1 tora3 2007-01-30 07:28:05 UTC

Created attachment 42564 [details]
A sample bugdoc.

Comment 2 Olaf Felka 2007-01-30 08:53:35 UTC

@ es: Is this yours?

Comment 3 eric.savary 2007-01-30 10:55:52 UTC

ES->OF: no, this this not a Writer issue (happens in all modules). Reassigning
to MBA.

Comment 4 eric.savary 2007-01-30 10:56:35 UTC

Comment 5 Mathias_Bauer 2007-02-02 10:20:17 UTC

Stephan, can you please shed some light on this?

Comment 6 Stephan Bergmann 2007-02-02 10:53:14 UTC

@tora:  I do not know I yet fully understand what your problem is.  Both
hyperlinks in 74002.ods work for me (they at least bring up existing web pages
in the browser, hopefully the right ones).  I assume your problem is that the
second hyperlink, as displayed by OOo, contains a mixture of %xx encodings and
(bogus) interpretations of some of those %xxs as UTF-8 characters (where in fact
they all represent EUC-JP characters).

That is the consequence of how INetURLObject's DECODE_TO_IURI DecodeMechanism
works (see tools/inc/urlobj.hxx:1.36 l. 227--232).  It was assumed that people
would like to see URLs displayed with readable characters rather than %xx
escapes (as with the first hyperlink in 74002.ods), and that most URLs would
encode UTF-8 data (again, as with the first hyperlink in 74002.ods).  To do so,
it uses the heuristic of interpreting any %xx escapes as representing UTF-8 data
that can be so interpreted, and leaving any other %xxs alone.

A fix might be to change DECODE_TO_IURI to not interpret any %xx as UTF-8 data
if at least one contained %xx cannot be interpreted as UTF-8.  For the given
74002.ods, that modified heuristic would cause the first hyperlink to still be
displayed with readable characters rather than %xxs, and the second hyperlinik
to be displayed with only the original %xxs and no bogus UTF-8-interpreted
characters.  Would that suite your needs?

Comment 7 tora3 2007-02-02 14:59:21 UTC

@sb: Exactly, you understand user's needs.

Comment 8 Stephan Bergmann 2007-02-02 15:50:38 UTC

ok, accepted

Comment 9 Marcus 2017-05-20 11:30:58 UTC

Reset assigne to the default "issues@openoffice.apache.org".