471 – utf-8 HTML Document opens correctly but is destroyed when saving

Issue 471 - utf-8 HTML Document opens correctly but is destroyed when saving

Summary: utf-8 HTML Document opens correctly but is destroyed when saving

Status:	CLOSED DUPLICATE of issue 971

Alias:	None

Product:	Writer
Classification:	Application
Component:	code (show other issues)
Version:	619
Hardware:	PC Windows 2000

Importance:	P3 Trivial (vote)
Target Milestone:	---
Assignee:	eric.savary
QA Contact:	issues@www

URL:
Keywords:

Depends on:
Blocks:

Reported:	2001-02-23 16:08 UTC by issues@www
Modified:	2003-12-06 14:52 UTC (History)
CC List:	1 user (show)

See Also:
Issue Type:	DEFECT
Latest Confirmation in:	---
Developer Difficulty:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description issues@www 2001-02-23 16:08:48 UTC

In OpenOffice 619 I opened a HTML document encoded in UTF-8 containing cyrillic
text. The document is diplayed correctly.
But when I save the document it gets destroyed. There are no longer UTF-8
characters but something looking like this:
EJ G5;>2cÂ„:J 4c;0 aÂ†1;8G0
and in the HTML header there is a "charset=iso8859-1" tag which is obviously not
correct.
I tried to save the document in another file format but all file types available
in the "Save as..." dialog box (except "Text (encoded)" -> "Unicode") destroy
the document without a warning!!
I expect that when opening a HTML document in UTF-8 it will be saved in UTF-8, too.

Comment 1 Dieter.Loeschky 2001-03-07 08:48:17 UTC

I think it's your task

Comment 2 jp 2001-03-13 08:46:17 UTC

For this problem now the user is able to define in which chracter set the HTML 
document is to save. The load (original) character set is only used for the 
import. The option is found in the 624 under tools/options/filtersettings/html.

Comment 3 issues@www 2001-06-02 21:06:52 UTC

In my opinion, three things still have to be changed:

1. What happens to characters in the document that are not part of the 
character set which is set in the HTML export options? (E.g., I have a document 
containing both German and Russian text, and I try to save it as iso-8859-1)? 
I think, those characters have to be saved in the HTML file as numeric HTML 
entities (&#1234;).

2. By now, when I try to save the German-Russian document mentioned above in 
HTML and the HTML export charset is not set to utf-8, the document is saved 
quietly without any error message. - Only after reopening the saved document I 
have to mention that all non-Latin1 characters are destroyed.
I think, there should be at least a warning displayed to the user, saying that 
the selected export charset does not match all characters in the document.

3. In my opinion, "tools/options/filtersettings/html" is not the right place 
for the character set option of the HTML export filter, because if I have 
documents in several languages I have to change the setting for every document.
I think the charset setting should be possible directly in the "Save as.." 
dialog as it is for the file type "Text (encoded)". The default value should be 
set to a charset that best matches the characters in the actual document (e.g. 
iso8859-1 if there are only Western characters in the document, KOI8-R if there 
are only Russian characters, UTF-8 if there are characters from more than one 8-
bit charset and so on...)

(These comments I have submitted to issue #971, too, because the problem in 
#471 and #971 seems to be the same. Should we mark it as duplicate?)

Comment 4 jp 2001-06-05 16:20:01 UTC

To point 
1.: Why? The character set has only 256 characters, and if it can't converted, 
then it can't set to &# value - because this define never a other character set.

2.: this can be implemented in the future.

3.: I think not. Because not every user wan't to define each time in which kind 
of character set he wan't to export his documents. This will only defined for 
one time and that's all. And if anybody kowns that he must use different 
character sets where is the problem to choose UTF8?

Comment 5 issues@www 2001-06-06 10:29:57 UTC

To point 1:
The HTML 4 spec makes a difference between "character set" which is always UCS 
(the full unicode repertoire) for every HTML 4 document and the "character 
encoding" for a single document. What the "charset" header or meta tag 
specifies is the "character encoding" in terms of the HTML spec, not 
the "character set"!!
(See http://www.w3.org/TR/html4/charset.html)
Concerning the character encoding, the HTML 4 spec says:

"Authoring tools (e.g., text editors) may encode HTML documents in the 
character encoding of their choice, and the choice largely depends on the 
conventions used by the system software. These tools may employ any convenient 
encoding that covers most of the characters contained in the document, provided 
the encoding is correctly labeled. Occasional characters that fall outside this 
encoding may still be represented by character references. These always refer 
to the document character set, not the character encoding."

IMHO this means, that in every HTML document every Unicode character may be 
present. All characters that are defined in the encoding given in the "charset" 
meta tag may be represented as defined in this encoding, while all other 
characters are to be represented by entities.
If a document is labelled as "iso8859-1" this means that all Latin1-characters 
may be in their 8-bit encoding as defined by iso8859-1, but the document may 
contain other characters, too, that have to be represented by named or numeric 
entities.

For the OOo HTML export filter this means that it should preserve every 
character in the document in every encoding. Characters that do not match the 
encoding the user selected must be represented by named or numeric entities.

Comment 6 issues@www 2001-06-19 07:55:05 UTC

This is probably a dup of issue #971 but since I'm not a developer I don't dare 
to mark it as such.

When resolving issue #971 one should consider my last comment of 2001-06-06 
concerning charsets and encodings in HTML.

Comment 7 eric.savary 2001-07-09 09:20:51 UTC

Closing as duplicate.
A reference to your last comments has been added to <A 
HREF="show_bug.cgi?id=971">issue #971</A>.

*** This issue has been marked as a duplicate of 971 ***

Comment 8 eric.savary 2003-01-20 13:23:37 UTC

closed