Apache OpenOffice (AOO) Bugzilla – Issue 472
Large file (80+pgs) with HTML in Endnotes Crashes
Last modified: 2007-09-23 15:14:16 UTC
The file at http://www.nihonlinks.com/OpenOffice/CURRENT.sdw will consistently crash all versions of Open Office and StarOffice. The file is fairly big (80+ pages). I don't think its size related. The endnotes contain HTML. I think it's not parsing correctly. In any event it'll crash anything. The file is copyrighted.. preliminary draft of a paper. It is not to be distributed except for testing.. ;)
Not even Storage viewing tools were able to open the file. Therefore the storage is broken. This file was created with a StarOffice 5.2 (569), that much was visible, but we can't do more than anyone else (saving text contents by using an editor). If you could tell us EXACTLY how to reproduce this (without a head crash or an electricity blackout at the right time while saving or copying the file ;-) then we might try to find out what went wrong. Otherwise this file is simply lost. PLease comment.
Well, I will try to reproduce again but I'm pretty sure pasting HTML text with extended characters did it. I'll try with the HTML I think I was using in a blank document. Last time it was Icelandic. I think this time it was French. Always the French.. ;) Anyway, I'll try it with the same HTML. I'm pretty sure there's an interesting bug somewhere. I can get an HTML exported and displaying it will not. Saving the HTML out as a StarOffice format will still make a file that crashes on load. Would you like to see the HTML (http://www.nihonlinks.com/OpenOffice/CURRENT.html) (BTW, the endnotes get all destroyed from the HTML otherwise I'd be totally happy recovered camper.) BTW, my editor has "advised" me to start using Word for the rest of the document.. ug.. ;)
The problem is, that the textencoding of the document is RTL_TEXTENCODING_EUC_JP and the ByteString -> UnicodeString conversion routine remove characters from the string. This does never happens!
Jeurgen, Thanks for the info! I'm not sure I understand what's going on though. The strings are encoded internally as EUC_JP according to the environment variable LANG's value? or are they internally Unicode? Is the ByteString to Unicode string conversion barfing because the strings being passed are malformed in some way? If the text is being entered from the clipboard there needs to be a way to apply conversion on the clipboard text, no?
TH->JP: See Interface-Announcement where you get a function to ignore MultiByte-Charsets. Plese discuss/inform also other developer (Malte, Niklas, ...) which could have the same problem when they save Old SO FileFormats.
Please change the code of binary im-/export filter like it is describe in the changes mail.
The bug document isn't existing any longer, therfor I can't verify whether it loads correctly using my changes.
Closing.