vendredi 4 avril 2014

Parsing unicode with values > 32767 from RTF files


Vote count:

0




I'm working on an RTF parser and having some difficulty handling unicode.


The RTF spec states that "Unicode values greater than 32767 must be expressed as negative numbers" (http://ift.tt/1mKj4fK), and to get the unicode numerical value we add 65536 to those negative numbers.


I was testing that scenario by setting up a document with unicode character 32767 and 32768. Word (v2011 on Mac) produces the following RTF syntax for those 2 characters:



\u32767\'5f\loch\af556\hich\af31506\dbch\f556 \uc2\u-32768\'97\'73


For the second one, -32768+65536 is 32768 as expected. So the \uNNNN commands make sense.


My problem is with the text escape sequences, like the \'97\'73 at the end. I don't understand why that's there. I could code my parser to ignore commands that are chained onto the end of a \uNNNN command like that. But I compared with the RTF output of TextEdit, and it only outputs the text escape sequences:



\uc0\u32767 \'97\'73


It seems like that's trying to be a double byte unicode escape sequence. And that kind of \' text escape is in hexadecimal. But 0x9773 is 38771, not 32768, so I don't understand how I can extract the desired unicode value from that data. Any ideas?



asked 1 min ago






Aucun commentaire:

Enregistrer un commentaire