3D grphique: Parsing unicode with values

vendredi 4 avril 2014

Parsing unicode with values > 32767 from RTF files

Vote count:

0

I'm working on an RTF parser and having some difficulty handling unicode.

The RTF spec states that "Unicode values greater than 32767 must be expressed as negative numbers" (http://ift.tt/1mKj4fK), and to get the unicode numerical value we add 65536 to those negative numbers.

I was testing that scenario by setting up a document with unicode character 32767 and 32768. Word (v2011 on Mac) produces the following RTF syntax for those 2 characters:


\u32767\'5f\loch\af556\hich\af31506\dbch\f556 \uc2\u-32768\'97\'73

For the second one, -32768+65536 is 32768 as expected. So the \uNNNN commands make sense.

My problem is with the text escape sequences, like the \'97\'73 at the end. I don't understand why that's there. I could code my parser to ignore commands that are chained onto the end of a \uNNNN command like that. But I compared with the RTF output of TextEdit, and it only outputs the text escape sequences:


\uc0\u32767 \'97\'73

It seems like that's trying to be a double byte unicode escape sequence. And that kind of \' text escape is in hexadecimal. But 0x9773 is 38771, not 32768, so I don't understand how I can extract the desired unicode value from that data. Any ideas?

asked 1 min ago

Adam Murray

347

3D grphique

vendredi 4 avril 2014

Parsing unicode with values > 32767 from RTF files

Vote count:

0

Aucun commentaire:

Enregistrer un commentaire

vendredi 4 avril 2014

Parsing unicode with values > 32767 from RTF files

Vote count: 0

Aucun commentaire:

Enregistrer un commentaire

Vote count:

0