lundi 6 avril 2015

Encoding of PDF text string


Vote count:

0




I am working on Parser for PDF (text extraction), When page contents are Flate Decoded (zlib compression), My code is able to decompress content streams then I have output (stream object) something like below



BT
56.8 721.3 Td
/F2 12 Tf
[<01>2<0203>2<04>-10<0503>2<04>-2<0506070809>2<0A>1<0B>]TJ
ET


I am interested in the string array (operand of TJ), it seems like there are multiple hex encoded strings contained in this array but corresponding hex values do not make senses instead it appears a sequence like 010203... sort of lz77 compression.


Do PDFs have multiple levels of compression? how can I get plain text from above string array?



asked 19 secs ago







Encoding of PDF text string

Aucun commentaire:

Enregistrer un commentaire