[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[bug #58206] [PATCH] fix PDFPIC issue with determining size of pdfs cont
From: |
G. Branden Robinson |
Subject: |
[bug #58206] [PATCH] fix PDFPIC issue with determining size of pdfs containing images |
Date: |
Fri, 21 Jan 2022 01:31:37 -0500 (EST) |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0 |
Update of bug #58206 (project groff):
Status: Need Info => In Progress
_______________________________________________________
Follow-up Comment #14:
I'm mostly unblocked.
The problem with the original problematic file (the "angular 1200x800" thing)
appears to be that it had a title property that was encoded in UTF-16BE.
$ xxd angular-1280-800.pdf | sed -n '/459f0/,/45a30/p'
000459f0: 3c3c 0a2f 5469 746c 6520 3c30 3036 3130 <<./Title <00610
00045a00: 3036 4530 3036 3730 3037 3530 3036 4330 06E00670075006C0
00045a10: 3036 3130 3037 3230 3032 4430 3033 3130 0610072002D00310
00045a20: 3033 3230 3033 3830 3033 3030 3032 4430 03200380030002D0
00045a30: 3033 3830 3033 3030 3033 3030 3030 303e 038003000300000>
You don't see that a lot these days, with the success of the global campaign
to exterminate big-endian desktop (and mobile) computing.
So this is what pdfinfo ends up doing with that.
$ pdfinfo angular-1280-800.pdf | xxd
00000000: 5469 746c 653a 2020 2020 2020 2020 2020 Title:
00000010: 0061 006e 0067 0075 006c 0061 0072 002d .a.n.g.u.l.a.r.-
00000020: 0031 0032 0038 0030 002d 0038 0030 0030 .1.2.8.0.-.8.0.0
00000030: 0000 0a50 726f 6475 6365 723a 2020 2020 ...Producer:
00000040: 2020 2068 7474 7073 3a2f 2f69 6d61 6765 https://image
00000050: 6d61 6769 636b 2e6f 7267 0a43 7265 6174 magick.org.Creat
00000060: 696f 6e44 6174 653a 2020 204d 6f6e 2041 ionDate: Mon A
00000070: 7072 2032 3020 3034 3a33 333a 3434 2032 pr 20 04:33:44 2
00000080: 3032 3020 4145 5354 0a4d 6f64 4461 7465 020 AEST.ModDate
00000090: 3a20 2020 2020 2020 204d 6f6e 2041 7072 : Mon Apr
000000a0: 2032 3020 3034 3a33 333a 3434 2032 3032 20 04:33:44 202
000000b0: 3020 4145 5354 0a54 6167 6765 643a 2020 0 AEST.Tagged:
000000c0: 2020 2020 2020 206e 6f0a 5573 6572 5072 no.UserPr
000000d0: 6f70 6572 7469 6573 3a20 6e6f 0a53 7573 operties: no.Sus
000000e0: 7065 6374 733a 2020 2020 2020 206e 6f0a pects: no.
000000f0: 466f 726d 3a20 2020 2020 2020 2020 2020 Form:
00000100: 6e6f 6e65 0a4a 6176 6153 6372 6970 743a none.JavaScript:
00000110: 2020 2020 206e 6f0a 5061 6765 733a 2020 no.Pages:
00000120: 2020 2020 2020 2020 310a 456e 6372 7970 1.Encryp
00000130: 7465 643a 2020 2020 2020 6e6f 0a50 6167 ted: no.Pag
00000140: 6520 7369 7a65 3a20 2020 2020 2031 3238 e size: 128
00000150: 3020 7820 3830 3020 7074 730a 5061 6765 0 x 800 pts.Page
00000160: 2072 6f74 3a20 2020 2020 2020 300a 4669 rot: 0.Fi
00000170: 6c65 2073 697a 653a 2020 2020 2020 3238 le size: 28
00000180: 3539 3337 2062 7974 6573 0a4f 7074 696d 5937 bytes.Optim
00000190: 697a 6564 3a20 2020 2020 206e 6f0a 5044 ized: no.PD
000001a0: 4620 7665 7273 696f 6e3a 2020 2020 312e F version: 1.
000001b0: 330a 3.
In other words, it simply blasts the encoded bytes to its own output in utter
indifference to the character encoding used by the output device. For an
information-extraction tool whose entire purpose is human-readable output,
that seems a dubious decision to me.
But, we're stuck with it for the time being (unless a PDFPIC user wants to
migrate to Deri's lower-level output driver-leveraging alternative in comment
#7).
I'll see if I can force a UTF-16 Title property onto gnu.eps so that I can
craft a proper regression test.
_______________________________________________________
Reply to this item at:
<https://savannah.gnu.org/bugs/?58206>
_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/
- [bug #58206] [PATCH] fix PDFPIC issue with determining size of pdfs containing images,
G. Branden Robinson <=