bug-ghostscript
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

8.01 losing searchable text selecting pdf pages


From: Karl Berry
Subject: 8.01 losing searchable text selecting pdf pages
Date: Sun, 21 Mar 2004 12:54:32 -0500

With GNU Ghostscript 8.01 under GNU/Linux (Red Hat 9), selecting pages
from a pdf file seems to lose any searchable text that might be there.
Here's what I mean:

- If you view the second page of the attached in.pdf with xpdf (among
  other viewers), you can search for, for example "vac", and find the
  string, the pdftotext program that comes with xpdf can extract
  the main text, etc.  (It was created with an HP 6100 scanner.) 

- Then I select the second page using gs:
gs -q -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=out.pdf \
   -dFirstPage=2 -dLastPage=2 in.pdf -c quit 

- Now, viewing out.pdf (also attached), no text is searchable, and
  pdftotext doesn't find any text.

This may be related to the warnings that gs emits when processing in.pdf:

   **** Warning: Fonts with Subtype = /TrueType should be embedded.
                 But Times-Roman is not embedded.
   **** Warning: Fonts with Subtype = /TrueType should be embedded.
                 But Times-Italic is not embedded.

   **** This file had errors that were repaired or ignored.
   **** The file was produced by: 
   **** >>>> Adobe PDF Library 5.0 <<<<
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.

I do not know enough about pdf to know if this is fixable, but I wanted
to report it, as the resulting pdf's are not great for posting on the
web, for example, since they can't be searched.

BTW, I did find an alternate method of selection from pdf pages which
preserves searchable text, using the ConTeXt program texexec with
--pdfselect, but gs is a lot faster.

Thanks,
karl

P.S. Is anyone there?  I reported a problem with ps2pdf misconverting
some figures back on March 8, but didn't get an acknowledgement.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]