lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] Testable alternative to compressed PDFs


From: Greg Chicares
Subject: Re: [lmi] Testable alternative to compressed PDFs
Date: Sat, 10 Feb 2018 23:29:18 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.5.2

On 2018-02-10 22:34, Vadim Zeitlin wrote:
> On Sat, 10 Feb 2018 20:41:05 +0000 Greg Chicares <address@hidden> wrote:
> 
> GC> 3. Write uncompressed PDFs.
> GC> 
> GC> I didn't even realize this was possible. Can you show me how to
> GC> do it, so that I can experiment with it?
> 
>  Before writing a more detailed reply, let me just answer this quickly:
> 
> ---------------------------------- >8 --------------------------------------
> diff --git a/pdf_writer_wx.cpp b/pdf_writer_wx.cpp
> index b72b51b9f..ad5240806 100644
> --- a/pdf_writer_wx.cpp
> +++ b/pdf_writer_wx.cpp
> @@ -74,6 +74,7 @@
>      pdf_dc_.SetMapMode(wxMM_POINTS);
>  
>      pdf_dc_.StartDoc(wxString()); // Argument is not used.
> +    pdf_dc_.GetPdfDocument()->SetCompression(false);
>      pdf_dc_.StartPage();
>  
>      // Use a standard PDF Helvetica font (without embedding any custom fonts 
> in
> ---------------------------------- >8 --------------------------------------
> 
> With this patch, the generated PDF won't be compressed and you will be able
> to find all the text strings in it. The images will still be compressed and
> I don't think it's possible to change this easily (PDF doesn't support XPM,
> which is the only bitmap format readable as text that I know...), but I
> also think it shouldn't really matter.

Thanks. Yes, bitmaps shouldn't matter. Let me paste some specimen
sections of the PDF I just created:

BT /F1 10.00 Tf ET
BT /F1 8.00 Tf ET

Many lines like that. My guess is that they're loading fonts.

Here's part of a footnote:

BT 1 0 0 -1 24.00 335.00 Tm 0 Tr (Premiums ) Tj ET
BT 1 0 0 -1 67.00 335.00 Tm 0 Tr (are ) Tj ET
BT 1 0 0 -1 83.00 335.00 Tm 0 Tr (assumed ) Tj ET
BT 1 0 0 -1 122.00 335.00 Tm 0 Tr (to ) Tj ET
BT 1 0 0 -1 132.00 335.00 Tm 0 Tr (be ) Tj ET
BT 1 0 0 -1 145.00 335.00 Tm 0 Tr (paid ) Tj ET
BT 1 0 0 -1 165.00 335.00 Tm 0 Tr (on ) Tj ET
BT 1 0 0 -1 178.00 335.00 Tm 0 Tr (annnual ) Tj ET
BT 1 0 0 -1 213.00 335.00 Tm 0 Tr (basis ) Tj ET

Here's a row of numeric values:

BT 1 0 0 -1 57.00 324.00 Tm 0 Tr (1) Tj ET
BT 1 0 0 -1 104.00 324.00 Tm 0 Tr (46) Tj ET
BT 1 0 0 -1 136.00 324.00 Tm 0 Tr (20,000) Tj ET
BT 1 0 0 -1 183.00 324.00 Tm 0 Tr (14,997) Tj ET
BT 1 0 0 -1 238.00 324.00 Tm 0 Tr (14,997) Tj ET
BT 1 0 0 -1 280.00 324.00 Tm 0 Tr (1,000,000) Tj ET
BT 1 0 0 -1 365.00 324.00 Tm 0 Tr (17,696) Tj ET
BT 1 0 0 -1 420.00 324.00 Tm 0 Tr (17,696) Tj ET
BT 1 0 0 -1 462.00 324.00 Tm 0 Tr (1,000,000) Tj ET

I don't think this is what we want:

 - The numeric section quoted above does bear some superficial
resemblance to our '.test' files in that there's one value per
line: thus, if only the fifth one changes, it's easy to isolate
that change when comparing files. However, the '.test' files
already print every value that can possibly be used in a PDF,
and there's no advantage in showing one line per value in any
other format.

 - The text section writes one word per line. For footnotes,
our '.test' files don't do that; but we could make them do so
if we wanted.

 - If every system test is expected to match perfectly, and
any deviation is an error, then uncompressed PDFs would work
well enough. In practice, though, results are often expected
to change in regular ways--for example, we might change the
footnote above thus:
- Premiums are assumed to be paid
+ It is assumed that payments are made
and we'd probably test it something like this:
 - copy the saved touchstone files to a scratch directory
 - change those copies, e.g. with sed, to reflect the intention
 - compare those altered copies to actual results
Actually, what we do is far more complicated than that, but
you get the idea. I don't think these uncompressed PDFs would
be suitable for such techniques.

> GC> >  So, depending on what exactly do we want to do, outputting text might 
> not
> GC> > be the best solution.

Options 2 and 3 in an earlier email don't meet our needs.
Option 1 (flat-text output) is the only option identified
so far that can really be considered.

> Before starting to do it, it would, IMHO, be better
> GC> > to clearly understand what are we going to do with the generated text
> GC> > files. Would you know the answer to this already and, if so, could you
> GC> > please explain it to me?
> GC> 
> GC> It's hard to say much more than I did above...
> 
>  I'd like to know how will the outputs be verified. For example, if it will
> be done manually, then uncompressed PDF is not really appropriate because
> even though it is readable, for some values of "readable", I can't
> seriously claim it's easy to read. OTOH if we just want to run "diff" on
> the outputs, then I think it might be sufficient, i.e. the diff obtained in
> case of failure could be sufficiently informative.

Differences might very well be failures--if we changed the PDF code in a
way that was intended to be a pure refactoring, then any difference is
an anomaly, which none of our other (existing) tests can find because
they don't use PDFs. This is an important use case: we often refactor,
and with strong system tests we can refactor boldly, so that material
changes can often be smaller, and thus easier to review and to test.

Differences might also be intentional effects of desired changes. This
is where uncompressed PDFs fall short. But exactly what sort of desired
changes would automated PDF testing find?

 - not changes in text or numbers: '.test' files already cover that

 - not changes in fine details of layout: that requires human eyes

 - not errors such as I attempted to represent in ASCII with letters
   from a Georgian alphabet in earlier messages: eyes required here too

So what's left? Plenty:

 - failure to show a conditional footnote properly, perhaps due to a
   mistaken change in a MST file

 - wrong number of columns; wrong order of columns

 - wrong page numbers, headers, etc.

Or maybe we swapped the order of two footnotes, and we can see that
clearly is diff (diff, meld, whatever)...and, furthermore, we could
potentially use our *nix toolkit to write little one-liners that
express the intended transformation, such that comparing observed
output to a transformed touchstone will show a perfect match (or
just isolated differences that indicate anomalies either in our
one-liners or in the code we're testing).

I'm sure we'll find more use cases, but I think those examples are
representative enough.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]