pdf-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[pdf-devel] Problems in the stream implementation


From: Juan Pedro Bolivar Puente
Subject: [pdf-devel] Problems in the stream implementation
Date: Wed, 01 Oct 2008 22:23:58 +0200
User-agent: Mozilla-Thunderbird 2.0.0.16 (X11/20080724)

Hi,

While writing the LZW filter implementation some doubts arose (I'm a
corner case maniac, sorry :)):

Marchesi, why didn't you use circular buffers as we discussed in the
GHM instead of those fill-rewind buffers you are using? I was wondering
how the filter had to behave in case it needs to leave some data in the
input buffer. I can happen quite often, as the "compression ratio" of
most algorithms can easily go over 1.0 (they generate more data than
they take).

I have just crawled the code a little bit to get a solution without
spamming the list and have found a very bad bug: a filter can note
generate more data than it takes! The reason is simple: the only way to
achieve that is using a circular buffer.

Lets get to a deadly function:

pdf_status_t
pdf_stm_filter_apply (pdf_stm_filter_t filter,
                      pdf_bool_t finish_p)
{
  pdf_status_t ret;

  pdf_stm_buffer_rewind (filter->out);
  ret = PDF_OK;

  while ((!pdf_stm_buffer_full_p (filter->out))
         && (ret == PDF_OK))
    {
      /* If the input buffer is empty, refill it */
      if (pdf_stm_buffer_eob_p (filter->in))
        {
          ret = pdf_stm_filter_get_input (filter, finish_p);
        }

      if (ret != PDF_ERROR)
        {
          /* Generate output */
          ret = filter->impl.apply_fn (filter->params,
                                       filter->state,
                                       filter->in,
                                       filter->out,
                                       finish_p);
        }
    }

  return ret;
}

The recursion will get here and apply the filter chain forwards. When
the output buffer is full it returns, thats ok. But what happens if the
filter had to leave some data in the input buffer because it generated
too much output? When this function is invoked again, the input buffer
will be reset! All the remaining input data is lost...

Another function that is buggy is the pdf_stm_finish function. It
directly applies the filter chain with finish_p set to PDF_TRUE, but
this is wrong again when the compression factor > 1.0. One have to loop
running the filter chain until the filters return EOF, and only then one
can be sure they will behave properly if finish_p = true.


Think that with only 8 input bytes a PCX filter can generate 4GB of data...

To prove that I'm right I've made this simple test: let the pdf_stm in
the pdf_filter util have a human understanable cache of 16 bytes. The
run this:

  $ echo "0123456789aqwertyuiop" | ./pdf-filter --ahexenc
    3031323334353637>

That output is completely wrong, the correct answerd would have been
(which a larger cache size can answer)

  $ echo "0123456789aqwertyuiop" | ./pdf-filter --ahexenc
    303132333435363738396171776572747975696F700A>


I have not prepared a patch for this because I know that jemarch wants
to maintain the stm infrastructure code. But if you want I can prepare a
patch with a fancy circular buffer an this problem solved tomorrow
afternoon.

Best,
JP






reply via email to

[Prev in Thread] Current Thread [Next in Thread]