[pdf-devel] Re: Modifications on pdf_token_read to get token boundaries

pdf-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[pdf-devel] Re: Modifications on pdf_token_read to get token boundaries

From:	Michael Gold
Subject:	[pdf-devel] Re: Modifications on pdf_token_read to get token boundaries
Date:	Mon, 25 May 2009 19:58:34 -0400
User-agent:	Mutt/1.5.18 (2008-05-17)

On Tue, May 26, 2009 at 00:53:46 +0200, address@hidden wrote:
...
> We need to be able to determine the boundaries of a token that has
> been read, for error reporting. We cannot rely on the stm used by the
> token reader to determine the beginning position of a read token,
> since it is skipping white characters.

This behaviour could be changed by
 - adding a flag that causes token_read to return whitespace as a token;
   or,
 - adding a function/flag to advance to the beginning of the next token

> We would need to expand the pdf_token_read to communicate both the
> beginning position and the end position in the stm of the last read
> token. It could be done using two extra parameters:
> 
> pdf_status_t pdf_token_read (pdf_token_reader_t reader,
>                              pdf_u32_t flags,
>                              pdf_size_t *beginning_pos,
>                              pdf_size_t *end_pos,
>                              pdf_token_t *token);
> 
> If NULLs are passed then the parameters are not filled.
> 
> An alternative would be to expand the pdf_token_t TAD to include such
> information, but I think it would not be quite appropriate, since it
> is not part of the semantics of the token.

True, I'd rather not include it in the token structure.

> Would this modification be ok with you?

I'm not sure about the API.  If the extra parameters will only be used
in the case of an error, maybe a new function could be added to access
the positions of the last token (to keep pdf_token_read simple); or the
stream methods could be called directly if the caller could manually
skip whitespace.

Also, what would beginning_pos and end_pos mean exactly?  Are they based
on the byte positions of the underlying stream before filtering, or on
the number of bytes actually seen by the tokeniser (after filtering)?
The physical stream position (e.g. as reported by ftell) might not be
useful; for example, if a decompression filter operates on blocks of
data, it could emit many tokens without advancing.

-- Michael

signature.asc
Description: Digital signature

[Prev in Thread]

Current Thread

[Next in Thread]

[pdf-devel] Modifications on pdf_token_read to get token boundaries, jemarch, 2009/05/25
- [pdf-devel] Re: Modifications on pdf_token_read to get token boundaries, Michael Gold <=
  - Re: [pdf-devel] Re: Modifications on pdf_token_read to get token boundaries, David Vazquez, 2009/05/25
    - Re: [pdf-devel] Re: Modifications on pdf_token_read to get token boundaries, Michael Gold, 2009/05/26

Prev by Date: [pdf-devel] Modifications on pdf_token_read to get token boundaries
Next by Date: Re: [pdf-devel] Re: Modifications on pdf_token_read to get token boundaries
Previous by thread: [pdf-devel] Modifications on pdf_token_read to get token boundaries
Next by thread: Re: [pdf-devel] Re: Modifications on pdf_token_read to get token boundaries
Index(es):
- Date
- Thread