pdf-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [pdf-devel] Tokeniser Module - Unit Test: test cases and suggestions


From: Michael Gold
Subject: Re: [pdf-devel] Tokeniser Module - Unit Test: test cases and suggestions
Date: Tue, 20 Oct 2009 16:16:51 -0400
User-agent: Mutt/1.5.20 (2009-06-14)

On Mon, Oct 19, 2009 at 13:18:29 -0700, Pierre FIlot wrote:
> 1- COMMENTS:
> 
> test1: comments are ignored (similarly to white space) (already done)
> test2: macro "PDF_TOKEN_RET_COMMENT" has to be defined if we need to return
> them
> test3: two exceptions: -%PDF-n.m
>                -%%EOF
> 
> should we return them all the times whether or not the macro is defined ?
> Or is it an issue left to the caller ? 

It's up to the caller to set the flag when looking for these comments
(assuming they're using the token reader to find them).  The token
reader doesn't have any way to know whether the comments have special
meaning; they're just normal comments if they occur in the middle of a
stream.

> test4: test long comments 
> 
> question 1: "The comment consists of all characters between the percent sign
> and the end of the line" but in the code (handle_char()) when we are
> detecting '%' we are storing it ? 

This looks like a bug.

> question 2: Do we need to store every character in the case we ignore
> comments ? would it be more efficient to decide whether we ignore them
> or consider them in the handle_char function instead of the flush_token

I'll look into this; changing it might make certain error cases easier to
handle, but isn't likely to provide any noticeable speedup.

> 2- BOOLEAN:
> 
> test1: keyword true and false

Booleans are not handled by the tokeniser (they're treated as keywords,
and the parser will deal with them).

> 3- INTEGER: 
> 
> test 1: one or more digits with optional sign
> test 2: Limit [+2 ^31-1 ; -2 ^31]
> 
> 4- REAL: 
> 
> test 1: one or more digits with optional sign with a leading, trailing,
> embedded decimal point
> test 2: Limit [+3.403 x 10 ^38; -3.403 x 10 ^38]
> test 3: 5 is the number of significant decimal digits of precision in
> fractional part

This is a minimum limit, which is more applicable to implementations
using a fixed-point representation; GNU PDF can exceed 5 digits of
precision.

> 5- STRING:
> 
> *literal characters enclosed with "()"
> test 1: unbalanced parentheses forbidden
> test 2: In a string, if the character immediately following a REVERSE
> SOLIDUS (\) is not one of n, r, t, b, f, (, ), \ or numbers specifying an
> octal value, the REVERSE SOLIDUS should be ignored. (already done)
> test 3: In a string, an end-of-line marker appearing within a literal string
> without a preceding REVERSE SOLIDUS shall be treated as a byte value of
> (0Ah), irrespective of whether the end-of-line marker was a CARRIAGE RETURN
> (0Dh), a LINE FEED (0ah), or both.(almost done, left to be tested \n alone)
> test 4: "\LF" "\CR""\CR+LF" are not considered part of the string (left to
> be tested "\LF" "\CR") 

This contradicts test 3 -- the EOL is part of the string, but always
treated as LF.

> test 5: High-order overflow in an octal character representation \ddd in a
> string should be ignored by the tokeniser. (done)
> test 6: In an octal character representation \ddd in a string, three octal
> digits shall be used, with leading zeros as needed, if the next character of
> the string is also a digit. Otherwise it can use one or two octal digits.(can
> only be tested on pdf-token-write())
> 
> question 1: would it be useful to differentiate hexadecimal and literal
> string as token types. Like that we could check that there is not unbalanced
> parentheses in literal strings (test 1).

I don't think it's useful -- there are many equivalent representations
of a string, like (A), (\101), and <41>, which should all produce the
same token.

Unbalanced parentheses are valid in any type of string if they're
properly escaped.  There's already a check for extra closing
parentheses, under this comment:
  /* this shouldn't occur outside the STRING and COMMENT states */

Extra opening parentheses can't really be detected (except when we reach
EOF with an open string).

> question 2: Limit fixed at 32767 characters is valid only inside content
> streams. Couldn't it be longer ? (see Appendix C). Should we introduce a
> continuation as in comments ?

Older versions of the PDF spec gave a limit of 65535 bytes for all
strings, but newer ones don't seem to have any limit outside of content
streams.  I didn't notice this while writing the token reader, but I
guess I'll need to fix the code to dynamically allocate the storage.

Each token returned by pdf_token_read should be usable by itself, so the
continuation feature doesn't make sense for strings.  It won't be needed
for comments either if the buffer can be dynamically reallocated.

> *hexadecimal characters enclosed with "<>"
> test 1: In a hexadecimal string, SPACE, HORIZONTAL TAB, CARRIAGE RETURN, LINE
> FEED and FORM FEED shall be ignored by the tokeniser.
> test 2: In a hexadecimal string, if there is an odd number of digits, the
> final digit shall be assumed to be 0.(already done)
> 
> 
> 6- NAMES:
> 
> test 1:In a name, A NUMBER SIGN (#) shall (MUST) be written by using its
> 2-digit hexadecimal code (23), preceded by a NUMBER SIGN.
> test 2: In a name, any character that is a regular character (other than
> NUMBER SIGN) shall be written as itself or by using its 2-digit hexadecimal
> code, preceded by the NUMBER SIGN. (would be useful to automatically test
> for every possible regular character and his octal equivalence).
> 
> Do you mean to check that all 2-digit hexadecimal code gives the right
> regular character ? why do you talk about octal values ?  

I don't understand what the text means, but names don't have octal
escapes.

> test 3: In a name, any character that is not a regular character shall (MUST)
> be written using its 2-digit hexadecimal code, preceded by the NUMBER SIGN
> only. (test negative cases with non-regular characters directly included in
> the name).
> 
> this test only concerns pdf-token-write, right ? because in pdf-token-read,
> any non regular characters (white spaces or delimiters) ends the NAME token.

It looks like this only applies to the writer, but of course the ability
of the token reader to parse escaped characters should be tested too.

> test 4: In a name, regular characters that are outside the range EXCLAMATION
> MARK(21h) to TILDE (7Eh) should (RECOMMENDED) be written using the
> hexadecimal notation. (test negative cases)
> 
> I don't see what should I do here that is not done before (test 2) ? 

Test 2 allows these characters to be written escaped or unescaped.
Test 4 should verify that the token writer escapes them.

> test 5: The token SOLIDUS (a slash followed by no regular characters)
> introduces a unique valid name defined by the empty sequence of characters.

"//" should also be tested (two tokens for name "").

> test 6: null character is forbidden (test pdf_token_name_new()) as well as
> #00 (test pdf-token-read())
> 
> Question 1:  The test to verify that Names token don't contain null
> characters, done with the creation of the token in pdf_token_name_new
> introduces redundance since it is already verified when reading the stream
> (pdf_token_read). We could instead let only pdf_token_read and
> pdf_token_write functions verify that. This question also goes for COMMENTS
> (eol characters) and KEYWORD tokens (non-regular characters).

The token reader and writer should verify the data is valid, but I think
it's fine to check in the token constructors too.  It should make
debugging easier for application writers that construct tokens manually
and pass invalid data for some reason.

If we didn't check, we'd have to update the documentation to remove
PDF_EBADDATA as a return value.

> GENERAL QUESTIONS:
> questions 1: in pdf-token.c, why do we add a null character at the end of
> some tokens(Names) and not others (Comments, Strings) (see
> pdf_token_buffer_new and its pdf_bool_t nullterm)

Tokens that can't contain null characters use null-terminated strings
for the convenience of C programmers.  The null character is not part of
the token, and is not included in the length.

Comments and strings can contain embedded null characters, so software
must not treat them as null terminated.  Maybe we should append a
non-null character to make such bugs obvious.

> question 2: should we create a test function (START_TEST) for each case
> (test1, test2...), or per token evaluated (COMMENTS, BOOLEAN...), or can we
> regroup them inside the same function as it has already been done in
> torture/unit/base/token/pdf-token-read.c

Using separate functions would make failures easier to track, but it's
probably fine to put a few related tests into a single function.
Don't group multiple token types in a function unless necessary.

pdf_token_read_001 might be doing too much, but it intentionally
includes multiple token types to verify the tokeniser can split them
apart properly.  Feel free to duplicate some of the tests done here;
it will be easier to verify test coverage if the tests are properly
organised by token type.

Thanks for working on this.

-- Michael

Attachment: signature.asc
Description: Digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]