Re : [pdf-devel] Tokeniser Module - Unit Test: test cases and suggestion

>On Mon, Oct 19, 2009 at 13:18:29 -0700, Pierre FIlot wrote:
>> 1- COMMENTS:
>>
>> test1: comments are ignored (similarly to white space) (already done)
>> test2: macro "PDF_TOKEN_RET_COMMENT" has to be defined if we need to return
>> them
>> test3: two exceptions: -%PDF-n.m
>> -%%EOF
>>
>> should we return them all the times whether or not the macro is defined ?
>> Or is it an issue left to the caller ?

>It's up to the caller to set the flag when looking for these comments
>(assuming they're using the token reader to find them). The token
>reader doesn't have any way to know whether the comments have special
>meaning; they're just normal comments if they occur in the middle of a
>stream.

ok !

>> test4: test long comments
>>
>>question 1: "The comment consists of all characters between the percent sign
>>and the end of the line" but in the code (handle_char()) when we are
>>detecting '%' we are storing it ?
>
>This looks like a bug.
>
>>question 2: Do we need to store every character in the case we ignore
>>comments ? would it be more efficient to decide whether we ignore them
>> or consider them in the handle_char function instead of the flush_token
>
>I'll look into this; changing it might make certain error cases easier to
>handle, but isn't likely to provide any noticeable speedup.
>
>> 2- BOOLEAN:
>>
>> test1: keyword true and false
>>
>Booleans are not handled by the tokeniser (they're treated as keywords,
>and the parser will deal with them).

yeap ! I am just checking that the keywords are effectively returned.

>> 3- INTEGER:
>>
>>test 1: one or more digits with optional sign
>> test 2: Limit [+2 ^31-1 ; -2 ^31]
>>
>> 4- REAL:
>>
>> test 1: one or more digits with optional sign with a leading, trailing,
>> embedded decimal point
>> test 2: Limit [+3.403 x 10 ^38; -3.403 x 10 ^38]
>> test 3: 5 is the number of significant decimal digits of precision in
>> fractional part

>This is a minimum limit, which is more applicable to implementations
>using a fixed-point representation; GNU PDF can exceed 5 digits of
>precision.

ok !

>> 5- STRING:
>>
>> *literal characters enclosed with "()"
>> test 1: unbalanced parentheses forbidden
>> test 2: In a string, if the character immediately following a REVERSE
>> SOLIDUS (\) is not one of n, r, t, b, f, (, ), \ or numbers specifying an
>> octal value, the REVERSE SOLIDUS should be ignored. (already done)
>> test 3: In a string, an end-of-line marker appearing within a literal string
>> without a preceding REVERSE SOLIDUS shall be treated as a byte value of
>> (0Ah), irrespective of whether the end-of-line marker was a CARRIAGE RETURN
>> (0Dh), a LINE FEED (0ah), or both.(almost done, left to be tested \n alone)
>> test 4: "\LF" "\CR""\CR+LF" are not considered part of the string (left to
>> be tested "\LF" "\CR")
>
>This contradicts test 3 -- the EOL is part of the string, but always
>treated as LF.

I meant here a backslash followed by an EOL:

"The backslash and the end-of-line marker following it are not considered part of the string. For example:
( These \
two strings \
are the same . )
( These two strings are the same . )" (PDF reference, version 1.7, section 3.2 page 55)

>> test 5: High-order overflow in an octal character representation \ddd in a
>> string should be ignored by the tokeniser. (done)
>> test 6: In an octal character representation \ddd in a string, three octal
>> digits shall be used, with leading zeros as needed, if the next character of
>> the string is also a digit. Otherwise it can use one or two octal digits.(can
>> only be tested on pdf-token-write())
>>
>> question 1: would it be useful to differentiate hexadecimal and literal
>> string as token types. Like that we could check that there is not unbalanced
>> parentheses in literal strings (test 1).
>
>I don't think it's useful -- there are many equivalent representations
>of a string, like (A), (\101), and <41>, which should all produce the
>same token.
>
>Unbalanced parentheses are valid in any type of string if they're
>properly escaped. There's already a check for extra closing
>parentheses, under this comment:
>/* this shouldn't occur outside the STRING and COMMENT states */
>
>Extra opening parentheses can't really be detected (except when we reach
>EOF with an open string).
>this shouldn't occur outside the STRING and COMMENT states

Well I was thinking of unbalanced parentheses within a literal string.
The thing is that a writer would need to check whether it deals with balanced parentheses that don't need to be escaped or unbalanced ones that need a backslash. ((A)) since it is balanced parentheses doesn't need to be escaped. But instead (A)) would need to be written as (A\)).
Another option is to escape every parenthesis within a literal string when writing it to a stream. But I don't know if that is against the specification :
"balanced pairs of parentheses within a string require no special treatment."

>> question 2: Limit fixed at 32767 characters is valid only inside content
>> streams. Couldn't it be longer ? (see Appendix C). Should we introduce a
>> continuation as in comments ?
>
>Older versions of the PDF spec gave a limit of 65535 bytes for all
>strings, but newer ones don't seem to have any limit outside of content
>streams. I didn't notice this while writing the token reader, but I
>guess I'll need to fix the code to dynamically allocate the storage.
>
>Each token returned by pdf_token_read should be usable by itself, so the
>continuation feature doesn't make sense for strings. It won't be needed
>for comments either if the buffer can be dynamically reallocated.

ok !

>> *hexadecimal characters enclosed with "<>"
>> test 1: In a hexadecimal string, SPACE, HORIZONTAL TAB, CARRIAGE RETURN, LINE
>> FEED and FORM FEED shall be ignored by the tokeniser.
>> test 2: In a hexadecimal string, if there is an odd number of digits, the
>> final digit shall be assumed to be 0.(already done)
>>
>>
>> 6- NAMES:
>>
>> test 1:In a name, A NUMBER SIGN (#) shall (MUST) be written by using its
>> 2-digit hexadecimal code (23), preceded by a NUMBER SIGN.
>> test 2: In a name, any character that is a regular character (other than
>> NUMBER SIGN) shall be written as itself or by using its 2-digit hexadecimal
>> code, preceded by the NUMBER SIGN. (would be useful to automatically test
>> for every possible regular character and his octal equivalence).
>>
>> Do you mean to check that all 2-digit hexadecimal code gives the right
>> regular character ? why do you talk about octal values ?
>
>I don't understand what the text means, but names don't have octal
>escapes.
>
>> test 3: In a name, any character that is not a regular character shall (MUST)
>> be written using its 2-digit hexadecimal code, preceded by the NUMBER SIGN
>> only. (test negative cases with non-regular characters directly included in
>> the name).
>>
>> this test only concerns pdf-token-write, right ? because in pdf-token-read,
>> any non regular characters (white spaces or delimiters) ends the NAME token.
>
>It looks like this only applies to the writer, but of course the ability
>of the token reader to parse escaped characters should be tested too.

ok !

>> test 4: In a name, regular characters that are outside the range EXCLAMATION
>> MARK(21h) to TILDE (7Eh) should (RECOMMENDED) be written using the
>> hexadecimal notation. (test negative cases)
>>
>> I don't see what should I do here that is not done before (test 2) ?
>
>Test 2 allows these characters to be written escaped or unescaped.
>Test 4 should verify that the token writer escapes them.

ok got it !

>> test 5: The token SOLIDUS (a slash followed by no regular characters)
>> introduces a unique valid name defined by the empty sequence of characters.
>
>"//" should also be tested (two tokens for name "").

ok !

>> test 6: null character is forbidden (test pdf_token_name_new()) as well as
>> #00 (test pdf-token-read())
>>
>> Question 1: The test to verify that Names token don't contain null
>> characters, done with the creation of the token in pdf_token_name_new
>> introduces redundance since it is already verified when reading the stream
>> (pdf_token_read). We could instead let only pdf_token_read and
>> pdf_token_write functions verify that. This question also goes for COMMENTS
>> (eol characters) and KEYWORD tokens (non-regular characters).
>
>The token reader and writer should verify the data is valid, but I think
>it's fine to check in the token constructors too. It should make
>debugging easier for application writers that construct tokens manually
>and pass invalid data for some reason.
>
>If we didn't check, we'd have to update the documentation to remove
>PDF_EBADDATA as a return value.
>
>> GENERAL QUESTIONS:
>> questions 1: in pdf-token.c, why do we add a null character at the end of
>> some tokens(Names) and not others (Comments, Strings) (see
>> pdf_token_buffer_new and its pdf_bool_t nullterm)
>
>Tokens that can't contain null characters use null-terminated strings
>for the convenience of C programmers. The null character is not part of
>the token, and is not included in the length.
>
>Comments and strings can contain embedded null characters, so software
>must not treat them as null terminated. Maybe we should append a
>non-null character to make such bugs obvious.

Sorry but I dont see how come that appending a non-null character helps to detect such a bug ?

>> question 2: should we create a test function (START_TEST) for each case
>> (test1, test2...), or per token evaluated (COMMENTS, BOOLEAN...), or can we
>> regroup them inside the same function as it has already been done in
>> torture/unit/base/token/pdf-token-read.c
>
>Using separate functions would make failures easier to track, but it's
>probably fine to put a few related tests into a single function.
>Don't group multiple token types in a function unless necessary.
>
>pdf_token_read_001 might be doing too much, but it intentionally
>includes multiple token types to verify the tokeniser can split them
>apart properly. Feel free to duplicate some of the tests done here;
>it will be easier to verify test coverage if the tests are properly
>organised by token type.

ok !

Also, I just spotted that there is no PDF_EBADDATA as return value of pdf_token_reader_new as stated in the manual.

Thanks :D
/Pierre

From:	Pierre Filot
Subject:	Re : [pdf-devel] Tokeniser Module - Unit Test: test cases and suggestions
Date:	Wed, 21 Oct 2009 14:04:38 -0700 (PDT)