[freetype2] master 17545d4bf: Avoid n^2 scanning for binary data.

freetype-commit

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[freetype2] master 17545d4bf: Avoid n^2 scanning for binary data.

From:	Werner Lemberg
Subject:	[freetype2] master 17545d4bf: Avoid n^2 scanning for binary data.
Date:	Fri, 8 Mar 2024 11:49:22 -0500 (EST)

branch: master
commit 17545d4bf72175a8ea8020dcbd4d462234d2b5d0
Author: Ben Wagner <bungeman@gmail.com>
Commit: Werner Lemberg <wl@gnu.org>

    Avoid n^2 scanning for binary data.
    
    When creating a CID parser the location of the 'StartData' or '/sfnts'
    tokens needs to be known.  However, the token parser requires that the
    entire document be in memory and flattening the entire stream into memory is
    to be avoided.
    
    To avoid forcing the entire stream into memory, previously this code would
    scan through the stream looking for 'StartData' or '/sfnts' as strings.
    However, these strings could have been in a comment or string token, so the
    stream would be read into memory up to that point and the parser run to
    check that these strings were actually tokens.  This forced a parser restart
    from the beginning each time; as a result, data with many 'StartData'
    non-tokens would take n^2 time to check.
    
    * src/cid/cidparse.c (cid_parser_new): Change algorithm to make the initial
    scan look for the last possible 'StartData' or '/sfnts' string in the
    stream.  The stream is read forward instead of backward as a typical normal
    CID font will have one 'StartData' toward the beginning of the data and it
    it much faster to read the data from beginning to end instead of end to
    beginning.  For memory-based fonts the limit is set to the end of the stream
    since the stream is already in memory.  Then the parser is run once to look
    for 'StartData' or '/sfnts' tokens.  If they are found the parser is re-set
    to reflect this new information.
    
    Reported as
    
      https://issues.chromium.org/issues/40201695
---
 src/cid/cidparse.c | 67 +++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 46 insertions(+), 21 deletions(-)

diff --git a/src/cid/cidparse.c b/src/cid/cidparse.c
index 1e7c9d2bf..73a3ade89 100644
--- a/src/cid/cidparse.c
+++ b/src/cid/cidparse.c
@@ -90,10 +90,15 @@
     if ( error )
       goto Exit;
 
-  Again:
-    /* now, read the rest of the file until we find */
-    /* `StartData' or `/sfnts'                      */
+    if ( !stream->read ) {
+      /* just parse memory-based streams */
+      offset = stream->size;
+    }
+    else
     {
+      /* Find the last `StartData` or `/sfnts`.  The parser requires */
+      /* contiguous memory; attempt to pin as little as necessary.   */
+
       /*
        * The algorithm is as follows (omitting the case with less than 256
        * bytes to fill for simplicity).
@@ -119,7 +124,8 @@
       FT_Byte*  p           = buffer;
 
 
-      for ( offset = FT_STREAM_POS(); ; offset += 256 )
+      offset = 0;
+      while ( 1 )
       {
         FT_ULong  stream_len;
 
@@ -127,7 +133,7 @@
         stream_len = stream->size - FT_STREAM_POS();
 
         read_len = FT_MIN( read_len, stream_len );
-        if ( FT_STREAM_READ( p, read_len ) )
+        if ( read_len && FT_STREAM_READ( p, read_len ) )
           goto Exit;
 
         /* ensure that we do not compare with data beyond the buffer */
@@ -141,20 +147,23 @@
                ft_strncmp( (char*)p, STARTDATA, STARTDATA_LEN ) == 0 )
           {
             /* save offset of binary data after `StartData' */
-            offset += (FT_ULong)( p - buffer ) + STARTDATA_LEN + 1;
-            goto Found;
+            offset = FT_STREAM_POS() - read_len - read_offset
+                     + (FT_ULong)( p - buffer ) + STARTDATA_LEN + 1;
           }
           else if ( p[1] == 's'                                   &&
                     ft_strncmp( (char*)p, SFNTS, SFNTS_LEN ) == 0 )
           {
-            offset += (FT_ULong)( p - buffer ) + SFNTS_LEN + 1;
-            goto Found;
+            offset = FT_STREAM_POS() - read_len - read_offset
+                     + (FT_ULong)( p - buffer ) + SFNTS_LEN + 1;
           }
         }
 
-        if ( read_offset + read_len < STARTDATA_LEN )
+        if ( read_offset + read_len <= STARTDATA_LEN )
         {
-          FT_TRACE2(( "cid_parser_new: no `StartData' keyword found\n" ));
+          if ( offset )
+            goto Found;
+
+          FT_TRACE2(( "cid_parser_new: no `StartData` keyword found\n" ));
           error = FT_THROW( Invalid_File_Format );
           goto Exit;
         }
@@ -171,9 +180,9 @@
     }
 
   Found:
-    /* We have found the start of the binary data or the `/sfnts' token. */
-    /* Now rewind and extract the frame corresponding to this PostScript */
-    /* section.                                                          */
+    /* We have found an efficient range to look for the binary data or    */
+    /* `/sfnts' token.  Now rewind and extract the frame corresponding to */
+    /* this PostScript section.                                           */
 
     ps_len = offset - base_offset;
     if ( FT_STREAM_SEEK( base_offset )                  ||
@@ -187,8 +196,8 @@
     parser->root.limit     = parser->root.cursor + ps_len;
     parser->num_dict       = FT_UINT_MAX;
 
-    /* Finally, we check whether `StartData' or `/sfnts' was real --  */
-    /* it could be in a comment or string.  We also get the arguments */
+    /* Find the first real `StartData' or `/sfnts' -- the last one    */
+    /* could be in a comment or string.  We also get the arguments    */
     /* of `StartData' to find out whether the data is represented in  */
     /* binary or hex format.                                          */
 
@@ -216,6 +225,7 @@
       {
         T1_TokenRec  type_token;
         FT_Long      binary_length;
+        FT_ULong     found_offset;
 
 
         parser->root.cursor = arg1;
@@ -234,6 +244,24 @@
             parser->binary_length = (FT_ULong)binary_length;
         }
 
+        /* set the real values for the parser, if different */
+        found_offset = (FT_ULong)( cur - parser->postscript )
+                       + STARTDATA_LEN + 1;
+        if ( found_offset != offset )
+        {
+          FT_FRAME_RELEASE( parser->postscript );
+
+          ps_len = found_offset - base_offset;
+          if ( FT_STREAM_SEEK( base_offset )                  ||
+               FT_FRAME_EXTRACT( ps_len, parser->postscript ) )
+            goto Exit;
+
+          parser->data_offset    = found_offset;
+          parser->postscript_len = ps_len;
+          parser->root.base      = parser->postscript;
+          parser->root.cursor    = parser->postscript;
+          parser->root.limit     = parser->root.cursor + ps_len;
+        }
         goto Exit;
       }
       else if ( cur[1] == 's'                                   &&
@@ -251,11 +279,8 @@
       cur  = parser->root.cursor;
     }
 
-    /* we haven't found the correct `StartData'; go back and continue */
-    /* searching                                                      */
-    FT_FRAME_RELEASE( parser->postscript );
-    if ( !FT_STREAM_SEEK( offset ) )
-      goto Again;
+    FT_TRACE2(( "cid_parser_new: no `StartData` token found\n" ));
+    error = FT_THROW( Invalid_File_Format );
 
   Exit:
     return error;

[Prev in Thread]

Current Thread

[Next in Thread]

[freetype2] master 17545d4bf: Avoid n^2 scanning for binary data., Werner Lemberg <=

Prev by Date: [freetype2] gsoc-craig-2023-final 984ff8918: Attempted for for tilde flattening
Next by Date: [Git][freetype/freetype][master] Avoid n^2 scanning for binary data.
Previous by thread: [freetype2] gsoc-craig-2023-final 984ff8918: Attempted for for tilde flattening
Next by thread: [Git][freetype/freetype][master] Avoid n^2 scanning for binary data.
Index(es):
- Date
- Thread