emacs-bug-tracker
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#72086: closed (scm_i_utf8_string_hash overruns buffer when len is ze


From: GNU bug Tracking System
Subject: bug#72086: closed (scm_i_utf8_string_hash overruns buffer when len is zero)
Date: Wed, 11 Dec 2024 17:57:02 +0000

Your message dated Wed, 11 Dec 2024 11:56:03 -0600
with message-id <87ldwmcde4.fsf@trouble.defaultvalue.org>
and subject line Re: scm_i_utf8_string_hash overruns buffer when len is zero
has caused the debbugs.gnu.org bug report #72086,
regarding scm_i_utf8_string_hash overruns buffer when len is zero
to be marked as done.

(If you believe you have received this mail in error, please contact
help-debbugs@gnu.org.)


-- 
72086: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=72086
GNU Bug Tracking System
Contact help-debbugs@gnu.org with problems
--- Begin Message --- Subject: scm_i_utf8_string_hash overruns buffer when len is zero Date: Fri, 12 Jul 2024 19:43:18 -0500
The first patch attempts to fix that, and the second is an optimization
when then input is ASCII (since we already have the information we need
to detect that):

>From 619e3d3afec2c116007d9cb2ad32a500fb32a7dd Mon Sep 17 00:00:00 2001
From: Rob Browning <rlb@defaultvalue.org>
Date: Sun, 30 Jun 2024 22:41:40 -0500
Subject: [PATCH 1/2] scm_i_utf8_string_hash: don't overrun when len is zero

When the length is zero, the previous code would include the byte after
the end of the string in the hash.  Fix that (the wide and narrow
hashers also guard against it via "case 0"), and while we're there,
switch to u8_mbtouc since the unsafe variant is now the same (see the
info pages), and don't bother mutating length for the trailing bytes,
since we don't need to.

libguile/hash.c (scm_i_utf8_string_hash): switch to u8_mbtouc, and avoid
overrun when len == 0.
---
 libguile/hash.c | 28 +++++++++++++++-------------
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/libguile/hash.c b/libguile/hash.c
index a038a11bf..d92f60df8 100644
--- a/libguile/hash.c
+++ b/libguile/hash.c
@@ -195,32 +195,34 @@ scm_i_utf8_string_hash (const char *str, size_t len)
   /* Handle most of the key.  */
   while (length > 3)
     {
-      ustr += u8_mbtouc_unsafe (&u32, ustr, end - ustr);
+      ustr += u8_mbtouc (&u32, ustr, end - ustr);
       a += u32;
-      ustr += u8_mbtouc_unsafe (&u32, ustr, end - ustr);
+      ustr += u8_mbtouc (&u32, ustr, end - ustr);
       b += u32;
-      ustr += u8_mbtouc_unsafe (&u32, ustr, end - ustr);
+      ustr += u8_mbtouc (&u32, ustr, end - ustr);
       c += u32;
       mix (a, b, c);
       length -= 3;
     }
 
   /* Handle the last 3 elements's.  */
-  ustr += u8_mbtouc_unsafe (&u32, ustr, end - ustr);
-  a += u32;
-  if (--length)
+  if (length)
     {
-      ustr += u8_mbtouc_unsafe (&u32, ustr, end - ustr);
-      b += u32;
-      if (--length)
+      ustr += u8_mbtouc (&u32, ustr, end - ustr);
+      a += u32;
+      if (length > 1)
         {
-          ustr += u8_mbtouc_unsafe (&u32, ustr, end - ustr);
-          c += u32;
+          ustr += u8_mbtouc (&u32, ustr, end - ustr);
+          b += u32;
+          if (length > 2)
+            {
+              ustr += u8_mbtouc (&u32, ustr, end - ustr);
+              c += u32;
+            }
         }
+      final (a, b, c);
     }
 
-  final (a, b, c);
-
   if (sizeof (unsigned long) == 8)
     ret = (((unsigned long) c) << 32) | b;
   else
-- 
2.43.0

>From c6a888888101a893820f38561898e7c0390dd9d2 Mon Sep 17 00:00:00 2001
From: Rob Browning <rlb@defaultvalue.org>
Date: Mon, 1 Jul 2024 20:56:57 -0500
Subject: [PATCH 2/2] scm_i_utf8_string_hash: optimize ASCII

Since we already compute the char length, use that to detect all ASCII
strings and handle those the same way we handle latin-1.

libguile/hash.c (scm_i_utf8_string_hash): when byte_len == char_len,
(i.e. fixed-width ASCII) optimize hashing via existing narrow path.
---
 libguile/hash.c | 28 ++++++++++++++++------------
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/libguile/hash.c b/libguile/hash.c
index d92f60df8..fe950e358 100644
--- a/libguile/hash.c
+++ b/libguile/hash.c
@@ -169,25 +169,29 @@ scm_i_latin1_string_hash (const char *str, size_t len)
 unsigned long 
 scm_i_utf8_string_hash (const char *str, size_t len)
 {
-  const uint8_t *end, *ustr = (const uint8_t *) str;
-  unsigned long ret;
-
-  /* The length of the string in characters.  This name corresponds to
-     Jenkins' original name.  */
-  size_t length;
-
-  uint32_t a, b, c, u32;
-
   if (len == (size_t) -1)
     len = strlen (str);
 
-  end = ustr + len;
-
+  const uint8_t *ustr = (const uint8_t *) str;
   if (u8_check (ustr, len) != NULL)
     /* Invalid UTF-8; punt.  */
     return scm_i_string_hash (scm_from_utf8_stringn (str, len));
 
-  length = u8_mbsnlen (ustr, len);
+  /* The length of the string in characters.  This name corresponds to
+     Jenkins' original name.  */
+  size_t length = u8_mbsnlen (ustr, len);
+
+  if (len == length) // ascii, same as narrow_string_hash above
+    {
+      unsigned long ret;
+      JENKINS_LOOKUP3_HASHWORD2 (str, len, ret);
+      ret >>= 2; /* Ensure that it fits in a fixnum.  */
+      return ret;
+    }
+
+  const uint8_t * const end = ustr + len;
+  uint32_t a, b, c, u32;
+  unsigned long ret;
 
   /* Set up the internal state.  */
   a = b = c = 0xdeadbeef + ((uint32_t)(length<<2)) + 47;
-- 
2.43.0

Thanks
-- 
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4

--- End Message ---
--- Begin Message --- Subject: Re: scm_i_utf8_string_hash overruns buffer when len is zero Date: Wed, 11 Dec 2024 11:56:03 -0600
Version: 3.0.11

A slightly different version of the fix is now in main.

-- 
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4


--- End Message ---

reply via email to

[Prev in Thread] Current Thread [Next in Thread]