bug-sed
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#30520: sed: silently fail substitution for 2GB+ single-line file


From: Assaf Gordon
Subject: bug#30520: sed: silently fail substitution for 2GB+ single-line file
Date: Mon, 19 Feb 2018 13:32:02 -0700
User-agent: NeoMutt/20170113 (1.7.2)

Hello,

On Mon, Feb 19, 2018 at 06:51:04AM +0000, YushiOMOTE wrote:
> When I pass a 2GB+-single-line to sed for substitution, sed exits with
> success status code but it silently fails substitution.
> 
> ```
> $ sed `s/aaa/bbb/g` 20GB+-single-line-file.txt > out.txt && echo OK
> OK
> (...but nothing gets substituted even though the target string exists)

Thank you for reporting this issue, and providing such detailed
analysis. I am able to reproduce this issue.

Few minor comments:

> In summary, the cause seems that sed isn't checking error return values
> from address@hidden

Technically, it's from gnulib, not glibc.
The code is in 
https://opengrok.housegordon.com/source/xref/gnulib/lib/regexec.c#369
the function 're_search_stub'.

> What is happening in detail seems as follows.
> 
> At the beginning of `do_subst` function (sed/execute.c),
> [...] 
> `line.length` (unsigned long), is set to the value larger than INT_MAX. And
> when `match_regex` (sed/regexp.c) calls `re_search`,
> 
> ```
>     ret = re_search (&regex->pattern, buf, buflen, buf_start_offset,
>                      buflen - buf_start_offset,
>                      regsize ? regarray : NULL);
> ```
> 
> `buflen` here is equal to `line.length` (the value larger than INT_MAX) but
> the corresponding argument of `re_search` is `int`. The value is converted
> to a negative number. Then, `re_search` returns -1 due to its error
> checking. But the return value is not treated by sed and no errors are
> raised.

>From a cursory look, it seems "re_search_stub" does not have error
checking for this specific case (length<0), and it continues with its
string search, but finds nothing - thus returning -1.
It returns "-1" both when the pattern is not found, and when there
are such value errors.

Later code in 'regexp.c:match_regex()' does check the returned value
like so:
   return (ret > -1);

Therefore, "match_regex" returns FALSE in this case,
and "do_subst" simply concludes that the string was not found.
Sed then terminates without performing any replacement,
and without reporting any errors.


> A possible fix could be adding an error checking for `re_search` (like this
> patch).

I'm not sure this would work.
The code already checks for -1, which is used to indicate "pattern not found".

The gnulib implementation of "re_search" is described here:
https://www.gnu.org/software/gnulib/manual/html_node/GNU-Searching.html
It says:
   "If no match is found, re_search returns -1.
    If a match is found, it returns the index where the match began.
    If an internal error happens, it returns -2."

However it seems that "internal errors" only relates to failed memory 
allocations.
Out of range values are treated by "re_search" as simple "pattern not found",
as evident by this check:

   /* Check for out-of-range.  */
   if (BE (start < 0 || start > length, 0))
     return -1;

In https://opengrok.housegordon.com/source/xref/gnulib/lib/regexec.c#381


> Or, we can check if the value of line length exceeds INT_MAX before
> reaching `re_search`. (`grep` checks it and returns with error status code)

This is likely the simplest solution for now, without changing gnulib itself.


regards,
 - assaf





reply via email to

[Prev in Thread] Current Thread [Next in Thread]