bug#34133: Huge memory usage and output size when using "H" and "G"

bug-sed

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#34133: Huge memory usage and output size when using "H" and "G"

From:	Hongxu Chen
Subject:	bug#34133: Huge memory usage and output size when using "H" and "G"
Date:	Sun, 20 Jan 2019 14:38:18 +0800

Hi Assaf,

    While I still think that this is sed's defect, I agree that programmers
should ensure
the script will not result in a bomb. What I'm thinking is whether there
can be some
builtin check to avoid this (e.g., roughly suppress use of “G” directly
followed by "H").

In addition, the error messages can be more meaningful in certain
scenarios. I believe
there involve different aspects of optimizations in the following scripts,
but the stderr
output can be improved.

sed '/b/e; G; H;' input >/dev/null  # instant exit, fine
sed '/a/e; G; H;' input >/dev/null  # many lines of `sh: 1: Syntax error:
";" unexpected`
sed '/a/x; G; H;' input >/dev/null # instant exit, fine
sed '/b/x; G; H;' input >/dev/null # long wait

Generally, the *semantics* of sed scripts had better to be much clearer. To
be honest,
I'm really confused about their meaning until I think for quite a while.


Best Regards,
Hongxu


On Sun, Jan 20, 2019 at 11:37 AM Assaf Gordon <address@hidden> wrote:

> Hello,
>
> On 2019-01-19 7:23 p.m., Hongxu Chen wrote:
> >      We think the way sed works may suffer from attacks.
>
> Not more than any other program that uses memory or disk
> based on its input.
>
> > If the user downloads some
> > sed scripts and run *without root privilege*, the host machine may soon
> > exceed
> > the memory;
>
> First,
> root privileges have nothing to do with it.
> A non-privileged user can consume as much memory as they want,
> and a poorly configured machine will be brought to its knees.
>
> Well configured machines will not allow user-space programs
> to cripple them (but I readily admit that such configuration
> is not trivial to achieve).
>
> Second,
> Any user who downloads any program from untrusted source is
> expected to know what they're doing.
> If they don't - then the can cause a lot of damage.
>
> This has nothing to do with sed.
>
> > in my case, the machine actually hangs and I have to restart
> > it.
>
> Yes, many common installations are not configured to handle
> memory exhaustion.
>
> > The
> > problem may be severer when the machine is hosting some service or does
> > the sed relevant service such as text processing (may be rare) itself
> > even inside
> > some sandbox. The issue may also be triggered unconsciously thus cause
> > surprise
> > and trouble.
>
> That is all true, but has nothing to do with sed.
>
> >
> >  > Any program that keeps the input in memory is vulnerable to unbounded
> > input size
> >
> >      I think input size is not big; and the size can still be reduced as
> > long as more "G;H"s
> > are appended to the script.
> >   Maybe sed can do something flush to avoid memory usage?
> >
>
> I'll rephrase my words:
>
> Your sed program has O(2^N) space requirements,
> Any program that have exponential behavior (be it space or time)
> will quickly lead to pathological cases.
>
> So while your input file is not too big (N=30 lines),
> it leads to huge memory requirements.
>
> I recommend reading some background about complexity
> (most of these deals with Time complexity, but it applies to space as
> well):
>    http://bigocheatsheet.com/
>    https://en.wikipedia.org/wiki/Time_complexity#Exponential_time
>    https://en.wikipedia.org/wiki/Computational_complexity
>
> To illustrate my point further, here are similar examples in AWK and
> PERL that would choke with your input:
>
>    awk '{ buf = buf $1 ; buf = buf buf } END { print buf }'
>    perl -ne '$buf .= $_ ; $buf .= $buf ; print $buf' < input
>
> Just as you wrote above, a non-root user who runs these on your input
> will consume a huge amount of memory. This is why I said it has nothing
> to do with sed.
>
> To see the unfolding of exponential growth in action,
> try the following C program (which goes to show it is not just
> (awk/perl/sed scripts):
>
>      /* Compile with:
>            gcc -o 1 1.c -lm
>
>         Run with:
>           seq 30 | ./1
>      */
>      #define _GNU_SOURCE
>      #define _POSIX_C_SOURCE 200809L
>      #include <stdlib.h>
>      #include <stdio.h>
>      #include <math.h>
>
>      int main()
>      {
>        char *buf,*line;
>        size_t linenum = 0;
>        size_t buf_size = 0;
>        size_t n;
>        ssize_t i;
>
>        while ( (i = getline(&line,&n, stdin)) != -1) {
>           ++linenum;
>           buf_size += n;
>           buf_size *= 2;
>           printf ("line %zu (%zu bytes), 2^%zu == %g, buf_size = %zu
> bytes\n",
>                 linenum, n, linenum, pow(2, linenum) , buf_size);
>        }
>        return 0;
>
>      }
>
>
> The above does not actually allocate memory, it just shows how much
> would be allocated. Run it with your input and see for yourself:
>
>    line 1 (120 bytes), 2^1 == 2, buf_size = 240 bytes
>    line 2 (120 bytes), 2^2 == 4, buf_size = 720 bytes
>    line 3 (120 bytes), 2^3 == 8, buf_size = 1680 bytes
>    [...]
>    line 30 (120 bytes), 2^30 == 1.07374e+09, buf_size = 257698037520 bytes
>
> If you tried to do "malloc(buf_size)" it would consume all your memory
> (if you have less than 256GB).
>
> ----
>
> This applies not just to memory (RAM), but to disk space as well
> (which conceptually is just different type of storage).
>
> Try this example:
>
>      printf a > in
>      for i in $(seq 20); do cat in in > out ; mv out in ; done ;
>
> The input file starts with 1 byte.
> After 20 rounds, the file size will be 1MB.
> If you try 30 rounds, the file will be 1GB.
> The "20" and "30" here correspond to your input size (number of lines).
> Small change in input leads to large changes in output (thus
> "exponential" programs are considered bad).
>
> If you are still not convinced, try 40 rounds - that will attempt
> to create a 1TB file. If you disk is smaller - it will become full
> (which is just like memory exhaustion).
>
> ----
>
> I hope this resolves the issue.
>
> regards,
>   - assaf
>
>
>
>
>
>
>
>

[Prev in Thread]

Current Thread

[Next in Thread]

bug#34133: Huge memory usage and output size when using "H" and "G", Hongxu Chen, 2019/01/19
- bug#34133: Duplicate of 34133, Hongxu Chen, 2019/01/19
- bug#34133: Huge memory usage and output size when using "H" and "G", Assaf Gordon, 2019/01/19
  - bug#34133: Huge memory usage and output size when using "H" and "G", Hongxu Chen, 2019/01/19
    - bug#34133: Huge memory usage and output size when using "H" and "G", Assaf Gordon, 2019/01/19
    - bug#34133: Huge memory usage and output size when using "H" and "G", Hongxu Chen <=

Prev by Date: bug#34142: AddressSanitizer reported heap-buffer-overflow
Next by Date: bug#34142: AddressSanitizer reported heap-buffer-overflow
Previous by thread: bug#34133: Huge memory usage and output size when using "H" and "G"
Next by thread: bug#34141: Stackoverflow triggered at lib/regexec.c:1948
Index(es):
- Date
- Thread