bug-standards
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GNU Coding Standards, automake, and the recent xz-utils backdoor


From: Jacob Bachmeyer
Subject: Re: GNU Coding Standards, automake, and the recent xz-utils backdoor
Date: Mon, 01 Apr 2024 22:51:57 -0500
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.1.22) Gecko/20090807 MultiZilla/1.8.3.4e SeaMonkey/1.1.17 Mnenhy/0.7.6.0

Jose E. Marchesi wrote:
Jose E. Marchesi wrote:
[...]
I agree that distcheck is good but not a cure all.  Any static
system can be attacked when there is motive, and unit tests are
easily gamed.
The issue seems to be releases containing binary data for unit tests,
instead of source or scripts to generate that data.  In this case,
that binary data was used to smuggle in heavily obfuscated object
code.
As a side note, GNU poke (https://jemarch.net/poke) is good for
generating arbitrarily complex binary data from clear textual
descriptions.
While it is suitable for that use, at last check poke is itself very
complex, complete with its own JIT-capable VM.  This is good for
interactive use, but I get nervous about complexity in testsuites,
where simplicity can greatly aid debugging, and it /might/ be possible
to hide a backdoor similarly in a poke pickle.  (This seems to be a
general problem with powerful interactive editors.)

Yes, I agree simplicity it is very desirable, in testsuites and actually
everywhere else.  I also am not fond of dragging in dependencies.

Exactly---I am sure that poke is great for interactive use, but a self-contained solution is probably better for a testsuite.

But I suppose we also agree in that it is not possible to assembly
non-trivial binary data structures in a simple way, without somehow
moving the complexity of the encoding into some sort of generator, which
will not be simple.  The GDB testsuite, for example, ships with a DWARF
assembler written in around 3000 lines of Tcl.  Sure, it is simpler than
poke and doesn't drag in additional dependencies.  But it has to be
carefully maintained and kept up to date, and the complexity is there.

The problem for a compression tool testsuite is that compression formats are (I believe) defined as byte-streams or bit-streams. Further, the generator(s) must be able to produce /incorrect/ output as well, in order to test error handling.

Further, GNU poke defines its own specialized programming language for
manipulating binary data.  Supplying generator programs in C (or C++)
for binary test data in a package that itself uses C (or C++) ensures
that every developer with the skills to improve or debug the package
can also understand the testcase generators.

Here we will have to disagree.

IMO it is precisely the many and tricky details on properly marshaling
binary data in general-purpose programming languages that would have
greater odds to lead to difficult to understand, difficult to maintain
and possibly buggy or malicious encoders.  The domain specific language
is here an advantage, not a liability.

This you need to do in C to encode and generate test data for a single
signed 32-bit NUMBER in an output file in a _more or less_ portable way:

  void generate_testdata (off_t offset, int endian, int number)
  {
    int bin_flag = 0, fd;

  #ifdef _WIN32
    int bin_flag = O_BINARY;
  #endif
    fd = open ("testdata.bin", bin_flag, S_IWUSR);
    if (fd == -1)
      fatal ("error generating data.");
if (endian == BIG)
      {
        b[0] = (number >> 24) & 0xff;
        b[1] = (number >> 16) & 0xff;
        b[2] = (number >> 8) & 0xff;
        b[3] = number & 0xff;
      }
    else
      {
        b[3] = (number >> 24) & 0xff;
        b[2] = (number >> 16) & 0xff;
        b[1] = (number >> 8) & 0xff;
        b[0] = number & 0xff;
      }

    lseek (fd, offset, SEEK_SET);
    for (i = 0; i < 4; ++i)
      write (fd, &b[i], 1);
    close (fd);
  }

While that is a nice general solution, (aside from neglecting the declaration "uint8_t b[4];"; with "int b[4];", the code would only work on a little-endian processor; with no declaration, the compiler will reject it) a compression format would be expected to define the endianess of stored values, so the major branch in that function would collapse to just one of its alternatives. Compression formats are generally defined as streams, so a different decomposition of the problem would likely make more sense: (example untested)

   void emit_int32le (FILE * out, int value)
   {
     unsigned int R, i;

     for (R = (unsigned int)value, i = 0; i < 4; R = R >> 8, i++)
       if (fputc(R & 0xff, out) == EOF)
         fatal("error writing int32le");
   }
Other code handles opening OUT, or OUT is actually stdout and we are writing down a pipe or the shell handled opening the file. (The main function can easily check that stdout is not a terminal and bail out if it is.) Remember that I am suggesting test generator programs, which do not need to be as general as ordinary code, nor do they need the same level of user-friendliness, since they are expected to be run from scripts that encode the precise knowledge of how to call them. (That this version is also probably more efficient by avoiding a syscall for every byte written is irrelevant for its intended use.)

This is the Poke equivalent:

  fun generate_testdata = (offset<uint<64>,B> off, int<32> endian, int<32> 
number) void:
  {
    var fd = open ("testdata.bin");
    set_endian (endian);
    int<32> @ fd : off = number;
    close (fd);
  }

And thanks to the DSL, this scales nicely to more complex structures,
such as an ELF64 relocation instead of a signed 32-bit integer:

  fun generate_testdata = (offset<uint<64>,B> off, int<32> endian, int<32> 
number) void:
  {
    type Elf64_RelInfo =
      struct Elf64_Xword
      {
        uint<32> r_sym;
        uint<32> r_type;
      };
type Elf64_Rela =
      struct
      {
        offset<uint<64>,B> r_offset;
        Elf64_RelInfo r_info;
        offset<int<64>,B> r_addend;
      };
var fd = open ("got32reloc.bin");
    set_endian (endian);
    Elf64_Rela @ 0#B
      = Elf64_Rela { r_info = Elf64_RelInfo { r_sym = 0xff00, r_type = 3 } }
    close (fd);
  }

Indeed Poke's language is excellent for concisely describing complex structures accurately, closely resembling C's syntax, but if I understand correctly, compression tool test data has few complex structures and is mostly byte-streams. Also, how did poke know to send the structure to "got32reloc.bin"? Was it simply the most-recently-opened file (or perhaps the top entry on a stack of recently-opened files)? Were the OFF and NUMBER parameters intentionally unused, accidentally unused, or somehow implicitly used in the second example?


-- Jacob



reply via email to

[Prev in Thread] Current Thread [Next in Thread]