[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: readarray leaves a NULL char embedded in each element
From: |
Greg Wooledge |
Subject: |
Re: readarray leaves a NULL char embedded in each element |
Date: |
Mon, 24 Jun 2024 14:06:39 -0400 |
On Mon, Jun 24, 2024 at 10:50:15 -0600, Rob Gardner wrote:
> Description:
> When using space or newline as a delimiter with readarray -d,
>
> elements in the array have the delimiter replaced with NULL,
>
> which is left embedded in each element of the array.
This isn't possible. Bash doesn't allow the storing of NUL bytes in
variables, and further, Unix/Linux doesn't permit passing NUL bytes as
command-line arguments to programs.
> This
> causes incorrect behavior when using array elements as arguments to
> sub-processes.
(Bash cannot pass a NUL byte as an argument.)
> I first noticed the problem when trying to use an array element as
> part of an
> argument to sed:
> readarray -d ' ' x << "A B"
> sed -e s/X/${x[0]}/
First point, your readarray command is using the wrong redirection
operator. I'm fairly sure you meant to write <<< instead of <<. Using
the here-string operator <<<, we can see that the first array element
retains the space delimiter (because -t was not used), and the second
retains the newline character, which is added by <<<.
hobbit:~$ readarray -d ' ' x <<< "A B"
hobbit:~$ declare -p x
declare -a x=([0]="A " [1]=$'B\n')
Second point, your sed command is not using quotes.
> This caused sed to complain "unterminated `s' command".
The space at the end of x[0] causes word splitting to occur, due to the
lack of quotes. The s/X/A part becomes one argument, and the / part
becomes a second argument.
> Using "read -a" instead of readarray produces correct results.
That one uses IFS to separate and trim the input fields. The default
IFS contains a space, so none of the array elements contains a space.
Therefore, your lack of quoting probably doesn't cause any additional
word splitting.
> With a simple C program to print out the characters in argv[1], one
> can see that a NULL character is left in the argument. Program:
> #include <stdio.h>
> #include <string.h>
> void main(int argc, char *argv[])
> {
> int i, n;
> if (argc > 1) {
> n = strlen(argv[1]);
> for (i=0; i<n+2; i++) printf("%d ", argv[1][i]);
> }
> }
I'm not at all clear on what this C program is doing. You're putting a
single character/byte on the stack for printf to process using the %d
operator, which... expects an integer? And therefore reads more than
one byte from the stack?
Sorry, it's been ages since I did C.
> $ readarray -d ' ' X <<< "A B C"
> $ read -d ' ' -a Y <<< "A B C"
> $ readarray -td ' ' Z <<< "A B C"
> $ ./printarg ${X[0]}A
> 65 0 65 $
In this command, ${X[0]} is a capital A plus a space character. You're
not using quotes, so ${X[0]}A becomes the two argument words "A" and "A".
hobbit:~$ readarray -d ' ' X <<< "A B C"
hobbit:~$ declare -p X
declare -a X=([0]="A " [1]="B " [2]=$'C\n')
hobbit:~$ printf '<%s> ' ${X[0]}A ; echo
<A> <A>
Your C program appears to look only at the first argument word, "A",
and ignores the second word. It takes strlen("A"), which is 1, and
adds 2 to it, getting 3. Thus, it loops 3 times, and thus, we see
the three numbers it writes to stdout.
The argument words are stored internally as NUL-terminated strings, so
it's no surprise that the second loop iteration prints a 0. The
third loop iteration is printing random garbage from beyond the end
of the argument string, unless I'm misreading the situation.
> $ ./printarg ${Y[0]}A
> 65 65 0 83 $
Here, Y[0] contains "A", so you're passing "AA" as your sole argument.
The argument's string length is 2, so you're looping 4 times. The
numbers 65 65 0 are from the internal storage of the argument words, and
the 83 is garbage from beyond the end of the string.
> $ ./printarg ${Z[0]}A
> 65 65 0 83 $
Here, Z[0] is "A" instead of "A ", because you used -t to trim the space.
So you're passing "AA" as your argument, just like the previous call.
So, in a nutshell, this is what I believe you need to see:
1) readarray without -t retains the delimiter, even if it's a space
or newline. It does not convert the delimiter to a NUL byte.
2) Unquoted ${X[0]} when X[0] ends with a space causes word splitting
to occur, so anything after the ${X[0]} will become a new word
(assuming IFS hasn't been modified).
3) Arguments passed to a program via the Unix kernel are NUL-terminated
strings. Therefore, the NUL byte can't be part of the argument
itself. It's a signpost that the argument string has ended.