[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4
From: |
Assaf Gordon |
Subject: |
bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4 |
Date: |
Thu, 11 May 2017 22:05:33 -0400 |
Hello,
> On May 11, 2017, at 18:39, Dick Dunbar <address@hidden> wrote:
>
> To round out this discussion:
> I wanted a simple filter to ensure filename paths didn't contain spaces.
There's a nuance here to verify:
Did you want a filter to ensure non of your files have spaces (e.g. detect
if some haves do have spaces and then fail),
or
Did you want a robust way to use the 'mv' command (as below), even in
the case of files with spaces ?
If you just wanted to detect files with spaces,
something like this would work:
find -type f -print0 | grep -qz '[[:space:]]' && echo have-files-with-spaces
If you wanted to print files that have spaces, something like this would
work:
find -type f -print0 | grep -z '[[:space:]]' | tr '\0' '\n'
> For example:
> find /foo -maxdepth 1 -atime +366 -print0 |
> xargs -0 sh -c 'mv "$@" /archive' move
I'm not sure what the purpose of 'move' in the above command.
But if you wanted to move all the found files to the directory /archive,
even if they have spaces in them, a more efficient way would be:
find /foo -maxdepth 1 -atime +366 -print0 | \
xargs -0 mv -no-run-if-empty -t /archive
This GNU extension (-t DEST) works great with find/xargs,
as xargs by default adds the parameters at the end of the command line,
and "-t" means the destination directory is specified first.
> So why are there different flags to indicate null-terminated lines?
> find -print0
> xargs -0
> sed -z
>
> Seems silly. To make a non-breaking-code-change,
> why not add "-z" to the find and xargs command so they are compatible?
Putting aside the naming conversion for a moment (Remember that each program is
developed
by different people) - I'll focus on find/xargs - which are part of the same
package (findutils) and developed by the same people.
These two are designed to work closely together - that's why they have
"-print0" and "-0".
The whole point of the following construct:
find [criteria] -print0 | xargs -0 ANY-PROGRAM
Is that 'ANY-PROGRAM' doesn't need to understand NUL-line-endings at all.
The main reason find and xargs need the NULLs is to ensure
file names are not broken by whitespace or even newlines. But once xargs reads
the entire filename, it passes each filename as a single parameter to
ANY-PROGRAM,
and so there's no need to worry any more about filenames with whitespaces.
This useful constructs breaks down if ANY-PROGRAM is 'sh' which the might
do further parameter splitting based on whitespace.
> And ... because we're dealing with the same issue of executables
> creating stream data, why doesn't sed/awk/grep have an option
> to deal with null delimited lines such that "$" would find them.
I'm not sure I understand: sed and grep have "-z" exactly for this purpose.
(also: sort -z , perl -0).
gawk has a slight different syntax, where you simply set the RS (input
record separator) to NULL:
find -type f -print0 | gawk -vRS="\0" -vORS="\n" '{ print "file = " $0 }'
But remember that when you use 'sed -z', the output also uses NULs as
line-terminators,
so it won't look good on the terminal or in a file.
> Having sed recognize \r, \n, \0 as end of line might cause some
> breakage if you have to deal with data that has embedded nulls.
Instead of thinking in general "data that has embedded nulls",
it'll be easier to consider concrete cases.
Text files do not have embedded nuls (by definition, otherwise they are not
text files). So standard text programs (sed/grep/awk) do not need to deal with
NULs
as line separators.
The main use case of having NUL as line separator is precisely with "find
-print0".
In this case, either use "xargs -0" and then the actual program doesn't need
to worry about NULs at all, or use the gnu extensions (e.g. 'sed -z' or 'grep
-z').
> Had to check:
> find . -type f -print0 | sed -e "s/^/'/" -e "s/\$/'/"
>
> Doesn't work. One very long string of null terminated filenames is returned.
It works perfectly:
1. sed without -z treats newlines (\n) as line terminators.
2. 'find -print0' did not generate '\n' character at all.
3. 'sed' read the entire input (i.e. all files separated by NULs),
treated it as one line, and added quotes at the beginning and the end
of the entire buffer.
4. NULs were kept as-is, and are printed on your terminal.
Example:
$ touch a b 'c d'
$ find -type f -print0 | sed -e "s/^/'/" -e "s/\$/'/" | od -An -tx1c
27 2e 2f 61 00 2e 2f 62 00 2e 2f 63 20 64 00 27
' . / a \0 . / b \0 . / c d \0 '
> So we now know that sed does not check for \0 as a line terminator.
> And the sed -z flag produces the same long string.
>
> find . -type f -print0 | sed -z -e "s/^/'/" -e "s/\$/'/"
It also produces the correct output:
This time, because of the '-z', sed indeed reads each filename until the NUL,
and adds quotes around each file.
But it also uses NULs as line terminators on the OUTPUT,
so newline characters are not used at all.
Notice that each file is surrounded by quotes, exactly as you've asked:
$ find -type f -print0 | sed -z -e "s/^/'/" -e "s/\$/'/" | od -An -tx1c
27 2e 2f 61 27 00 27 2e 2f 62 27 00 27 2e 2f 63
' . / a ' \0 ' . / b ' \0 ' . / c
20 64 27 00
d ' \0
The missing piece is that after you've processed each file using 'sed -z',
if you want to print them to the terminal, you still need to convert NULs to
newlines:
$ find -type f -print0 | sed -z -e "s/^/'/" -e "s/\$/'/" | tr '\0' '\n'
'./a'
'./b'
'./c d'
Or, if you wanted to user sed/grep as an intermediate filter between 'find' and
'xargs',
then something like this:
find [criteria] -print0 | grep -z [REGEX] | xargs -0 ANYPROGRAM
find [criteria] -print0 | sed -z [REGEX] | xargs -0 ANYPROGRAM
In most of my examples above, whitespace don't actual cause problems -
because sed/grep will not be confused by whitespace and won't break the line
(it is mostly shell argument parsing that will get terribly confused by
whitespace,
and also "xargs" with certain parameters).
They real 'kick' is that using NULs allows handling files that have embedded
newlines.
Consider the following:
$ touch a b 'c d' "$(printf 'e\nf')"
$ ls -log
total 0
-rw-r--r-- 1 0 May 12 01:43 a
-rw-r--r-- 1 0 May 12 01:43 b
-rw-r--r-- 1 0 May 12 01:43 c d
-rw-r--r-- 1 0 May 12 01:43 e?f
The last file has an embedded newline, which will mess-up 'find':
## incorrect output: the 'e\nf' file is broken, 'echo' is executed
## wrong number of times with non-existing file names:
$ find -type f | xargs -I% echo ==%==
==./e==
==f==
==./a==
==./b==
==./c d==
Using 'xargs -0' will solve it. This output is correct, but perhaps confusing
when displayed on the terminal:
$ find -type f -print0 | xargs -0 -I% echo ==%==
==./e
f==
==./a==
==./b==
==./c d==
And similarly with 'sed -z':
$ find -type f -print0 | sed -z -e 's/^/<<</' -e 's/$/>>>/' | tr '\0' '\n'
<<<./e
f>>>
<<<./a>>>
<<<./b>>>
<<<./c d>>>
Once last tip:
Sometimes you want to find and operate on files based on the their content
instead
of attributes (e.g. 'grep').
Here too, a file with spaces or newlines will cause troubles:
$ echo yes > "$(printf 'hello\nworld')"
$ ls -log
total 4
-rw-r--r-- 1 4 May 12 01:57 hello?world
If you wanted to find all files containing 'yes',
grep alone would print a confusing output:
$ grep -l yes *
hello
world
And using it with "xargs" will fail:
$ grep -l yes * | xargs -I% echo 'handling file ===%==='
handling file ===hello===
handling file ===world===
Grep has a separate option (upper case -Z) to print the matched filenames
with a NUL instead of a newline. This enables correct handling:
$ grep -lZ yes * | xargs -0 -I% echo 'handling file ===%s==='
handling file ===hello
worlds===
And later:
$ grep -lZ yes * | xargs -0 mv -t /destination
Hope this helps,
regards,
- assaf
- bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Dick Dunbar, 2017/05/11
- bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Eric Blake, 2017/05/11
- bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Dick Dunbar, 2017/05/11
- bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Eric Blake, 2017/05/11
- bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Assaf Gordon, 2017/05/11
- bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Eric Blake, 2017/05/11
- bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Assaf Gordon, 2017/05/11
- bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Dick Dunbar, 2017/05/11
- bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Dick Dunbar, 2017/05/11
- bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4,
Assaf Gordon <=
- bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Dick Dunbar, 2017/05/12
- bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Dick Dunbar, 2017/05/12
- bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Assaf Gordon, 2017/05/12
- bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Assaf Gordon, 2017/05/12
- bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Dick Dunbar, 2017/05/13
- bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Dick Dunbar, 2017/05/12
- bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Eric Blake, 2017/05/12
- bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Dick Dunbar, 2017/05/12
- bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Dick Dunbar, 2017/05/12
- bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4, Eric Blake, 2017/05/12