[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: split behavior
From: |
Roger McNichols |
Subject: |
Re: split behavior |
Date: |
Mon, 14 Sep 2009 11:13:53 -0500 (CDT) |
I found a machine with the old version of split.
home:~> uname -a
Linux home 2.2.13 #4 Thu May 8 23:11:31 CDT 2003 i686 unknown
home:~>
home:~> split --version
split (GNU textutils) 1.22
home:~>
Here's the result of
home:~> cat /var/log/messages | split -2 - /tmp/x.
not exactly as I recalled. instead of adding zz first time, adds za but ends
with yz,
then starts adding zz... Anyway:
x.aa
x.ab
x.ac
x.ad
x.ae
x.af
x.ag
x.ah
x.ai
x.aj
...
x.yv
x.yw
x.yx
x.yy
x.yz
x.zaaa
x.zaab
x.zaac
x.zaad
x.zaae
x.zaaf
...
x.zyzt
x.zyzu
x.zyzv
x.zyzw
x.zyzx
x.zyzy
x.zyzz
x.zzaaaa
x.zzaaab
x.zzaaac
x.zzaaad
...
___________________________
Roger J. McNichols, Ph.D.
Chief Scientist
BioTex, Inc.
8058 El Rio St.
Houston, TX 77054
713.741.0111 (o)
713.741.0122 (f)
832.338.4371 (m)
----- Pádraig Brady <address@hidden> wrote:
> Roger McNichols wrote:
> >
> > Thanks for the feedback.
> >
> >
> >> Do you mean select the appropriate suffix length based on size,
> >> or do you mean the zzaa, zzab scheme? The former wouldn't
> >> help when processing a pipe for example so I'd probably
> >> stick with the latter method for consistency.
> >
> > Currently, split (at least 5.2.1) DOES pick the suffix size based on the
> > file
> > size when used as "split -<#> file" and the file size is known.
>
> I checked the repo and can't see code supporting that.
> Perhaps you've got a locally modified `split` ?
>
> > But as you
> > point out, if the file is a pipe you may still run out of suffixes if the
> > file size
> > changes after invocatio of slpit, or if split is used in the "split -<#> -"
> > (reads stdin) mode, a 2-letter suffix is all you get unless you specify a
> > length.
> > Now I suppose that maybe the discussion went something like:
> > >> what if an unknown-sized input stream is the input?
> > >> well then just use -a 100 and you will never* run out...
> > (*note 26^100 is pretty big)
> >
> > Anyway, I propose to develop a new commandline option that would invoke the
> > 'old'
> > suffix formation behavior. And even though aa ... zaa ... zzaa ... instead
> > of
> > aa .. zzaa ... zzzzaa (as well as many other schemes) would work just as
> > well,
>
> Bzzt. zaa would sort before zb
> In general one needs to append 'z'*suffix_len which would default to 2 if not
> specified.
> One would need to consider this behaviour with digit suffixes also.
>
> > I propose to utilize the 'old' one for the added advantage of reverse
> > compatibility.
>
> OK. While I like the scheme it would be really nice to see what we're being
> compatible
> with. I.E. it would be great if you found where the old split you used came
> from.
>
> > That way any code that relied on the old scheme for counting would be able
> > to be
> > re-functionalized with a simple addition of a commandline argument.
> >
> >> if the suffix len is specified and is too small.
> >> Otherwise we use the zzaa, zzab method as described before.
> >
> > This is also a good idea, but it might override the users intention which
> > could
> > be to use split to detect a file that was more that 676*N lines long or to
> > use it
> > with the -1 option and only write our the first 676 lines of the input
>
> That's exceedingly unlikely. It would be great to have the "unlimited"
> behaviour
> by default I think. As mentioned before we could have the "limited" behaviour
> if POSIXLY_CORRECT is set.
>
> > (who knows why, but we're fixing a fix that broke something else, right?)
>
> I can't see the code for the old behaviour so I wouldn't assume that.
>
> cheers,
> Pádraig.
Re: split behavior, Pádraig Brady, 2009/09/11