[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Counting words, fast!
From: |
Dennis Williamson |
Subject: |
Re: Counting words, fast! |
Date: |
Fri, 19 Mar 2021 13:30:16 -0500 |
On Fri, Mar 19, 2021, 12:31 PM Koichi Murase <myoga.murase@gmail.com> wrote:
> 2021年3月17日(水) 5:32 Jesse Hathaway <jesse@mbuki-mvuki.org>:
> >
> > Ben Hoyt wrote a fun blog post on counting words,
> > https://benhoyt.com/writings/count-words, I wrote a Bash version with
> > as many optimizations as I could find[1], I would love any suggestions
> > on any substantial optimizations, the code is attached. The full
> > repo and test file is here: https://github.com/benhoyt/countwords
> >
> > Yours kindly, Jesse
> >
> > [1]: https://github.com/benhoyt/countwords/pull/66
>
> Maybe, it is a bit late but some comments.
>
> * Bash indexed arrays are implemented by a sparse linked list (of
> key-value pairs), i.e., Bash indexed arrays are already a kind of
> sorted dictionary of (integer-)key & value. So, you don't have to
> construct the associative array and sort its keys by yourself, but you
> can just construct an indexed array and print it by "${array[@]}". A
> non-trivial thing is that the problem requires inputting the result in
> descending order with respect to the frequency. For this one needs to
> revert the array elements.
>
> * Usually shorter variable names are faster in Bash scripts. The
> speedup by this one is small for this script because this is not the
> bottleneck of this script, but there is indeed a difference.
>
> * Usually, `mapfile' is much faster than `read'. But somehow in this
> case, the implementation by `read -N 65536' seems to be faster than
> `mapfile' implementation. Because ${arr[*]} seems to have the
> quadratic time complexity O(N^2) under some conditions, there is a
> finite suitable size of a block to process at once which is something
> between 32768..65536 in this problem. `mapfile' cannot control the
> data size of the processed input through its options.
>
> * «while LANG=C IFS= read ...; do ...» is slow because Bash
> reinitializes the locales of the C library every time any shell
> variables LANG or LC_* is changed. One can instead set LANG outside
> the while statement and restore it after the while statement.
>
> I haven't thoroughly tested it, but here's my optimization (to the
> latest version using read -N 65536):
>
> --------------------
> #!/bin/bash
>
> # load
> declare -iA h
> set -f
> LANG=C E=
> until [[ $E ]]; do
> IFS= read -N 65536 -r a || E=1
> IFS= read -r b || E=1
> for w in ${a@L}${b@L}; do
> h[$w]+=1
> done
> done
>
> # construct outputs
> for w in "${!h[@]}"; do
> f=${h[$w]}
> o[f]+=$w' '$f$'\n'
> done
>
> # reverse
> ((i=0,j=${#o[@]}-1))
> while ((j>=0)); do r[i++]=${o[j--]}; done
>
> printf '%s' "${r[@]}"
> --------------------
>
> --
> Koichi
>
Your optimizations trim another >30% off the best times of the other Bash
versions.
>
- Re: Counting words, fast!, (continued)
- Re: Counting words, fast!, Greg Wooledge, 2021/03/16
- Re: Counting words, fast!, Jesse Hathaway, 2021/03/16
- Re: Counting words, fast!, Dennis Williamson, 2021/03/16
- Re: Counting words, fast!, Jesse Hathaway, 2021/03/17
- Re: Counting words, fast!, Dennis Williamson, 2021/03/17
- Re: Counting words, fast!, Jesse Hathaway, 2021/03/17
- Re: Counting words, fast!, Greg Wooledge, 2021/03/17
- Re: Counting words, fast!, Jesse Hathaway, 2021/03/17
Re: Counting words, fast!, Koichi Murase, 2021/03/19
- Re: Counting words, fast!,
Dennis Williamson <=
- Re: Counting words, fast!, Jesse Hathaway, 2021/03/19
- Re: Counting words, fast!, Koichi Murase, 2021/03/19
- Re: Counting words, fast!, Koichi Murase, 2021/03/19
- Re: Counting words, fast!, Lawrence Velázquez, 2021/03/20
- Re: Counting words, fast!, Jesse Hathaway, 2021/03/22