coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Dictionary data structures for path name strings


From: Pádraig Brady
Subject: Re: Dictionary data structures for path name strings
Date: Wed, 21 Sep 2016 13:45:19 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0

On 21/09/16 09:53, Stefan Vargyas wrote:
> Dear GNU'ers,
> 
> Early at the start of thread [2] by Rasmus Borup Hansen, I was thinking
> of playing a bit around the problem of online dictionary data structures
> for path name strings, using Bentley and Sedgewick's [1] ternary search
> trees (or tries).
> 
> Path name sets obtained e.g. from 'find' have an interesting property which
> TSTs can make use of: the strings belonging to such sets share a great deal
> of common prefixes. Later, on the same discussion thread, Pádraig Brady [3]
> noted this fact too.
> 
> However, only this summer, I actually did give substance to these ideas [4].
> 
> The results obtained so far are not spectacular, yet not negligible either:
> for two sets of path names, the gains in *node* memory consumption relative
> to (1) the size of the input and to (2) the node memory usage of Gnulib's
> generic hash table implementation (which is currently used by 'cp') were as
> shown by the following table:
> 
>   ----- ----------------- ----- -----
>    set       platform      (1)   (2)
>   ----- ----------------- ----- -----
>     1    32-bit pointers   35%   42%
>     2    32-bit pointers   27%   36%
>   ----- ----------------- ----- -----
>     1    32-bit offsets    23%   38%
>     2    32-bit offsets    13%   32%
>   ----- ----------------- ----- -----

Cool. So about a third less mem.
I presume that's dependent on the depth of the paths in question?

> The first of the path name sets has cardinality 396K and size 26M, while
> the second one -- cardinality 811K and size 47M.
> 
> The '32-bit offset' platform is to be understood as a 64-bit platform where
> Path-Set is using 32-bit offsets instead of 64-bit pointers for referencing
> the node structures within the node structure themselves. 
> 
> Read the story to its full extent in the file 'doc/path-set.txt' of [4].
> 
> I will gladly provide you guys with more insights into the innards of the
> implementation code and the choices made therein.
> 
> Happy hacking!
> 
> Sincerely,
> 
> Stefan Vargyas.
> 
> 
> References:
> 
> [1] Jon L. Bentley, Robert Sedgewick:
>     Fast Algorithms for Sorting and Searching Strings
>     8th Symposium on Discrete Algorithms, 1997, 360-369.
> 
>     http://www.cs.princeton.edu/~rs/strings/paper.pdf
> 
> [2] Rasmus Borup Hansen: My experience with using cp to copy a lot of files
>       (432 millions, 39 TB)
>     Coreutils Archives, 11 Aug 2014
> 
>     https://lists.gnu.org/archive/html/coreutils/2014-08/msg00012.html
> 
> [3] Pádraig Brady: Re: My experience with using cp to copy a lot of files
>       (432 millions, 39 TB)
>     Coreutils Archives, 14 Sep 2014
> 
>     http://lists.gnu.org/archive/html/coreutils/2014-09/msg00020.html:
> 
>       BTW there is a lot of redundancy in all the stored paths,
>       which could be improved by using a prefix compression method.
> 
> [4] Path-Set: Dictionary Data Structures for Path Name Strings
> 
>     http://www.nongnu.org/path-set/

404 ?

>     For about three weeks, Savannah administation seems to have stalled its
>     activity. The second package I submitted for Savannah's approval can be
>     obtained from the site by:
> 
>       $ wget -O path-set-Sep-16-2016.tar.bz2 \
>       https://savannah.gnu.org/support/download.php?file_id=38530

Lots of info here. Good work!
Having this available as an abstracted data structure seems very useful.
Would it make sense to include it in gnulib for easier
integration with various projects?

thanks!
Pádraig



reply via email to

[Prev in Thread] Current Thread [Next in Thread]