[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] git-download: Speed up 'git-predicate'.
From: |
Christopher Baines |
Subject: |
Re: [PATCH] git-download: Speed up 'git-predicate'. |
Date: |
Mon, 19 Jun 2017 08:24:24 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 |
On 07/06/17 13:52, Ludovic Courtès wrote:
> Christopher Baines <address@hidden> skribis:
>
>> Adjust 'git-predicate' to use data structures that perform better when used
>> with git repositories with a large number of files.
>>
>> Previously when matching either a regular file or directory, 'git-predicate'
>> would search a list with a length equal to the number of files in the
>> repository. As a search operation happens for roughly every file in the
>> repository, this meant that the time taken to use 'git-predicate' to traverse
>> all the files in a repository was roughly exponential with respect to the
>> number of files in the repository.
>>
>> Now, for matching regular files or symlinks, 'git-predicate' uses a vhash
>> using the inode value as the key. This should perform roughly in constant
>> amount of time, instead of linear with respect to the number of files in the
>> repository.
>>
>> For matching directories, 'git-predicate' now uses a tree structure stored in
>> association lists. To check if a directory is in the tree, the tree is
>> traversed from the root. The time complexity of this depends on the shape of
>> the tree, but it should be an improvement on searching through the list of
>> all
>> files.
>
> Great, more than welcome it seems. :-)
>
> Do you know how much the inode optimization vs. the tree structure
> contributes to the performance.
I've got some more data. I ran the test script for smart-answers, and
let it complete this time:
real 97m21.291s
user 120m50.400s
sys 0m31.020s
Just applying the inode optimization gives this result:
real 90m28.116s
user 100m44.784s
sys 0m18.524s
I'm going to assume that using the tree structure for directories makes
up the rest of the difference. This will vary between repositories
though, I think smart answers has a unusually large directory to file ratio.
>> * guix/git-download.scm (git-predicate): Use different data structures to
>> speed up 'git-predicate' with a large number of files.
>
> [...]
>
>> + (define (create-directory-tree files)
>> + (define (directory-lists->tree directory-lists)
>> + (map (lambda (top-level-dir)
>> + (cons top-level-dir
>> + (directory-lists->tree
>> + (filter-map
>> + (lambda (directory-list)
>> + (if (eq? (length directory-list) 1)
>> + #f
>> + (cdr directory-list)))
>> + ;; Find all the directory lists under this
>> top-level-dir
>> + (filter
>> + (lambda (directory-list)
>> + (equal? (car directory-list)
>> + top-level-dir))
>> + directory-lists)))))
>> + (delete-duplicates
>> + (map car directory-lists))))
>> +
>> + (directory-lists->tree
>> + (filter-map (lambda (path)
>> + (let ((split-path (string-split path #\/)))
>> + ;; If this is a file in the top of the repository?
>> + (if (eq? (length split-path) 1)
>> + #f
>> + ;; drop-right to remove the filename, as it's
>> + ;; just the directory tree that's important
>> + (drop-right (string-split path #\/) 1))))
>> + files)))
>> +
>> + (define (directory-in-tree? directory tree)
>> + (define (directory-list-in-tree? directory-list tree)
>> + (if (eq? (length directory-list) 1)
>> + (list? (member (car directory-list)
>> + (map car tree)))
>> + (and=> (find (match-lambda
>> + ((top-level-dir . subtree)
>> + (equal? top-level-dir
>> + (car directory-list))))
>> + tree)
>> + (match-lambda
>> + ((top-level-dir . subtree)
>> + (directory-list-in-tree? (cdr directory-list)
>> + subtree))))))
>> +
>> + (directory-list-in-tree? (string-split directory #\/)
>> + tree))
>
> Note that ‘length’ and ‘list?’ are O(n). I think ‘directory-in-tree?’
> should be written using ‘match’, which would avoid that altogether.
I've sent an updated patch now, and I think I've made some progress
towards this.
> Likewise, the (map car …) call for ‘match’. :-)
I'm stuck on this bit though, in the latest patch it reads:
(list? (member top-directory (map car tree))
The list? call is just to turn the list or #f returned by member to #t
or #f. The (map car tree) converts the tree to a list of top level
directories. This bit of code is used when the last directory in the
input list has been reached (e.g. when checking for foo/bar/baz
top-directory would be baz) so the last check to perform is to check if
baz exists at the current level of the tree. Any suggestions on
restructuring this?
> I find the tree implementation hard to grasp. Perhaps it would help to
> move it outside of the ‘git-predicate’ function and perhaps decompose it
> a bit more? Thoughts?
I've moved it directly above git-predicate for now.
>> + (inodes-vhash (alist->vhash
>> + (map
>> + (lambda (file)
>> + (let ((stat
>> + (lstat (string-append directory "/"
>> file))))
>> + (cons (stat:ino stat) (stat:dev stat))))
>> + files)))
>
> I would call it ‘inodes’ simply. Also, we could use ‘list->set’ from
> (guix sets) here.
I've made the inodes-vhash -> inodes rename, but I was undecided about
using (guix sets), is there a reason you recommended it?
Thnaks for your review,
Chris
signature.asc
Description: OpenPGP digital signature