[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Deduplication performance
From: |
Ludovic Courtès |
Subject: |
Deduplication performance |
Date: |
Sat, 27 Sep 2014 17:43:18 +0200 |
User-agent: |
Gnus/5.130011 (Ma Gnus v0.11) Emacs/24.3 (gnu/linux) |
Hello Guix of the hackathon! :-)
Every time core packages like libc or Bash are changed, we end up
rebuilding (or re-downloading) the world, and storing multiple copies of
mostly-identical packages. The daemon implements a simple deduplication
strategy, where identical files found in the store are hard-linked
together (via the /gnu/store/.links directory), so as to achieve
single-instance storage.
I’ve upgraded my system to current master (with the Bash fix), and I
still have a bunch of old profile generations, so I wanted to see how
efficient deduplication is. Some characteristics of my system:
--8<---------------cut here---------------start------------->8---
$ ls -ld /gnu/store/*-bash-4.3 /gnu/store/*-bash-4.3.25 | wc -l
9
$ ls -ld /gnu/store/*-glibc-2.{20,19} | wc -l
4
$ guix package -I | wc -l
178
$ guix package --list-generations | grep ^Gen | wc -l
31
--8<---------------cut here---------------end--------------->8---
The attached script has allowed me to see how much deduplication is
taking place in my store.
The first question is: how much disk space is actually saved by
deduplication?
--8<---------------cut here---------------start------------->8---
scheme@(guile-user)> (saved-space (links))
$43 = 15128046966
scheme@(guile-user)> (/ $43 GiB 1.)
$44 = 14.089091649278998
--8<---------------cut here---------------end--------------->8---
That’s 14 GiB saved by deduplication (!).
We can plot the cumulative distribution function (CDF) of the number of
hard links or each deduplicated file:
It’s a bit hard to read, but we see that most than half of the files
under /gnu/store/.links have 1 or 2 hard links, and 80% have 4 hard
links or more.
The CDF of the size of deduplicated files is like this:
Half of the deduplicated files are 4 KiB or less, and 90% of the
deduplicated files are between 0 and 64 KiB.
The code for that is attached below; it uses Andy’s cool Guile-Charting.
Happy hacking!
Ludo’.
(use-modules (charting)
((guix store) #:select (%store-prefix))
(ice-9 ftw)
(ice-9 match)
(srfi srfi-1)
(srfi srfi-9))
(define-record-type <deduplicated-file>
(deduplicated-file name size links)
deduplicated-file?
(name deduplicated-file-name)
(size deduplicated-file-size)
(links deduplicated-file-link-count))
(define %links-directory
(string-append (%store-prefix) "/.links"))
(define (links)
"Return a list of <deduplicated-file>."
(file-system-fold (const #t)
(lambda (file stat result) ;leaf
(cons (deduplicated-file file
(stat:size stat)
(stat:nlink stat))
result))
(lambda (directory stat result) ;down
result)
(lambda (directory stat result) ;up
result)
(const #f) ;skip
(lambda (file stat errno result)
(error "i/o error" file errno))
'()
%links-directory
lstat))
(define KiB (expt 2 10))
(define MiB (* KiB KiB))
(define GiB (* KiB MiB))
(define (saved-space files)
"Return the total amount of saved space given FILES, a list of
<deduplicated-file>."
(fold (lambda (df result)
(match df
(($ <deduplicated-file> name size links)
(+ result (* size (- links 1))))))
0
files))
(define (cumulative-distribution files property)
"Return a list of (VALUE . COUNT) pairs representing the number of FILES
whose PROPERTY is VALUE or less."
(define (file<? df1 df2)
(< (property df1) (property df2)))
(fold (lambda (df result)
(match result
(((value . count) . rest)
(let ((current (property df)))
(if (= value current)
(alist-cons value (+ count 1) rest)
(alist-cons current (+ count 1) result))))
(_
(alist-cons (property df) 1 result))))
'()
(sort files file<?)))
(define* (plot-distribution distribution output
#:key subtitle max-x group-name x-axis-label)
"Plot DISTRIBUTION, and produce file OUTPUT."
(define (log2 n)
(let loop ((result 1))
(if (zero? (ash n (- result)))
(- result 1)
(loop (+ 1 result)))))
(define (format-log2-tick tick)
;; (string-append "2^"
;; (number->string (log2 (inexact->exact tick))))
(number->string (inexact->exact tick)))
(define (adjust-items total)
(lambda (x)
(match x
;; XXX: Filter out the two cases that would give us a numerical
;; overflow.
((0 . _) #f)
((1 . _) #f)
((value . count)
(and (or (not max-x) (< value max-x))
(cons value (* 100. (/ count total))))))))
(match distribution
(((_ . total) . rest)
(let ((percent (filter-map (adjust-items total) distribution)))
(make-scatter-plot #:title (string-append "cumulative distribution by "
subtitle)
#:data `((,group-name ,@percent))
#:x-axis-label x-axis-label
#:y-axis-label "%"
#:tick-label-formatter format-log2-tick
#:log-x-base 2
#:min-x 1
#:max-y 101
#:write-to-png output)))))
signature.asc
Description: PGP signature
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- Deduplication performance,
Ludovic Courtès <=