guix-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Deduplication performance


From: Ludovic Courtès
Subject: Deduplication performance
Date: Sat, 27 Sep 2014 17:43:18 +0200
User-agent: Gnus/5.130011 (Ma Gnus v0.11) Emacs/24.3 (gnu/linux)

Hello Guix of the hackathon!  :-)

Every time core packages like libc or Bash are changed, we end up
rebuilding (or re-downloading) the world, and storing multiple copies of
mostly-identical packages.  The daemon implements a simple deduplication
strategy, where identical files found in the store are hard-linked
together (via the /gnu/store/.links directory), so as to achieve
single-instance storage.

I’ve upgraded my system to current master (with the Bash fix), and I
still have a bunch of old profile generations, so I wanted to see how
efficient deduplication is.  Some characteristics of my system:

--8<---------------cut here---------------start------------->8---
$ ls -ld /gnu/store/*-bash-4.3 /gnu/store/*-bash-4.3.25 | wc -l
9
$ ls -ld /gnu/store/*-glibc-2.{20,19} | wc -l
4
$ guix package -I | wc -l
178
$ guix package --list-generations | grep ^Gen | wc -l
31
--8<---------------cut here---------------end--------------->8---

The attached script has allowed me to see how much deduplication is
taking place in my store.

The first question is: how much disk space is actually saved by
deduplication?

--8<---------------cut here---------------start------------->8---
scheme@(guile-user)> (saved-space (links))
$43 = 15128046966
scheme@(guile-user)> (/ $43 GiB 1.)
$44 = 14.089091649278998
--8<---------------cut here---------------end--------------->8---

That’s 14 GiB saved by deduplication (!).

We can plot the cumulative distribution function (CDF) of the number of
hard links or each deduplicated file:

PNG image

It’s a bit hard to read, but we see that most than half of the files
under /gnu/store/.links have 1 or 2 hard links, and 80% have 4 hard
links or more.

The CDF of the size of deduplicated files is like this:

PNG image

Half of the deduplicated files are 4 KiB or less, and 90% of the
deduplicated files are between 0 and 64 KiB.

The code for that is attached below; it uses Andy’s cool Guile-Charting.

Happy hacking!

Ludo’.

(use-modules (charting)
             ((guix store) #:select (%store-prefix))
             (ice-9 ftw)
             (ice-9 match)
             (srfi srfi-1)
             (srfi srfi-9))

(define-record-type <deduplicated-file>
  (deduplicated-file name size links)
  deduplicated-file?
  (name    deduplicated-file-name)
  (size    deduplicated-file-size)
  (links   deduplicated-file-link-count))

(define %links-directory
  (string-append (%store-prefix) "/.links"))

(define (links)
  "Return a list of <deduplicated-file>."
  (file-system-fold (const #t)
                    (lambda (file stat result)      ;leaf
                      (cons (deduplicated-file file
                                               (stat:size stat)
                                               (stat:nlink stat))
                            result))
                    (lambda (directory stat result) ;down
                      result)
                    (lambda (directory stat result) ;up
                      result)
                    (const #f)                      ;skip
                    (lambda (file stat errno result)
                      (error "i/o error" file errno))
                    '()
                    %links-directory
                    lstat))

(define KiB (expt 2 10))
(define MiB (* KiB KiB))
(define GiB (* KiB MiB))

(define (saved-space files)
  "Return the total amount of saved space given FILES, a list of
<deduplicated-file>."
  (fold (lambda (df result)
          (match df
            (($ <deduplicated-file> name size links)
             (+ result (* size (- links 1))))))
        0
        files))

(define (cumulative-distribution files property)
  "Return a list of (VALUE . COUNT) pairs representing the number of FILES
whose PROPERTY is VALUE or less."
  (define (file<? df1 df2)
    (< (property df1) (property df2)))

  (fold (lambda (df result)
          (match result
            (((value . count) . rest)
             (let ((current (property df)))
               (if (= value current)
                   (alist-cons value (+ count 1) rest)
                   (alist-cons current (+ count 1) result))))
            (_
             (alist-cons (property df) 1 result))))
        '()
        (sort files file<?)))

(define* (plot-distribution distribution output
                            #:key subtitle max-x group-name x-axis-label)
  "Plot DISTRIBUTION, and produce file OUTPUT."
  (define (log2 n)
    (let loop ((result 1))
      (if (zero? (ash n (- result)))
          (- result 1)
          (loop (+ 1 result)))))

  (define (format-log2-tick tick)
    ;; (string-append "2^"
    ;;                (number->string (log2 (inexact->exact tick))))
    (number->string (inexact->exact tick)))

  (define (adjust-items total)
    (lambda (x)
      (match x
        ;; XXX: Filter out the two cases that would give us a numerical
        ;; overflow.
        ((0 . _) #f)
        ((1 . _) #f)
        ((value . count)
         (and (or (not max-x) (< value max-x))
              (cons value (* 100. (/ count total))))))))

  (match distribution
    (((_ . total) . rest)
     (let ((percent (filter-map (adjust-items total) distribution)))
       (make-scatter-plot #:title (string-append "cumulative distribution by "
                                                 subtitle)
                          #:data `((,group-name ,@percent))
                          #:x-axis-label x-axis-label
                          #:y-axis-label "%"
                          #:tick-label-formatter format-log2-tick
                          #:log-x-base 2
                          #:min-x 1
                          #:max-y 101
                          #:write-to-png output)))))

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]