[Monotone-devel] database compaction

monotone-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Monotone-devel] database compaction

From:	Markus Schiltknecht
Subject:	[Monotone-devel] database compaction
Date:	Wed, 17 Oct 2007 21:35:35 +0200
User-agent:	Icedove 1.5.0.12 (X11/20070730)

Hi,

I've recently been trying to implement some of the ideas from thedatabase compaction wiki page [1]. I'd like some feedback beforecontinuing. You can find code in the new branchn.v.m.experiment.db-compaction. Note, that there are currently *threeheads* for different things. Please do not merge them, yet!



== Compact Heights ==

As the wiki says, uleb128 wouldn't work too well, because compressedheights couldn't be compared with memcmp(). As I've had to do a lot withJPEGs recently, I've stumbled across Huffman encoding. Given ahand-crafted Huffman Tree, the memcmp() property could be maintained.But even though that idea fascinated me, and I've already begun tocompile statistics about distribution of height values, I had to realizethat the potential gain is very limited. As an example, my currentworking database is 618.1 MiB. Trimming all heights to a single-bytevalue - which is way too optimistic - results in a 612.8 MiB file.Maximum savings: less than 1% :-( Thus overly complicated things likeHuffman are definitely not worth the effort. I didn't write any code here.



== Putting heights in the revisions table ==

I'm not sure why exactly heights haven't been stored in the revisionstable from day one on, but I certainly think it's a good idea. It saveslookups and increases locality in case one needs the heights. And Idon't think it hurts much if you don't. To check confirm or reject theseassumptions, please refer to revision 45501e62.

Does anybody know any hard facts about sqlite3's per-row header size?There's the row id, which is a 64bit. If that's the per-row header sizewe store 8 bytes of headers and 40 bytes of hex hashes. On my workingdatabase with 54311 revisions, that should give a 2.5 MiB, plus yetanother index on heights(revision) we save.

I didn't complete the schema migration, but some SQL code is alreadythere. Manual migration results in 610.8 MiB, a saving of about 1.2%.Not that much either, but I'd vote for that small change anyway.



== Storing hashes in binary ==

Now, this one could really save us space and disc I/O. Every hashcurrently takes 40 bytes because it's stored in hex. Binary hashes couldsave 50% for every hash in the database. And we maintain lots of hashesthere!

I've added encode_hexenc() and decode_hexenc() calls in database.ccwhere appropriate. I didn't change the internal type of revision_id. Sothis really only changes the database storage format. One could possiblyremove lots of these hex conversions if we were using binary hashesinternally, too.

Anyway, please see revision fd4fcc55, which already passes most testsand does a perfect schema migration. I think I've converted pretty muchall hashes. This time no estimates, only bare numbers: 565.6 MiB, 8.4%saving here.

BTW: querying the database needs some more keystrokes to not cripple theterminal with binary output. But it's simple enough, IMO. An example:


  SELECT hex(id) FROM revisions
    WHERE id = x'fd4fcc55bcd79b26fa77ebb12e07b89b49e6826b';


== Using rowids or lookaside tables ==

The wiki has several paragraphs about using rowids or lookaside tables.I don't particularly like that idea, because a) it increases the numberof lookups needed, b) savings are dubious and c) it is database specific(okay, that probably doesn't count...). I might rethink b) as soon as weswitch to longer hashes, but 20 bytes hardly justify the extra rowheaders (and indices) necessary.



== A better, combined index for revision_certs ==

As Ralf and Lapo recently figured out, the indices on revision_certsaren't optimal. In revision 212d7e28 I've replaced the combined uniqueindex on (name, id, value, keypair, signature) with one on (name, value,id, keypair, signature). That allows us to drop the extra index on(name, value) called revision_certs__name_value index.

That change saves us about 2.7% at 600.9 MiB. Complete schema migrationis already included.

As a quick check, I've again run the commit and annotate benchmarks. Seebelow for the results. For the commit benchmark, all three variants showsmall performance gains, mostly below 5% - but only rarely worse thanreference. The annotate benchmark gives a slightly different picture:while 'better-index' and 'included-heighs' aren't any faster, ratherslightly slower, the binary hashes make for a > 10% gain. (I'd haveexpected the 'included-heights' to gain on annotate, because I thoughtfewer seeks were necessary. But as long as the database fits in memory,that might not really be true. Other explanations?).

I'd like to merge these changes and implement a single schema migrationstep. That shouldn't be too much work anymore.


What do you think? Any objections, comments, questions?

Regards

Markus


[1]: http://www.venge.net/mtn-wiki/DatabaseCompaction

[2]: the benchmark results. These were the monotone revisions measured:

ref:              4c5b60a4cb72f99d6ec676bb80b7cdc041099685
included_heights: 45501e62d710ba58d60fcbd203966dc653911a17
better_index:     212d7e289d89fac869924e7753a06871fb73c204
bin_hashes:       fd4fcc55bcd79b26fa77ebb12e07b89b49e6826b

examples/commit.sh, for the 5 MB test case I've run 20 iterations, for50 MB 10, and 500 only 3 iters:

address@hidden:/home/markus/projects/monotone/nvm.cbench/myresults# ../compare.py 5MBtxt-{ref,included_heights,better_index,bin_hashes}-memtime.csv5MBtxt-ref-memtime.csv 5MBtxt-included_heights-memtime.csv 5MBtxt-better_index-memtime.csv 5MBtxt-bin_hashes-memtime.csv p

commit-avg-resident-MiB                   3.47                                
3.47                            3.49                          3.46  0.99
    commit-avg-size-MiB                   9.41                                
9.43                            9.40                          9.40  0.99
commit-max-resident-MiB                   5.66                                
5.62                            5.65                          5.66  0.00
    commit-max-size-MiB                  10.95                               
10.93                           10.94                         10.96  0.00
     commit-num-samples                 180.35                              
174.95                          182.05                        181.30  0.69
     commit-system-time                   0.03                                
0.02                            0.02                          0.02  0.11
       commit-user-time                   0.44                                
0.43                            0.44                          0.43  0.65
       commit-wall-time                   0.98                                
0.94                            0.97                          0.98  0.64

address@hidden:/home/markus/projects/monotone/nvm.cbench/myresults# ../compare.py 5MBrand-{ref,included_heights,better_index,bin_hashes}-memtime.csv5MBrand-ref-memtime.csv 5MBrand-included_heights-memtime.csv 5MBrand-better_index-memtime.csv 5MBrand-bin_hashes-memtime.csv p

commit-avg-resident-MiB                    3.50                                 
3.53                             3.54                           3.59  0.87
    commit-avg-size-MiB                    9.45                                 
9.45                             9.49                           9.51  0.90
commit-max-resident-MiB                    5.73                                 
5.71                             5.73                           5.74  0.05
    commit-max-size-MiB                   11.02                                
11.00                            11.02                          11.04  0.00
     commit-num-samples                  187.60                               
180.85                           179.45                         178.05  0.57
     commit-system-time                    0.02                                 
0.02                             0.03                           0.02  0.64
       commit-user-time                    0.46                                 
0.45                             0.45                           0.44  0.01
       commit-wall-time                    1.01                                 
0.97                             0.96                           0.96  0.64

address@hidden:/home/markus/projects/monotone/nvm.cbench/myresults# ../compare.py 50MBtxt-{ref,included_heights,better_index,bin_hashes}-memtime.csv50MBtxt-ref-memtime.csv 50MBtxt-included_heights-memtime.csv 50MBtxt-better_index-memtime.csv 50MBtxt-bin_hashes-memtime.csv p

commit-avg-resident-MiB                    2.87                                 
2.79                             2.67                           2.67  0.21
    commit-avg-size-MiB                    9.17                                 
8.90                             9.01                           8.72  0.07
commit-max-resident-MiB                    5.54                                 
5.52                             5.53                           5.55  0.03
    commit-max-size-MiB                   10.81                                
10.82                            10.83                          10.85  0.29
     commit-num-samples                  307.00                               
278.20                           298.20                         291.40  0.41
     commit-system-time                    0.02                                 
0.03                             0.03                           0.03  0.61
       commit-user-time                    0.47                                 
0.46                             0.46                           0.44  0.04
       commit-wall-time                    1.65                                 
1.56                             1.68                           1.58  0.61

address@hidden:/home/markus/projects/monotone/nvm.cbench/myresults# ../compare.py 50MBrand-{ref,included_heights,better_index,bin_hashes}-memtime.csv50MBrand-ref-memtime.csv 50MBrand-included_heights-memtime.csv 50MBrand-better_index-memtime.csv 50MBrand-bin_hashes-memtime.csv p

commit-avg-resident-MiB                     2.72                                
  2.85                              2.70                            2.90  0.16
    commit-avg-size-MiB                     8.83                                
  9.02                              8.63                            9.04  0.14
commit-max-resident-MiB                     5.76                                
  5.71                              5.73                            5.77  0.02
    commit-max-size-MiB                    10.93                                
 10.93                             10.94                           10.95  0.38
     commit-num-samples                   284.60                                
285.00                            284.20                          284.00  1.00
     commit-system-time                     0.03                                
  0.04                              0.03                            0.04  0.85
       commit-user-time                     0.46                                
  0.47                              0.47                            0.44  0.38
       commit-wall-time                     1.64                                
  1.55                              1.59                            1.56  0.80
address@hidden:/home/markus/projects/monotone/nvm.cbench/myresults# 
../compare.py 
500MBtxt-{ref,included_heights,better_index,bin_hashes}-memtime.csv
                        500MBtxt-ref-memtime.csv 
500MBtxt-included_heights-memtime.csv 500MBtxt-better_index-memtime.csv 
500MBtxt-bin_hashes-memtime.csv     p
commit-avg-resident-MiB                     9.62                                
  9.62                              9.80                            9.65  0.84
    commit-avg-size-MiB                    15.56                                
 15.54                             15.73                           15.60  0.80
commit-max-resident-MiB                    20.76                                
 20.76                             20.79                           20.33  0.69
    commit-max-size-MiB                    26.30                                
 26.26                             26.15                           26.04  0.78
     commit-num-samples                   512.00                                
505.00                            481.33                          482.00  0.34
     commit-system-time                     0.06                                
  0.07                              0.08                            0.08  0.13
       commit-user-time                     1.35                                
  1.34                              1.32                            1.32  0.06
       commit-wall-time                     2.70                                
  2.71                              2.54                            2.57  0.32
address@hidden:/home/markus/projects/monotone/nvm.cbench/myresults# 
../compare.py 
500MBrand-{ref,included_heights,better_index,bin_hashes}-memtime.csv
                        500MBrand-ref-memtime.csv 
500MBrand-included_heights-memtime.csv 500MBrand-better_index-memtime.csv 
500MBrand-bin_hashes-memtime.csv     p
commit-avg-resident-MiB                     10.00                               
    9.56                              10.02                             9.93  
0.59
    commit-avg-size-MiB                     15.95                               
   15.52                              15.88                            15.85  
0.58
commit-max-resident-MiB                     22.55                               
   22.18                              22.43                            22.69  
0.90
    commit-max-size-MiB                     27.97                               
   27.69                              27.83                            28.18  
0.84
     commit-num-samples                    490.67                               
  499.67                             505.67                           477.33  
0.77
     commit-system-time                      0.09                               
    0.09                               0.07                             0.08  
0.23
       commit-user-time                      1.27                               
    1.27                               1.27                             1.28  
0.73
       commit-wall-time                      2.64                               
    2.67                               2.69                             2.55  
0.79

examples/annotate.sh, with 5 iterations for each binary. All binariesgot a copy of my working database, migrated to their schema.

address@hidden:/home/markus/projects/monotone/nvm.cbench# ./compare.py 
{ref,included-heights,better-index,bin-hashes}/a-ref-memtime.csv
                          ref/a-ref-memtime.csv 
included-heights/a-ref-memtime.csv better-index/a-ref-memtime.csv 
bin-hashes/a-ref-memtime.csv     p
annotate-avg-resident-MiB                 20.35                              
20.27                          20.31                        19.78  0.05
    annotate-avg-size-MiB                 26.15                              
26.09                          26.11                        25.59  0.05
annotate-max-resident-MiB                 31.22                              
31.18                          31.21                        31.21  0.00
    annotate-max-size-MiB                 36.69                              
36.68                          36.68                        36.70  0.00
     annotate-num-samples               3193.00                            
3204.40                        3256.60                      2993.40  0.00
     annotate-system-time                  1.08                               
1.06                           1.07                         0.99  0.05
       annotate-user-time                  4.25                               
4.28                           4.24                         5.66  0.00
       annotate-wall-time                 18.55                              
18.66                          18.76                        16.83  0.00
address@hidden:/home/markus/projects/monotone/nvm.cbench# ./compare.py 
{ref,included-heights,better-index,bin-hashes}/b-ref-memtime.csv
                          ref/b-ref-memtime.csv 
included-heights/b-ref-memtime.csv better-index/b-ref-memtime.csv 
bin-hashes/b-ref-memtime.csv     p
annotate-avg-resident-MiB                 10.06                              
10.16                          10.28                         9.54  0.01
    annotate-avg-size-MiB                 15.96                              
16.08                          16.13                        15.45  0.01
annotate-max-resident-MiB                 16.71                              
16.87                          16.70                        15.34  0.00
    annotate-max-size-MiB                 22.16                              
22.32                          22.14                        20.85  0.00
     annotate-num-samples               1183.60                            
1290.20                        1232.80                      1034.00  0.01
     annotate-system-time                  0.24                               
0.24                           0.25                         0.21  0.28
       annotate-user-time                  0.82                               
0.82                           0.79                         0.97  0.00
       annotate-wall-time                  7.01                               
7.43                           7.07                         5.96  0.00
address@hidden:/home/markus/projects/monotone/nvm.cbench# ./compare.py 
{ref,included-heights,better-index,bin-hashes}/c-ref-memtime.csv
                          ref/c-ref-memtime.csv 
included-heights/c-ref-memtime.csv better-index/c-ref-memtime.csv 
bin-hashes/c-ref-memtime.csv     p
annotate-avg-resident-MiB                 11.15                              
11.43                          11.16                        11.14  0.66
    annotate-avg-size-MiB                 17.04                              
17.34                          17.05                        17.02  0.58
annotate-max-resident-MiB                 19.32                              
19.60                          19.31                        17.68  0.00
    annotate-max-size-MiB                 24.79                              
25.02                          24.72                        23.14  0.00
     annotate-num-samples               1438.40                            
1393.60                        1427.60                      1305.40  0.40
     annotate-system-time                  0.38                               
0.37                           0.38                         0.35  0.28
       annotate-user-time                  1.46                               
1.46                           1.47                         1.76  0.00
       annotate-wall-time                  8.42                               
8.18                           8.29                         7.19  0.10
address@hidden:/home/markus/projects/monotone/nvm.cbench# ./compare.py 
{ref,included-heights,better-index,bin-hashes}/d-ref-memtime.csv
                          ref/d-ref-memtime.csv 
included-heights/d-ref-memtime.csv better-index/d-ref-memtime.csv 
bin-hashes/d-ref-memtime.csv     p
annotate-avg-resident-MiB                 23.02                              
23.33                          23.12                        22.54  0.10
    annotate-avg-size-MiB                 28.84                              
29.16                          28.93                        28.37  0.09
annotate-max-resident-MiB                 32.78                              
32.57                          32.75                        32.39  0.00
    annotate-max-size-MiB                 38.16                              
38.07                          38.14                        37.88  0.00
     annotate-num-samples               3800.20                            
3878.00                        3903.20                      3529.40  0.01
     annotate-system-time                  0.95                               
1.04                           1.02                         0.98  0.23
       annotate-user-time                  7.52                               
7.55                           7.60                         8.98  0.00
       annotate-wall-time                 21.77                              
22.05                          22.19                        19.44  0.00

[Prev in Thread]

Current Thread

[Next in Thread]

[Monotone-devel] database compaction, Markus Schiltknecht <=
- Re: [Monotone-devel] database compaction, Thomas Moschny, 2007/10/17
  - Re: [Monotone-devel] database compaction, Markus Schiltknecht, 2007/10/18
- Re: [Monotone-devel] database compaction, Zack Weinberg, 2007/10/17
  - Re: [Monotone-devel] database compaction, Markus Schiltknecht, 2007/10/18
    - Re: [Monotone-devel] database compaction, Thomas Moschny, 2007/10/18
    - Re: [Monotone-devel] database compaction, Markus Schiltknecht, 2007/10/18
- Re: [Monotone-devel] database compaction, Nathaniel Smith, 2007/10/17
  - Re: [Monotone-devel] database compaction, Thomas Moschny, 2007/10/18
  - Re: [Monotone-devel] database compaction, Markus Schiltknecht, 2007/10/18

Prev by Date: Re: [Monotone-devel] cygwin "pre-0.37 manually operated" buildbot report
Next by Date: Re: [Monotone-devel] database compaction
Previous by thread: [Monotone-devel] cygwin "pre-0.37 manually operated" buildbot report
Next by thread: Re: [Monotone-devel] database compaction
Index(es):
- Date
- Thread