[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Monotone-devel] database compaction
From: |
Markus Schiltknecht |
Subject: |
Re: [Monotone-devel] database compaction |
Date: |
Thu, 18 Oct 2007 08:33:35 +0200 |
User-agent: |
Icedove 1.5.0.10 (X11/20070328) |
Hi,
Zack Weinberg wrote:
Paul Crowley had a suggestion for an encoding that would work and
might be simple enough to be worth implementing, even though (as you
say) the space savings are quite small relative to the total database
size.
I remember having seen some bit encoding. Do you have any pointers to
Paul's suggestion? A real, optimized Huffman thing would be, troublesome
because it's bit based, so the resulting bit string could be 11 bits
long and thus not end on byte boundaries. Can sqlite store bit strings
of a certain length efficiently?
Anyway, one could also come up with a simpler, byte based encoding.
Something like that:
height encoded
values representation
0 - 127 0xxxxxxx
128 - 16511 10xxxxxx xxxxxxxx
16512 - 2113664 110xxxxx xxxxxxxx xxxxxxxx
...
FYI: I've copied a sample height value distribution list below.
Another idea which just popped up: do we need to store the heights in a
single row per revision at all? It looks more like sqlite is loading it
all into memory once and then using SQL to query it seems rather expensive.
Wouldn't it be better if we stored all the heights in a single blob,
loading it all into memory in one step? That would allow much better
compression using libz, because lots of height sequences are repetitive.
Up to now, we need heights for annotate and log, right? Both presumably
use more that one rev_height per invocation...
== Using rowids or lookaside tables ==
The wiki has several paragraphs about using rowids or lookaside tables.
I don't particularly like that idea, because a) it increases the number
of lookups needed, b) savings are dubious and c) it is database specific
(okay, that probably doesn't count...). I might rethink b) as soon as we
switch to longer hashes, but 20 bytes hardly justify the extra row
headers (and indices) necessary.
We really do want to do this, but only in contexts where table A
stores a hash which is used as the sole key in a SELECT on table B.
For example: file_deltas.base, revision_ancestry.(parent,child),
rosters.id, roster_deltas.(id,base) and maybe revision_certs.id. In
that context, doing it reduces lookups, because what sqlite does if
asked to SELECT * FROM b WHERE id = 'hash' is look up 'hash' in the
autoindex for id, which tells it the rowid, and then do a second
lookup in the actual table.
Doh! Really? I see, sqlite is very much not like Postgres... That must
hurt as soon as those things don't fit into memory anymore, no?
Anyway, thank you for pointing this out. I'm still somewhat skeptical,
because your example only shows the winning side of the coin. There are
other queries, for example getting the parent of a revision. Currently,
that's an index scan on the autoindex for revision_ancestry(child)
(followed by the rowid lookup). If we use rowids in revision_ancestry,
we'd first have to lookup the child revision's rowid in revisions, then
lookup the rowid of the parent revision in revision_ancestry to finally
get the real revision id from revisions by a rowid lookup. That seems
rather expensive, or am I missing something because of thinking in terms
of traditional databases?
Regards
Markus
Distribution of the different height values in my monotone database at
some point in time. Others, like the xaraya, openembedded or botan
repositories were similar.
height number of
value occurrences
0, 57888
1, 25721
2, 5737
3, 11909
4, 6889
5, 620
6, 360
7, 501
8, 792
9, 3329
10, 678
11, 555
12, 420
13, 7606
14, 456
15, 296
16, 162
17, 259
18, 504
19, 294
20, 252
21, 225
22, 275
23, 6014
24, 4675
25, 112
26, 310
27, 1761
28, 388
29, 97
30, 6838
31, 336
32, 502
33, 89
34, 200
35, 538
36, 667
37, 391
38, 75
39, 78
40, 86
41, 460
42, 721
43, 131
44, 60
45, 85
46, 70
47, 1314
48, 178
49, 65
50, 58
51, 1973
52, 57
53, 67
54, 107
55, 87
56, 58
57, 203
58, 5467
59, 39
60, 75
61, 46
62, 78
63, 950
64, 47
65, 128
66, 34
67, 51
68, 43
69, 45
70, 33
71, 110
72, 32
73, 46
74, 240
75, 50
76, 32
77, 32
78, 34
79, 59
80, 24
81, 31
82, 577
83, 25
84, 1157
85, 32
86, 75
87, 32
88, 29
89, 24
90, 22
91, 21
92, 24
93, 22
94, 25
95, 48
96, 23
97, 123
98, 31
99, 25
100, 20
101, 54
102, 29
103, 21
104, 18
105, 20
106, 18
107, 46
108, 19
109, 18
110, 16
111, 17
112, 42
113, 26
114, 168
115, 29
116, 21
117, 18
118, 15
119, 18
120, 92
121, 19
122, 15
123, 16
124, 15
125, 15
126, 20
127, 16
128, 30
129, 1257
130, 17
131, 18
132, 15
133, 15
134, 112
135, 448
136, 18
137, 13
138, 219
139, 75
140, 13
141, 11
142, 535
143, 11
144, 10
145, 10
146, 11
147, 10
148, 10
149, 11
150, 11
151, 119
152, 11
153, 9
154, 9
155, 9
156, 9
157, 9
158, 9
159, 9
160, 10
161, 9
162, 10
163, 37
164, 11
165, 9
166, 11
167, 15
168, 11
169, 8
170, 134
171, 13
172, 8
173, 9
174, 8
175, 9
176, 10
177, 8
178, 13
179, 8
180, 28
181, 12
182, 12
183, 15
184, 7
185, 7
186, 15
187, 7
188, 8
189, 7
190, 138
191, 8
192, 64
193, 7
194, 8
195, 10
196, 13
197, 8
198, 982
199, 7
200, 777
201, 7
202, 7
203, 7
204, 7
205, 7
206, 21
207, 3836
208, 11
209, 20
210, 6
211, 290
212, 8
213, 10
214, 6
215, 7
216, 5
217, 4
218, 4
219, 7
220, 5
221, 4
222, 6
223, 24
224, 4
225, 4
226, 1645
227, 3
228, 3
229, 4
230, 3
231, 3
232, 3
233, 3
234, 3
235, 3
236, 3
237, 3
238, 3
239, 3
240, 3
241, 3
242, 3
243, 3
244, 3
245, 3
246, 3
247, 3
248, 3
249, 3
250, 3
251, 3
252, 3
253, 3
254, 4
255, 3
256, 3
257, 3
258, 3
259, 3
260, 3
261, 3
262, 3
263, 3
264, 3
265, 3
266, 2
267, 2
268, 2
269, 2
270, 2
271, 2
272, 2
273, 2
274, 2
275, 6882
276, 2
277, 2
278, 2
279, 2
280, 2
281, 2
282, 2
283, 2
284, 2
285, 2
286, 2
287, 2
288, 2
289, 2
290, 2
291, 2
292, 2
293, 2
294, 2
295, 2
296, 2
297, 3
298, 2
299, 2
300, 2
301, 2
302, 2
303, 2
304, 2
305, 2
306, 2
307, 2
308, 2
309, 2
310, 2
311, 2
312, 2
313, 2
314, 2
315, 2
316, 2
317, 2
318, 2
319, 2
320, 2
321, 2
322, 2
323, 2
324, 2
325, 2
326, 2
327, 2
328, 2
329, 2
330, 2
331, 2
332, 2
333, 2
334, 2
335, 2
336, 2
337, 2
338, 2
339, 2
340, 2
341, 2
342, 2
343, 2
344, 2
345, 2
346, 2
347, 2
348, 2
349, 2
350, 2
351, 2
352, 2
353, 2
354, 2
355, 2
356, 2
357, 2
358, 2
359, 2
360, 2
361, 2
362, 2
363, 2
364, 2
365, 2
366, 3
367, 2
368, 2
369, 2
370, 4
371, 2
372, 2
373, 2
374, 2
375, 2
376, 6
377, 2
378, 2
379, 2
380, 2
381, 2
382, 2
383, 2
384, 2
385, 2
386, 2
387, 2
388, 2
389, 2
390, 2
391, 2
392, 2
393, 2
394, 2
395, 2
396, 2
397, 2
398, 2
399, 2
400, 2
401, 2
402, 2
403, 2
404, 2
405, 2
406, 2
407, 1
408, 1
409, 1
410, 1
411, 1
412, 1
413, 1
414, 1
415, 1
416, 1
417, 1
418, 1
419, 1
420, 1
421, 1
422, 1
423, 1
424, 1
425, 1
426, 1
427, 1
428, 1
429, 1
430, 1
431, 1