[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [rdiff-backup-users] Proposal: Storing excess file information
From: |
Ben Escoto |
Subject: |
Re: [rdiff-backup-users] Proposal: Storing excess file information |
Date: |
Mon, 02 Dec 2002 12:58:55 -0800 |
Ok, I just ran a few tests, pretending that I was trying to write and
read the user and group ownership information of 500000 files in
order. I compared Bud Bruegger's suggestion of shelve and a normal
text file.
First the sizes of the files:
-rw-r--r-- 1 ben ben 82911232 Dec 2 12:21 shelf.db
-rw-r--r-- 1 ben ben 1252698 Dec 2 12:17 text.gz
and now the times (ran each one twice):
Writing the text file:
real 0m23.902s
user 0m23.650s
sys 0m0.150s
real 0m24.811s
user 0m23.820s
sys 0m0.180s
Reading from that text file:
real 0m47.725s
user 0m47.310s
sys 0m0.080s
real 0m47.443s
user 0m47.360s
sys 0m0.080s
Writing using python's shelve module:
real 1m51.196s
user 0m34.540s
sys 0m24.880s
real 2m49.347s
user 0m29.180s
sys 0m23.810s
Reading from that shelf file:
real 1m37.679s
user 1m25.480s
sys 0m7.370s
real 1m32.652s
user 1m24.510s
sys 0m7.710s
So it seems the database format took longer to read and write. At
least in the writing case this was apparently because it took up much
more space on disk and disk writes are relatively slow.
The shelve format would have done better at random access. This
doesn't happen much, but could happen at the start of a selective
restore. However 'zcat text.gz | grep File filename' took about half
a second even for filenames near the bottom, so in theory there
shouldn't be a big speed issue.
Dave Steinberg mentioned cdb. This is probably faster than
shelve (assuming the python interface is reasonably fast) but I doubt
it will be must faster than the text version. Also it doesn't seem to
be released under a GPL compatible license, which is a showstopper for
me.
The script used is below. The text reading code could use much
improvement as it is surely faster to read in large blocks instead of
line by line (as was shown by the scan_text function).
So anyway I think I will just use a flat text format, with all the
metadata in one file. It seems fast enough, takes up very little
disk space, and the format should be easy to process on any platform,
or read by hand.
--
Ben Escoto
import shelve, re, gzip, sys
user = "larry"
group = "losers"
count = 500000
def write_shelf():
"""Write shelve DB"""
d = shelve.open("shelf.db")
for i in xrange(count):
d["foo/" + str(i)] = {"user": user, "group": group}
if i % 100000 == 0: print i
d.close()
def read_shelf():
"""Read every file from shelf"""
d = shelve.open("shelf.db")
for i in xrange(count):
assert d["foo/" + str(i)]['user'] == user
if i % 100000 == 0: print i
d.close()
def write_text():
"""Write gzipped text file"""
fout = gzip.GzipFile("text.gz", "wb")
for i in xrange(count):
filename = "foo/" + str(i)
fout.write("File %s\n" % filename)
fout.write(" User %s\n" % user)
fout.write(" Group %s\n" % group)
if i % 100000 == 0: print i
assert not fout.close()
def read_text():
"""Read gzipped text file"""
fin = gzip.GzipFile("text.gz", "rb")
while 1:
line = fin.readline()
if not line: break
if line.startswith("File "): filename = line[5:-1]
elif line.startswith(" User "):
line_user = line[9:-1]
assert line_user == user
assert not fin.close()
def scan_text():
"""Scan for a given text file"""
fin = gzip.GzipFile("text.gz", "rb")
while 1:
buf = fin.read(64 * 1024)
assert buf
if re.search("File " + "foo/482323", buf): break
assert not fin.close()
eval(sys.argv[1])()
pgp2YjsKGERGr.pgp
Description: PGP signature