[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
cvs (or something!) on very large scales with non-source code objects
From: |
Nigel Kerr |
Subject: |
cvs (or something!) on very large scales with non-source code objects |
Date: |
Fri, 01 Feb 2002 10:24:37 -0500 |
User-agent: |
Gnus/5.090005 (Oort Gnus v0.05) Emacs/20.7 (sparc-sun-solaris2.8) |
good folk,
i ask this forum because i'm not at all sure where start looking for
ideas on how to address my problems. cvs may not be the right tool
for what i have, but any ideas or suggestions or redirections to other
fora are welcome and desired.
i have several million objects ("very large scales"): roughly half of
them are bitonal TIFF files, scanned page images of printed material;
the other half are OCR'd text of those same TIFF files. there are a
relatively small number of other kinds of files: metadata about chunks
of these data, and auxilliary images of parts of some of the pages.
right now the top level chunks of this corpus number about 3,000, with
sub-chunks inside those top-level chunks.
at any moment, it might be discovered that there is an error or
problem with any of these objects, that will need to be fixed:
the TIFF file might be bad/corrupt/unclear
the ocr'd text might be bad/corrupt/unclear
the metadata might be found to be wrong
the auxilliary images might be bad/corrupt/unclear
we might make a change to a small number of things at a time, we might
also make a batch change to thousands of things at a time. back when
we had less than 500 top-level chunks, our life was relatively easy:
we had a home-grown edit-history-type system that basically:
moved the old file FILE to FILE.yyyy.mm.dd.hh.mm
moved the new version of FILE into place
wrote in a date-stamped log file a message meaning "i changed
this!", where the message phrased differently depending on what
got changed.
used the doughty mirror perl script on our different machines to
get the changed data from the master to the slave machines.
we're still using that system. we get about 400,000 new items a month
in between 30-50 new top-level chunks (a top-level varies in size
considerably). the increases in size of our corpus will never slow
down.
our stated *goals* for using this system are two-fold:
a method for communicating from the master to the slave machines
about what has changed, and what they should try to update.
a record of what all has changed ever, so that if we had to start
from original source media (the cd-roms the data arrive to us on),
we could, and only update what needed updating.
i don't have much problem with the first goal: we need some
communication method from master to slave. i am increasingly nervous
about the second goal as we get larger and larger, and am looking for
other ways to address or consider that problem.
it might be that we:
give up on "record of what all has changed ever", and try to go
for "record of what all has changed since the last time we had a
complete checkpoint of our corpus", and keep using our change
system, and give up on the "restore from original media" idea.
use a version control system that can handle millions of things
(which would be?!) changing, and the master-to-slave transport of
changes efficiently.
keep going about things as we have, and just hope we never have to
restore from scratch.
something else?
anyone here approached this kind of problem, know someone who has, or
have any ideas about it? people/places i can seek advice from?
anything is appreciated, thank you.
cheers,
nigel kerr
address@hidden
- cvs (or something!) on very large scales with non-source code objects,
Nigel Kerr <=