Re: [Gnu-arch-users] Re: give us a hand with arch

gnu-arch-users
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Gnu-arch-users] Re: give us a hand with arch

From:	Andrea Arcangeli
Subject:	Re: [Gnu-arch-users] Re: give us a hand with arch
Date:	Sat, 27 Sep 2003 12:09:41 +0200
User-agent:	Mutt/1.4.1i
On Fri, Sep 26, 2003 at 07:04:25PM -0700, Tom Lord wrote:
> 
> 
>     > From: Andrea Arcangeli <address@hidden>
> 
>     > There is no way I can maintain an huge regexp.
> 
> Want some advice?
> 
> Stop.  
> 
> Slow down.
> 
> Stop posting so much.
> 
> You don't quite know what you're talking about yet.

that's true, I don't understand it all yet, I'm definitely not an
experienced user yet (and the case I was examining is what I consider an
extreme case since I've to handle around 300 branches all chained one on
top of the other and inserting new branches in the middle of the chain,
which is something I've no clear solution for yet, i.e. the "re-tag"
idea). It's quite complex stuff, I doubt many people are using arch for
this scenario today, I don't think it's the usual usage of parallel
distributed development,  and I will postpone it after the first
problems gets fixed. You're completely right we've to do things in
order and prioritize. For now I've quite clear the more normal usage
with star-merge.

NOTE: I also don't pretend to be able to use arch for the 300 branches
case.  So far I didn't even consider that it was possible at all with
any other product (even in the non open source area). It may be just too
slow to manage that many branches, but I will research if that's
feasible over time, and what I can do to make it more handy. I'm not
really in a hurry.

I'm sorry if I spammed, I was quite excited, and I was trying to
understand all what was going on.

Especially all the explanations from Pau were _just_ very useful to me
to better understand how arch can be used for my purpose (i.e. the pure
branches). My original assumption of using the patchsets as if they were
pure branches was definitely half broken ;)

But we're very right I need to slowdown and lot and focus on each thing
one by one, slowly and prioritize. Or I risk to waste your time with too
easy questions too. It was was my intention anyways to prioritize after
my first steps.  Also note the long questions were sent to Pau, I didn't
want to waste your time with easy or non interesting questions.

So now, let's start with the current showstopper and ignore everything
else. You probably may want to read only 1 (and possibly 3), think about
it, discuss it and go to point 2/4 much much later (you don't need to
read 2/4 now).

------------------------------------------------------------------

1) the commit has to be strict for the kernel usage.  You simply
can't pretend that we must allow a regexp to catch statically all the
junk forever, that's not feasible as I stated in the quote above. I feel
like you're disappointed that we need this feature but I don't think it's
fixable by hoping everybody does the right thing. The need for the
strict commit is exactly to allow people to easily find when they do a
mistake and they left garbage in the tree: to do the right thing before
doing a commit.

Tons of branches and architectures are developed indipendently, and they
sometime use even different policies in the coding style, not just in
the .c sourcecode generation name conventions that don't exist at all
and that doesn't make much sense to enforce (though coding style
differences are discouraged these days, but often we merge drivers
developed in weird manners still). In a so distributed development like
the kernel, enforcing a regexp is not feasible. Or at least there will
be always the risk of an exception and the regexp will eventually
obsolete, and this theorical huge regexp will definitely differentiate a
lot from 2.4 to 2.6 to 2.7.

Personally I need the strict commit anyways to trivially validate the
'distclean' functionality. The strict commit feature that will decrease
the probability of adding junk in the tree by mistake to ~zero IMHO. I
recall in the past (2.3 or 2.4) we had junk, for example in some kernel
there was some problem after a distclean you couldn't reverse the
prvious patch anymore or similar. somebody forgot to update distclean. I
also had junk myself recently when I applied a patch with -p0 instead of
-p1 (and I tend to re-read patches).

I've to admit the current commit already tells me which files are being
added, so even the current non-strict commit would be still better than
no revision control, but personally I also like that I can write a new
random file and not having to worry about having garbage in the tree if
I don't want to commit it explicitly. Maybe it's a temporary python
script that does some calculation/simulation and it's useful to me
during development and I don't need to store it in /tmp or similar. I
can store next to the file involved which is the most on-topic place of
the filesystem where to store it (assuming the commit is strict and I
don't run into risking committing garbage).

> To be sure: you've some catching up to do.  Very recent features --
> some even since 1.1pre5, come into play here.  You'll need some help

are these very recent features addressed in the docs too, right? I'm
still stuck at pre4 but I'm going to upgrade right now to the rpm that
Pau sent me. You're very right I'd better study the new features before
making more questions on the complex "branch" matter. Eventually I'll
sort out the final problems while doing experiments in practice, I just
wanted to be more or less on the right track first and those discussions
were useful to me.

You may want to stop reading this email here and continue later.
I can post you the rest of the email (if still actual) over time after
point 1 is addressed. Of course I will try to contribute with code too
as time permits and/or if you need.

------------------------------------------------------------------
2) the second showstopper to address that I mentioned (at least for my
needs, but it will boost everyone) is the checkout (i.e.  get) through
_hardlinks_ from a cached _unpacked_ tree. the cachrev should generate
an unpacked tree, and it should itself check previous unpacked trees and
fork from them with hardlinks to preserve the inodes as much as possible.

90% of the code in the kernel will never change across all the branches,
and I need only 1 inode allocated in the whole filesystem for this huge
amount of code that never changes. an unpacked tarball of the kernel
takes 200-500M hence the absolute need for hardlinks everywhere and the
avoidance of any tar xzf of a 200-500M payload during any checkouts of
the pure branches.

I understand Miles wants this too, so maybe this is not a stupid idea
already addressed by the current pre5 code. I will soon check if this
can be already done in pre5 somehow.

I don't see other showstoppers at the moment.

------------------------------------------------------------------
3) The third issue in my scale of priority (and this isn't in the
showstopper area anymore) is the tagging mode.

Since I need the strict commit in the first place, I shouldn't be
supposed to know or care about the tagging mode at all when I'm in
strict commit mode. To provide the strict commit feature, obviously I
have to tell arch every time I move stuff around and as usual when I add
or remove stuff from the tree. Since I've to do that anyways for the
strict commit behaviour, then it'd be very wasteful for me to be asked
to manage the tagging in a non transparent manner like it happens right
now.

Note, more or less I'm also claiming the current explicit tagging mode
is underpowered at the moment. So the way to go is probably to boost the
explicit tagging to the power it should have to be completely
transparent to the user, and then to enable the explicit mode by default
(IMHO it would be reasonable to have it as the only tagging option) when
the strict commit feature is enabled on a project.

Last night I suggested to generate ids by random in function of my-id
and various other sources. But we don't need to work by random, it seems
we can be mathematically safe even while doing distributed development
thanks to the my-id. I overlooked this possibility last night.  The
arch-tag for a file would be address@hidden where 1234 is a
sequence number per-project.  It'll never be decreased and it can grow
to infinite (well a 64bit long long would be more than enough, in
practice even a long would be ok ;).

So the idea is: assing transparently and automatically, to each file a
unique id my-id_sequence-number during their first import or their first
add-tag (that for me means cvs add in terms of strict commit), and
store it together with the base-0 tarball in the first patchset and
replicate it in the {arch} directory during `tla get` after the tar xzf
of the base-0 tarball. (it may be possible to avoid storing the
information into the base-0 tarball if we rely on ordering and we know
the sequence number had to start with 0 but this is a non important
detail) This thing will generate a dictionary/hash that resolves from
name to ID (maybe both ways, dunno). the commit and inventory commands
will depend on this data structure when strict commit is enabled.

Every time we do a tla move-tag (that I'd call more simply `tla move` with
strict commit enabled) we update the hash accordingly and  record in the
{arch} directory (separately from the hash), that one unique ID changed
name. Conceptually the information that `tla move` has to store separately
could be something like:

        address@hidden mm/memory.c
        address@hidden mm/memory_lowlevel.c
        address@hidden mm/memory_highlevel.c

that indicates the ID 1234 changed from memory.c to
memory_lowlevel.c and a new file was added with tla add later.

Then when we do a commit we must store only the above short metadata
into the repository so during `tla get` we can patch the database
accordingly to what we did before the commit. This will reply the `tla
mv` operation in the patchset.

I'm fairly sure all this support must be more or less already
implemented for the explicit tagging, that's the whole point of the
move-tag add-tag delete-tag commands after all. So it should be very
easy to convert the explicit tagging to a true transparent
auto-tag-mode, not to require any user intervention anymore. The user
will be simply forced to use delete-tag/add-tag/move-tag every time he
wants to make a change, but the explicit mode already requires it.

I'm ignoring the mixture between differnt modes and backwards
compatibility for simplicity, that's a separate matter anyways.

Does this make any sense? What I'd like to know is if you can you see
any valid reason to even consider giving the user an option to use
'tagline' or 'explicit' or 'names' tag-modes, when you can solve the
problem completely and _transparently_ when 'strict commit' is enabled?
(with a real powerful and transparent explicit mode)

(in a even mathematically safe way, unlike tagline and explicit where
you never know if both users adds a new file with the same arch-tag,
they're not forcing the my-id in front of their tag for example, while
the implicit transparent auto-tagging is smart enough to do the right
thing always)

So in short, I believe only people using the current non-strict commits,
will have to still to worry about the tagmodes at all. And once the
non-strict commit users will face how more powerful is the
full-powered explicit mode, then probably they will want to use it
instead of the others, since it gives them more reliability, and less
overhead, with the only downside of having to use `tla
add-tag/del-tag/move-tag` (that I would rename to `tla add/del/move` to
be more intuitive with the strict commit needs that aren't working for
tags, as said with strict commits the whole tag-mode thing vanishes)

Now on this third part, it could be one of the half broken ideas and I
could be missing something, you're warned ;).

------------------------------------------------------------------
4) the last bit I'm mentioning here (and it's very low priority, much
lower than 3) is about the gzip. I believe that should be made optional
or subject to a size of the tar patchset exceeding a certain limit
(possibly configurable, different people may have different speed/space
ratio needs so it's very reasonable to leave it configurable).

You know small files compress much less. So the compression is more
worthwhile on the big files. So if you care about speed and not space
you should compress the tar only when the file is very big.

And if any temporary data should be stored into a filesystem, that
filesystem should be /dev/shm in linux (warning the user trapping the
sigsegv, that he should check that no garbage is left in /dev/shm) We
also should be deeply careful in not following symlinks for the usual
security reasons (it's the same as /tmp). Using /dev/shm may boost
performance since it never invokes any real fs operation and it works
pure in the kernel cache layer, so it runs at full cache speed.

Right now it's just fine to have tar.gz always enabled because it'll
save some network bandwidth compared to pure tar as you said, but note
that it's extremely inefficient *especially* for the network to have all
the patchsets compressed in small tar.gz packages (despite the tar.gz
are already saving something compared to .tar).  You really want in the
long run, not to compress any _single_ tar during the network syncs, but
you want to pack togeher all the pure .tar (not .tar.gz) patchsets from
base-0 to the last one, and pump all of them at once in a stream through
a '| gzip -1'.  This way the compression ratio will be an order of
magnitude more significant (I could imagine up to 3-5 times more) since
it'll be like compressing an huge tar archive not many tiny ones.
compressing tiny things leads to very small compression ratio. This
efficient network compression mode of course may require a dedicated
server, or a cgi script. Actually also rsync may be interesting, that
will just do the compression as described above natively with -z, all we
have to be careful about is to give it .tar and not .tar.gz to eat.
------------------------------------------------------------------

This emails is huge than evern, but these are ordered ideas,
prioritized, and there is no hurry at all to address them. Before
attempting any production usage I will need at the very least point 1)
to be addressed first and I certainly can help with that if needed.
Point 2) I feel is needed to in order for me to try it with the full
2.4-aa since I don't have enough disk space for 300 >200M branches
checked out at the same time (60G total). With point 2 the 60G could be
reduced to a few gigs and thanks to the almost immediate generation of
the pure branches, I likely won't need to keep all the pure branches
checked out at the same time anymore to edit them fast.

Andrea - If you prefer relying on open source software, check these links:
            rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.[45]/
            http://www.cobite.com/cvsps/
            svn://svn.kernel.org/linux-2.[46]/trunk
[Prev in Thread]
Current Thread
[Next in Thread]
[Gnu-arch-users] Re: give us a hand with arch, (continued)
Prev by Date: [Gnu-arch-users] French translation of docs-tla
Next by Date: Re: [Gnu-arch-users] Re: give us a hand with arch
Previous by thread: Re: [Gnu-arch-users] Re: give us a hand with arch
Next by thread: Re: [Gnu-arch-users] Re: give us a hand with arch
Index(es):
- Date
- Thread