pika-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Pika-dev] hackerlab fsplay.[ch] status


From: Tom Lord
Subject: Re: [Pika-dev] hackerlab fsplay.[ch] status
Date: Thu, 6 May 2004 23:38:24 -0700 (PDT)

    > From: Andreas Rottmann <address@hidden>
    > >     > I've started porting the GLib memchunk implementationn to
    > >     > hackerlab. 

    > > Why?

    > [snip]

    > Now that I did speed tests on alloc and free memchunks, I was
    > not all that convinced they make sense anymore (they are
    > noticably slower than malloc()/free()). However, I have now
    > optimized the memchunk as not to use binary trees but bit
    > masking and posix_memalign(3) (suggested by Owen Taylor on
    > #gtk+) to find the area for a block, and now the speed is about
    > the same as for malloc()/free().

Sure.  You can definately stack a "malloc-like-thingie" on top of
malloc and get comparable performance, especially if your measurements
are for a random sequence of allocations and frees.  Your
"malloc-like-thingie" winds up using malloc as if it were sbrk (or brk
or memmap or whatever) -- i.e., infrequently.  It winds up
reimplementing what malloc does on top of that.  A good malloc doesn't
call sbrk very often -- so there's very little performance impact from
stacking.

That you _can_ stack doesn't mean you should.   You really need a
positive reason to do it.   And you need to be sure you won't be
suffering lossage.

It's very hard to predict on first-principles how allocator hacks will
behave in real life.   It's not that intuitive.  So, take what I say
with large grains of salt but:

One problem with mem_chunk-style pools is the way their used.  They
tend to be created on a per-type basis.  Type A and type B might both
be 16-byte structures, but you'll use two different pools for them.
That means you lose the opportunity to frequently recycle a former A
region of memory as a B region of memory.  Malloc doesn't lose that
opportunity.  That's why I worry about fragmentation with pools.

The standard malloc/free interface does have a huge semi-bug.  The
semi-bug is that `free' has to figure out what size of object it's
freeing.  This is the bug that you're working around if you use
free-lists for a given type.  And it's the bug that mem_chunk is
mostly aimed at getting around.  Absent that bug, `free' could
instantly find the meta-data for a chunk being freed.  The metadata
could be minimally small.  Chunks could be returned to a general pool
very quickly, etc.  Present that bug -- you have to spend either a bit
of extra space or extra time so that when `free' is handed some
address the allocator can figure out where it goes, how large it is,
etc.   mem_chunk knows what size is being freed.   In principle, it
can use allignment tricks to find the meta-data very quickly (which is
what you did when you took Owen's suggestion).

But: it's only a semi-bug in free.  It's a real problem, sorta, but a
large part of the thrust of malloc subsystem implementations over the
past 30 years has been to beat the performance implications of that
semi-bug into submission.  It's only edge-cases (like cons-pairs in a
lisp implementation or some list structures in Rx) where malloc/free
can never ever compete.



    > > What's the case for having this facility?

    > Well, glibc (so I've heard) imposes a 4-byte overhead on each
    > allocated piece of memory (it has to know the size). 

That's not necessarily too bad.   For small objects, a good bet
(haven't looked at the code) is that it's rounding up to the first
large-enough power-of-two size.   Some mallocs use alignment tricks
instead of a pointer-size header. 

Arguably, you should use _more_ than 4-bytes overhead for small
objects, on average, to ensure that the caller's data (the value
returned from malloc) winds up aligned on a cache-line boundary (on
systems with that kind of cache).  That is: optimize for L1 cache at
the expense of the OS page cache because, after all, machines have
lots of memory and disk space for paging may as well be infinite.


    > I thought we'd want to save that for object data areas, since
    > hashq values might be used *very* often (additionally, with
    > lim_malloc(), when using limits, the overhead increases by
    > another 4 bytes).

lim_malloc doesn't increase overhead by another 4 bytes if you're
using it the way it's usually used -- as a front-end to ordinary
malloc.

--------------------------------

I wish I had some conclusive, obviously true statement to wrap up
with.  I don't.  If by next year I've completely embraced a mem_chunk
interface in hackerlab you're officially not allowed to call me a
hypocrite.   Until then....

My judgement to date has been that malloc is hard to beat, almost
always.  When it isn't hard to beat, you probably want something
rather custom.  Those two end-points squeeze out stuff like
mem_chunk. 

BTW: allocators can be very fun to hack on.  If you want to play
around with tuning a malloc implementation, may I suggest playing with
the PIW subdirs of src/hackerlab?  They are derived from an old BSD
libc allocator by Mike Hartel.  They add lots of instrumentation and
debugging facilities (that's why they exist).  They aren't especially
fast or tuned.  It might be a fun place to play around.  (You have to
rename a bunch of UNPLUGGED directories to PLUGIN to turn that stuff
on.)

Of course, digging up Hartel's original allocator is also a good
starting place.   I vaguely recall adding comments that might make my
derivative easier to read but, lacking the debugging features, the
original is far simpler.   Also it's a good taste of "classical style"
for unix systems programming.

You can learn a heck 'o lot about allocation and profiling and
optimization by playing in this area.   If you want a hobby project to
work on casually for a year or two, this is one to consider.

-t






reply via email to

[Prev in Thread] Current Thread [Next in Thread]