pspp-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: casefile.c revision


From: Ben Pfaff
Subject: Re: casefile.c revision
Date: Wed, 01 Jun 2005 21:29:26 -0700
User-agent: Gnus/5.1007 (Gnus v5.10.7) Emacs/21.4 (gnu/linux)

[Adding pspp-dev back to CCs]

John Darrington <address@hidden> writes:

> On Tue, May 31, 2005 at 11:43:27PM -0700, Ben Pfaff wrote:

> Like I said above, currently all cases are always the same size, so it's 
> not a problem.  If however we do away with the union, then it certainly
> will become a problem, and the model will have to be rethought, since
> case sizes will no longer be constant.  

Not sure I'm communicating well here.  I'm not envisioning making
case sizes non-constant, I'm planning to store them more
efficiently.  Currently a case takes up at least 8*num_vars +
8*str_vars bytes; I would make it take up only 8*num_vars +
total_str_var_widths bytes.  Still a constant but potentially a
smaller constant.

>      > 2. Currently, casefiles can be accessed only sequentially.  RANK needs 
> random 
>      >    access.  
>      
>      I'm not sure I really like this idea.  Sequential access is fast.
>      Sorting only needs sequential access.  But random access is very,
>      very slow (when your data doesn't fit in OS caches).  A cheap
>      hard disk can read or write 10 MB or more sequentially in one
>      second, but in that same second it can only do 50 to 100 seeks.
>      
>      If RANK has to seek randomly, then it's in trouble for files
>      bigger than RAM.  It's going to be slow as heck.
>      
>      Let's run the numbers.  [..snip..]
>
> Thanks for taking the time to explain this in so much detail.  

I've been involved in teaching basic operating systems courses
lately so it's second nature at the moment ;-)

> Of course sequential read/writes are only truly sequential if
> the disk is relatively unfragmented.  Otherwise they're random
> access just like before.

Well, that's certainly true.  Most modern file systems are
designed to largely avoid fragmentation if they're under, say,
80% full.  If the files are stored on disk, say, 1 to 4 MB chunks
that are scattered around the disk then those can be read with
only slightly less efficiency than if they're completely
contiguous.  It's still a *lot* better than random access.

Actually I forgot about two factors that would have made some
small differences in my analysis.  First, our sort routines do an
optimization called "replacement selection" while they compose
the initial runs, which causes the initial runs to be, on
average, twice as long as the amount of data we can fit in memory
at once (if the data is already sorted, then it allows us to
avoid merging entirely).  This would significantly reduce the
amount of I/O required for each sort, maybe by 1/3 or so.

Second, it's doubtful that we'd be seeking across the entire
disk.  Probably the file system would cluster the file data a
little better than that, so we'd probably get a few more than 300
seeks/s out of our expensive disk.

> Anyway if you'd prefer the two-sorts approach we can do it that way.
> Although there are some issues here (some of which are issues whichever
> way we go):
>
> 1. Obviously we'd need to store a key so that we can "unsort" the cases.

Yes.

> 2. Rank &c requires a unique sort --- but it requires a unique sort with
>    some kind of consolidation function.  By implication, we'd need to
>    store extra data to "unconsolidate" .  By way of illustration:
>
> DATA LIST LIST /x * .
> BEGIN DATA
> 3
> 2
> 4
> 4
> 6
> END DATA.
>
> RANK X 
>   /TIES=CONSOLIDATE.
>
> The first sort needs to produce the sequence 2,3,4,6  and the second
> must recover the original 3,2,4,4,6 (without loosing any other variables
> either).

I don't think it's a big deal.  First sort:

X ORIG_CASENUM
- ------------
2       2
3       1
4       3
4       4
6       5

Add ranks.  Consolidation is easy because identical ranks are
always sequential:

X ORIG_CASENUM RANK_X
- ------------ -------
2       2         1
3       1         2
4       3         3
4       4         3
6       5         4

Second sort:

X ORIG_CASENUM RANK_X
- ------------ -------
3       1         2
2       2         1
4       3         3
4       4         3
6       5         4

Drop ORIG_CASENUM:

X RANK_X
- -------
3    2
2    1
4    3
4    3
6    4

I believe it should be possible to combine the first and second
steps, and the third and fourth steps.  It may take some extra
work on the casefile code to do so.

> 3.  RANK's BY keyword complicates things further.

Not much.  It adds sort keys.  In the "rank" step we have to
check for changes in the BY variables.  It is similar to the
handling of BREAK variables in the AGGREGATE procedure.  (The 
case_compare() or case_compare_2dict() functions will probably
prove useful here.)

I think that SPLIT FILE variables are probably equivalent to BY
variables, by the way.  We should probably test that.

> My feeling is that rather than monkey with the sort algorithm we'd need
> to  define some kind of property on a casereader such that it iterates
> in a fashion skipping over certain cases.

It's not obvious to me why we'd want to skip any cases.
-- 
"It was then I realized how dire my medical situation was.  Here I was,
 a network admin, unable to leave, and here was someone with a broken
 network.  And they didn't ask me to fix it.  They didn't even try to
 casually pry a hint out of me." --Ryan Tucker in the Monastery




reply via email to

[Prev in Thread] Current Thread [Next in Thread]