bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: rm -rf: avoid ext3/4 O(N^2) performance hit, further librarify remov


From: Eric Blake
Subject: Re: rm -rf: avoid ext3/4 O(N^2) performance hit, further librarify remove.c
Date: Thu, 25 Sep 2008 23:25:46 +0000 (UTC)
User-agent: Loom/3.14 (http://gmane.org/)

Jim Meyering <jim <at> meyering.net> writes:

> i.e., this change makes rm process dir entry names in sorted
> inode order, when that's sensible.

What about 'ls' - should that also be changed to stat in inode order when -f is 
not enabled?  (An interesting thought, since it means ls would then be sorting 
twice per directory in order to run faster than with a single sort!)

>  ** New features
> 
> +  chgrp, chmod, chown, chcon, du, rm: now all display linear performance,

Is this really true, given that sorting is at best log-linear (and with qsort, 
possibly quadratic)?  True, log-linear is noticeably better than quadratic, but 
have you proven that the linear time spent stat'ing the sorted list outweighs 
the log-linear cost of sorting enough to discard the log-linear term?  I guess 
that explains the FIXME comment about detecting tmpfs to avoid the sort.

> +  INODE_SORT_DIR_ENTRIES_THRESHOLD = 10000

And you might want to play with lowering this threshold once you convert to a 
different sort algorithm, in case the time of qsort was impacting the timing 
impacts you used when selecting this threshold.

> +
> +/* Return an approximation of the maximum number of dirent entries
> +   in a directory with stat data *ST.  */
> +static size_t
> +dirent_count (struct stat const *st)
> +{
> +  return st->st_size / 16;

Hmm. On cygwin, directories show st_size of 0 (it's too time-consuming to 
compute a real value).  Is this approximation going to be valid on every file 
system where sorting benefits?  Or should we add a special case for a st_size 
of zero to request sorting, on the worst-case assumption that the directory is 
big even though the OS won't tell us how big?

> +static void
> +preprocess_dir (DIR **dirp, struct rm_options const *x)
> +{
> +#if HAVE_STRUCT_DIRENT_D_TYPE

Phooey - cygwin currently lacks d_type.  So I guess the bulk of this patch 
won't benefit cygwin anyway, whether or not NTFS is also a file system where 
sorting would help (although I don't know if cygwin's emulation of inodes atop 
NTFS will still result in something where sorted order will help or hinder disk 
seek times).

And shouldn't this condition also check for the presence d_ino, since otherwise 
the D_INO below makes every inode 0 and the qsort unstable?

> +      /* - the directory is smaller than some threshold.
> +      Estimate the number of inodes with a heuristic.
> +         There's no significant benefit to sorting if there are
> +      too few inodes.  This  */

Dangling comment.

> +  /* Read all entries, storing <d_ino, d_name> for each non-dir one.
> +     Maintain a parallel list of pointers into the primary buffer.  */
> +  while (1)
> +    {
> +      struct dirent const *dp;
> +      dp = readdir_ignoring_dot_and_dotdot (*dirp);
> +      /* no need to distinguish EOF from failure */
> +      if (dp == NULL)
> +     break;

Would it be worth using scandir to do the argument collection and sort, rather 
than writing your own loop?  There's already the proposal of porting scandir to 
gnulib.

> +  /* Iterate through those pointers, unlinking each name.  */
> +  size_t i;
> +  for (i = 0; i < n; i++)

Since you've already gone for C99, should you compress these two lines?

for (size_t i = 0;...

> +  /* Ensure each of these is NULL, in case init/allocation
> +     fails and we end up calling ds_free on all three while only
> +     one or two has been initialized.  */
> +  for (i = 0; i < 3; i++)
> +    o[i]->chunk = NULL;

Does the obstack API allow you to safely reference this field without going 
through an accessor macro/function?

-- 
Eric Blake






reply via email to

[Prev in Thread] Current Thread [Next in Thread]