coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Du feature request - group reporting


From: Assaf Gordon
Subject: Re: Du feature request - group reporting
Date: Sat, 3 Mar 2018 11:47:42 -0700
User-agent: Mutt/1.5.24 (2015-08-30)

Hello Daniel,

On Fri, Mar 02, 2018 at 10:08:57PM -0500, Daniel Gall wrote:
> POSIX requires that
> applications that don't handle UID/GIDs greater than the originally
> specified 64k should aggregate high UID/GIDs to 65534.  I didn't think
> we wanted to allocate arrays the size of the expanded UID/GID range.

If we continue with this feature, I think IDs above 65K must be supported -
they are supported in many common OSes. Some gnu systems already have 
UIDs>145000
and GIDs>78000.

The implementation should likely not be a pre-allocated array, but perhaps
a hash or a tree (both exists as gnulib modules).

> Returning to the logic behind my feature request(s) I work with
> tolerably large file systems (5-30PiB) and it is untenable to use the
> normal Unix approach of piping commands together if only due to time.

I don't think "piping" is the bottleneck (certainly not the tiny awk script).
The issue is file system access, by "du" (with your patch) or "find" (with my
example).

Based on cursory observation, the command
   find -printf '%u %g %s %D %i\n'
performs a single stat syscall per file - that's as efficient as it gets.

You can try it on your system with:
   strace -e trace=file find -type f -printf "%u %g %s %D %i\n"

> [...] I understand that my use case
> is not the concern of the majority of users, but still I think storage
> density is growing faster than I/O throughtput / latency even in
> consumer hardware and that these features could save administrators
> and users nontrivial amounts of time for relatively little complexity
> cost in the du source.

I agree with the density-vs-latency, but when dealing with
large scale file-systems (many PB, in your case) - there are additional
optimizations that should be performed on the filesystem level - such
as dedicated metadata servers, large metadata cache etc. Such optimization
will typically be much more effective than any user-level program improvement
can achieve.

As for "non trivial amounts of time" - can you provide some measurements
of 'du' with your patch vs 'find' to demonstrate it is indeed non-trivial 
amount?
This would give more weight towards accepting the patch.

regards,
 - assaf



reply via email to

[Prev in Thread] Current Thread [Next in Thread]