Austin Group Defect Tracker

Aardvark Mark IV


Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0000527 [1003.1(2008)/Issue 7] Shell and Utilities Objection Clarification Requested 2011-12-17 04:15 2019-06-10 08:55
Reporter eblake View Status public  
Assigned To ajosey
Priority normal Resolution Accepted As Marked  
Status Closed  
Name Eric Blake
Organization Red Hat
User Reference ebb.du
Section du
Page Number 2611
Line Number 84170
Interp Status Approved
Final Accepted Text See Note: 0001105
Summary 0000527: du and files found via multiple command line arguments
Description The standard currently states about du:

[line 84170] Files with multiple links shall be counted and written for
only one entry. The directory entry that is selected in the report is
unspecified.

A strict interpretation of the standard applies this restriction to
all ways in which the same file can be found, even if the file is
found twice only by virtue of multiple command line arguments.
However, this strict reading renders existing du implementations
non-compliant, since they reset their hash of duplicate files when
proceeding on to additional arguments; meanwhile, GNU du implemented
this strict reading of the standard, and is in the middle of a debate
about whether this change from traditional behavior is desirable:
http://debbugs.gnu.org/10281 [^]

This shell sequence demonstrates traditional behavior (as implemented
on Solaris 10) vs. GNU behavior (the fact that the overall numbers
differ appears to stem from allocation patterns on the file system,
and what appears to be a slight algorithm differences between Solaris
and GNU about whether to favor st_blocks or st_size, and how rounding
affects matters; but that's unrelated to the point at hand about
eliding duplicate entries).

$ mkdir tmp && cd tmp
$ mkdir a b
$ printf %04098d 1 > a/a
$ ln a/a b/b
$ # both implementations detect that a/a and b/b are links, and elide b/b
$ /usr/xpg4/bin/du -ak .
5 ./a/a
6 ./a
1 ./b
9 .
$ ~/gnu/du -ak .
5 ./a/a
7 ./a
2 ./b
10 .
$ # now, compare when a and b are listed as separate arguments
$ /usr/xpg4/bin/du -ak a b
5 a/a
6 a
5 b/b
6 b
$ ~/gnu/du -ak a b
5 a/a
7 a
2 b

Meanwhile, the existing wording can be deemed self-conflicting, in that
both the -a and -s options mention listing all arguments given on the
command line, in contrast to the earlier statement about eliding
duplicates:

[line 84117] Regardless of the presence of the −a option, non-directories
given as file operands shall always be listed.
[line 84185] Instead of the default output, report only the total sum for
each of the specified files.

In all likelihood, the intent of the standard was to codify traditional
behavior where the hash for duplicate files is reset each time du
starts processing the next command line argument, and GNU du was
wrong for trying to take the standard too literally. However, it was
pointed out that the GNU behavior of remembering duplicates across
multiple command line arguments does have a use not possible in the
traditional implementation: if a user has multiple directories, all
of which share some hard links, then only the GNU semantics make it
possible to see how much disk space will be reclaimed by removing
the one directory, by invoking 'du -s' with the directory to be
removed as the last argument. Therefore, I'm presenting two options
for solving the conflict in the standard, although my preference
would be for option 1 (the GNU implementation is willing to change
its behavior to comply with option 1 by adding an extension option
to provide its current behavior of remembering links across
multiple command line arguments, and all other implementations
already comply with option 1).
Desired Action Option 1 - require duplicate checking to reset for each /file/:
Change line 84170 [du DESCRIPTION] from:

Files with multiple links shall be counted and written for only one
entry.

to:

Within the file hierarchy of each individual /file/ argument, files
with multiple links shall be counted and written for only one entry.


Option 2 - leave things unspecified to permit both traditional and
GNU behavior when there are multiple /file/ arguments:
Change line 84170 [du DESCRIPTION] from:

Files with multiple links shall be counted and written for only one
entry.

to:

Files with multiple links shall be counted and written for only one
entry, although when there are multiple /file/ arguments, it is
unspecified whether a file entry encountered via two different
arguments will be counted or skipped during processing of the later
argument.
Tags tc2-2008
Attached Files

- Relationships
related to 0000539Appliedajosey tighten du regarding files encountered through multiple command line arguments 

-  Notes
(0001102)
eblake (manager)
2012-01-26 16:10

Paul Eggert sent this comment to the mailing list:

Eric Blake's Option 1 does not appear to be tenable, as du
traditionally preserved hashes of duplicate files across all
of its operands. 7th Edition Unix 'du' did that, and (as
Jilles Tjoelker pointed out) so do at least two current 'du'
implementations, namely, FreeBSD and GNU.

The idea behind Eric's Option 2 is better, but its wording
is unclear partly because of another issue Jilles raised:
whether a file's disk space should be counted multiple times
if the file occurs multiple times and its link count is 1.
For example:

  mkdir d
  cd d
  cp /bin/sh w
  cp w y
  ln y ../y
  ln -s w x
  ln -s y z
  du -aL

This analyzes a directory with two regular files, 'w' and
'y'. GNU and Solaris du count these files once each, with
an accurate sum of non-symlink disk usage under the current
directory. But w's link count is 1 so FreeBSD counts 'w'
twice, thus overcounting disk usage.

The current POSIX wording does not say what to do for this
example, but the intent is to avoid overcounting disk usage,
and the GNU and Solaris behavior supports this intent better.
(The 7th Edition Unix behavior agrees with FreeBSD, but this
predates symbolic links so the behavior is now dubious.)

Given all the above, the standard's wording could be
improved in several different ways, all elaborations of
Option 2. Here are two possibilities:

  Option 2A - require that files be hashed among all
  operands, and that disk usage be counted at most once.

    Change line 84170 [du DESCRIPTION] from:

      Files with multiple links shall be counted and written
      for only one entry.

    to:

      A file that occurs multiple times shall be counted and
      written for only one entry, even if the occurrences
      are under different file operands.

  Option 2B - leave unspecified whether files are hashed
  among all operands, and leave unspecified whether disk
  usage is counted multiple times for files whose link
  count does not exceed 1. From the user's point of view,
  this means du's output is a reliable count of disk usage
  only if du is invoked without -L and with -x and with at
  most one operand.

    Change line 84170 [du DESCRIPTION] from:

      Files with multiple links shall be counted and written
      for only one entry.

    to:

      A file that occurs multiple times under one file
      operand and that has a link count greater than 1 shall
      be counted and written for only one entry. It is
      implementation-defined whether a file that has a link
      count no greater than 1 is counted and written just
      once, or is counted and written for each occurrence.
      It is implementation-defined whether a file that
      occurs under one file operand is counted for other
      file operands.

Option 2A is simpler and clearer, but it invalidates many
existing implementations. Option 2B modifies the standard
to describe how existing implementations actually work, but
is more complicated and more of a hassle to use reliably.

Eric raised one other issue: the description of the -a
option implies that "du A B" must always list B. This
implication is incorrect for 7th edition Unix du, GNU du,
and (I expect) FreeBSD du, so it should be fixed as well.
Here's one possible fix, which is independent of the
abovementioned changes.

  Change line ????? [du OPTIONS] from:

    Regardless of the presence of the -a option,
    non-directories given as file operands shall always
    be listed.

  to:

    The -a option does not affect whether
    non-directories given as file operands are listed.

(Sorry, I don't know the line number here; I don't have a
PDF copy of the current standard and don't know offhand how
to get one.)
(0001103)
eblake (manager)
2012-01-26 16:11

Paul also sent these comments:

> It boils down to a decision of whether we want to standardize a useful
> behavior, and whether that behavior avoids over-counting, but possibly
> invalidating existing implementations (in which case, it is better
> targetted to Issue 8), or whether we give up and declare things
> unspecified when encountering files with link count of 1 through
> multiple locations (in which case we could make the changes in TC2 of
> Issue 7, and still make recommendations on the underlying goal of
> avoiding over-counting).
We can do both, and it makes sense to do both.
That is, we can have Issue 7 TC2 specify Option 2B
with a suggestion to implement Option 2A, and have
Issue 8 require Option 2A.

On 01/13/2012, Geoff Clare wrote:

> One problem with requiring Option 2A is that it requires du to use
> much more memory for hierarchies where there are large numbers
> of files with link count 1. This could be a problem for embedded
> systems in particular.
The extra memory shouldn't be needed in the typical case
where there is at most one file operand and where -L is not used.
In the typical case, du is within its rights to not hash
files whose link count is 1, even if Option 2A is required.
This is because in the typical case du can't encounter the
same file twice if its link count is 1.
(0001104)
nick (manager)
2012-01-26 16:19
edited on: 2012-01-26 16:28

Change line 84170 [du DESCRIPTION] from:

      Files with multiple links shall be counted and written
      for only one entry.

    to:

      A file that occurs multiple times under one file
      operand and that has a link count greater than 1 shall
      be counted and written for only one entry. It is
      implementation-defined whether a file that has a link
      count no greater than 1 is counted and written just
      once, or is counted and written for each occurrence.
      It is implementation-defined whether a file that
      occurs under one file operand is counted for other
      file operands.

In FUTURE DIRECTIONS, change line 84274 from "None" to
     "A future version of this standard may require that
      a file that occurs multiple times shall be counted and
      written for only one entry, even if the occurrences
      are under different file operands."



Change line 84177 [du OPTIONS] from:

    Regardless of the presence of the -a option,
    non-directories given as file operands shall always
    be listed.

  to:

    The -a option does not affect whether
    non-directories given as file operands are listed.

(0001105)
nick (manager)
2012-01-26 16:24
edited on: 2012-01-26 16:25

Interpretation response
------------------------
The standard states that files with multiple links shall be counted and written for only one entry , and conforming implementations must conform to this. However, concerns have been raised about this which are being referred to the sponsor.

Rationale:
-------------
Existing practice is varied in how du counts files.

Notes to the Editor (not part of this interpretation):
-------------------------------------------------------
See Note: 0001104

(0001291)
ajosey (manager)
2012-06-29 16:16

Interpretation proposed 29 June 2012 for final 45 day review
(0001354)
ajosey (manager)
2012-08-30 09:15

Interpretation approved 30 Aug 2012

- Issue History
Date Modified Username Field Change
2011-12-17 04:15 eblake New Issue
2011-12-17 04:15 eblake Status New => Under Review
2011-12-17 04:15 eblake Assigned To => ajosey
2011-12-17 04:15 eblake Name => Eric Blake
2011-12-17 04:15 eblake Organization => Red Hat
2011-12-17 04:15 eblake User Reference => ebb.du
2011-12-17 04:15 eblake Section => du
2011-12-17 04:15 eblake Page Number => 2611
2011-12-17 04:15 eblake Line Number => 84170
2011-12-17 04:15 eblake Interp Status => ---
2012-01-26 16:10 eblake Note Added: 0001102
2012-01-26 16:11 eblake Note Added: 0001103
2012-01-26 16:19 nick Note Added: 0001104
2012-01-26 16:24 nick Note Added: 0001105
2012-01-26 16:25 nick Note Edited: 0001105
2012-01-26 16:25 nick Note Edited: 0001105
2012-01-26 16:26 nick Note Edited: 0001104
2012-01-26 16:27 nick Interp Status --- => Pending
2012-01-26 16:27 nick Final Accepted Text => See Note: 0001104
2012-01-26 16:27 nick Status Under Review => Interpretation Required
2012-01-26 16:27 nick Resolution Open => Accepted As Marked
2012-01-26 16:27 nick Tag Attached: tc2-2008
2012-01-26 16:28 nick Note Edited: 0001104
2012-01-26 16:29 nick Final Accepted Text See Note: 0001104 => See Note: 0001105
2012-01-26 21:04 eblake Relationship added related to 0000539
2012-06-29 16:16 ajosey Interp Status Pending => Proposed
2012-06-29 16:16 ajosey Note Added: 0001291
2012-08-30 09:15 ajosey Interp Status Proposed => Approved
2012-08-30 09:15 ajosey Note Added: 0001354
2019-06-10 08:55 agadmin Status Interpretation Required => Closed


Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker