Austin Group Defect Tracker

Aardvark Mark III


Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0000293 [1003.1(2008)/Issue 7] System Interfaces Objection Omission 2010-07-22 22:05 2012-08-08 18:16
Reporter eblake View Status public  
Assigned To ajosey
Priority normal Resolution Accepted As Marked  
Status Interpretation Required  
Name Eric Blake
Organization Red Hat
User Reference ebb.readdir
Section readdir
Page Number 1744
Line Number 55677
Interp Status Approved
Final Accepted Text Note: 0000534
Summary 0000293: restricting readdir normalization of filenames
Description The standard is currently silent on whether filenames can be transformed to a different byte sequence when being stored to disk. Thus, it is unclear whether a program can expect to create a file by one name, and then use opendir(".")/readdir(dir) to find that same file by that same name. It is also unclear what happens if two different strings both map to the same canonical name, with ramifications on whether a file can be opened by either name, and whether rename(2) should be able to change the spelling on disk between two alternate strings that map to the same canonical name.

From a practicality standpoint, it is desirable to have an explicit statement in the standard that requires that if a given byte pattern is provided when a file is created (via creat(), open() with O_CREAT, link(), symlink(), mkdir(), ...), then that same byte pattern must be returned when listing the directory entry via readdir(). And since filenames are defined as byte sequences rather than characters, it should not matter whether the current locale recognizes a filename as valid characters. (Note that the implementation may still choose to canonicalize the byte sequence under the hood as it goes to disk, provided that the transformation can be undone in a 1-to-1 manner back to the original spelling when the directory entry is read.) Not only is this the principle of least surprise ("why does readdir() not list the file I just created?"), it also matches the traditional behavior of older Unix.

However, there are several existing file systems (although typically non-compliant for reasons other than the topic of this aardvark) that perform filename transformations, such as case-insensitive filesystems (where creat("a",mode) forms the file "A" on disk; example FAT without long-file-name support), case-preserving case-insensitive filesystems (where open("a") and open("A") refer to the same file, but the disk remembers the case that the file was created with; example NTFS under typical Windows use), and filesystems that enforce that all names be UTF8 characters of a particular Unicode normalization (where the two-byte name in creat("\xc3\x81",mode) forms the three-byte name "A\xcc\x81" on disk; two different Unicode normalizations of Á; example HFS+).

The standard is already clear that case-insensitivity is non-compliant (see XRAT A.4.6, line 115564). Meanwhile, by defining the existence of a portable filename set, and acknowledging that there are non-compliant filesystems, the standard implies that it could be possible for a compliant implementation to reject a string as invalid for the particular filesystem if that potential filename does not meet certain criteria (for example, it is conceivable to want a filesystem that rejects filenames with bytes outside the portable filename set, and this should still be compliant). However, there is no standardized error for informing an application that a particular byte sequence is not acceptable as a filename for a given filesystem. This proposal reuses EINVAL as a shall-fail error for this situation, although it may be worth reusing ENXIO instead or inventing a new errno value that is more specific to this particular case, or relaxing this to may-fail

The standard already mandates one case of filename canonicalization - it requires that mkdir("a/") can succeed, and that both open("a",O_RDONLY) and open("a/",O_RDONLY) will open the same sub-directory via the same directory entry. Existing implementations of readdir(), even for non-compliant filesystems, already return the normalized name of just "a" for the name of such a directory entry even if it was created via mkdir("a/"). Therefore, it makes sense that the normalization of removing the trailing slash in order for readdir() to return just a filename (which, by definition, cannot contain slash) should be explicitly standardized.

This proposal is couched in terms of normative requirements that apply to all compliant filesystems, but is forced to use only non-normative text when discussing non-compliant filesystems (we can't require behavior on something that doesn't purport to obey our standard in the first place). There is a slight chance that the proposed wording will render a previously-compliant filesystem non-compliant, although it should be relatively easy to consider this as a bug in such a filesystem that is worth fixing. Meanwhile, it is a service to readers to alert them to some of the pitfalls of non-compliant filesystems, to aid in writing programs that can be portable outside a strictly compliant environment.

Not done in this aardvark, but a potentially useful addition, would be a change to <unistd.h> to add a new _PC_FILENAME* constant (or several, depending on whether we want to provide more granularity into what types of filenames are not considered canonical for a given filesystem), and updating fpathconf() to query on a directory whether filenames created in that directory are subject to particular naming convention restrictions.

Additionally, this aardvark strives to remain independent of the issue of restricting <newline> in filenames, although if both issues are resolved at once, they will probably have overlapping edits to consider.

[This aardvark was created based on an action item in the 1 July 2010 meeting]
Desired Action At line 55679 (readdir description), add a sentence:

The d_name member shall be a filename, and (if not dot or dot-dot) shall contain the same byte sequence as the last pathname component of the string used to create the directory entry. Directory entries naming a file of type directory shall not include any trailing slashes.

At line 55799 (readdir rationale), add two new paragraphs:

Compliant filesystems are required to store filenames unaltered from how they were created (via open(), link(), mkdir(), mkfifo(), etc.). The one exception to this is that it is possible to create a directory entry for sub-directory with a string that includes a trailing slash; but by definition, a filename does not include slash, so the directory entry must not include that slash.

However, there are non-compliant filesystems where filenames are converted to a canonical representation before a directory entry is created, such that it is possible to create a file using one string, then perform opendir() and a readdir() loop and not encounter the same string, because readdir() returns the canonical form of the string instead. Such non-compliant filesystems also have the issue that more than one filename resolves to the same directory entry, with potentially confusing results. This standard cannot mandate the behavior of non-compliant filesystems, and strictly compliant applications need not worry about dealing with such filesystems, but it is a concern for developers of portable applications. Therefore, this standard recommends that filesystem implementers that want to enforce canonical filenames should consider rejecting attempts to create a directory entry with a non-canonical filename, rather than allowing the creation but recording a different name in the directory entry. However, following the example that the pathname "a/" is required to access an existing directory entry for a directory with a filename "a", it is reasonable for a filesystem to permit accessing files via a non-canonical filename, provided that the directory entry already exists.



At line 42320 (mkdir description), add a new sentence:

The directory entry in the directory that contains the new entry shall not include any trailing slashes that were present in the path argument.

At line 40114 (link description), add a new sentence:

If the implementation permits using link() on directories, any trailing slashes that are present in the path2 argument shall be omitted from the new directory entry.

At line 56947 (rename description), add a new sentence:

The directory entry in the parent directory containing new shall not include any trailing slashes that were present in the new argument.



After line 21080 (bind errors), add a new line:

[EINVAL] The last pathname component contains a byte sequence outside of the portable filename set that is not acceptable to the filesystem that would contain the file.

After line 26547 (fattach errors), add a new line:

[EINVAL] The last pathname component of path contains a byte sequence outside of the portable filename set that is not acceptable to the filesystem that would contain the file.

After line 29169 (fopen errors), add a new line with CX shading:

[EINVAL] The mode begins with 'w' or 'a', the file did not previously exist, and the last pathname component contains a byte sequence outside of the portable filename set that is not acceptable to the filesystem that would contain the file.

After line 30915 (freopen errors), add a new line with CX shading:

[EINVAL] The mode begins with 'w' or 'a', the file did not previously exist, and the last pathname component contains a byte sequence outside of the portable filename set that is not acceptable to the filesystem that would contain the file.

After line 40149 (link errors), add a new line:

[EINVAL] The last pathname component of path2 contains a byte sequence outside of the portable filename set that is not acceptable to the filesystem that would contain the file.

After line 42336 (mkdir errors), add a new line:

[EINVAL] The last pathname component of path contains a byte sequence outside of the portable filename set that is not acceptable to the filesystem that would contain the file.

At line 42444 (mkdtemp errors), change:

[EINVAL] The string pointed to by template does not end in "XXXXXX".

to:

[EINVAL] The string pointed to by template does not end in "XXXXXX"; or the last pathname component of template contains a byte sequence outside of the portable filename set that is not acceptable to the filesystem that would contain the file.

At line 42532 (mkfifo errors), add a new line:

[EINVAL] The last pathname component of path contains a byte sequence outside of the portable filename set that is not acceptable to the filesystem that would contain the file.

At line 42683 (mknod errors), change:

[EINVAL] An invalid argument exists.

to:

[EINVAL] An invalid argument exists, or the last pathname component of path contains a byte sequence outside of the portable filename set that is not acceptable to the filesystem that would contain the file.

At line 45311 (open errors), change

<SIO>[EINVAL] The implementation does not support synchronized I/O for this file.</SIO>

to:

[EINVAL] <SIO>The implementation does not support synchronized I/O for this file</SIO>, or O_CREAT was specified, the file did not exist, and the last pathname component of path contains a byte sequence outside of the portable filename set that is not acceptable to the filesystem that would contain the file.

At line 56991 (rename errors), change this line:

<CX>[EINVAL] The old pathname names an ancestor directory of the new pathname, or either pathname argument contains a final component that is dot or dot-dot.</CX>

to:

<CX>[EINVAL] The old pathname names an ancestor directory of the new pathname, or either pathname argument contains a final component that is dot or dot-dot, or the last pathname component of the new pathname contains a byte sequence outside of the portable filename set that is not acceptable to the filesystem that would contain the file.</CX>

At line 65088 (symlink errors), add a new line:

[EINVAL] The last pathname component of path2 contains a byte sequence outside of the portable filename set that is not acceptable to the filesystem that would contain the file.
Tags issue8
Attached Files

- Relationships
related to 0000291Closedajosey filenames need not contain characters 
related to 0000251Under Reviewajosey Forbid newline, or even bytes 1 through 31 (inclusive), in filenames 

-  Notes
(0000531)
eblake (manager)
2010-08-19 16:16
edited on: 2010-10-15 09:04

In light of the resolution of 0000291, the term 'filename' is not
NUL-terminated, so a new term is needed when referring to a string
that is a filename followed by a null character. An updated Desired
Action follows. This action focuses on XBD and XSH; XCU needs further
preliminary work to document how delimited words on the shell command
line correspond to null-terminated strings at the XSH library level.

[2010-Oct-15: changes related to the term 'filename' have been moved
into 0000291 so that they can be included in TC1]

Discussion on 26 August favored EILSEQ (not EINVAL) for rejecting
filenames not appropriate for a file system.


At line 55679 (XSH readdir DESCRIPTION), add a sentence:

The d_name member shall be a filename string, and (if not dot or
dot-dot) shall contain the same byte sequence as the last pathname
component of the string used to create the directory entry, plus
the terminating <NUL> byte.

At line 55799 (readdir rationale), add two new paragraphs:

Compliant file systems are required to store filenames unaltered from
how they were created (via open(), link(), mkdir(), mkfifo(),
rename(), etc.). By definition, a filename string does not include a
<slash>, even if a trailing <slash> was present in the pathname
presented to mkdir() when creating a sub-directory.

However, there are non-compliant file systems where filenames are
converted to a canonical representation before a directory entry is
created, such that it is possible to create a file using one string,
then perform opendir() and a readdir() loop and not encounter the same
string, because readdir() returns the canonical form of the string
instead. Such non-compliant file systems also have the issue that
multiple filenames can resolve to the same directory entry, with
potentially confusing results. This standard cannot mandate the
behavior of non-compliant file systems, and strictly compliant
applications need not worry about dealing with such file systems, but
it is a concern for developers of portable applications. Therefore,
this standard recommends that file system implementations that
perform canonicalization of filenames should reject attempts to create
a directory entry with a non-canonical filename using the [EILSEQ]
error. However, if a directory entry already exists, it is reasonable
for a file system to permit accessing that file via a non-canonical
filename.


After line 21066 (XSH bind errors), add a new line:

[EILSEQ] The last pathname component is not a portable filename, and
cannot be created in the target directory.

After line 26547 (XSH fattach errors), add a new line:

[EILSEQ] The last pathname component of path is not a portable
filename, and cannot be created in the target directory.

After line 29168 (XSH fopen errors), add a new line with CX shading:

[EILSEQ] The mode begins with 'w' or 'a', the file did not previously
exist, and the last pathname component is not a portable filename, and
cannot be created in the target directory.

After line 30914 (XSH freopen errors), add a new line with CX shading:

[EILSEQ] The mode begins with 'w' or 'a', the file did not previously
exist, and the last pathname component is not a portable filename, and
cannot be created in the target directory.

After line 40149 (XSH link errors), add a new line:

[EILSEQ] The last pathname component of path2 is not a portable
filename, and cannot be created in the target directory.

After line 42336 (XSH mkdir errors), add a new line:

[EILSEQ] The last pathname component of path is not a portable
filename, and cannot be created in the target directory.

After line 42443 (XSH mkdtemp errors), add a new line:

[EILSEQ] The string pointed to by template is not a portable filename,
and cannot be created in the target directory.

After line 42532 (XSH mkfifo errors), add a new line:

[EILSEQ] The last pathname component of path is not a portable
filename, and cannot be created in the target directory.

After line 42682 (XSH mknod errors), add a new line:

[EILSEQ] The last pathname component of path is not a portable
filename, and cannot be created in the target directory.

After line 45308 (XSH open errors), add a new line:

[EILSEQ] O_CREAT was specified, the file did not exist, and the last
pathname component of path is not a portable filename, and cannot be
created in the target directory.

After line 56990 (XSH rename errors), add a new line with CX shading:

[EILSEQ] The last pathname component of the new pathname is not a
portable filename, and cannot be created in the target directory.

After line 65087 (XSH symlink errors), add a new line:

[EILSEQ] The last pathname component of path2 is not a portable
filename, and cannot be created in the target directory.

(0000534)
msbrown (manager)
2010-08-26 16:30

Interpretation response
------------------------

The standard does not speak to this issue, and as such no conformance distinction can be made between alternative implementations based on this. This is being referred to the sponsor.

Rationale:
-------------
We believe that the original authors desired that the suggested response be a requirement, but there is no explicit text to that effect.

Notes to the Editor (not part of this interpretation):
-------------------------------------------------------
Please use the text from Note: 0000531 as suggested response.
(0001321)
pkar (reporter)
2012-08-08 15:42

I think problem is broader.

My thoughts about this problem:

 There is well-known difference between Unix (as example of POSIX) and Windows - Unix differentiate between "a" and "A", in Windows file functions (like open(filename)) treat "a" and "A" as same character.
 So, maybe POSIX should treat names with different Unicode representation as different names.
 
 From the other hand, anyone can differentiate between "a" and "A" on directory listing, but no one can differentiate between "ą" written (stored) as unicode 105 ("latin small letter A with ogonek") and written as "a" ("latin small letter A") combined with unicode 328 ("combining ogonek").
 You can list files, ls -l, but you would have problems with e.g. cat filename - how to write filename just seen in listing :) (and that assuming your keyboard layout allow you to write both forms)
 So, maybe POSIX should treat all representation such as one.
 
 But how about efficiency, one system call open(filename) and filesubsystem have to try all possible combinations? And which file should be open when you have two files in directory, with different spelling? :)
 
 Next thing, such filenames could be used to encrypt messages (steganography). Small bit rate, but still... U+105 is "zero", and U+61 combined with U+328 is "one".
 So, maybe POSIX should 'canonize' filenames on creating file, to break this method of hidden messaging?
 
 And, of course, in such case, we have another question - how it should be canonized? To be U+105 or to be U+61 combined with U+328? That's a question :)
 Maybe as U+61 with U+328, so open/search could simply discard all 'combining' characters while comparing filenames?
 But in such case filenames would require more storage - e.g. "greek capital letter alpha with dasia and varia and prosgegrameni', U+1f8b, or four symbols. Or "hebrew letter shin with dagesh and sin".
 
 Same problem about digraphs - should "ae" be different, or same, as U+e6?
 And with ligatures, U+fb00 ("latin small ligature ff"), should be treatet as "ff" or different?
(0001322)
eblake (manager)
2012-08-08 18:16

As was discussed at the time this bug was first opened, POSIX cannot afford to mandate Unicode naming on low-level functions like open() on all implementations, since there is no existing functionality for hooking up current locale into kernel functions which operate on bytes, and since not all users operate in a UTF-8 locale. That is, before you can even correlate Unicode characters to bytes, you have to choose an encoding, but there are no hooks to pass an encoding down to the kernel (and no desire to burden kernels with iconv-like functionality). Therefore, at the kernel level, filenames will always be treated as if in a single-byte encoding.

Meanwhile, while working on this bug, it was agreed that POSIX can at least make recommendations on how a system that wants to enforce Unicode names, by rejecting the creation of filenames in single-byte encoding that do not map back to a format deemed acceptable by the system enforcing such a naming scheme in whatever other encoding. The recommendation of this bug is that an implementation that wants to enforce that file names are stored in Unicode should reject attempts to create names by anything other than a canonical form (and should clearly document what that canonical form is), but may allow open (but not creation) of a non-canonical form to succeed. Since Unicode names fall outside the bounds of the portable filename character set, about the best you can get in the actual POSIX standard is a statement that the filenames you are worried about are already non-portable.

Regarding your specific examples, such an implementation would not allow the simultaneous creation of both U0105 and U0061-U0328 (or rather, files with names containing UTF-8 bytes that match those Unicode sequences), since one of those two forms is not canonical (which form is non-canonical depends on whether you prefer NFD, NFC, or some other Unicode normalization; http://unicode.org/reports/tr15/ [^] gives more advice on choosing a Unicode-mandated normalization form). Both forms could be used to open an existing file, for ease of use.

- Issue History
Date Modified Username Field Change
2010-07-22 22:05 eblake New Issue
2010-07-22 22:05 eblake Status New => Under Review
2010-07-22 22:05 eblake Assigned To => ajosey
2010-07-22 22:05 eblake Name => Eric Blake
2010-07-22 22:05 eblake Organization => Red Hat
2010-07-22 22:05 eblake User Reference => ebb.readdir
2010-07-22 22:05 eblake Section => readdir
2010-07-22 22:05 eblake Page Number => 1744
2010-07-22 22:05 eblake Line Number => 55677
2010-08-19 16:16 eblake Note Added: 0000531
2010-08-19 17:54 eblake Note Edited: 0000531
2010-08-26 15:49 eblake Note Edited: 0000531
2010-08-26 15:59 eblake Note Edited: 0000531
2010-08-26 16:14 eblake Note Edited: 0000531
2010-08-26 16:19 eblake Note Edited: 0000531
2010-08-26 16:22 eblake Note Edited: 0000531
2010-08-26 16:30 msbrown Interp Status => ---
2010-08-26 16:30 msbrown Note Added: 0000534
2010-08-26 16:30 msbrown Status Under Review => Interpretation Required
2010-08-26 16:30 msbrown Resolution Open => Accepted As Marked
2010-08-26 16:31 msbrown Interp Status --- => Pending
2010-08-26 16:31 msbrown Final Accepted Text => Note: 0000534
2010-08-27 16:09 ajosey Note Edited: 0000531
2010-09-02 15:48 eblake Note Edited: 0000531
2010-09-13 05:49 ajosey Interp Status Pending => Proposed
2010-09-24 16:22 geoffclare Tag Attached: issue8
2010-10-15 09:04 geoffclare Note Edited: 0000531
2010-11-05 15:29 ajosey Interp Status Proposed => Approved
2011-04-11 21:58 eblake Relationship added related to 0000291
2012-08-03 19:36 eblake Relationship added related to 0000251
2012-08-08 15:42 pkar Note Added: 0001321
2012-08-08 18:16 eblake Note Added: 0001322


Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker