Austin Group Defect Tracker

Aardvark Mark III

Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0000251 [1003.1(2008)/Issue 7] Base Definitions and Headers Objection Enhancement Request 2010-05-03 18:49 2015-06-22 16:48
Reporter dwheeler View Status public  
Assigned To ajosey
Priority normal Resolution Open  
Status Under Review  
Name David A. Wheeler
User Reference
Section XBD 3.170 Filename
Page Number 60
Line Number 1781
Interp Status ---
Final Accepted Text
Summary 0000251: Forbid newline, or even bytes 1 through 31 (inclusive), in filenames
Description Forbid bytes 1 through 31 (inclusive) in filenames.

POSIX.1-2008 page 60 lines 1781-1786 states that filenames (aka "pathname component") may contain all characters except <slash> and the null byte, and this has historically been true. However, this excessive permissiveness has resulted in numerous security vulnerabilities and erroneous programs. It also increases the effort to write correct programs, because correctly processing filenames that include characters like newline is very difficult (even the expert POSIX developers have trouble; see 0000248). The "Unix-Haters Handbook" specifically notes the problems caused by control characters (such as newlines) in filenames (see page 156-157), so this is not a new problem!

A key offender, of course, is the <newline> character. This is widely used as
a filename separator, even though it is strictly speaking not a valid filename separator (since a filename may include it). But other control characters also cause problems. Another common problematic character is the <tab> character; files with records terminating in newline, and fields separated by tab, are extremely common, and are encouraged by some tools (e.g., by the default delimiter of "cut" and "paste"). Some terminals and terminal emulators accept control characters (range 1-31), e.g., via the escape character. Simply *displaying* filenames with bytecodes in this range can cause problems on such systems. (Granted, there is no requirement that terminals must only accept control characters in the range 1-31, but if 1-31 could not be in filenames, that would be a reasonable configuration to move to.)

In practice, the primary use of filenames with control characters appears to be to enable security vulnerabilities and to cause errors. Other than "we've always done it that way", there seems to be little justification for them. By forbidding control characters, many programs that are currently erroneous (e.g., because they process filenames one line at a time) become correct. The number of applications this change would *fix* is far larger than the vanishingly small number of programs that non-portably *depend* on control characters being permitted in filenames.

Any program that depends on control characters in filenames is *already* not portable,
since control characters are not in the "portable character set" of filenames
(Page 77, line 2194, XBD 3.276 Portable Filename Character Set).
What's more, NTFS (and probably other filesystems) already forbid bytes 1-31 in filenames,
so this is already an extant limitation. But merely making it "not portable" is not enough;
it needs to be *forbidden* before correct programs can count on it.

This proposal suggests a new error, [ENAME], though other options are possible.

I'm well aware that this may be a controversial proposal. But I think most readers will understand *why* I am proposing this, and that while this is a dramatic approach, we are now in an era where POSIX systems are routinely used to manage important information worldwide. The ability to include control characters in filenames has rarely been a help, and instead has been a hindrance to the use of these systems. We've had ample time to see that this is a problem. It's time to jettison this misfeature.

Desired Action Vol. 1, Page 60, line 1782-1784:
 "The characters composing the name may be selected
  from the set of all character values excluding the <slash> character
  and the null byte."
 "The characters composing the name may be selected
  from the set of all character values excluding the <slash> character,
  the null byte, and character values 1 through 31 inclusive.
  Attempts to create filenames containing bytes 1 though 31 must be rejected,
  and conforming implementations
  must not return such filenames to applications or users."

Vol. 1, Page 77, append after 2199:
The set of character values 1 through 31, inclusive, are expressly forbidden.
 Attempts to create such filenames must be rejected, and conforming implementations
 must not return such filenames to applications or users.

Vol. 2, page 480,
 The path name is not a permitted name, e.g., it contains bytes 1 through 31 (inclusive).

Vol. 2, page 1382, in "open()", after line 45297, state:
"If open() or openat() is passed a path containing byte 1 through 31 (inclusive),
it must reply with error ENAME instead of opening the file."

Vol.2, page 1382, in open(), before line 45319, state:
[ENAME] The path name is not a permitted name, e.g., it contains bytes 1 through 31 (inclusive).

Vol. 2, page 1744, lines 55680-55682: Append:
"The readdir( ) function shall not return directory entries of names containing bytes 1 through 31, inclusive."

There would need to be other changes in the interfaces, but there's no point in identifying these other changes if this proposal will be flatly rejected. Since this involves filenames, which occur all over in the spec, it may be possible to have this change described in a few places instead of many. I've identified open() and readdir() here, because many other interfaces are built from these two.

Tags No tags attached.
Attached Files

- Relationships
related to 0000293Interpretation Requiredajosey restricting readdir normalization of filenames 
related to 0000573Under Reviewajosey Please add '+' to the portable filename character set 
has duplicate 0000545Closedajosey Escape Filenames as per Uniform Resource Identifier Encoding 
related to 0000291Closedajosey filenames need not contain characters 

-  Notes
dwheeler (reporter)
2010-05-03 19:37

Note that if values 1..31 were forbidden in filenames,
widely-used constructs like these would be become correct:

# Only correct if tab and newline cannot be in filenames:
IFS=`printf '\n\t'` # Remove 'space', so filenames with spaces work well.
for file in `find . -type f` ; do
    COMMAND "$file"

More rationale for this proposal can be found here: [^] [^]
msbrown (manager)
2011-03-10 16:04

Another data point:
The encodings 0x00-0x1f are "control characters" in the "standard" z/OS UNIX System Services code page IBM-1047.
Note that "line feed" is 0x25 in EBCDIC/IBM-1047, but the C language '\n' is 0x15 (EBCDIC "new line").
eblake (manager)
2011-04-11 20:28

I have retitled the bug; ongoing discussion is still debating the best
solution, but any solution is more likely to gain consensus if it only
forbids newlines (which is the main source of ambiguous output on any
utility that produces line-oriented listings of file names) than to
forbid all control characters.
dwheeler (reporter)
2011-04-11 21:17

Forbidding just newline would be better than the current state. So if that more limited goal is truly necessary to make progress, then sure!

That said, perhaps there's an alternative in the middle. As noted above, there are two other control characters that cause heartburn: tab (widely used in tab-delimited fields and as IFS whitespace) and ESC (widely used for terminal escapes). If *all* control characters can't be forbidden, can it be at least those three? That would simplify processing (tab-delimited fields work, and IFS of \n\t works nicely), and display of filenames is easier too (no ESC)

What are the critically-important use cases where control characters in filenames are REQUIRED for applications to work properly? Especially since several filesystems (such as NTFS) essentially forbid them already? I haven't seen a lot of critically-important use cases documented, and control characters have *never* been in the portable character set for filenames.
2011-07-07 15:31

This proposal is fine for applications that depend on the POSIX portable character set, but leaves others in the lurch. It would be preferable to restrict file names to printable characters, space, and NBSP.
oiaohm (reporter)
2012-02-22 06:48

Lets look at this differently why do we have to forbin anything.
Programs feeding data to shell cannot use \0 Null as a line break.

This is a double sided sword. We know the stdin and stdout are still open after getting a \0 char.

IFS=`printf '\0'`
for file in `find . -type f` ; do
    COMMAND "$file"

The command should have no issue taking a null terminated string. Why because
int main (int argc, char *argv[])
That is exactly what a C argument is.

IFS becomes a problem. Because applications don't support accepting \0. This is a case they are depending on string length by Null ended strings instead of when the file-handle feeding them information ends.

Reason we don't have two of them. One IFS for applications to feed information to the shell and a IFS for screen. IFS=\0 is good for applications feeding into other applications. Not that great for feeding to screen. \0=\n on screen would be ideal in lots of cases. Also to screen \n(newline) could be displayed as \n and so on for the control chars.

What happens when we have a UTF8 char that contains 1-31 as one of its bytes and it gets damaged. Basically to support UTF8 in case of malfunction this is not possible. Because we cannot say that a file name displaying 31 might not turn up due to file system damage. Its bad enough with \0.

The issue is the shell handling of input and output.

The - issue that exists as well. Is also a shell issue but its harder to solve.

Applications don't have a clear marking in int main (int argc, char *argv[]) what is a directive todo something and what is data to be used.

FLAGs on argv contents would be handy in this regard. Simplest method I can think up is a envormental var containing it that is cleared automatically by next exec command.

All chars bar <slash> character and the null byte should be able to be handled in file names. Because we have two chars that cannot be in files.

${VAR} exits $(type){VAR} could also very handy ie $(safe){VAR} control chars and so on become there \ equals.

Basically there are tones of options without restricting filenames. Big one is improve shells. Add better options to shells.

Another is work out how to type argv inputs. So applications can know the difference between rm -i * that just happens to find a file called -rf and deleted everything without asking.

We have to remember this wildcard char problem does not just apply to filenames.

Bash can read data from keyboard input so a your problem here does not end with file names.

read in
might just happen to send a control char into a command. I might type anything or it might be raw contents of a file.

Fixing filenames does not fix Posix Shell bad handling of chars this makes Posix shells be here be dragons.
dwheeler (reporter)
2012-02-22 19:25

Comment #1140 says: "Fixing filenames does not fix Posix Shell bad handling of chars"

This is NOT just a problem with POSIX shell. The problem is systemic. Almost all programs that deal with filenames have trouble, regardless of language, because POSIX permits filenames to include control chars. For example, it's common to store filenames in text files. The "obvious" way is by using newline for each row (tab-separated if there is more than one field). Of course, this fails. Even tools like "pax" are fundamentally broken - if they list filenames, you can't tell if the newline is part of a filename or not. Yes, you can fix this using escape sequences, or by using \0 separators, but most people don't do this because people generally have no *use* for control chars (e.g., newline, tab, ESC) in filenames.

As I said earlier, the primary use of filenames with control characters is "to enable security vulnerabilities and to cause errors". Other than "we've always done it that way", there seems to be little justification for them. Filenames already can't include "/" and \0, so their charset is ALREADY limited.

The comment asks, "What happens when we have a UTF8 char that contains 1-31 as one of its bytes and it gets damaged?" I think that's easily answered:
* If an application is trying to *create* a filename with 1-31, then a compliant POSIX system should reject the request. POSIX already *permits* this; such chars have never been in the portable character set.
* If the filename is damaged internally, then there are many options. For example, separate repair tools could be used to fix it (we already have to do this if a filename contains '/'). A system could provide them, escaped. Less good would be a system that permitted people to read and rename them, just not create them. POSIX could allow a range of options, or require one in particular.

The comment said: "Basically there are tones of options without restricting filenames. Big one is improve shells. Add better options to shells."

Actually, I agree that we should improve some POSIX utilities. I think it's important to also add \0 support to a few key utilities (find, xargs, read). But adding \0 capability to *everything*, so that full filenames can be handled, is absurd; there's a huge list of tools that process text files. It's better to ease transition (by improving a few selected tools) and create a better place to transition to.
wlerch (reporter)
2012-02-22 19:38

For the record, the only UTF-8 characters that contain the bytes 1-31 are the control characters 1-31 themselves. Any Unicode character outside of the ASCII range is encoded with bytes greater than 127.
oiaohm (reporter)
2012-02-23 00:55

wlerch in case of bit errors it still turns up. So something has in UTF-8 string that should be above 127 is under because one bit as flipped. Needing to see file no matter how damaged is critical.

dwheeler the big questions here is.
Why do the shell have to ever show the non escape sequence form for a file name?
Why to applications ever need to see the non escaped sequence of filenames from directory listings or other look ups?

"Even tools like "pax" are fundamentally broken"
You say its fundamentally broken. This is the point is fundamentally broken and should be fixed. Not hidden.

You should see the hell you cause to a windows system when you write \n containing files into NTFS. Physical limit of NTFS is the same as ext2/3/4 everything bar slash and nul can be written. Its the win32 subsystem that prevents you from doing control chars. Yes you can ruin windows the same way.

Attacker does not have to play by the rules if the filesystem can store they can do it. Posix is not OS kernel is really cannot set what file names we fully get.

Would it not be simpler and more complete covering to demand that all file-names must be particular escape sequence by default. This cures issue. Filesystem can store what ever char it likes no error. When we handle filenames they are escaped always.

Correct solution really is escape. This can fix a lot of things like rm -i *
So that a file -rf is escaped so rm sees \-rf so does not malfunction on the user instead of thinking rm -i -rf <plus everything else * found> when user does rm -i *. There are a lot of malfunctions like this.

Little example of the space error. ls `echo *` with a file "this that". Due to no escaping ls comes back that it cannot find this and that. Reason shell command line is finding breaks in command line by where spaces exist so think this and that are different arguments so pass them to ls that way so going against what the code would appear todo. This is a far more likely of a error for a person to hit than control chars in file-names. Will be hidden until a script runs over a directory with a filename with space.

Escaped this\ that returned by * as long as it don't get processed out by echo no problems. Reason why the % escaping of URI(Uniform Resource Identifiers) might have to be considered. Its something normal posix tools are not going to mess with.

You request a directory list. libc serves it up pre escaped. This just kills any chance to a control char newline or tab causing trouble or whatever char is going to cause you program trouble from making it. Even space in filenames is no longer a problem by default. Why you escape space so that filenames are always one constant string with nothing evil in it.

libc needs to get a standard escaping function or use a pre-existing.

Person can pick any split char they like does not have to be \n \t or a control char. A person could pick - or space or anything else stupid splitter like a , The important part is that cannot be in the file-name there application sees but that filename must be able to exist on disc. This is where excaping fits in hide from program what it cannot handle seeing.

Simple 100 cure. Set escaping function. Function that tells libc I want my file-names escaped on all calls by this particular method. With a default that the file-names are escaped by something set in standard. That to see raw and take raw binary filenames you have to call the Set escaping function to turn it to a noop because by default it is on.

This is transparent escaping. Could it possibly break a few programs. Really no if the libc requirements are right so open will take a non escaped string and wake up opps we have a problem those chars should not be in there and opens the non escaped version. Also for old programs you could include an option to force program to run with escaping off.

\0 support is required to work without escaping. So making it mandatory option on quite a few programs would be a good idea.

dwheeler to be truthful the error you are talking about are coming from one simple fact programmers are forgetting todo secure programming. So they don't escape inputs or filenames so thing go badly wrong. Also due to the fact file-names and program directives currently can look identical to programs we get malfunctions.

dwheeler you are looking too small the problem of filenames is huge you are only addressing a small corner case. Escaping can address almost all cases with 1 set of alterations to core parts.

Currently non-escaped is default and you have to remember to escape stuff or things go badly wrong. Secuirty flaws appear.....

Default escaped. Programmer cannot make mistake of forgetting to escape lots issues disappear including non control char ones like - starting filenames and filenames containing spaces. Reason escaped gets on top of the problem.

Shells will have to be made handle taking like user typed file-names and performing the escape on them.

Really what need is there for a portable charset for filenames on my own system if the tools can handle anything on my system. Portable character set should really only be for cross OS. If you are needing to use it to limit user you have a problem with the tools secuirty.
oiaohm (reporter)
2012-02-25 02:25

dwheeler I have opened this new one with my solution. [^]

I think my solution is the best. I am removing a limit from POSIX as well as solving a problem once and for all. Scary enough my change will allow filenames on disc containing NULL byte and Slash without upsetting anything long term. Applications using URI will have to get use to the idea they don't have to encode and decode any more since Posix does that bit. Application doing raw will have to get use to the idea they have to encode but still be usable to users just with a little horrid typing %stuff.

The big issue is that you say that if file name contains a particular char this is forbin to be opened created or otherwise. Only party that can see what is allowed it exist on disc is the party that creates the file system.

From my point of view this here should be closed and 545 pushed forwards since 545 is a more platform neutral solution. Yes there will be some overhead.

Even better 545 most likely will be doable at kernel level. Alter the file-system drivers.
random832 (reporter)
2012-12-23 05:12

olaohm, if a filename is "damaged" by bit errors, it might contain a slash or null byte - is dealing with a damaged filesystem really within the scope of POSIX? Such filenames can be fixed [moved to a different name with offending bytes removed] by a system-specific "fsck" tool.

And anyway, it's already legal for an implementation to forbid control characters in filenames, the proposed change is to make it mandatory.
oiaohm (reporter)
2013-01-07 23:29

random832 if you look at MS Windows and its forbin chars with the | in particular damaged due to bit errors is not the only issue. Software viruses have also exploited the fact | would cause filename to be hidden from user view and anti-virus software. Newer versions of windows is reducing number of forbin chars.

This is why I particularly do not like the idea. XFS and EXT3 along with other filesystems it is valid to have filenames containing those chars.

This proposal contains no migration plan to cope with what could possibly be out there. [^]

I have come to the point of view that the shell in standard needs to be fixed.

fopen .... I have not really found an Posix ABI call with a issue with chars with these chars.

random832 the idea that filenames don't exist in the.

Also to be O F. Is the fact that under Fat file systems 0,7-10,13 dec or 0,7-A,D hex are the only ones that are not printable on screen as single chars. [^] This is what dos ANSI allows.

Yes Microsoft two primary file systems don't agree on what chars you can use in file name.

One of the other result-ions to this problem might be follow how Dos deals with it. Assign 1-31 printable symbols and add a mode to printf for processing out filenames.

random832 can you promise me that no future OS will do the Dos method again and assign printable chars to control chars so allowing file-names to contain them.
oiaohm (reporter)
2013-01-07 23:47

random832 the big question here why should the dosfsck fix a file-system that is not technically broken. The idea that the fsck tools can fix it is basically saying lets fix something that is not broken. Instead of fixing fault in posix.

If you want simple printable chars for the control codes unicode defines one. Simple Squarish box with the number of the char inside. Lot of fonts miss providing this. Same with terminals. So printing control chars to screen should be fairly issue-less as long as a print in the API exits that disregards control chars and uses the replacement.

Basically we have a historic terminal design fault. clocal and other settings can be pushed to terminal to disable particular control chars from working like modem control chars. So yes another solution is a simple flag.
jrincayc (reporter)
2013-08-14 02:43

I agree with you that it would be possible to make it possible to have newlines work fine in filenames. However, from what I can see of your proposal it seems that your suggestions would require substantial changes.

Can you summarize what you want changed, and what changes would be needed in kernel and userspace?

Thank you.
dwheeler (reporter)
2013-08-14 03:23

An encoding system would make sense if there was a user need to include newlines (or tabs) in filenames. But there is no such need; the only significant use of control characters in filenames is to attack systems, and support for such characters in filenames has NEVER been required as part of the POSIX specification. It would be possible to encode such names... but to what end? It would be far simpler to prevent them.
dalias (reporter)
2013-08-14 05:41

The issue is that you have a huge base of existing systems where such malicious filenames may already exist in the filesystem, and you have implementations that deal with external media, networked filesystem, etc. which might present malicious names. Requiring an implementation to reject names containing newlines or control characters would make it impossible to access (or even delete) such files. I believe this is the motivation for encoding.
oiaohm (reporter)
2013-08-14 07:34

dwheeler bad news
--An encoding system would make sense if there was a user need to include newlines (or tabs) in filenames. But there is no such need;--

I do have a few links and files with newlines in them. It was a work around on some of the X11 windows managers desktops that did not word wrap file-names yet would support \n in the filenames to get around problem. So historically there has been need for newline in filename at different times.

"Can you summarize what you want changed, and what changes would be needed in kernel and userspace?"
The fact of the matter the Linux/bsd and OS X kernel already works with these chars. A lot of programs don't have issue with these chars.

So its all user-space changes like making ls not be stupid that hello world one exposes a issue.

ln -s ~ Hello$'\n'world Yes this will create one. This basically works on all existing posix systems other than NT. Most graphical filemangers on Linux handle this event without issue. They already do substitution. Libreoffice and Openoffice see a file like Hello$'\n'world and it becomes Helloworld the \n is just not displayed and you can access the document as if its not there. KDE displays a filler char.

The big thing is there is no standard substitution defined.

Really the simplest would be made it mandatory when printing special chars in file-names to print them as per [^] 1-31 in Code page 437 are in fact printable chars.

--Interpretation of code points 1–31 and 127

Code points 1–31 and 127 (00–1Fhex and 7Fhex) may be interpreted as either control or graphic characters, depending on the context.--

It the existence of Code page 437 is what brings the real nightmare. What is Code page 437. MS Dos or fat filesystems.

dwheeler basically you are forbidding valid printable chars.

The missing item is really a way to print with control chars disabled.

Shell scripting does need to be altered as well. So arrays are used to call applications. So removing control char hell.

If you pick up some old dos discs some people did use the sub 31 chars to make their file-names look fancy.

Basically it too late to be prevent them dwheeler the filenames with sub 31 chars exist. Some from dos some from Unix and Linux. Dos to look fancy, Unix and Linux to work around some application bugs.

Forbidding also gives somewhere for malicious programs to hide thinking the kernels support these chars.

Windows forbids creating new files containing 1-31. But handles processing 1-31 and displaying a filler if they happens to be on the file-system.

dwheeler basically your complete base of your idea is off. What ever we change we have to still be able to handle as much as Windows. You are directly putting forwards not to be compatible enough.
jrincayc (reporter)
2013-08-15 02:33

dalias and oiaohm,
Yes, existing filesystems with control characters are an issue. For local non-readonly filesystems, a reasonable possibility is that the kernel can just not return them in readdir and not allow them to be created with creat and rename. The next fsck that is done can rename all the files with invalid characters in them.

What to do with invalid filenames in a networked filesystem is a more complicated issue.

Besides Code page 437, another possibility is the control pictures starting at U+2400 (Symbol for Line Feed is U+240A).

I tried out Gnome's gedit, and it could read and save files with newlines, but so far as I could tell it did not have a way to create a new file with newlines in the filename. I presume that most GUI applications are the same way.

I noticed that ls on my linux system actually lists foo\nbar as foo?bar.

I agree that there do not seem to be significant uses of control codes in filenames. I think the security considerations would more than sufficient as motivation to forbid them.
dalias (reporter)
2013-08-15 02:48

I agree that security considerations are the most important factor here, but I think it's a real possibility that more security problems will arise out of attempts to encode/escape invalid filenames than presently exist. Basically the only software vulnerable to filenames containing newlines is sloppily-written shell scripts, and it's fairly well-known that you should not run shell scripts on untrusted directory trees unless you're confident they're robust.

The other main point I'm concerned about is fragmenting the standard. While I don't particularly agree with their attitude on matters like this, I think you'll find the Linux kernel maintainers 100% unwilling to "impose policy" at the kernel level. Thus you'll either have Linux-based systems intentionally non-conforming, or you'll have userspace hacks attempting to enforce the POSIX rules, which creates an even worse security situation since malicious users can just bypass the userspace code and make syscalls directly to produce malicious filenames.
jrincayc (reporter)
2013-08-15 03:31

I think that encoding bad names is a problem. I think only three programs should ever have to deal with the bad names: The kernel, fsck, and a another program called rename_invalid which goes through a directory tree and renames every file that has an invalid filename. Every other program on the system should probably not see the invalid filenames, encoded or otherwise. For the kernel and fsck, they already read the filesystem at the disk block level, so they already can see what is actually in the directory even if readdir doesn't return the names. I am not sure how to make it so that programs like a hypothetical rename_invalid would get access to the complete directory, and ordinary programs would not.
jrincayc (reporter)
2013-08-16 02:43

I can think of two different approaches for letting programs that are designed for weird filenames handle them. The first would be at the kernel level and would add a flag to open or fcntl that switches the readdir to provide all the filenames instead of just the filenames with no forbidden characters. The second would be to modify the default meaning of various shell commands and features (such as * would not expand to include filenames with forbidden characters by default, similar to how it currently does not expand to dot files currently). So shell programs that are designed to handle newlines and other weirdness would need to opt in, instead of opt out.

Lastly, from what I can tell from RFC 1813 for NFS 3, a / is a perfectly valid filename character, it is the server's choice. See 4.6: """Many servers will also forbid the use of names that contain certain characters, such as the path component separator used by the server operating system. For example, the UFS file system will reject a name which contains "/", while "." and ".." are distinguished in UFS, and may not be specified as the name when creating a file system object.""" I am curious how current UNIX systems would handle a NFS server returning filenames with "/"'s in them.
oiaohm (reporter)
2013-08-16 05:03

jrincayc I do have a horible source of control chars. One of my old mp3 player is full Code page 437 with control chars being the alternative chars and you can rename to them. There are a few cameras and other things out there that are also.

"I tried out Gnome's gedit, and it could read and save files with newlines, but so far as I could tell it did not have a way to create a new file with newlines in the filename. I presume that most GUI applications are the same way."
Yes its fine to create a rule that we cannot create them simply.

Open up charmap the only 2 you cannot insert into a file-name that are under 31 is Null and Newline by graphical. Yes copy paste from a charmap tool or document somewhere into rename and horible can happen. So currently yes you are free to create new files contain everything bar / Null and newline using graphical programs.

Yes some people can have already created these files by mistake.

I have also had control chars appear in filename without control chars due to bad sector effect on filesystems.

Most posix systems most programs don't call the kernel readdir directly. They call the libc version. All like shells I know don't go direct syscall. The kernel syscalls could be left untouched if you are mad enough to be calling the kernel directly you should be ready to cope with whatever it throws up.

jrincayc the NFS solution for odd strange bad evil what chars is a "character translation file" So you define something to substitute for the likes of / . .. being real file names. So you would remap something odd Unicode instead a char you don't use. Yes it is horible but works around problem.

jrincayc I would have no issue with a opt in limitation on shell script. Other than the dead in the face fact how long before they say hey spaces in file-names are giving us trouble lets make that a forbin char as well.

Something to become very quickly aware of is most bash scripts issues with control chars also have issues with file-names with spaces.

"Yes, existing filesystems with control characters are an issue. For local non-readonly filesystems, a reasonable possibility is that the kernel can just not return them in readdir and not allow them to be created with creat and rename. The next fsck that is done can rename all the files with invalid characters in them."
This could be really bad. If someone ever does something like a firmware piece on a fat partition on a device requiring a control char and you go and nuke it. In one instant move you just bricked a device. Do not presume device makers are always sane.

So the answer has to be cope. We can place a rule on create we can place a warning to fsck that they have been created to make sure it was user intention. We cannot place a alter rule because we don't know what we will run into in future and if any device will be mad enough to use control chars in file-names as key requirement.

Basically programs must cope somehow with the existence of strange chars in file-names. The to major options are Substitute them or ignore their existence.

dwheeler, jrincayc and dalias you are forgetting you take away the system calls what stops a malicious program that has breached deep directly interfacing with the block devices and doing exactly what it so pleases. The answer is nothing. This is why I see no reason to mess with kernels. If the fix is not workable in userspace its not workable full stop.

Remember the source of the strange char in filenames might be on a USB key that has come from a completely different OS outside your control like a photocopier that can scan to usb because its ram has gone a bit sus. So should the user not see their files?? This is why I am very sure the library functions should be provide todo opt in but the programs should opt in to be filtered. Like gedit and most graphical file-mangers on Linux don't need to be filtered they already cope with strange chars. The programs that cope need to be left alone. Programs that cannot cope need to have solutions made.

A shell that cannot handle strange chars also need to abort if it happens to be passed a list of arguments containing them. jrincayc opt in with shell scripts most likely this has to be done at interpretor start. Basically shell scripts are horible at handling input. And equally horible at doing predictable and safe string transformations.

Yes C defined valid program arguments as any char bar Null. So its not just filesystems that can send shell programs off the deep end.

Posix shell was not design to cope. There is a lack of protection around shell scripts from malicious issues. It all comes back to shell scripts being horible at handling strings.

In a lot of ways it might be simpler to plan end of life of the posix shell.
a brouwer (reporter)
2013-08-16 10:46

jrincayc wonders:
> I am curious how current UNIX systems would handle a NFS server returning filenames with "/"'s in them.

Kernel code might test, but it is not absolutely necessary. The / plays a role in path resolution. If you have a CD-ROM with a filename like "a/b" with embedded slash then the corresponding file cannot be opened for reading because path resolution will interpret the string "a/b" not as a filename but as a pathname.

jrincayc considers:
> add a flag to open or fcntl

This mess was started because it was thought that security could be improved. Now the simpler a system, the better for security. Adding complications never makes a system more secure. Programmers that have to write secure setuid software have to consider all possibilities. If you introduce new failure modes for programs (-EBADNAME) then that is yet another detail. No doubt the first effect would be that security was broken because some software was written without considering the possibility that a program might fail for this reason.
dwheeler (reporter)
2013-08-17 15:03

I agree that being *simple* is *critically* *important*. The fundamental problem with approaches like filename encoding, open flags, and fcntl, is that they create new complications. Most developers will ignore them, and this problem will continue to be a source of vulnerabilities.

There is nothing simple about the current situation. It's difficult to handle filenames with control chars correctly, and the "standard" approaches for filename handling are actually security vulnerabilities. E.G., everyone "pipes a list of filenames to {program}" - yet this is actually wrong, since filenames can include newlines. This is NOT just a shell problem.

Newlines have *never* been in the *portable* character set, so *portable* programs (including security-related ones) must already assume that (for example) creating a file might fail if it has a newline in the name. So adding a limitation changes nothing if you've written your program portably. And while some badly-written program might break, there are probably thousands of other incorrect programs that would suddenly work correctly. A fair trade.

If we need to back off this proposal further to gain consensus, we could simply forbid *creating* filenames with newlines (and TAB and ESC ideally). That would be adequate for many servers/clouds/etc. where untrusted users cannot directly manipulate a filesystem that would later be mounted. If we have to back off even further, an alternative could be an "implementations SHOULD forbid creating filenames with newline/tab/esc" instead of "MUST". But if it's not mandatory, that creates its own problems... it means that portable applications must be able to handle implementations that fail to prevent the problem. In that case, I think the POSIX spec *REALLY* needs enhancements to handle such filenames (e.g., byte-0-termination options for find, xargs, and shell read), so that developers can have standard tools for dealing with these filenames.
dalias (reporter)
2013-08-17 16:13

There is nothing simple about imposing a policy that will be unpopular with implementors and result in real-world systems behaving in ways that deviate from the standard. That's even worse than the situation we have now.

As it stands, implementations are free to reject any characters they like outside the portable character set. This allows a "hardened" implementation to reject newline and even to advertise this rejection as an additional hardening feature. I would be happy with adding text in appropriate places remarking that historical implementations generally treat filenames as abstract byte sequences (minus \0 and /) but that for the sake of preventing security issues and other unwanted behavior in applications which are processing information in unsafe ways, implementators may want to place restrictions on the characters that can be used in filenames, such as forbidding newlines.

My view is that, in matters of security, the only time that imposing a new requirement on implementations actually makes security easier for portable applications is when all historical implementations already have the desired property, and you're just writing it down for the first time. If there a historical implementations that lack the property, and wrongly assuming they have it would result in a vulnerable application, then all security-conscious developers will be faced with the complexity of supporting both.

One further issue I just realized with any sort of escaping is symbolic links: what do you do with link content? If the escaping form is filesystem-type-specific (for instance, only used on network-mounted devices) and not used on the local system that forbids all unwanted characters, you would run into situations where the proper translation of the symbolic link contents depend on where it points to. This is of course unacceptable since symbolic link contents are pure strings and aren't necessarily even being used to access files they point to...

As for the topic of enhancing find, etc., the fundamental problem is that find and xargs don't speak the same language. xargs reads a sequence of shell-escaped words, whereas find outputs raw filenames. If find had an option to output properly escaped results, using it with xargs would be trivial, and using it with read would at least be possible (albeit a bit more work). This can actually be achieved already with the -exec option, and appropriate shell script passed to exec to do the escaping, but it's painful.
dwheeler (reporter)
2013-08-18 02:01

Clearly, some people believe that newlines should be forbidden in the spec, while others do not. It's not clear to me what the committee will decide.

However... if newlines (and other nasty characters) are not forbidden by the standard, it should at least be easier to *portably* handle such filenames. Currently it's extremely difficult to write programs that handle them correctly... so nearly all programs incorrectly handle filenames. In addition, the fact that "find" and "xargs" aren't easily made compatible in the standard is an old problem; it's past time to fix that.

Although not currently in POSIX, there's a widely-used solution: lists of filenames terminated by \0. Adding support for them for a few utilities (e.g., find, xargs, shell "read", and probably pax) would make it much easier to handle dangerous filenames safely. There's no need for *every* text-handling tool to handle \0-termination; adding this capability to just a few tools is enough (e.g., this would allow the easy creation of encoding tools to handle other cases). Proposals to do so are here for find, xargs, and read are here: [^] [^] [^]

It would probably be wise to also clearly define an error code when opening is rejected specifically because of the name. That way, hardened systems could be consistent on the error they return.
oiaohm (reporter)
2013-08-18 05:22

dwheeler there is a problem. Bad programming methods for file-names.

Are you going to ban , and = as well. There is a long list of chars in file names that can cause issue.

"everyone "pipes a list of filenames to {program}" - yet this is actually wrong, since filenames can include newlines. This is NOT just a shell problem."

Can this be handled by masking dwheeler the answer is yes. Remember dos linux and apple have different end of line.

Reality is that pipe of filenames to a program is mostly a shell problem. Why graphical programs have a habit of not taking piped. Take example VLC. Each file name is 1 argument after the program and if you need to add more it uses IPC that can send null terminated strings.

DOS—uses two characters at the end of the line: <cr><lf>, in that order.
Macintosh—uses one character at the end of the line: <cr>
Unix—uses one character at the end of the line: <lf>

So a old Macintosh with lf in its file-name will pipe perfectly. Can the NFS solution of a mask out be used. So a file on a NFS share contain lf or cr may not be a problem is NFS is masking that to another.

That is the thing dwheeler 243 244 and 245 should be implemented or equal. The reality is dangerous file-names exist and we have to handle them.

The other sad fact to remember with pipe using \n is remember the [^] example. Just because you forbid it from the file-system does not mean a human or machine error still will not find away to enter it.

dwheeler yes I agree better handling is required.

dwheeler this is my problem there is a clear divide between the terminal and the graphic when it comes to this problem. Majority of graphical applications don't have issues. They are not using pipe. They are using better IPC.

This issue has a lot in common with SQL injection. In a lot of ways pipe should be deprecated. pipe never included the idea of data blocks.

dwheeler is not too hard to write a black list of features not to use to be safe. Pipe is one of those features in administration tools that should be blacklisted.

Processing stdin, stderr and standard out to find out what is going on also programs should not be doing full stop.

Parsing‎ is insanely hard. In fact is simpler to use something like dbus for security than pipe. The passed data is formatted.

Less and less graphical programs are using the command line or pipe or any of the trouble causing bits. Instead they are using libraries and dbus to talk to other programs using better IPC.

dwheeler this is why I have such a big problem. Like \0 filename chars are possible to be handled by dbus and other methods without issues.

dwheeler this becomes my serous question thinking its too hare todo safe parsing of stdin, stdout, stderr and pipe. Should we not consider discontinuing their default existence. Remember syslog that some linux distributions are doing away with has the same issue. Tainted input ie application send a error to syslog with a \n in it this would appear as 2 syslog entries possible with a fake program name. Something like journal and other syslog replacements would see this as a string send by the program as one block.

Miss handling of \n is not only happening with filenames. Same with all those other control chars.

Replacement be something far more formal. Something that has define blocks.

dwheeler I know my idea is most likely more shocking. The story is on the wall. PHP scripts are using less exec and pipe using more library options.

There need to be a clear split between data application has received that might be tainted line breaks and the applications own intended ends of transmissions.

dwheeler is the problem not that you cannot tell what is what. A messeging protocol sending sized blocks of data as filenames really would not have to care what chars were in the filename. This could appear as a formal null terminated array at the other end.

posix was never designed to be safe. Text documents is really not a good way of listing file-names either. There are many ways of packing null terminated strings for file storage.

Yes a historic issue most text editors believe nul is end of file so included no way to insert nul simply.

dwheeler can you see it now its not really forbid chars that is required. Improve the tools so nul can be used in user-space as the de-monitor it is in kernel space.

Basically why does shell process by end of line char in the first place when it not user input. Historic huge mistake keeps on coming back and biting us.
oiaohm (reporter)
2013-08-18 06:10

Think about this. If you could create a shell script that was based around nul as the separator a lot problems would disappear.

single nul replaces space between items double nul replaces end of line. This cures space in filename and end of line in filename. Most of those other char issues are cured as well.

Basically not tab or coma separated but nul separated. pipe can be cleaned up by null separator being default.

If you don't want to encode the other option is stop using the chars in the parsing of shell scripts that can be in file-names. This also has the knock on.
Changing the IFS was included in posix shell requirements. So changing spaces to nul should not be a problem if programs are well behaved.

Less of a change to make secure shell scripts not use chars that can exist on file systems. Nul is a universal not to exist in any file-system file name or directory name in existence. The problem remains a lot of existing programs were built wrong to be portable. They presume the end of line is \n when in fact this should have been looked up from a environmental var or equal.

Like making a file for msdos should be <cr><lf> end of line and IFS changed to nul nul and nul repectively can fix so many problems. It also will expose lots of programs that are not portability coded. Like programs using execv to call command line will not be effected by change. But programs using "ls -l" string execution would bust. In fact busting them would be a good thing. Those are some of the worst exploits. "ls -l <userstring>"

dwheeler I bet you have also forgot that in bash ; is also end of line. Usestring "joke;rm -rf ~" is not going to be particularly nice.

The other option nul filling. That is also quite simple. nul<space> nul<newline> nul; Those chars have to have a lead in nul before them for the shell parsing them. Same char without a nul is not to be parsed.

Nul could be quite useful for harden script against user input attacks as well as filename issues. This makes taint protecting input simple. Delete all nuls and taint protection is basically done.

I am not deleting the usage of chars. Alter shell to harden it.

Lets see if we can use what is already forbin to fix the issues. Nul is universal. Its one of the few fairly much for-bin items. Yes it would mean changing editors to support programmers controlling nul locations.
jrincayc (reporter)
2013-08-18 13:12

oiaohm, for a "character translation file", which as I understand it is basically a translation from one character to another (ie. '\n' -> '_','?' -> '_' and so forth), what happens if the translated filename already is in the directory? For example, what happens if we have a directory with $'hi\nthere' and hi_there and the translation is '\n' -> '_'? Then we translate into a directory with two files named hi_there and hi_there, which is a new problem.
dwheeler (reporter)
2013-08-18 19:09

Comment: "Are you going to ban , and = as well. There is a long list of chars in file names that can cause issue." Reply: No, I did not propose that. Control chars (esp. newlines and tabs) cause FAR more problems than "," or "=". Let's focus on the problems at hand!

Comment: "Reality is that pipe of filenames to a program is mostly a shell problem." Reply: No, many GUI programs also need to store and exchange of filenames. And even if they use a "better" IPC, there often have to talk with other programs which use different IPC conventions. I do *NOT* think that trying to ban pipes, and simple storage, is practical... or even desirable. You propose "systems like dbus"... but many POSIX systems do not *have* dbus. Instead, we have many incompatible nonstandard systems. Having a simple *standard* way to safely store and exchange filename information is valuable.

I'm fully aware that different systems have different newline conventions, but we are discussing POSIX - which specifically says newline starts a new line. And I'm well aware that ";" separates shell commands. But handling filenames with ";" and "=" is easy; filenames with newline are harder to *correctly* handle.

It would be possible to interpret "NUL" as field separator and "NUL NUL" as row separator, but then it'd be impossible to distinguish between an empty field and a new row. Also, that is not a common convention. In contrast, simple NUL-termination *is* widely supported and used.

Comment: "That is the thing dwheeler 243 244 and 245 should be implemented or equal. The reality is dangerous file-names exist and we have to handle them." Reply: Thanks!! Here we completely agree.

Even if filenames-with-newlines were banned, and implementors all agreed, it would take a long time to get implementations to follow suit, and there would not be enough standards-based support for the transition. Thus, I propose that the POSIX community at *least* add basic support for \0 termination (e.g., 243, 244, and 245), and get started whether or not newlines (or more) are banned.
geoffclare (manager)
2013-08-19 09:25

The notes recently added to this bug seem to be mostly rehashing
discussions that occurred on the mailing list over two years ago. In
particular, Note: 0001726 says "Clearly, some people believe that
newlines should be forbidden in the spec, while others do not. It's
not clear to me what the committee will decide."

The core team made a decision on this over two years ago. There was
a thread on the mailing list in April 2011 in which we started to
flesh out the details. However, we came to the realisation that
there was a lot of overlap with other changes being made in TC1 and
decided to postpone finalisation of the changes.

The basic idea of the direction the core team decided to take is:

* Attempts to create files with names containing newlines fail.

* In "normal use", readdir(), readdir_r(), and utilities that list
filenames in a "one per line" format all report an error if they
encounter a filename containing a newline. (This brings these files
to the attention of the user.)

* New features should be added that provide administrators and power
users with a way to deal with filenames containing newlines (i.e. to
find them and then either delete or rename the files). The mechanism
chosen for this is a new file descriptor flag which readdir() can
query and certain utilities (such as find) can turn on when told,
via new options, to do so.
a brouwer (reporter)
2013-08-19 10:48

geoff: rehash - yes; decision - ach

It is ironic to see that what was intended to be a security enhancement has now been decided to diminish security on all POSIX systems. I hope Linux will not follow. (Did anyone ask?)

> readdir() reports an error ...

So it is proposed to audit all software in the world and fix the readdir use.
Until now:
  while (1) {
    dp = readdir(d);
    if (dp == NULL && (errno == EAGAIN || errno == EINTR))
    if (dp == NULL)
    .. do something with the entry dp ..
Now some code has te be added. Of course this doesn't happen, and a hacker gets an entirely new tool: filenames with newline. Earlier only useful to attack sloppy shell scripts. Now useful to attack all systems software. One can make files appear and disappear from the sight of readdir by putting a file with newline in its name earlier in the directory. Very interesting.
jrincayc (reporter)
2013-08-19 12:34

a brouwer, For what it is worth, OpenBSD, NetBSD and Linux all currently allow filenames with newlines to be created by unprivileged users. (Probably other POSIX systems as well, but those are the ones I tried.) As I understand it, only users with the privilege to directly edit the directory blocks could create filenames with newlines once "Attempts to create files with names containing newlines fail" so only an attacker that already has elevated privileges could use this to attack software. (If you can directly edit the blocks in a directory, you probably can directly edit the blocks in a suid program.) NFS mounted filesystems might be an exception however.
oiaohm (reporter)
2013-08-19 14:55

"Comment: "Reality is that pipe of filenames to a program is mostly a shell problem." Reply: No, many GUI programs also need to store and exchange of filenames."
Reality its a shell problem GUI programs don't have the problem. specs define all transfers of filenames will be url encoded unless in a nul terminated string. Graphical applications are not using \n terminated with raw. You want \n terminated its url encoded with graphical. So graphical programs on Posix systems don't store or exchange anything else if they are sticking to the common spec. So don't have the problem. url encoded in fact does not have a problem with filenames containing strange chars.

This is where it gets problematic. Shell and Graphical are in dispute. Graphical has solved this problem years ago with a simple choice to encode all filenames in a uniform way for transfers between programs. Calling graphical tools with find and other posix tools should not be problem but the reality is url encoding is not an option.

You find the upper graphical toolkits like GTK and QT doing URL Encoding wrapping. Its good quality URL Encoding wrapping.

geoffclare the big problem I have is why on posix systems is the graphical and shell handling so different. Make shell follow graphical and the problem disappears. Yes following graphical model find and other items would only return URL encoded unless in special mode where \n is not used and nul is used to terminate.

geoffclare as jrincayc stated no kernel has taken the 2011 alteration on board. Since no adoption its not that successful.

Reality the graphical programs don't have a problem they can record file names with \n in them they can access files with \n in them. Exactly why can we not have readdir open and so on just support url encoding out box.

Control chars are simple a issue to those not using graphical libraries.

geoffclare did you guys bother looking at what the graphical side could cope with. Graphical applications don't have the issue because of many reasons. Nul terminated strings used when reading from the libc. Url encoded used internally.

Graphical life could be made simpler if url encoding could be support by normal c functions.

I have mentioned url encoding before and I was basically told don't be stupid. Command line supporting url encode as option would be coming into line with graphical also cures problem.

jrincayc remember if you system is highly infected the malware will be using any method it can find. One of the current ones is monitoring accessing of directories and disconnecting an reconnecting files from the directory. This is not exactly the most dependable.

Really the \n terminating to split filenames has always been wrong. Fixing \n does not fix the other problem chars. Hardening of shell scripts also need to be considered.

To be conforming with graphical applications its url encoded or nul terminated no other form of file-name can exist. Why both of these options are fully cross platform.

Since graphical does not have a problem is highly unlikely that banning moves are going to be accepted by the kernels. All alterations to address this problem have to be restricted to user-space libc and up. So yes unprivileged users. If any idea has the requirement for privilage you can fairly much write it off being accepted. List of applications that can support these strange filenames is long.

--* New features should be added that provide administrators and power
users with a way to deal with filenames containing newlines (i.e. to
find them and then either delete or rename the files). The mechanism
chosen for this is a new file descriptor flag which readdir() can
query and certain utilities (such as find) can turn on when told,
via new options, to do so.--
What about on read only media. Yes no matter what you think that case does exist.

We are past the point of being able to say hey we forbid. We cannot have one section of the world have no issues and another section having issues then saying we will limit the other one.

Reality this is is fully Shell related. Graphical has it fixed. Now the choice is simple. Do we follow the graphical or do we do something different.

Sanity for simpler coding would say follow the Graphical.
oiaohm (reporter)
2013-08-19 22:40

geoffclare read only media requirement is for forensics after security breach.

So welcome to devil in the details. This is why rename and run at higher privilege to access what some people call unsafe file-names is not really an option. Normal userspace has to be able to handle them. Special mode in normal user-space could be tolerable. Forensics work you want to do with min privilege possible.

There are a set of needs with very limited ways of addressing. url encode can in fact handle filenames with nul and / or \. Yes this is the problem graphical design can handle every char in existence. Yet the shell design is attempt to say hey we cannot.
random832 (reporter)
2013-08-19 22:57

oiaohm, the thing is - if url encoding is forced at the open/readdir/etc level, then the files don't _really_, from the point of view of any program written to the existing API, contain control characters, they contain percent sign and hex. And when your gui programs try to apply url-encoding, they'll end up presenting the filenames with %25xx because they don't know the kernel has already applied url-encoding.
dalias (reporter)
2013-08-19 23:21

oiaohm, read-only media also includes CD-ROM, DVD-ROM, NFS, CIFS (SMB), etc. Definitely not just forensics.

geoffclare, I don't believe the "fragmenting the standard" issue was sufficiently discussed before. Personally, I would love to have both newlines and EILSEQ banned from filenames, if such a requirement would be universally adopted. The problem is that some 90% or more of the deployed machines to which this standard is relevant (even though they lack certification) are based on Linux, and there is zero chance of getting Linus to accept anything that conflicts with the holy gospel of "filenames are abstract byte sequences" into the official Linux kernel. So we would be left with one of the following situations:

1. Completely ignoring the POSIX requirements, and reducing the relevance of POSIX, as it fails to reflect actual real-world systems.

2. Implementing the POSIX requirements as a Linux Security Module. This would allow systems to be configured in a conforming way (note that this option is already available as a conforming enhancement even without POSIX making any new requirements or allowances), but in practice they rarely would be, leading to application developers having to continue to support the possibility that filenames contain newlines or risk having vulnerabilities.

3. Implementing the POSIX requirements in userspace. This would just be a mess, as malicious users could make system calls directly to create filenames containing newlines, which would either be exposed to applications (if readdir failed to filter them) or which would be completely hidden from all users, including administrators, and could thereby prevent "rm -rf" from working.

Anyway, my opposition to the proposed change is not based on any belief that filenames should be allowed to contain newlines, but on a (well-justified, I believe) concern that victory over newlines would be a pyrrhic victory, leaving us with a much worse security situation, and much more fragmentation, than what we have now...
shware_systems (reporter)
2013-08-19 23:48

"What about on read only media. Yes no matter what you think that case does exist."
Not really, IMO... From the perspective of the standard and majority of implementations that object is no longer "read only media", it is "non-compliant hardware", as in "coaster" or "paperweight".

Two things the current discussion has brought up for me...
1) Is the standard explicit enough that portable applications can expect the file system, and file names, to adhere to the limitations of the"C" locale, or are file systems supposed to be compatible with the global implementation-defined "" locale?

2) The point raised about hardware failures and single bit errors corrupting file names stored in DIR sectors on a disk points out to me that most implementations rely on the ECC code used by the hardware at the sector level to indicate a portion of a directory may have been corrupted. To assist in error recovery attempts of this type, I'm wondering if it's plausible that each dirent should have a chksum field to localize which entry may be the one corrupted, when multiple entries fit into a single sector? I realize it wouldn't be backwards compatible with current file systems, but as an explicit implementation option it would be able to coexist with those partitions, I believe. For those file systems that didn't maintain them the relevant field added to dirent as returned by readdir() would be zeroed, and those that did a bit in the statvfs struct for the partition could indicate the support. I don't think additional fields a file system might use to make that reliable would need to be reflected in dirent.

Food for thought, anyways.
random832 (reporter)
2013-08-20 02:40

Since people are using Linux (and Linus Torvalds personally) as a bogeyman against this ever being implemented, does anyone have any links to posts by him (or other Linux kernel people or important developers of other systems) actually commenting on the issue?
dalias (reporter)
2013-08-20 03:23

Not exactly the same issue, but close: here is the classic email thread where Linus Torvalds put forth the "filenames are abstract byte sequences" viewpoint against having any UTF-8 policy in the kernel: [^]

At one point he mentions '\n' as an analogous case that applications must deal with escaping, just like invalid UTF-8 sequences.

Admittedly that is rather old, and his position may have evolved since then; I'd welcome further research on the matter. If I felt confident that we could get real-world implementations to adopt a policy banning \n in filenames, I would not be against it as long as the technical difficulties can be overcome.
shware_systems (reporter)
2013-08-20 06:38

As far as evolving goes, while his view that a file name is more of a locale less blob than not still holds with many interfaces, the standard C interfaces for collecting and comparing user input to those blobs do rely now on LC_CTYPE and LC_COLLATE to give semantic context to the bits in his byte size char containers, which by usage extension now includes file names. I think the only way to avoid this for the kernel is if it's compiled with a freestanding C implementation and custom libraries, which I don't believe is the case. I haven't gotten a snapshot in a while, so I'm not sure what's current.

His arguments about file systems should use UTF-8 over some locale specific single byte encoding has floundered, IMO, on the simple economic fact that no company I know of markets a UTF or Unicode specific keyboard to enter those file names with, and from what I've seen few implementations are going to spend the extra time doing the conversions from the 7 or 8 bit code set a keyboard does support to UTF-8 in the device drivers.
dalias (reporter)
2013-08-20 07:02

shware_systems, please stay on-topic. I cited that thread not to discuss UTF-8 (which, by the way, all mainstream Linux distributions have used by default since c. 2007, and which does not require a "Unicode specific keyboard" to use) but as a look into Linus's position on filenames as raw byte sequences, which random832 requested.
dwheeler (reporter)
2013-08-20 14:52

It would be fairly easy to add to a kernel (e.g., the Linux kernel) the ability to configure which bytes/characters are allowed (or not) at the beginning, end, and middle of a filename, and whether or not UTF-8-ness of filename is enforced. I suggest only doing this check at creation time. Then the kernel provides a mechanism to enforce policy, while the actual *policy* is under control of the administrator.

I looked over Linus' (old) comments, and he clearly didn't like the idea that ALL systems would forbid non-UTF-8 files. But if this were run-time-configurable, it might be just fine.
oiaohm (reporter)
2013-08-21 12:23

"oiaohm, the thing is - if url encoding is forced at the open/readdir/etc level, then the files don't _really_, from the point of view of any program written to the existing API, contain control characters, they contain percent sign and hex. And when your gui programs try to apply url-encoding, they'll end up presenting the filenames with %25xx because they don't know the kernel has already applied url-encoding."
In fact url encoding you do know. file:// is at the start. Yes if you do have "file://something" [^] as a real file-name to get some graphical programs to used file-name as argument it you have to encode it this is the only issue graphical programs currently have with strange. Arguments passed to programs are nul terminated.

There is no reason why open/readdir/... etc could not have url encode as flag. The follow freedesktop guide lines open/readdir accept nul terminated string. So freedesktop requirement are meet since it is nul terminated or url encoded. Note or. Something can accept both if it does accept both it ideally is meant to have a flag to inform what it is accepting.

Remember freedesktop guide lines place two requirements. nul terminated strings. These have zero problems with the strange chars. URL encoded for cases where you want to use like \n to make a list of files. So following freedesktop guide lines all filenames on the shell should be URL encoded since you are using \n space and other things as separators.
oiaohm (reporter)
2013-08-21 12:25

"It would be fairly easy to add to a kernel (e.g., the Linux kernel) the ability to configure which bytes/characters are allowed (or not) at the beginning, end, and middle of a filename, and whether or not UTF-8-ness of filename is enforced. I suggest only doing this check at creation time. Then the kernel provides a mechanism to enforce policy, while the actual *policy* is under control of the administrator."

Dwheeler no read the [^] again.

Linus Torvalds
"The kernel is _agnostic_ in what it does. As it should be. It doesn't
really care AT ALL what you feed it, as long as it is a byte-stream."

This is the key line you are missing dwheeler. To fix this problem you are not allowed to use kernel space if you want to have it work on Linux Kernel end of story. Linux kernel the base to the dominate posix operating systems out there.

Next problem dwheeler you wanting to perform filtering. There is a reason why Linux kernel keeps it simple. The Linux kernel avoids transforming as much as possible. So a file system that supports utf-8 like Linux core file systems ext file systems xfs btrfs and others only the bare min of checks is performed. Performing the checks you are talking about is increasing code size in kernel space increasing location for bugs while string handling.

dwheeler think Dinal of Service Attack. Administrator typos and forbids some char that is required. Is it possible by forbidding chars you can write to file system to leave a system you cannot log into. The answer is yes.

The userspace locale setting can forbid creating file-names containing particular chars already on Linux. Yes anyone running particular backup programs from cron find out this because backup will not create particular files because the default is not utf-8 but is ascii yep 0-127. This forbid does not stop the files with invalid chars being displayed.

dwheeler so you are proposing double filtering so double set of bug locations. Is it the kernel or the user space locale refusing to create a file? is what you ideal will cause. This is another reason why you filtering idea is so dead. Administrator can edit the locale tables as well to forbid chars without touching the kernel today.

Yes run-time-configurable file name char limiting exists today on Linux. In the userspace. This is what you have to all get kernel space is off limits to solve this problem.
oiaohm (reporter)
2013-08-21 12:49

"Not really, IMO... From the perspective of the standard and majority of implementations that object is no longer "read only media", it is "non-compliant hardware", as in "coaster" or "paperweight"."
Sorry no. Do you know what write blockers are. For every type of storage media out there exists a matching write blocker. [^]
Every storage item can be in read only mode. Some have a switch todo it some have a block of hardware you attach. Sorry that is the reality.

shware_systems some times you should only raise things if you know them.
"2) The point raised about hardware failures and single bit errors corrupting file names stored in DIR sectors on a disk points out to me that most implementations rely on the ECC code used by the hardware at the sector level to indicate a portion of a directory may have been corrupted."

At what point do harddrives(same for spinning and ssd) stop being able to ECC code correct. When they run out of spare sectors for smart to replace with. Yes even the newest hard drives can end up in failed state spitting up bit errors. Hard drives stop correcting once out of replacement sectors. At this point the corrupted sector might be flagged as such but will be passed to the OS in its damaged state.

There is a requirement to handle what ever when the crap hits the fan. Hardware failure or Been attacked. Write blockers are used when maintaining evidence or performing data recovery. Why performing data recovery to make sure you don't destroy more data.

So how are you going to fell harddrive is starting to pack it in and that very important business file you could not copy off because the OS would not allow you over some stupid I don't allow X chars then the drive dies completely.

"1) Is the standard explicit enough that portable applications can expect the file system, and file names, to adhere to the limitations of the"C" locale, or are file systems supposed to be compatible with the global implementation-defined "" locale?"

Linus Torvalds
"The kernel is _agnostic_ in what it does. As it should be. It doesn't really care AT ALL what you feed it, as long as it is a byte-stream."

This answer this. Native File systems in a system like Linux are no locale just utf-8 bytestream filenames. Linux was not the first todo this this is common behaviour for most of the Unix OS's in existance. Applications are expected to cope with this fact. Posix is failing to meet what is required. Default BSD and Solaris file systems are also no locale. Dominate kernels of the posix world are no locale file systems.

Agnostic is the Unix BSD and Linux worlds of file systems. There are debates from BSD that also state the same thing as Linus. Agnostic is defined by the major parties who will implement the posix spec. Windows has very poor posix implementation so really does not need to be shown special treatment.
steffen (reporter)
2013-08-21 13:12

I've just read Don Cragun's note `884' as of 2011-07-06 in [1]:

  The current plan is to add a set of byte values (based on single-byte characters in the C Locale) that will not be allowed in newly created filenames using 0000251 as the bug to make the changes. If consensus is reached on a resolution for bug 251, the plan is to reject and close bugs 243, 244, and 245. These three bugs will remain open until bug 251 is resolved

  [1] <> [^]

and just want to point out that imho this is a weird, non backward-compatible plan, just as useful as trying to add an URL encoding layer or any other plan that imposes loss of backward-compatibility and adds restrictions which require multiple parties to agree in the steps taken.

POSIX standardized tools in common use have already taken the necessary steps to overcome this problem on a per-utility base via options like -print0, -z (BSD sort(1)), and many shells (mksh, ksh, bash) have adopted the -d option, as in this example from the mksh(1) manual:
  find . -type f -print0 | while IFS= read -d '' […]

This is common practice in all over the Unix community for many years.
(I must however admit that i also have written programs and tools that can be fooled by using control characters.)
Thank you.
oiaohm (reporter)
2013-08-21 14:02

steffen locale blocking remains outside kernel space. Also locale blocking can be override if required.

Adding URL encoding would be about alignment between shell and graphical. Yes graphical following freedesktop with nul and url. And posix way was the broken seperator.

Like find being able to URL encode for -exec functions would be useful for the bug cases that can effect graphical. file://something [^] issues where the graphical can get confused if that is a file or a url pointing to a file.

Yes different bug. Graphical has a different issue completely. Of course solve-able on both ends. One graphical application adds a flag to know that arguement is not URL encoded two shell adds means to send url encoded simply. The graphical and shell interlink is not great.

steffen its not just backwards compatibility its better compatibility going forwards as well that has to be considered.

Yes URL encoding is more solution that helps the compatibility between the shell and graphical. At least there are a existing population of applications that can take advantage of URL encode that are not existing security risks.

URL encoding does not require kernel agreement. Any plan wanting kernel agreement is fairly much up the creek.

Basically I am willing to tolerate some backwards comparability loss if the result going forwards does not increase risk of data loss and makes graphical and shell more friendly with each other.

Shell and Graphical are in conflict over how to address the issue. The graphical solution of URL encode and Nul terminated only at least work without altering kernel space or restricting chars that can be in files(this is backwards compatibility access to data). Applications can be replaced data not always.

steffen breaking shell scripting to fix it I would not put off the cards.
oiaohm (reporter)
2013-08-30 04:44

Something I failed to mention. Please look closer what is happening inside Linux.

More and more init system solutions not using shell. Graphical using like execv to call other applications if they have to. So avoiding shell interpreter.

Plans in the Linux kernel to kick text console processing to userspace to the point that its option to even be on the system.

This is the problem. This is talking about fixing a issue that could perfectly fix itself because shell is given up on completely. So adding a restriction to the standard to fix something that ceases to exist.
safinaskar (reporter)
2014-12-05 14:45

At least, please add -print0 option to find and similar options. This is my proposal: [^]
mirabilos (reporter)
2015-06-22 16:48

Changing something fundamental like this is bad. I see three causes of trouble here:

① data interchange with systems that do not implement this yet (think mounting foreign-OS filesystems, network, but even just tarballs) – this is problematic, but didn’t prompt me to comment yet…

② malware hiding in files with “forbidden” names, which other readers pointed out, is also a strong issue, one which I found while re-reading this report…

③ but the addition of new interfaces to access such files will be the cause of big and utter pain; another LWN reader commented on this, and this prompted me to re-read this and actually comment… before I was just thinking “oh no, but, meh, we can just ignore this”.

Honestly, I fully agree with the need to escape some kinds of filenames. Widely deployed, and easily later-added-to-existing-systems, measures like “print -0” and “read -d ''” exist. Actually, make that…

④ a further concern is that code doing 'foo `ls`' (instead of “xargs -0 foo --”) becoming “correct”, which will lead to more people writing code like that, which will lead to more bad (lacking enough proper escaping to be generally useful) code… I mean, ok, you can say “just implement POSIX”, but there’s enough existing systems around that that is… a rather unkind attitude to have, so I urge you to not do that.

Escape mechanisms for bad Unicode sequences in full Unicode APIs exist. I know of two:

⒈ MirBSD’s maps invalid Unicode into the range EF80‥EFFF when converting from 8-bit to Unicode (we use 16-bit Unicode, but this also works with 32-bit Unicode), and the other way back. On-disc “valid Unicode EF80” will be treated as invalid and converted to three codepoints (EFEE EFEB EF80), and be reconstructed from it – it’s in the PUA, so applications shouldn’t have been using it. The range used is actually coordinated with and allocated by the CSUR (ConScript Unicode Registry, the largest PUA users around). MirBSD uses this because there is only one locale which uses only one charset/encoding, which is UTF-8 (well CESU-8, but it’s the same for 16-bit Unicode systems) but 8-bit transparent, so we need to be able to round-trip arbitrary octets through wchar_t.

⒉ Python 3’s maps them as high surrogates which are (in the UCS-2/UTF-16 case) not preceded by low surrogates (and in the UCS-4 case, surrogates are invalid anyway). Py3k uses this because their string data type is actually Unicode (of an OS-dependent width) and used by all APIs, but required to be able to pass things like OS filenames (which may be encoded in something not UTF-8) through to e.g. the open function.

Dealing with NUL is also perfectly documented, and all those \n cases are just people using bad shell scripting and the standard not providing officialness stamps to the existing, widely deployed (over one *billion* Android devices come with mksh as the system – and only – shell installed; my contact at Google said in 2013? 2014? that they’re working on the second billion), perfectly usable, easily retrofitted, measures to fix them (see the other three issues about read -d, find -print0, xargs -0).

On that note, please do *not* standardise on GNU sort (-z for NUL-terminated), but on BSD’s:

mirbsd$ printf 'foo\0bar\0' | sort -R '' | hd
00000000 62 61 72 00 66 6F 6F 00 - ||
mirbsd$ printf 'foo/bar/' | sort -R / | hd
00000000 62 61 72 2F 66 6F 6F 2F - |bar/foo/|

… is much more flexible than GNU’s…

       -z, --zero-terminated
              line delimiter is NUL, not newline

… for the same reason the shell read built-in command has -d for the delimiter (first octet of $OPTARG is used) instead of -0.

- Issue History
Date Modified Username Field Change
2010-05-03 18:49 dwheeler New Issue
2010-05-03 18:49 dwheeler Status New => Under Review
2010-05-03 18:49 dwheeler Assigned To => ajosey
2010-05-03 18:49 dwheeler Name => David A. Wheeler
2010-05-03 18:49 dwheeler Section => XBD 3.170 Filename
2010-05-03 18:49 dwheeler Page Number => 60
2010-05-03 18:49 dwheeler Line Number => 1781
2010-05-03 19:37 dwheeler Note Added: 0000412
2011-03-10 16:04 msbrown Note Added: 0000689
2011-04-11 20:28 eblake Interp Status => ---
2011-04-11 20:28 eblake Note Added: 0000739
2011-04-11 20:28 eblake Summary Forbid bytes 1 through 31 (inclusive) in filenames => Forbid newline, or even bytes 1 through 31 (inclusive), in filenames
2011-04-11 21:17 dwheeler Note Added: 0000740
2011-07-07 15:31 user27 Note Added: 0000887
2011-12-17 21:43 dwheeler Issue Monitored: dwheeler
2012-02-22 06:48 oiaohm Note Added: 0001140
2012-02-22 19:25 dwheeler Note Added: 0001141
2012-02-22 19:38 wlerch Note Added: 0001142
2012-02-23 00:55 oiaohm Note Added: 0001143
2012-02-25 02:25 oiaohm Note Added: 0001147
2012-02-25 02:40 eblake Relationship added related to 0000545
2012-03-29 23:10 Don Cragun Relationship replaced has duplicate 0000545
2012-08-03 19:35 eblake Relationship added related to 0000291
2012-08-03 19:36 eblake Relationship added related to 0000293
2012-08-03 19:36 eblake Relationship added related to 0000573
2012-12-23 05:12 random832 Note Added: 0001437
2013-01-07 23:29 oiaohm Note Added: 0001438
2013-01-07 23:47 oiaohm Note Added: 0001439
2013-08-14 02:43 jrincayc Note Added: 0001711
2013-08-14 03:23 dwheeler Note Added: 0001712
2013-08-14 05:41 dalias Note Added: 0001713
2013-08-14 07:34 oiaohm Note Added: 0001714
2013-08-15 02:33 jrincayc Note Added: 0001715
2013-08-15 02:48 dalias Note Added: 0001716
2013-08-15 03:31 jrincayc Note Added: 0001717
2013-08-16 02:43 jrincayc Note Added: 0001721
2013-08-16 05:03 oiaohm Note Added: 0001722
2013-08-16 10:46 a brouwer Note Added: 0001723
2013-08-17 15:03 dwheeler Note Added: 0001724
2013-08-17 16:13 dalias Note Added: 0001725
2013-08-18 02:01 dwheeler Note Added: 0001726
2013-08-18 05:22 oiaohm Note Added: 0001727
2013-08-18 06:10 oiaohm Note Added: 0001728
2013-08-18 13:12 jrincayc Note Added: 0001730
2013-08-18 19:09 dwheeler Note Added: 0001731
2013-08-19 09:25 geoffclare Note Added: 0001732
2013-08-19 10:48 a brouwer Note Added: 0001733
2013-08-19 12:34 jrincayc Note Added: 0001734
2013-08-19 14:55 oiaohm Note Added: 0001735
2013-08-19 22:40 oiaohm Note Added: 0001736
2013-08-19 22:57 random832 Note Added: 0001737
2013-08-19 23:21 dalias Note Added: 0001738
2013-08-19 23:48 shware_systems Note Added: 0001739
2013-08-20 02:40 random832 Note Added: 0001740
2013-08-20 03:23 dalias Note Added: 0001741
2013-08-20 06:38 shware_systems Note Added: 0001742
2013-08-20 07:02 dalias Note Added: 0001743
2013-08-20 14:52 dwheeler Note Added: 0001744
2013-08-21 12:23 oiaohm Note Added: 0001745
2013-08-21 12:25 oiaohm Note Added: 0001746
2013-08-21 12:49 oiaohm Note Added: 0001747
2013-08-21 13:12 steffen Note Added: 0001748
2013-08-21 14:02 oiaohm Note Added: 0001749
2013-08-30 04:44 oiaohm Note Added: 0001779
2014-12-05 14:45 safinaskar Note Added: 0002480
2015-06-22 16:48 mirabilos Note Added: 0002730

Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker