| Anonymous | Login | Signup for a new account | 2013-05-24 18:01 UTC |
| Main | My View | View Issues | Change Log | Docs |
| Viewing Issue Simple Details [ Jump to Notes ] | [ Issue History ] [ Print ] | |||||||||||
| ID | Category | Severity | Type | Date Submitted | Last Update | |||||||
| 0000251 | [1003.1(2008)/Issue 7] Base Definitions and Headers | Objection | Enhancement Request | 2010-05-03 18:49 | 2013-01-07 23:47 | |||||||
| Reporter | dwheeler | View Status | public | |||||||||
| Assigned To | ajosey | |||||||||||
| Priority | normal | Resolution | Open | |||||||||
| Status | Under Review | |||||||||||
| Name | David A. Wheeler | |||||||||||
| Organization | ||||||||||||
| User Reference | ||||||||||||
| Section | XBD 3.170 Filename | |||||||||||
| Page Number | 60 | |||||||||||
| Line Number | 1781 | |||||||||||
| Interp Status | --- | |||||||||||
| Final Accepted Text | ||||||||||||
| Summary | 0000251: Forbid newline, or even bytes 1 through 31 (inclusive), in filenames | |||||||||||
| Description |
Forbid bytes 1 through 31 (inclusive) in filenames. POSIX.1-2008 page 60 lines 1781-1786 states that filenames (aka "pathname component") may contain all characters except <slash> and the null byte, and this has historically been true. However, this excessive permissiveness has resulted in numerous security vulnerabilities and erroneous programs. It also increases the effort to write correct programs, because correctly processing filenames that include characters like newline is very difficult (even the expert POSIX developers have trouble; see 0000248). The "Unix-Haters Handbook" specifically notes the problems caused by control characters (such as newlines) in filenames (see page 156-157), so this is not a new problem! A key offender, of course, is the <newline> character. This is widely used as a filename separator, even though it is strictly speaking not a valid filename separator (since a filename may include it). But other control characters also cause problems. Another common problematic character is the <tab> character; files with records terminating in newline, and fields separated by tab, are extremely common, and are encouraged by some tools (e.g., by the default delimiter of "cut" and "paste"). Some terminals and terminal emulators accept control characters (range 1-31), e.g., via the escape character. Simply *displaying* filenames with bytecodes in this range can cause problems on such systems. (Granted, there is no requirement that terminals must only accept control characters in the range 1-31, but if 1-31 could not be in filenames, that would be a reasonable configuration to move to.) In practice, the primary use of filenames with control characters appears to be to enable security vulnerabilities and to cause errors. Other than "we've always done it that way", there seems to be little justification for them. By forbidding control characters, many programs that are currently erroneous (e.g., because they process filenames one line at a time) become correct. The number of applications this change would *fix* is far larger than the vanishingly small number of programs that non-portably *depend* on control characters being permitted in filenames. Any program that depends on control characters in filenames is *already* not portable, since control characters are not in the "portable character set" of filenames (Page 77, line 2194, XBD 3.276 Portable Filename Character Set). What's more, NTFS (and probably other filesystems) already forbid bytes 1-31 in filenames, so this is already an extant limitation. But merely making it "not portable" is not enough; it needs to be *forbidden* before correct programs can count on it. This proposal suggests a new error, [ENAME], though other options are possible. I'm well aware that this may be a controversial proposal. But I think most readers will understand *why* I am proposing this, and that while this is a dramatic approach, we are now in an era where POSIX systems are routinely used to manage important information worldwide. The ability to include control characters in filenames has rarely been a help, and instead has been a hindrance to the use of these systems. We've had ample time to see that this is a problem. It's time to jettison this misfeature. |
|||||||||||
| Desired Action |
Vol. 1, Page 60, line 1782-1784: Change: "The characters composing the name may be selected from the set of all character values excluding the <slash> character and the null byte." to: "The characters composing the name may be selected from the set of all character values excluding the <slash> character, the null byte, and character values 1 through 31 inclusive. Attempts to create filenames containing bytes 1 though 31 must be rejected, and conforming implementations must not return such filenames to applications or users." Vol. 1, Page 77, append after 2199: The set of character values 1 through 31, inclusive, are expressly forbidden. Attempts to create such filenames must be rejected, and conforming implementations must not return such filenames to applications or users. Vol. 2, page 480, [ENAME] The path name is not a permitted name, e.g., it contains bytes 1 through 31 (inclusive). Vol. 2, page 1382, in "open()", after line 45297, state: "If open() or openat() is passed a path containing byte 1 through 31 (inclusive), it must reply with error ENAME instead of opening the file." Vol.2, page 1382, in open(), before line 45319, state: [ENAME] The path name is not a permitted name, e.g., it contains bytes 1 through 31 (inclusive). Vol. 2, page 1744, lines 55680-55682: Append: "The readdir( ) function shall not return directory entries of names containing bytes 1 through 31, inclusive." There would need to be other changes in the interfaces, but there's no point in identifying these other changes if this proposal will be flatly rejected. Since this involves filenames, which occur all over in the spec, it may be possible to have this change described in a few places instead of many. I've identified open() and readdir() here, because many other interfaces are built from these two. |
|||||||||||
| Tags | No tags attached. | |||||||||||
| Attached Files | ||||||||||||
|
|
||||||||||||
Relationships |
|||||||||||||||||||||
|
|||||||||||||||||||||
Notes |
|
|
(0000412) dwheeler (reporter) 2010-05-03 19:37 |
Note that if values 1..31 were forbidden in filenames, widely-used constructs like these would be become correct: # Only correct if tab and newline cannot be in filenames: IFS=`printf '\n\t'` # Remove 'space', so filenames with spaces work well. for file in `find . -type f` ; do COMMAND "$file" done More rationale for this proposal can be found here: http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html [^] http://www.dwheeler.com/essays/filenames-in-shell.html [^] |
|
(0000689) msbrown (manager) 2011-03-10 16:04 |
Another data point: The encodings 0x00-0x1f are "control characters" in the "standard" z/OS UNIX System Services code page IBM-1047. Note that "line feed" is 0x25 in EBCDIC/IBM-1047, but the C language '\n' is 0x15 (EBCDIC "new line"). |
|
(0000739) eblake (manager) 2011-04-11 20:28 |
I have retitled the bug; ongoing discussion is still debating the best solution, but any solution is more likely to gain consensus if it only forbids newlines (which is the main source of ambiguous output on any utility that produces line-oriented listings of file names) than to forbid all control characters. |
|
(0000740) dwheeler (reporter) 2011-04-11 21:17 |
Forbidding just newline would be better than the current state. So if that more limited goal is truly necessary to make progress, then sure! That said, perhaps there's an alternative in the middle. As noted above, there are two other control characters that cause heartburn: tab (widely used in tab-delimited fields and as IFS whitespace) and ESC (widely used for terminal escapes). If *all* control characters can't be forbidden, can it be at least those three? That would simplify processing (tab-delimited fields work, and IFS of \n\t works nicely), and display of filenames is easier too (no ESC) What are the critically-important use cases where control characters in filenames are REQUIRED for applications to work properly? Especially since several filesystems (such as NTFS) essentially forbid them already? I haven't seen a lot of critically-important use cases documented, and control characters have *never* been in the portable character set for filenames. |
|
(0000887) seedsg (reporter) 2011-07-07 15:31 |
This proposal is fine for applications that depend on the POSIX portable character set, but leaves others in the lurch. It would be preferable to restrict file names to printable characters, space, and NBSP. |
|
(0001140) oiaohm (reporter) 2012-02-22 06:48 |
Lets look at this differently why do we have to forbin anything. Programs feeding data to shell cannot use \0 Null as a line break. This is a double sided sword. We know the stdin and stdout are still open after getting a \0 char. IFS=`printf '\0'` for file in `find . -type f` ; do COMMAND "$file" done The command should have no issue taking a null terminated string. Why because int main (int argc, char *argv[]) That is exactly what a C argument is. IFS becomes a problem. Because applications don't support accepting \0. This is a case they are depending on string length by Null ended strings instead of when the file-handle feeding them information ends. Reason we don't have two of them. One IFS for applications to feed information to the shell and a IFS for screen. IFS=\0 is good for applications feeding into other applications. Not that great for feeding to screen. \0=\n on screen would be ideal in lots of cases. Also to screen \n(newline) could be displayed as \n and so on for the control chars. What happens when we have a UTF8 char that contains 1-31 as one of its bytes and it gets damaged. Basically to support UTF8 in case of malfunction this is not possible. Because we cannot say that a file name displaying 31 might not turn up due to file system damage. Its bad enough with \0. The issue is the shell handling of input and output. The - issue that exists as well. Is also a shell issue but its harder to solve. Applications don't have a clear marking in int main (int argc, char *argv[]) what is a directive todo something and what is data to be used. FLAGs on argv contents would be handy in this regard. Simplest method I can think up is a envormental var containing it that is cleared automatically by next exec command. All chars bar <slash> character and the null byte should be able to be handled in file names. Because we have two chars that cannot be in files. ${VAR} exits $(type){VAR} could also very handy ie $(safe){VAR} control chars and so on become there \ equals. Basically there are tones of options without restricting filenames. Big one is improve shells. Add better options to shells. Another is work out how to type argv inputs. So applications can know the difference between rm -i * that just happens to find a file called -rf and deleted everything without asking. We have to remember this wildcard char problem does not just apply to filenames. Bash can read data from keyboard input so a your problem here does not end with file names. read in COMMAND $in might just happen to send a control char into a command. I might type anything or it might be raw contents of a file. Fixing filenames does not fix Posix Shell bad handling of chars this makes Posix shells be here be dragons. |
|
(0001141) dwheeler (reporter) 2012-02-22 19:25 |
Comment #1140 says: "Fixing filenames does not fix Posix Shell bad handling of chars" This is NOT just a problem with POSIX shell. The problem is systemic. Almost all programs that deal with filenames have trouble, regardless of language, because POSIX permits filenames to include control chars. For example, it's common to store filenames in text files. The "obvious" way is by using newline for each row (tab-separated if there is more than one field). Of course, this fails. Even tools like "pax" are fundamentally broken - if they list filenames, you can't tell if the newline is part of a filename or not. Yes, you can fix this using escape sequences, or by using \0 separators, but most people don't do this because people generally have no *use* for control chars (e.g., newline, tab, ESC) in filenames. As I said earlier, the primary use of filenames with control characters is "to enable security vulnerabilities and to cause errors". Other than "we've always done it that way", there seems to be little justification for them. Filenames already can't include "/" and \0, so their charset is ALREADY limited. The comment asks, "What happens when we have a UTF8 char that contains 1-31 as one of its bytes and it gets damaged?" I think that's easily answered: * If an application is trying to *create* a filename with 1-31, then a compliant POSIX system should reject the request. POSIX already *permits* this; such chars have never been in the portable character set. * If the filename is damaged internally, then there are many options. For example, separate repair tools could be used to fix it (we already have to do this if a filename contains '/'). A system could provide them, escaped. Less good would be a system that permitted people to read and rename them, just not create them. POSIX could allow a range of options, or require one in particular. The comment said: "Basically there are tones of options without restricting filenames. Big one is improve shells. Add better options to shells." Actually, I agree that we should improve some POSIX utilities. I think it's important to also add \0 support to a few key utilities (find, xargs, read). But adding \0 capability to *everything*, so that full filenames can be handled, is absurd; there's a huge list of tools that process text files. It's better to ease transition (by improving a few selected tools) and create a better place to transition to. |
|
(0001142) wlerch (reporter) 2012-02-22 19:38 |
For the record, the only UTF-8 characters that contain the bytes 1-31 are the control characters 1-31 themselves. Any Unicode character outside of the ASCII range is encoded with bytes greater than 127. |
|
(0001143) oiaohm (reporter) 2012-02-23 00:55 |
wlerch in case of bit errors it still turns up. So something has in UTF-8 string that should be above 127 is under because one bit as flipped. Needing to see file no matter how damaged is critical. dwheeler the big questions here is. Why do the shell have to ever show the non escape sequence form for a file name? Why to applications ever need to see the non escaped sequence of filenames from directory listings or other look ups? dwheeler "Even tools like "pax" are fundamentally broken" You say its fundamentally broken. This is the point is fundamentally broken and should be fixed. Not hidden. You should see the hell you cause to a windows system when you write \n containing files into NTFS. Physical limit of NTFS is the same as ext2/3/4 everything bar slash and nul can be written. Its the win32 subsystem that prevents you from doing control chars. Yes you can ruin windows the same way. Attacker does not have to play by the rules if the filesystem can store they can do it. Posix is not OS kernel is really cannot set what file names we fully get. Would it not be simpler and more complete covering to demand that all file-names must be particular escape sequence by default. This cures issue. Filesystem can store what ever char it likes no error. When we handle filenames they are escaped always. Correct solution really is escape. This can fix a lot of things like rm -i * So that a file -rf is escaped so rm sees \-rf so does not malfunction on the user instead of thinking rm -i -rf <plus everything else * found> when user does rm -i *. There are a lot of malfunctions like this. Little example of the space error. ls `echo *` with a file "this that". Due to no escaping ls comes back that it cannot find this and that. Reason shell command line is finding breaks in command line by where spaces exist so think this and that are different arguments so pass them to ls that way so going against what the code would appear todo. This is a far more likely of a error for a person to hit than control chars in file-names. Will be hidden until a script runs over a directory with a filename with space. Escaped this\ that returned by * as long as it don't get processed out by echo no problems. Reason why the % escaping of URI(Uniform Resource Identifiers) might have to be considered. Its something normal posix tools are not going to mess with. You request a directory list. libc serves it up pre escaped. This just kills any chance to a control char newline or tab causing trouble or whatever char is going to cause you program trouble from making it. Even space in filenames is no longer a problem by default. Why you escape space so that filenames are always one constant string with nothing evil in it. libc needs to get a standard escaping function or use a pre-existing. Person can pick any split char they like does not have to be \n \t or a control char. A person could pick - or space or anything else stupid splitter like a , The important part is that cannot be in the file-name there application sees but that filename must be able to exist on disc. This is where excaping fits in hide from program what it cannot handle seeing. Simple 100 cure. Set escaping function. Function that tells libc I want my file-names escaped on all calls by this particular method. With a default that the file-names are escaped by something set in standard. That to see raw and take raw binary filenames you have to call the Set escaping function to turn it to a noop because by default it is on. This is transparent escaping. Could it possibly break a few programs. Really no if the libc requirements are right so open will take a non escaped string and wake up opps we have a problem those chars should not be in there and opens the non escaped version. Also for old programs you could include an option to force program to run with escaping off. \0 support is required to work without escaping. So making it mandatory option on quite a few programs would be a good idea. dwheeler to be truthful the error you are talking about are coming from one simple fact programmers are forgetting todo secure programming. So they don't escape inputs or filenames so thing go badly wrong. Also due to the fact file-names and program directives currently can look identical to programs we get malfunctions. dwheeler you are looking too small the problem of filenames is huge you are only addressing a small corner case. Escaping can address almost all cases with 1 set of alterations to core parts. Currently non-escaped is default and you have to remember to escape stuff or things go badly wrong. Secuirty flaws appear..... Default escaped. Programmer cannot make mistake of forgetting to escape lots issues disappear including non control char ones like - starting filenames and filenames containing spaces. Reason escaped gets on top of the problem. Shells will have to be made handle taking like user typed file-names and performing the escape on them. Really what need is there for a portable charset for filenames on my own system if the tools can handle anything on my system. Portable character set should really only be for cross OS. If you are needing to use it to limit user you have a problem with the tools secuirty. |
|
(0001147) oiaohm (reporter) 2012-02-25 02:25 |
dwheeler I have opened this new one with my solution. http://austingroupbugs.net/view.php?id=545 [^] I think my solution is the best. I am removing a limit from POSIX as well as solving a problem once and for all. Scary enough my change will allow filenames on disc containing NULL byte and Slash without upsetting anything long term. Applications using URI will have to get use to the idea they don't have to encode and decode any more since Posix does that bit. Application doing raw will have to get use to the idea they have to encode but still be usable to users just with a little horrid typing %stuff. The big issue is that you say that if file name contains a particular char this is forbin to be opened created or otherwise. Only party that can see what is allowed it exist on disc is the party that creates the file system. From my point of view this here should be closed and 545 pushed forwards since 545 is a more platform neutral solution. Yes there will be some overhead. Even better 545 most likely will be doable at kernel level. Alter the file-system drivers. |
|
(0001437) random832 (reporter) 2012-12-23 05:12 |
olaohm, if a filename is "damaged" by bit errors, it might contain a slash or null byte - is dealing with a damaged filesystem really within the scope of POSIX? Such filenames can be fixed [moved to a different name with offending bytes removed] by a system-specific "fsck" tool. And anyway, it's already legal for an implementation to forbid control characters in filenames, the proposed change is to make it mandatory. |
|
(0001438) oiaohm (reporter) 2013-01-07 23:29 |
random832 if you look at MS Windows and its forbin chars with the | in particular damaged due to bit errors is not the only issue. Software viruses have also exploited the fact | would cause filename to be hidden from user view and anti-virus software. Newer versions of windows is reducing number of forbin chars. This is why I particularly do not like the idea. XFS and EXT3 along with other filesystems it is valid to have filenames containing those chars. This proposal contains no migration plan to cope with what could possibly be out there. http://austingroupbugs.net/view.php?id=545 [^] I have come to the point of view that the shell in standard needs to be fixed. fopen .... I have not really found an Posix ABI call with a issue with chars with these chars. random832 the idea that filenames don't exist in the. Also to be O F. Is the fact that under Fat file systems 0,7-10,13 dec or 0,7-A,D hex are the only ones that are not printable on screen as single chars. http://www.jimprice.com/jim-asc.shtml [^] This is what dos ANSI allows. Yes Microsoft two primary file systems don't agree on what chars you can use in file name. One of the other result-ions to this problem might be follow how Dos deals with it. Assign 1-31 printable symbols and add a mode to printf for processing out filenames. random832 can you promise me that no future OS will do the Dos method again and assign printable chars to control chars so allowing file-names to contain them. |
|
(0001439) oiaohm (reporter) 2013-01-07 23:47 |
random832 the big question here why should the dosfsck fix a file-system that is not technically broken. The idea that the fsck tools can fix it is basically saying lets fix something that is not broken. Instead of fixing fault in posix. If you want simple printable chars for the control codes unicode defines one. Simple Squarish box with the number of the char inside. Lot of fonts miss providing this. Same with terminals. So printing control chars to screen should be fairly issue-less as long as a print in the API exits that disregards control chars and uses the replacement. Basically we have a historic terminal design fault. clocal and other settings can be pushed to terminal to disable particular control chars from working like modem control chars. So yes another solution is a simple flag. |
| Mantis 1.1.6[^] Copyright © 2000 - 2008 Mantis Group |