0001250: sh input file restrictions are too strict

ID	Project	Category	View Status	Date Submitted	Last Update

0001250	1003.1(2016/18)/Issue7+TC2	Shell and Utilities	public	2019-05-16 19:22	2024-06-11 09:08

Reporter	eblake	Assigned To
Priority	normal	Severity	Objection	Type	Error
Status	Closed	Resolution	Accepted

Name	Eric Blake
Organization	Red Hat
User Reference	ebb.sh.input
Section	XSH sh
Page Number	3228
Line Number	108358
Interp Status	---
Final Accepted Text


Summary	0001250: sh input file restrictions are too strict
Description	The standard has a well-defined definition of a text file: 3.403 Text File A file that contains characters organized into zero or more lines. The lines do not contain NUL characters and none can exceed {LINE_MAX} bytes in length, including the <newline> character. which, after chasing the definition of character, really translates to four requirements, where the last one is locale-specific: - limited line lengths - no NUL characters (for that matter, no NUL bytes, since no other multibyte character can contain NUL) - if non-empty, must have trailing <newline> - no encoding errors (a file that is text in Latin-1 might be binary in UTF-8) The standard already exempts one of the four requirements when describing valid input files for the 'sh' utility, stating that the shell must handle arbitrary line lengths. However, the remaining three requirements are still too strict: - a non-empty text file requires a trailing <newline>, however, most shells happily treat EOF as equivalent as an acceptable token terminator: $ printf 'echo hi' > ./script $ chmod +x ./script $ sh ./script && echo $? hi 0 $ sh -c ./script && echo $? hi 0 So, this file is not a text file (no trailing newline), yet the shell handled it just fine, and our requirement is probably too strict - there are existing 'self-extracting scripts' in the wild, which are basically the concatenation of two separate files: a shell script in the first half, and a binary payload in the second half. While the standard already documents the uuencode utility as a way to create a self-extracting payload that consists only of portable characters, the translation from binary data to encoded payload inflates the file size, so various authors have distributed files that instead directly embed a binary payload for minimal size. (In fact, while writing this up, I did a google search for "self-extracting script", with the first hit https://www.linuxjournal.com/node/1005818 being a guide on creating such files). Stated in another way, any shell script that terminates (whether by syntax error, 'exit', 'exec' of a replacement process, etc) mid-file does not care whether the rest of the file would qualify as a text file, because the shell does not parse that far ahead (although precisely defining how a script terminates may be akin to attempting to solve the Halting Problem). For a concise example: $ printf 'echo hi && exit\n\0\n' > ./script $ chmod +x ./script $ sh ./script && echo $? hi 0 $ sh -c ./script && echo $? hi 0 So, this file is not a text file (it contains a NUL byte), yet the shell handled it just fine, and our requirement is probably too strict - a shell script might contain encoding errors in one locale, but still manage reasonable execution (here, using GNU grep 3.1 which has different behavior for encoding errors to demonstrate the issue): $ printf 'echo \x85 >/dev/null\necho hi\n' > ./script $ LC_ALL=C grep . ./script \| LC_ALL=C grep -v echo $ LC_ALL=en_US.UTF-8 grep . ./script \| LC_ALL=C grep -v echo Binary file ./script matches $ chmod +x ./script $ LC_ALL=en_US.UTF-8 sh ./script && echo $? hi 0 $ LC_ALL=en_US.UTF-8 sh -c ./script && echo $? hi 0 So, this file is sometimes not a text file (it contains encoding errors if sh is invoked in a UTF-8 locale), yet the shell handled it just fine, and our requirement is probably too strict (although we could also lean towards just requiring the user to invoke the script using a locale that does not cause encoding errors) In short, requiring a 'text file' with one exception is inappropriate, and we are better off coming up with a different description of what forms a valid input file to the shell.
Desired Action	On page 3228 line 108358 (sh INPUT FILES) change: The input file shall be a text file, except that line lengths shall be unlimited. to: The input file may be of any type, but the initial portion of the file intended to be parsed according to the shell grammar (XREF to XSH 2.10.2 Shell Grammar Rules) shall consist of characters and shall not contain the NUL character. The shell shall not enforce any line length limits. On page 3242 line 108994 (sh RATIONALE) add a new paragraph: Earlier versions of this standard required that input files to the shell be text files except that line lengths were unlimited. However, that was overly restrictive in relation to the fact that shells can parse a script without a trailing newline, and in relation to a common practice of concatenating a shell script ending with an 'exit' or 'exec $command' with a binary data payload to form a single-file self-extracting archive.
Tags	tc3-2008

kre 2019-05-16 22:46 reporter bugnote:0004396	I fully agree with deleting the trailing newline requirement (and so not requiring a file that is not a "text file" from that point of view). Shells already deal with "scripts' without newlines as input to "sh -c" and eval all the time - no-one would ever propose requiring a trailing newline on those, and if the shell can handle reaching EOF as correctly terminating the script in those cases, it can for files as well. Or alternatively, if "sh file" works, then (provided the file is not so big as to exceed the arg lenght limit) so should sh -c "$(cat file)" And if the latter works, so should the former. For the latter it is clearly immaterial whether the file ends in a \n as the $() would remove it. For \0 chars (or bytes) my impression is that most shell simply ignore them (the way the byte was always meant to be used - it appears on paper tape input where the tape has been scroller forward to check its contents, and then repositioned back - but not to the exact next position, leaving a few \0 bytes between the last valid char and this one). However, I also agree that we should not require that behaviour, and that scripts should not contain nul chars if they expect to be processed correctly. The "the initial portion of the file intended to be parsed according to..." is a bit vague however. Some shells have been known to test the first 512 bytes of the script for anything that looks "binary" (whatever their interpretation of that is). That is, a script that is (entirely) exec gzip -d <gzip'd data starts here> is not necessarily going to work. Further you cannot really expect the shell will stop parsing at the exact point where it will stop executing, even where that stop is not conditional. That is, exit ; <<>> foo (any binary gibberish) is not going to work, as shells parse (at least) the complete line before executing any of it. There is (or perhaps should be) a difference between what is expected to work when run as "sh file" (one might argue there that the shell should not test the input at all, but simply execute it) and what is expected to be run for "./file" from inside the shell (where file has execute permission, but the exec fails with an ENOEXEC error). That latter case is probably the only one where there shell should be doing any analysis of the content of the file, so that attempts to run random commands that are intended for some other system architecture (ARM binaries in an x86 system for example) won't be attempted to run as a sh script (unless the user actually says it should be done). So for the "INPUT FILES" section I'd simply remove all restrictions on what the file can contain ... if it doesn't survive tokenisation and the grammar (whic, for example, a file with \0's in it would not, if the shell does not simply ignore them) then there will be errors and the scipt will fail. (Not part of the standard, but this also allows anything to be in a #!/bin/sh file, as those are run as "sh file" effectively - whatever their content). Whatever restrictions on input format we want to be able to test for would be better placed in XCU 2.9.1.1.1.e.i.b (page 2367, around line 75586-9). That's where it makes sense to verify the file format before arbitrarily deciding to run it as "sh file" when the user said just "file" and it was found via a PATH search (or the user gave a path name). [Aside: that section number assumes that the resolution of issue 1227 does not fix that nonsense numbering...] There is also a relationship here with the changes being make (not yet finished) for issue whatever it is (I have temporarily lost its number...) which is specifying just how much data the shell should parse before it starts executing any of it. In particular, for current purposes, whether the file argument to the "." (dot) special built-in is an "input file" in the sense being considered here. At the very least there probably needs to be some wording somewhere saying that it is not true that any file that can br run as "sh file" is suitable to be run as ". file" (assuming the same file is located in both cases - eg: for a fully qualified path name). (We know the reverse is not true, as the file for ". file" can contain a "return" command, which would be undefined if the script was run as "sh file"

eblake 2019-05-17 00:00 manager bugnote:0004397	"Further you cannot really expect the shell will stop parsing at the exact point where it will stop executing, even where that stop is not conditional." Actually, you can. Line 108346 states "When the shell is using standard input and it invokes a command that also uses standard input, the shell shall ensure that the standard input file pointer points directly after the command it has read when the command begins execution. It shall not read ahead in such a manner that any characters intended to be read by the invoked command are consumed by the shell (whether interpreted by the shell or not) or that characters that are not read by the invoked command are not seen by the shell."

kre 2019-05-17 02:05 reporter bugnote:0004398	Yes, I know about the requirement mentioned in 0001250:0004397 but that does not really apply here, as the shell is not reading standard input, it is reading the file from where its commands originate. That requirement applies when one does "sh < file" or more commonly when the standard input is the terminal. But more than that, even there, shells read entire lines (or to EOF when there is no \n) before executing anything, always. That is, if you type (in an interactive shell, so the shell is reading standard input) read foo ; echo "$foo" and you can find a single shell which does not wait for the next line of input and assign that to "foo", but rather assigns the string 'echo "$foo"' to foo, I would be amazed. I have certainly never seen a shell which acts like that, and I doubt anyone else has either, even though that "shell is using standard input" and "it invokes a command that also uses standard input", the command after which it ensures "the standard input file pointer points" is to the start of the next line, as it has already read the 'echo "$foo"' part (tty input, for a normal shell, not doing something like command line editing, is delivered to the process a line at a time - shells included - and there is no way to seek backwards and get the input again.) It has always been this way, and shells reading files do things the same way - they "parse ahead" of the actual command being executed. Oh yes, now I remember, this is in the bug related to alias processing 0001055 which is specifying exactly how much the shell should or must parse ahead before it executes a command, as aliases affect parsing of commands after the changes made become visible, so we need to know just when a change made by an alias command becomes available to be used.

Date Modified	Username	Field	Change
2019-05-16 19:22	eblake	New Issue
2019-05-16 19:22	eblake	Name	=> Eric Blake
2019-05-16 19:22	eblake	Organization	=> Red Hat
2019-05-16 19:22	eblake	User Reference	=> ebb.sh.input
2019-05-16 19:22	eblake	Section	=> XSH sh
2019-05-16 19:22	eblake	Page Number	=> 3228
2019-05-16 19:22	eblake	Line Number	=> 108358
2019-05-16 19:22	eblake	Interp Status	=> ---
2019-05-16 19:23	eblake	Relationship added	related to 0001226
2019-05-16 22:46	kre	Note Added: 0004396
2019-05-17 00:00	eblake	Note Added: 0004397
2019-05-17 02:05	kre	Note Added: 0004398
2019-06-27 16:23	~~Don Cragun~~	Status	New => Resolved
2019-06-27 16:23	~~Don Cragun~~	Resolution	Open => Accepted
2019-06-27 16:23	~~Don Cragun~~	Tag Attached: tc3-2008
2019-11-19 16:21	geoffclare	Status	Resolved => Applied
2024-06-11 09:08	agadmin	Status	Applied => Closed

View Issue Details

Relationships

Activities

Issue History