Austin Group Defect Tracker

Aardvark Mark IV


Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0001250 [1003.1(2016/18)/Issue7+TC2] Shell and Utilities Objection Error 2019-05-16 19:22 2024-06-11 09:08
Reporter eblake View Status public  
Assigned To
Priority normal Resolution Accepted  
Status Closed  
Name Eric Blake
Organization Red Hat
User Reference ebb.sh.input
Section XSH sh
Page Number 3228
Line Number 108358
Interp Status ---
Final Accepted Text
Summary 0001250: sh input file restrictions are too strict
Description The standard has a well-defined definition of a text file:
3.403 Text File
A file that contains characters organized into zero or more lines. The lines do not contain NUL characters and none can exceed {LINE_MAX} bytes in length, including the <newline> character.


which, after chasing the definition of character, really translates to four requirements, where the last one is locale-specific:
- limited line lengths
- no NUL characters (for that matter, no NUL bytes, since no other multibyte character can contain NUL)
- if non-empty, must have trailing <newline>
- no encoding errors (a file that is text in Latin-1 might be binary in UTF-8)

The standard already exempts one of the four requirements when describing valid input files for the 'sh' utility, stating that the shell must handle arbitrary line lengths. However, the remaining three requirements are still too strict:

- a non-empty text file requires a trailing <newline>, however, most shells happily treat EOF as equivalent as an acceptable token terminator:
$ printf 'echo hi' > ./script
$ chmod +x ./script
$ sh ./script && echo $?
hi
0
$ sh -c ./script && echo $?
hi
0
So, this file is not a text file (no trailing newline), yet the shell handled it just fine, and our requirement is probably too strict

- there are existing 'self-extracting scripts' in the wild, which are basically the concatenation of two separate files: a shell script in the first half, and a binary payload in the second half. While the standard already documents the uuencode utility as a way to create a self-extracting payload that consists only of portable characters, the translation from binary data to encoded payload inflates the file size, so various authors have distributed files that instead directly embed a binary payload for minimal size. (In fact, while writing this up, I did a google search for "self-extracting script", with the first hit https://www.linuxjournal.com/node/1005818 [^] being a guide on creating such files). Stated in another way, any shell script that terminates (whether by syntax error, 'exit', 'exec' of a replacement process, etc) mid-file does not care whether the rest of the file would qualify as a text file, because the shell does not parse that far ahead (although precisely defining how a script terminates may be akin to attempting to solve the Halting Problem). For a concise example:
$ printf 'echo hi && exit\n\0\n' > ./script
$ chmod +x ./script
$ sh ./script && echo $?
hi
0
$ sh -c ./script && echo $?
hi
0
So, this file is not a text file (it contains a NUL byte), yet the shell handled it just fine, and our requirement is probably too strict

- a shell script might contain encoding errors in one locale, but still manage reasonable execution (here, using GNU grep 3.1 which has different behavior for encoding errors to demonstrate the issue):
$ printf 'echo \x85 >/dev/null\necho hi\n' > ./script
$ LC_ALL=C grep . ./script | LC_ALL=C grep -v echo
$ LC_ALL=en_US.UTF-8 grep . ./script | LC_ALL=C grep -v echo
Binary file ./script matches
$ chmod +x ./script
$ LC_ALL=en_US.UTF-8 sh ./script && echo $?
hi
0
$ LC_ALL=en_US.UTF-8 sh -c ./script && echo $?
hi
0
So, this file is sometimes not a text file (it contains encoding errors if sh is invoked in a UTF-8 locale), yet the shell handled it just fine, and our requirement is probably too strict (although we could also lean towards just requiring the user to invoke the script using a locale that does not cause encoding errors)

In short, requiring a 'text file' with one exception is inappropriate, and we are better off coming up with a different description of what forms a valid input file to the shell.
Desired Action On page 3228 line 108358 (sh INPUT FILES) change:
The input file shall be a text file, except that line lengths shall be unlimited.
to:
The input file may be of any type, but the initial portion of the file intended to be parsed according to the shell grammar (XREF to XSH 2.10.2 Shell Grammar Rules) shall consist of characters and shall not contain the NUL character. The shell shall not enforce any line length limits.

On page 3242 line 108994 (sh RATIONALE) add a new paragraph:
Earlier versions of this standard required that input files to the shell be text files except that line lengths were unlimited. However, that was overly restrictive in relation to the fact that shells can parse a script without a trailing newline, and in relation to a common practice of concatenating a shell script ending with an 'exit' or 'exec $command' with a binary data payload to form a single-file self-extracting archive.
Tags tc3-2008
Attached Files

- Relationships
related to 0001226Closed shell can not test if a file is text 

-  Notes
(0004396)
kre (reporter)
2019-05-16 22:46

I fully agree with deleting the trailing newline requirement (and so not
requiring a file that is not a "text file" from that point of view).
Shells already deal with "scripts' without newlines as input to "sh -c"
and eval all the time - no-one would ever propose requiring a trailing newline
on those, and if the shell can handle reaching EOF as correctly terminating
the script in those cases, it can for files as well.

Or alternatively, if "sh file" works, then (provided the file is not
so big as to exceed the arg lenght limit) so should
     sh -c "$(cat file)"
And if the latter works, so should the former. For the latter it is
clearly immaterial whether the file ends in a \n as the $() would remove it.

For \0 chars (or bytes) my impression is that most shell simply ignore them
(the way the byte was always meant to be used - it appears on paper tape
input where the tape has been scroller forward to check its contents, and then
repositioned back - but not to the exact next position, leaving a few \0
bytes between the last valid char and this one). However, I also agree
that we should not require that behaviour, and that scripts should not contain
nul chars if they expect to be processed correctly.

The "the initial portion of the file intended to be parsed according to..."
is a bit vague however. Some shells have been known to test the first 512
bytes of the script for anything that looks "binary" (whatever their
interpretation of that is).

That is, a script that is (entirely)

exec gzip -d
<gzip'd data starts here>

is not necessarily going to work.


Further you cannot really expect the shell will stop parsing at the
exact point where it will stop executing, even where that stop is not
conditional.

That is,

       exit ; <<>> foo (any binary gibberish)

is not going to work, as shells parse (at least) the complete line before
executing any of it.

There is (or perhaps should be) a difference between what is expected to
work when run as "sh file" (one might argue there that the shell should not
test the input at all, but simply execute it) and what is expected to be
run for "./file" from inside the shell (where file has execute permission,
but the exec fails with an ENOEXEC error).

That latter case is probably the only one where there shell should be doing
any analysis of the content of the file, so that attempts to run random
commands that are intended for some other system architecture (ARM binaries
in an x86 system for example) won't be attempted to run as a sh script
(unless the user actually says it should be done).

So for the "INPUT FILES" section I'd simply remove all restrictions on what
the file can contain ... if it doesn't survive tokenisation and the grammar
(whic, for example, a file with \0's in it would not, if the shell does not
simply ignore them) then there will be errors and the scipt will fail.
(Not part of the standard, but this also allows anything to be in a #!/bin/sh
file, as those are run as "sh file" effectively - whatever their content).

Whatever restrictions on input format we want to be able to test for would
be better placed in XCU 2.9.1.1.1.e.i.b (page 2367, around line 75586-9).
That's where it makes sense to verify the file format before arbitrarily
deciding to run it as "sh file" when the user said just "file" and it was
found via a PATH search (or the user gave a path name). [Aside: that
section number assumes that the resolution of issue 1227 does not fix
that nonsense numbering...]

There is also a relationship here with the changes being make (not yet
finished) for issue whatever it is (I have temporarily lost its number...)
which is specifying just how much data the shell should parse before it
starts executing any of it. In particular, for current purposes, whether
the file argument to the "." (dot) special built-in is an "input file" in
the sense being considered here. At the very least there probably needs
to be some wording somewhere saying that it is not true that any file that
can br run as "sh file" is suitable to be run as ". file" (assuming the
same file is located in both cases - eg: for a fully qualified path name).
(We know the reverse is not true, as the file for ". file" can contain a
"return" command, which would be undefined if the script was run as "sh file"
(0004397)
eblake (manager)
2019-05-17 00:00

"Further you cannot really expect the shell will stop parsing at the
exact point where it will stop executing, even where that stop is not
conditional." Actually, you can. Line 108346 states "When the shell is using standard input and it invokes a command that also uses standard input,
the shell shall ensure that the standard input file pointer points directly after the command it has
read when the command begins execution. It shall not read ahead in such a manner that any
characters intended to be read by the invoked command are consumed by the shell (whether
interpreted by the shell or not) or that characters that are not read by the invoked command are
not seen by the shell."
(0004398)
kre (reporter)
2019-05-17 02:05

Yes, I know about the requirement mentioned in Note: 0004397
but that does not really apply here, as the shell is not
reading standard input, it is reading the file from where
its commands originate.

That requirement applies when one does "sh < file" or more
commonly when the standard input is the terminal.

But more than that, even there, shells read entire lines
(or to EOF when there is no \n) before executing anything,
always.

That is, if you type (in an interactive shell, so the
shell is reading standard input)

     read foo ; echo "$foo"

and you can find a single shell which does not wait for
the next line of input and assign that to "foo", but rather
assigns the string 'echo "$foo"' to foo, I would be amazed.
I have certainly never seen a shell which acts like that,
and I doubt anyone else has either, even though that
"shell is using standard input" and "it invokes a command
that also uses standard input", the command after which
it ensures "the standard input file pointer points" is
to the start of the next line, as it has already read the
'echo "$foo"' part (tty input, for a normal shell, not
doing something like command line editing, is delivered to
the process a line at a time - shells included - and there
is no way to seek backwards and get the input again.)

It has always been this way, and shells reading files
do things the same way - they "parse ahead" of the actual
command being executed. Oh yes, now I remember, this is
in the bug related to alias processing 0001055 which is
specifying exactly how much the shell should or must
parse ahead before it executes a command, as aliases
affect parsing of commands after the changes made become
visible, so we need to know just when a change made by
an alias command becomes available to be used.

- Issue History
Date Modified Username Field Change
2019-05-16 19:22 eblake New Issue
2019-05-16 19:22 eblake Name => Eric Blake
2019-05-16 19:22 eblake Organization => Red Hat
2019-05-16 19:22 eblake User Reference => ebb.sh.input
2019-05-16 19:22 eblake Section => XSH sh
2019-05-16 19:22 eblake Page Number => 3228
2019-05-16 19:22 eblake Line Number => 108358
2019-05-16 19:22 eblake Interp Status => ---
2019-05-16 19:23 eblake Relationship added related to 0001226
2019-05-16 22:46 kre Note Added: 0004396
2019-05-17 00:00 eblake Note Added: 0004397
2019-05-17 02:05 kre Note Added: 0004398
2019-06-27 16:23 Don Cragun Status New => Resolved
2019-06-27 16:23 Don Cragun Resolution Open => Accepted
2019-06-27 16:23 Don Cragun Tag Attached: tc3-2008
2019-11-19 16:21 geoffclare Status Resolved => Applied
2024-06-11 09:08 agadmin Status Applied => Closed


Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker