0001036: Errors/Omissions in specification of here document redirection

ID	Project	Category	View Status	Date Submitted	Last Update

0001036	1003.1(2013)/Issue7+TC1	Shell and Utilities	public	2016-03-22 03:18	2024-06-11 08:57

Reporter	kre	Assigned To
Priority	normal	Severity	Objection	Type	Error
Status	Closed	Resolution	Accepted As Marked

Name	Robert Elz
Organization
User Reference
Section	2.7.4
Page Number	2335-2336
Line Number	74235-74256
Interp Status	---
Final Accepted Text	0001036:0005561


Summary	0001036: Errors/Omissions in specification of here document redirection
Description	Aside from the question of just which newline is the "next" newline, that has been canvassed (without resolution I can see) elsewhere, there are several problems with the specification of here documents. First, given that the here doc is processed after encountering a newline (which newline is the other issue) they must be largely processed as a side effect of lexical processing (as newlines, other than those that happen to be literal) no longer exist in the scanned form of the shell input, they have served as token delimiters, and are not otherwise relevant. This would suggest that the here document is processed during lexical analysis - and nothing in the specification contradicts that. The spec does say that (given an unquoted) delimiter word, the text is subject to various expansions. It does not say that those expansions should not be performed while reading the here doc text, however I believe that is (or should be) the intent - that is, if the here doc is never used because it is attached to a command that is never executed (on the "wrong" side of an && or similar) then the expansions in the here doc should not be performed. It could be I am missing something, but I cannot see any text that says that the expansions in the here doc should be evaluated in the context of the command that is about to use the data, immediately before it is used (in the appropriate sequence of all applicable redirect operations). Second, the text says ... If any character in word is quoted, the delimiter shall be... and in the following paragraph ... If no characters in word are quoted, all lines of the ... but I do not believe that is what is intended, and is not what is actually implemented in any shell I can find. Consider ... cat << ""EOF lines of text EOF The delimiter there is the string EOF in which none of the characters were quoted. True it was preceded by a quoted null string, but that contains no quoted characters. Hence no characters of the delimiter word were quoted, and according to the spec, "lines of text" should be subject to the various expansions. No-one implements it that way, it is not whether any characters in the word are quoted, but whether any quote characters were encountered while scanning word. Third, in cases where expansions are done, nothing makes it explicit that the end delimiter cannot be found as the result of an expansion. That is, in the following, there is one here document that happens to contain the string echo foo << EOF and not two here documents end=EOF cat <<EOF lines of text $end echo foo <<EOF another line EOF Of course, if the first question above is resolved to make it clear that expansions do not happen when the document is being read, this would be a moot point, as $end expanding to EOF would not be known while the here doc is being read, which I believe is the correct interpretation. Fourth, I am totally confused by the relationship between double quoting and backtick command expansions, section 2.2.3 appears to say that if backticks appear inside double quotes, then the double-quote interpretation continues through the command expansion (if that were not true, it would not be possible for a double quoted string to start before a ` command substitution, and end inside it, as a " inside the `...` would be the start of a new string, not the end of the previous one (the same as it is in $( ) command substitutions). The relevance of this to here documents is illustrated by the following ... echo "` cat << EOF X = $(( 1 + 2 )) EOF `" If things are as I have postulated, then the EOF is quoted (by the double quotes that surround the command substitution) and hence the here document should not be expanded, and echo should (eventually) output X = $(( 1 + 2 )) and not X = 3 but again, I do not believe this is in accordance with what any shell does. This again may be an artifact of the 2nd point above, and if the text is changed so that only quote characters encountered while scanning the delimiter word cause the expansion to be supressed, and not whether "characters are quoted" then this issue will go away. Fifth, and more minor I think, when the delimiter is not quoted, the text states that backslashes work the way they do in double quoted strings, and references section 2.2.3 for the details. There we are informed that inside double quotes, \ is only special (only a quote character) when the following character is one of \ " ` $ and newline (so for example "\n" is a two character string). But then (back to 2.7.4) the text goes on to say that inside the here document " is not special. The problem is that it is not clear whether \ continues to act as a quote character when followed by this non-special " or not (ie: is \" in a here document, with an unquoted delimiter word, one character, or two?) I believe two is correct. Sixth, and perhaps most important of all, there is no discussion of what is expected to happen when the input string ends before the here document delimiter is encountered. Most important, because unlike the previous issues where I believe all shells (all I could find) actually agree on what should be done, and the text just needs to be more clear, for this one, there is a difference of opinion. Some shells treat end of file as equivalent to the delimiter, and go ahead and execute whatever command the here document was attached to with as much input as they managed to gather (one issues a warning when it does this, but does it anyway, most that adopt this behaviour do it silently.) Other shells consider this to be a redirect error, suppress execution of the command, and set $? to indicate failure. Personally I believe that the latter is the best approach, as it avoids situations where the shell eats the entire rest of the script as the here document because of some error or other (the one that happens to me from time to time is that I cut & paste a script, or script segment, and the tabs that had been present get converted to spaces, and then the <<-EOF doe not stop on space space..EOF where it would have with tab EOF.)
Desired Action	Change the words "If any character in word is quoted" to "If any quoting character is encountered while scanning word", and "If no characters in word are quoted" to "If no quoting characters are encountered while scanning word". At the end of the paragraph that currently starts "If no characters in word are quoted" add an extra sentence along the lines of ... "The expansions listed are preformed in the context of the command about to be executed to which the here document contents are to be input, and at the appropriate time in the sequence of all redirect operators applying to that command - if that command is never executed the here document shall not be expanded." Where it talks about <backslash> quoting in here documents with unquoted delimiter words, add some text to make it clear that even though a \ in "" quotes a ", a \ in a here document does not, and the sequence \" is 2 chars. Finally, add (somewhere) words to the effect "If the terminating line of the here document is not located before the shell exhausts its input, the behaviour is undefined, implementers are encouraged to treat this as a redirect error, but applications should not rely upon this."
Tags	tc3-2008

kre 2016-03-22 06:19 reporter bugnote:0003097	When I look more closely, I think you can forget the fifth issue in the list (and the third of the desired actions), I missed the phrase "when considered special" in 2.2.3 (the description of \ inside "..."). But I think you can replace that one with a request that the "<newline>" in "begins after the next <newline>" (2nd paragraph in 2.7.4) should be changed to "NEWLINE token" (as is done in the description of token recognition in section 2.3 (2nd paragraph). The difference is that in nl=' ' there is a <newline> but no NEWLINE token. I think it is clear that in a sequence like cat <<EOF; nl=' ' line 1 line 2 EOF The here document starts at "line 1" not at the line containing just the ' character, even though that is the line after the next <newline> Similarly, there is a <newline> in the \ sequence (\ followed immediately by newline), that is also not a NEWLINE token, and a here document would not start after that either. This still leaves the question of NEWLINE tokens embedded in command substitutions and similar, where the << operator was outside.

joerg 2016-03-22 19:52 reporter bugnote:0003098 Last edited: 2016-03-23 11:15	I did not yet check the other cases but you are mistaken with respect to cat << ""EOF $$ EOF as it expands to the process id of the Bourne Shell and even ksh88 and ksh93 document exactly this behavior. From the ksh93 documentation: If any character of word is quoted, then no interpretation is placed upon the characters of the document. Otherwise, parame- ter expansion, command substitution, and arithmetic substitution occur, so it seems that you discovered a ksh bug. Check the original Bourne Shell at: http://schilytools.sourceforge.net/bosh.html to verify that your example causes parameter substitution. Note that the documentation from the Bourne Shell, ksh88, ksh93, bash, mksh and zsh clearly mention that no expansion occurs when any of the characters from word is quoted. The dash man page is not written clearly and thus does not help. Looking at the ksh88 source, it is obvious that the deviating behavior from ksh is an unintended side-effect of the rewritten field splitting code that is used to strip off the quoting from "word".

geoffclare 2016-03-23 09:29 reporter bugnote:0003099	We are already fixing the "any character in word" problem: TC2 changes it to "any part of word". See 0000583

joerg 2016-03-23 10:57 reporter bugnote:0003100	Geoff, it seems that this was a mistake as from what I can say, the ksh behavior was changed unintentionally and the new text is in conflict with both the documentation and the behavior of the Bourne Shell.

geoffclare 2016-03-23 11:30 reporter bugnote:0003101	No it is not a mistake. That is how ksh88 behaves, and the POSIX shell was based on ksh88 not Bourne.

kre 2016-03-24 00:04 reporter bugnote:0003103	Thanks for the pointer to issue 583 - I actually did a search (I looked for references to 2.7.4) and the search did not produce that one... But that resolution is fine for that point, though you might want to amend the language just a little more to handle the case where the whole redirection (including the delimiter word) is in a quoted environment (making it clear that quotes need to be explicit in "word" itself to count as quoting existing for this purpose). The suggestion (in 583) that "if quote removal changes the word, it was quoted" seems about right to me, kre

kre 2016-03-24 00:18 reporter bugnote:0003104	To make the first point in the issue more clear (or one aspect of it anyway), consider the following ... unset X cat <<EOF ${X=2} EOF echo "${X-1}" No question but that the output from cat is "2" (a line containing 2), but what value does the echo line print, 1 or 2 ? That all depends upon the context in which the here doc is evaluated. If it is in the context of the shell running the script, then the answer is 2. On the other hand if it is in the context being established to run cat, then 1 would be the answer. My testing shows shells (I have to test) about equally divided on this issue, but I would have expected that 1 makes most sense, given that the here document is processed at the correct point in the sequence of redirections. kre

kre 2016-03-24 00:28 reporter bugnote:0003105	For the sixth point, consider this example cat > File1 <<EOF 2>File2 lines of text but no line containing "EOF" and the script ends right here. There are several possibilities here, one is that the here doc with no end delimiter is a lexical or parser level syntax error, and nothing else is done with the command at all (a non-interactive shell would exit). That is the one I prefer... Another is that the faulty here document is discovered during redirect processing, after File1 is created, before File2 is created, and things stop in that state, again with a syntax error, but later in the processing. Third is that it isn't an error at all, cat is run, the data present is sent to File1, File2 is created, but empty, as there are no errors. Exit status would be 0. This solution (though seemingly quite common) I do not like at all, as this almost always indicate some kind of user error, and giving either half the intended data, or just as likely, more than intended when the end delimiter is entered in an incorrect way, is just as bad as being unable to open a file in a normal '<' redirection, so just going ahead and substituting some other file (/dev/null maybe) instead. Not sane. kre

chet_ramey 2016-04-04 19:52 reporter bugnote:0003124	I'm interested in what the group would like to do about the EOF-as here-document-delimiter issue. As kre says, just about every shell I looked at allows EOF to delimit a here document (mksh is the notable exception). It's clearly existing practice, but the standard is silent. Does this render all these shells non-conformant? The other interesting case is whether or not a shell allows an instance of the delimiter immediately followed by the end of a command substitution (`)' or ``') to delimit a here-document. For example, what should the following output? x=$(cat <<EOF a b EOF) echo "$x" echo after There are varying behaviors. Shells allowing delimiter+right paren to delimit here document in $(...): ksh93, bash, mksh, zsh, posh Shells that do not: dash, BSD(s) sh Shells allowing delimiter+backquote to delimit here document in `...`: ksh93, SVR4.2 sh, bash, mksh, zsh, posh, dash, BSD(s) sh Even in this there is varying behavior: dash uses EOF (in the form of the end of the `...` command substitution) as the delimiter and includes the delimiter word as part of the here document. Is it worthwhile to add text saying the behavior is unspecified if the shell encounters end-of-file before finding the here-document delimiter? What about the command substitution case?

jilles 2016-04-04 21:48 reporter bugnote:0003125	Given that $(case x in x) : ;; esac) is a single valid command substitution, I would expect the following two to be as well: $(cat <<EOF && EOF) EOF :) $(if :; then cat <<EOF EOF) EOF fi) If that is accepted but: $(cat <<EOF x EOF) is also to be a single valid command substitution, detecting whether the end marker with closing parenthesis is an end marker is rather complicated. There is no such issue with: `cat <<EOF x EOF`

joerg 2016-04-05 12:20 reporter bugnote:0003126	`cat <<EOF x EOF` is parsed in a different way than $(cat <<EOF x EOF) The first one is parsed on a lexical base, i.e. the next unescaped "`" is searched for, before any instance in the shell tries to understand the here document. In the second form, there is a need to recursively call the parser in order to understand where the end of the $() command is. This is caused by the fact that the number of opening and closing parenthesis in a command is not always equal. For this reason, with $(), the here document is read in during the lexial scan already, because the lexical scan calls a recursive parser. Note that this is a POSIXLY correct command: echo $(if cat <<EOF 1 2 3 EOF) EOF then echo a fi) but it is not accepted by ksh93 because ksh93 implements a funny recognition of "EOF)".

kre 2016-04-05 13:58 reporter bugnote:0003127	Re note 3124: Is it worthwhile to add text saying the behavior is unspecified if the shell encounters end-of-file before finding the here-document delimiter? I would say yes - along with an admonition on applications to always supply the end delimiter. The NetBSD shell is another which (now) complains about here docs without an end delim (treats it as a syntax error), which I think is far and away the best thing for shells to do - and I would encourage all of you who are implementers to do that. Since we made that change, we have (as far as I know) encountered exactly 1 script which was working only because of the "eof terminates a here doc" behaviour - and that one was almost certainly an accident (it was one of many scripts, several others of which also had here docs that ran to the end of the script, and only that one was missing the end delimiter). So, I would not be too worried that you will be breaking large numbers of scripts that are relying on EOF delimiting here docs, the users don't seem to know about this, or find it simple enough to add the string just before EOF... On the other hand, accidentally getting the here doc end delimiter incorrect is easy to do - say a space amongst the leading tabs that are to be stripped, or a simple typo - having that silently cause the rest of the script to be treated as here doc content, silently, and simply carrying on working, is not friendly. On: What about the command substitution case? That one should simply be regarded as incorrect, the spec is already clear on what makes an end delimiter, and a ) following the string is not it, the newline before the ) is required. (And yes, I agree, the `` case is quite different, in all ways, even though it seems initially to be just a different, more difficult to nest, equivalent.)

kre 2016-04-07 00:46 reporter bugnote:0003134	Since it seems agreed that there is no consistency on what happens if a here doc is not terminated, and that applications neither need to, or ever should, rely upon any particular behaviour there (and in practice, do not seem to), I suggest that the following wording be added to the end of the normative text in section 2.7.4 (just before the informative example). [Sorry, I do not have page or line numbers - someone else will need to add those.] The effect of failing to detect the here-document delimiter before the shell exhausts its input stream is unspecified. Applications shall ensure the delimiter is present. And perhaps in a rationale section somewhere... Traditional shell behaviour has been to treat "end of file" as being equivalent to the delimiter of a here document, terminating the here document, usually without any indication, and continuing as if the delimiter had been recognised. This can cause problems where the delimiter had been intended to occur much earlier in the script, but was incorrectly entered - a mistake which for many other errors would have resulted in a syntax error, and an aborted script, instead simply generates incorrect results. Because of this some shell implementations have changed to reporting an undelimited here document as a syntax error. Other implementations are encouraged to do the same. or maybe something less wordy with similar effect... The other issues still need resolution.

kre 2016-04-07 01:05 reporter bugnote:0003135	Another of my original issues, in which I believe (hope) there is no dissent (as in, I believe all shells act this way, it is just not, yet, written in the standard), also add to the normative text of section 2.7.4, in the paragraph that begins: If no characters in word are quoted, ... add the following sentence after the initial sentence of the paragraph (again, sorry, page/line numbers not available to me): This expansion happens after the here document delimiter has been recognised and the here document extracted from the input stream, and thus the end delimiter for the here document cannot be generated as a by-product of the expansion. And to make things even clearer (I believe this is how shells behave, but am less certain that this is universal), at the end of that same paragraph add When an unquoted backslash is followed by a newline, line joining occurs, and the backslash newline combination is removed. This occurs while the here document is being scanned, the end delimiter will not be recognised immediately after a newline that has been deleted in this way.

stephane 2017-02-25 14:12 reporter bugnote:0003572	About your last sixth point, I'm not sure I see a problem. It's currently unspecified so applications have to make sure there delimiter is provided and implementations can do what they want when it's not, allowing warning or error message, or ignore the problem, all of which are valid approaches to me. There's also the case of eval "cat << EOF" xxx EOF (and the same with the "." command) that may need to be covered.

kre 2017-03-03 15:57 reporter bugnote:0003592	Re note 3572 ... I agree that the behaviour is unspecified in the literal sense (in that the spec says nothing about it at all), I do not agree that is adequate however - if unspecified behaviour is what is expected in this case (and I'd certainly accept that as an outcome for that point), it ought to be explicitly unspecified, not just literally. kre

kre 2017-05-11 13:46 reporter bugnote:0003690	When this issue reaches the head of the queue, it might be worthwhile spending a minute or two on the subject of \newline continuation lines in here docs, and their effect on tab suppression, and end-string recognition. For the first of those, given cat <<-EOF \ X EOF (where the white space is supposed to represent a tab character, but is spaces here, as I cannot seem to input a tab in the form...) what exactly is expected to be written to stdout? That is, is the tab before X a leading tab, or not? For the second, the following script (with the same caveat about spaces and tabs...) EOF() { printf 'EOF executed as a command\n'; } cat <<-EOF \ EOF EOF executes differently in different shells, in some it simply says "EOF" (where the second EOF is the end-string) and in others it says "EOF executed as a command" where the first EOF is the end string. (Here similar things happen without the tab stripping if the \ line contains only a \ character).

joerg 2017-05-11 14:01 reporter bugnote:0003691 Last edited: 2017-05-11 14:05	Re: 0001036:0003690 With your first example, all shells except ksh93 seem to print just a X With your second example, the historic Bourne Shell, ksh88, bash, bosh, mksh print "EOF executed as a command" While ksh93, dash, yash, zsh are non-compliant. You discovered an interesting aspect.

kre 2017-06-10 00:58 reporter bugnote:0003756	There is another case worth considering ... cat <<'EOF REALLY' Hello EOF REALLY Is this supposed to work, or not? I see nothing in the text that prohibits a \n as one of the characters of the end delimiter (obviously, like spaces, and other operator type characters, it can only occur in a quoted delimiter) My tests show that yash simply forbids it. Nothing else I tested does, though when the here doc delimiter contains a \n, none of bash, zsh, mksh, or bosh seem to recognise anything as the delimiter. The ash derived shells (dash, freebsd netbsd) and ksh93 all just say "Hello" when given the command above (the two line end delimiter is handled just fine .. I didn't test more than two, but I am fairly confident that at least the FreeBSD and NetBSD shells would handle as many embedded \n's as you want to give - probably the others as well.)

geoffclare 2021-12-17 15:01 reporter bugnote:0005561 Last edited: 2021-12-17 15:03	I suggest the following changes are made to resolve this bug and the related bugs 0001043 and 0001521. On 2018 edition page 2347 line 74748 section 2.3, change: When an io_here token has been recognized by the grammar (see [xref to 2.10]), one or more of the subsequent lines immediately following the next NEWLINE token form the body of one or more here-documents and shall be parsed according to the rules of [xref to 2.7.4]. to: When an io_here token has been recognized by the grammar (see [xref to 2.10]), one or more of the subsequent lines immediately following the next NEWLINE token form the body of a here-document and shall be parsed according to the rules of [xref to 2.7.4]. Any non-NEWLINE tokens (including more io_here tokens) that are recognized while searching for the next NEWLINE token shall be saved for processing after the here-document has been parsed. If a saved token is an io_here token, the corresponding here-document shall start on the line immediately following the line containing the trailing delimiter of the previous here-document. If any saved token includes a <newline> character, the behavior is unspecified. On 2018 edition page 2348 line 74774 section 2.3, after: While processing the characters, if instances of expansions or quoting are found nested within the substitution, the shell shall recursively process them in the manner specified for the construct that is found. append: For "$(" and '`' only, if instances of io_here tokens are found nested within the substitution, they shall be parsed according to the rules of [xref to 2.7.4]; if the terminating ')' or '`' of the substitution occurs before the NEWLINE token marking the start of the here-document, the behavior is unspecified. On 2018 edition page 2362 line 75355 section 2.7.4, change: The here-document shall be treated as a single word that begins after the next <newline> and continues until there is a line containing only the delimiter and a <newline>, with no <blank> characters in between. Then the next here-document starts, if there is one. to: The here-document shall be treated as a single word that begins after the next NEWLINE token and continues until there is a line containing only the delimiter and a <newline>, with no <blank> characters in between. Then the next here-document starts, if there is one. For the purposes of locating this terminating line, the end of a command_string operand (see [xref to sh]) shall be treated as a <newline> character, and the end of the commands string in <tt>$(commands)</tt> and <tt>`commands`</tt> may be treated as a <newline>. If the end of input is reached without finding the terminating line, the shell should, but need not, treat this as a redirection error. On 2018 edition page 2362 line 75363 section 2.7.4, change: It is unspecified whether the file descriptor is opened as a regular file, a special file, or a pipe. to: It is unspecified whether the file descriptor is opened as a regular file or some other type of file. On 2018 edition page 2362 line 75366 section 2.7.4, change: If any part of word is quoted, the delimiter shall be formed by performing quote removal on word, and the here-document lines shall not be expanded. Otherwise, the delimiter shall be the word itself. If no part of word is quoted, all lines of the here-document shall be expanded for parameter expansion, command substitution, and arithmetic expansion. In this case, the <backslash> in the input behaves ... to: If any part of word is quoted, not counting double-quotes outside a command substitution if the here-document is inside one, the delimiter shall be formed by performing quote removal on word, and the here-document lines shall not be expanded. Otherwise: The delimiter shall be the word itself. The removal of <backslash><newline> for line continuation (see [xref to 2.2.1]) shall be performed during the search for the trailing delimiter. (As a consequence, the trailing delimiter is not recognized immediately after a <newline> that was removed by line continuation.) It is unspecified whether the line containing the trailing delimiter is itself subject to this line continuation. All lines of the here-document shall be expanded, when the redirection operator is evaluated but after the trailing delimiter for the here-document has been located, for parameter expansion, command substitution, and arithmetic expansion. If the redirection operator is never evaluated (because the command it is part of is not executed), the here-document shall be read without performing any expansions. Any <backslash> characters in the input shall behave ... (As per 0001036:0003097, no special mention of \" needs to be made here because 2.2.3 says "that would otherwise be considered special".) On 2018 edition page 2362 line 75374 section 2.7.4, change: all leading <tab> characters shall be stripped from input lines and the line containing the trailing delimiter. to: all leading <tab> characters shall be stripped from input lines after <backslash><newline> line continuation (when it applies) has been performed, and from the line containing the trailing delimiter. Stripping of leading <tab> characters shall occur as the here-document is read from the shell input (and consequently does not affect any <tab> characters that result from expansions). Cross-volume changes to XRAT ... On 2018 edition page 3736 line 128241 section C.2.7.4, insert a new paragraph: Historical shell behavior was to treat the end of input as being equivalent to the delimiter of a here-document, terminating the here-document, usually without any indication, and continuing as if the delimiter had been recognized. This can cause problems where the delimiter had been intended to occur much earlier in the script, but was incorrectly entered - a mistake which for many other errors would have resulted in a syntax error, and an aborted script, instead simply generates incorrect results. Because of this some shell implementations have changed to reporting an undelimited here-document as a syntax error. Other implementations are encouraged to do the same.

shware_systems 2021-12-17 16:32 reporter bugnote:0005564	Re: 5561 I think it needs to be more explicit line joining, especially if tokenizing is done incrementally, still has precedence over <newline> detection for finding the start of the initial io_here body. While the model is 'do line joining before tokenization of command lines' this requires more resources for potential read ahead's than otherwise.

kre 2021-12-18 13:10 reporter bugnote:0005568	Re 0001036:0005564 That is not an issue, a newline used in line joining is never a NEWLINE token, and cannot possibly cause here doc reading to start (everyone agrees this is how it should be, and though the standard, while it was missing "token" was not clear), after the proposed change it will be with respect to that issue. What resources are consumed is an implementation issue, not relevant here, unless there were implementations that found themselves unable to implement this correctly, of which I know of none. Re 0001036:0005561 in general that looks OK - there are a few places where the wording seems a bit strange to me, though not incorrect, and a couple of places where the result is not what I would call ideal - but I can see that it is the best that can be achieved under current circumstances. I will add another note on the wording issues when I get time. In the meantime, there is one issue that can easily be fixed: If any saved token includes a <newline> character, the behavior is unspecified. That is unnecessary. There's no difference of opinion on this is issue, a newline character which is not a newline token (eg: a newline in a quoted string) never causes here document data reading - that is agreed by all. Again, it might not have been clear in the standard before, but it is in the implementations. Those words might have been intended to apply to the following issue, but they really do not (or certainly not unambiguously). There's one omission which needs to be handled, which is the status of: cat <<\EOF; ls $( find . -name 'foo' -type d -print \| sed '/^foo*$/d' ) Where does the here document data go in that case? The first newline token encountered will be the one after the pipe inside the command substitution. This one must be seen by the lexer (not necessarily the grammar, but it is the lexer that reads here document data, and removes it from the input stream, just as it removes line-joinings) as part of scanning that command substitution. If it isn't parsing it correctly, when first read (whatever is done with the results of that parse) which icludes noting newline tokens as they appear, the end of the command substitution cannot be correctly located in all cases. Our implementation expects to read the here doc immediately after the pipe, that being the next newline token encountered by the lexer after an io_here operator was encountered. Other implementations do not agree with that, and don't consider any newline tokens observed while parsing the command substitution as candidates for here documents for io_here operators that appeared outside the command substitution (and before it, obviously). This one is going to need very explicit language to deal with. Clearly applications can be written to avoid this kind of thing simply by ensuring a newline token appears before the command substitution starts, where that is possible (here, simply by replacing the ';' with a newline). In cases where it isn't possible, the io_here and the command substitution must be part of the same simple command, so the io_here (and any other redirections which follow it) can simply be moved after the command substitution. However, if an application is written as above, or even worse cat <<\EOF ; ls $( sed <<\DONE Whatever text follows that, follows a newline token, and is not immediately relevant, there is a newline token - not visible explicitly in this note, but assume it is there, at the end of that line (after "DONE"). There is no question in that case but that the here doc data for sed has to come after this newline token, before any future tokens belonging to the command substitution can be encountered The question is does the here document data for cat come before it, or after it, or somewhere else entirely. The general principle is that here doc data is read in the order in which the io_here tokens appeared on the line, left to right. "Line" is a very little used concept in the shell grammar, but it is used in this context - not command, not anything else, but "line". To me this implies that the here doc data for cat must precede the here doc data for sed. But again, not everyone agrees. Sorting this one out might not be easy. For anyone interested, Chet Ramey and I have been having (since Nov 17) a somewhat detailed discussion of these issues on the bug-bash list (in a message thread with the Subject: Unclosed quotes on heredoc mode (or Re:...). Ignore the fist few messages, they just provided the motivation for our discussion/debate. Chet's message from Nov 17 is the first that is relevant. I'm sure someone here can explain how to find the archives of that list, but that someone is not me.

geoffclare 2021-12-20 10:23 reporter bugnote:0005569	Re 0001036:0005568 The text you say is unnecessary: If any saved token includes a <newline> character, the behavior is unspecified. is needed for things like this: cat << \EOF > foo$( echo bar) string EOF Some shells write string to the file foobar, but dash reports a syntax error, and ksh93 gets a memory fault (with the versions I tried) so it is anyone's guess what it is trying to do. The example you give as an "omission" is another case where a saved token includes a <newline>, and so is also covered by the text you say is unnecessary. If it is true that all shells handle a quoted <newline> the same, then the text could be changed to say: If any saved token includes an unquoted <newline> character, the behavior is unspecified.

kre 2021-12-20 12:30 reporter bugnote:0005570	Re 0001036:0005569 You're right that the two cases (the one you showed in your note, and the one in my note) are essentially the same - they're both where a newline appears (as a token, not as just a newline character) in a command substitution which starts on the same line as (but after) an io_here operator. That's why I said: Those words might have been intended to apply to the following issue, as I recognised that might have been what was intended. My point was that it is hard to justify that, as before the token containing the newline character is seen in this case, the lexer must have seen that character (in the cases where it matters) as a newline token, that's the only way the end of the command substitution can be correctly found, and that has have happened before the token containing that newline character can be formed. Hence, that newline character is the next NEWLINE token referenced in the second line of your proposed text to got line 74748 - that's what I meant by the "(or certainly not unambiguously)" comment. If that one is the next newline TOKEN, then the issue raised (newline in a token) doesn't arise, as the here documennt text will have been read before that token is formed. However, I know not all shells treat things this way, and this is why I claimed that this case (both yours and mine) need more explanation of what is intended to work, and what isn't. However, your example does serve to demonstrate that I was wrong in thinking that users can always simply rearrange their input to avoid problems with io_here and newline tokens which some shells will interpret as the significant pne, and others will not, I had not considered the case where the problem occurs in a subsequent redirection operator, and those cannot always simply be arbitrarily reordered (the order could be swapped in your example, but that's not always true) - which makes thus a rather more serious problem than I considered it. The question of whether the newline in question is quoted or not gets even murkier - no issue in the examples (your or mine), it isn't, but the examples might use "$(....)" instead of an unquoted command stustitution, which changes nothing at all with respect to a newline token appearing within that command substitution, but certainly makes it quoted as far as the token being formed is concerned. The cases where there is no issue is when a (obviously must be quoted) newline character appears in a token i a situation where that newline character is never considered as a token of its own, something like cat <<EOF >'strange filename' LINE EOF which I believe all shells treat the same way - that case need not be made unspecified or undefined or anything else other than "works as expected"

geoffclare 2021-12-20 15:09 reporter bugnote:0005571	> My point was that it is hard to justify that, as before the token containing > the newline character is seen in this case, the lexer must have seen that > character (in the cases where it matters) as a newline token, that's the only > way the end of the command substitution can be correctly found, and that has > have happened before the token containing that newline character can be formed. > > Hence, that newline character is the next NEWLINE token referenced in the > second line of your proposed text No, according to the standard (XCU 2.3) everything from the $( to the closing ) is part of a single token. In my example the token starts with foo$( and ends with ). The shell finds the closing ) by recursively parsing the command substitution, but it does not use the results of that parsing for anything except finding the ). > the examples might use "$(....)" instead of an unquoted command stustitution, > which changes nothing at all with respect to a newline token appearing > within that command substitution, but certainly makes it quoted as far as > the token being formed is concerned. Characters within "$(...)" are not quoted by the outer quotes. Otherwise: echo "$(echo foo \| cat)" would output: foo \| cat instead of just foo.

Date Modified	Username	Field	Change
2016-03-22 03:18	kre	New Issue
2016-03-22 03:18	kre	Name	=> Robert Elz
2016-03-22 03:18	kre	Section	=> 2.7.4
2016-03-22 03:18	kre	Page Number	=> unknown
2016-03-22 03:18	kre	Line Number	=> unknown
2016-03-22 06:19	kre	Note Added: 0003097
2016-03-22 19:52	joerg	Note Added: 0003098
2016-03-23 09:29	geoffclare	Note Added: 0003099
2016-03-23 09:30	geoffclare	Relationship added	related to 0000583
2016-03-23 10:55	joerg	Note Edited: 0003098
2016-03-23 10:57	joerg	Note Added: 0003100
2016-03-23 11:13	joerg	Note Edited: 0003098
2016-03-23 11:15	joerg	Note Edited: 0003098
2016-03-23 11:30	geoffclare	Note Added: 0003101
2016-03-24 00:04	kre	Note Added: 0003103
2016-03-24 00:18	kre	Note Added: 0003104
2016-03-24 00:28	kre	Note Added: 0003105
2016-04-04 19:52	chet_ramey	Note Added: 0003124
2016-04-04 21:48	jilles	Note Added: 0003125
2016-04-05 12:20	joerg	Note Added: 0003126
2016-04-05 13:58	kre	Note Added: 0003127
2016-04-07 00:46	kre	Note Added: 0003134
2016-04-07 01:05	kre	Note Added: 0003135
2016-04-07 05:42	~~Don Cragun~~	Page Number	unknown => 2335-2336
2016-04-07 05:42	~~Don Cragun~~	Line Number	unknown => 74235-74256
2016-04-07 05:42	~~Don Cragun~~	Interp Status	=> ---
2017-02-25 14:12	stephane	Note Added: 0003572
2017-03-02 16:18	nick	Relationship added	related to 0001037
2017-03-03 15:57	kre	Note Added: 0003592
2017-03-23 16:11	nick	Relationship added	related to 0001043
2017-05-11 13:46	kre	Note Added: 0003690
2017-05-11 14:01	joerg	Note Added: 0003691
2017-05-11 14:05	joerg	Note Edited: 0003691
2017-06-10 00:58	kre	Note Added: 0003756
2021-09-10 08:47	geoffclare	Relationship added	related to 0001521
2021-12-17 15:01	geoffclare	Note Added: 0005561
2021-12-17 15:03	geoffclare	Note Edited: 0005561
2021-12-17 16:32	shware_systems	Note Added: 0005564
2021-12-18 13:10	kre	Note Added: 0005568
2021-12-20 10:23	geoffclare	Note Added: 0005569
2021-12-20 12:30	kre	Note Added: 0005570
2021-12-20 15:09	geoffclare	Note Added: 0005571
2022-01-06 17:21	geoffclare	Final Accepted Text	=> 0001036:0005561
2022-01-06 17:21	geoffclare	Status	New => Resolved
2022-01-06 17:21	geoffclare	Resolution	Open => Accepted As Marked
2022-01-06 17:21	geoffclare	Tag Attached: tc3-2008
2022-01-06 17:24	~~Don Cragun~~	Relationship replaced	has duplicate 0001043
2022-01-06 17:26	~~Don Cragun~~	Relationship replaced	has duplicate 0001521
2022-02-04 09:56	geoffclare	Status	Resolved => Applied
2024-06-11 08:57	agadmin	Status	Applied => Closed

View Issue Details

Relationships

Activities

Issue History

related to	0000583	Closed	ajosey	1003.1(2008)/Issue 7	When is any character in the delimiter word quoted?
has duplicate	0001043	Closed		1003.1(2013)/Issue7+TC1	Which newline starts collection of here document data?
has duplicate	0001521	Closed		1003.1(2016/18)/Issue7+TC2	here document processing is underspecified
related to	0001037	Closed		1003.1(2013)/Issue7+TC1	The grammar for here documents misses the data body and the final EOF condition