Austin Group Defect Tracker

Aardvark Mark IV


Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0001036 [1003.1(2013)/Issue7+TC1] Shell and Utilities Objection Error 2016-03-22 03:18 2024-06-11 08:57
Reporter kre View Status public  
Assigned To
Priority normal Resolution Accepted As Marked  
Status Closed  
Name Robert Elz
Organization
User Reference
Section 2.7.4
Page Number 2335-2336
Line Number 74235-74256
Interp Status ---
Final Accepted Text Note: 0005561
Summary 0001036: Errors/Omissions in specification of here document redirection
Description Aside from the question of just which newline is the "next" newline, that
has been canvassed (without resolution I can see) elsewhere, there are
several problems with the specification of here documents.

First, given that the here doc is processed after encountering a newline
(which newline is the other issue) they must be largely processed as a side
effect of lexical processing (as newlines, other than those that happen to be literal) no longer exist in the scanned form of the shell input, they have
served as token delimiters, and are not otherwise relevant. This would suggest
that the here document is processed during lexical analysis - and nothing in the
specification contradicts that. The spec does say that (given an unquoted)
delimiter word, the text is subject to various expansions. It does not say that those expansions should not be performed while reading the here doc text, however I believe that is (or should be) the intent - that is, if the here doc
is never used because it is attached to a command that is never executed (on the "wrong" side of an && or similar) then the expansions in the here doc should not be performed. It could be I am missing something, but I cannot see any text that says that the expansions in the here doc should be evaluated in the context of the command that is about to use the data, immediately before it is used (in the appropriate sequence of all applicable redirect operations).

Second, the text says ...

      If any character in word is quoted, the delimiter shall be...

and in the following paragraph ...

      If no characters in word are quoted, all lines of the ...

but I do not believe that is what is intended, and is not what is actually
implemented in any shell I can find. Consider ...

      cat << ""EOF
      lines of text
      EOF

The delimiter there is the string EOF in which none of the characters were quoted. True it was preceded by a quoted null string, but that contains no quoted characters. Hence no characters of the delimiter word were quoted, and according to the spec, "lines of text" should be subject to the various expansions. No-one implements it that way, it is not whether any characters in the word are quoted, but whether any quote characters were encountered while scanning word.

Third, in cases where expansions are done, nothing makes it explicit that the end delimiter cannot be found as the result of an expansion. That is, in the
following, there is one here document that happens to contain the string
         echo foo << EOF
and not two here documents

         end=EOF
         cat <<EOF
         lines of text
         $end
         echo foo <<EOF
         another line
         EOF

Of course, if the first question above is resolved to make it clear that expansions do not happen when the document is being read, this would be a moot point, as $end expanding to EOF would not be known while the here doc is being read, which I believe is the correct interpretation.

Fourth, I am totally confused by the relationship between double quoting and
backtick command expansions, section 2.2.3 appears to say that if backticks appear inside double quotes, then the double-quote interpretation continues through the command expansion (if that were not true, it would not be possible for a double quoted string to start before a ` command substitution, and end inside it, as a " inside the `...` would be the start of a new string, not the end of the previous one (the same as it is in $( ) command substitutions).
The relevance of this to here documents is illustrated by the following ...

        echo "` cat << EOF
        X = $(( 1 + 2 ))
        EOF
        `"

If things are as I have postulated, then the EOF is quoted (by the double
quotes that surround the command substitution) and hence the here document
should not be expanded, and echo should (eventually) output

       X = $(( 1 + 2 ))

and not

       X = 3

but again, I do not believe this is in accordance with what any shell does.
This again may be an artifact of the 2nd point above, and if the text is changed so that only quote characters encountered while scanning the delimiter word cause the expansion to be supressed, and not whether "characters are quoted" then this issue will go away.

Fifth, and more minor I think, when the delimiter is not quoted,
the text states that backslashes work the way they do in double quoted strings, and references section 2.2.3 for the details. There we are informed that inside double quotes, \ is only special (only a quote character) when the following character is one of \ " ` $ and newline (so for example "\n" is a
two character string). But then (back to 2.7.4) the text goes on to say that
inside the here document " is not special. The problem is that it is not
clear whether \ continues to act as a quote character when followed by this non-special " or not (ie: is \" in a here document, with an unquoted delimiter word, one character, or two?) I believe two is correct.

Sixth, and perhaps most important of all, there is no discussion of what is expected to happen when the input string ends before the here document delimiter is encountered. Most important, because unlike the previous issues where I believe all shells (all I could find) actually agree on what should be done, and the text just needs to be more clear, for this one, there is a difference of opinion. Some shells treat end of file as equivalent to the
delimiter, and go ahead and execute whatever command the here document was attached to with as much input as they managed to gather (one issues a warning when it does this, but does it anyway, most that adopt this behaviour do it silently.) Other shells consider this to be a redirect error, suppress execution of the command, and set $? to indicate failure. Personally I believe that the latter is the best approach, as it avoids situations where the
shell eats the entire rest of the script as the here document because of some error or other (the one that happens to me from time to time is that I cut & paste a script, or script segment, and the tabs that had been present get converted to spaces, and then the <<-EOF doe not stop on space space..EOF
where it would have with tab EOF.)
Desired Action Change the words "If any character in word is quoted" to "If any quoting character is encountered while scanning word", and "If no characters in word are quoted" to "If no quoting characters are encountered while scanning word".

At the end of the paragraph that currently starts "If no characters in word are quoted" add an extra sentence along the lines of ... "The expansions listed are preformed in the context of the command about to be executed to which the here document contents are to be input, and at the appropriate time in the sequence of all redirect operators applying to that command - if that command is never executed the here document shall not be expanded."

Where it talks about <backslash> quoting in here documents with unquoted delimiter words, add some text to make it clear that even though a \ in ""
quotes a ", a \ in a here document does not, and the sequence \" is 2 chars.

Finally, add (somewhere) words to the effect "If the terminating line of the here document is not located before the shell exhausts its input, the behaviour is undefined, implementers are encouraged to treat this as a redirect error,
but applications should not rely upon this."
Tags tc3-2008
Attached Files

- Relationships
related to 0000583Closedajosey 1003.1(2008)/Issue 7 When is any character in the delimiter word quoted? 
has duplicate 0001043Closed 1003.1(2013)/Issue7+TC1 Which newline starts collection of here document data? 
has duplicate 0001521Closed 1003.1(2016/18)/Issue7+TC2 here document processing is underspecified 
related to 0001037Closed 1003.1(2013)/Issue7+TC1 The grammar for here documents misses the data body and the final EOF condition 

-  Notes
(0003097)
kre (reporter)
2016-03-22 06:19

When I look more closely, I think you can forget the fifth issue in the
list (and the third of the desired actions), I missed the phrase "when
considered special" in 2.2.3 (the description of \ inside "...").

But I think you can replace that one with a request that the "<newline>" in
"begins after the next <newline>" (2nd paragraph in 2.7.4) should be
changed to "NEWLINE token" (as is done in the description of token recognition
in section 2.3 (2nd paragraph). The difference is that in

        nl='
        '

there is a <newline> but no NEWLINE token. I think it is clear that in
a sequence like

        cat <<EOF; nl='
        '
        line 1
        line 2
        EOF

The here document starts at "line 1" not at the line containing just the '
character, even though that is the line after the next <newline>

Similarly, there is a <newline> in the \
sequence (\ followed immediately by newline), that is also not a NEWLINE
token, and a here document would not start after that either.

This still leaves the question of NEWLINE tokens embedded in command substitutions and similar, where the << operator was outside.
(0003098)
joerg (reporter)
2016-03-22 19:52
edited on: 2016-03-23 11:15

I did not yet check the other cases but you are mistaken with respect to

   cat << ""EOF
   $$
   EOF

as it expands to the process id of the Bourne Shell and even ksh88 and ksh93 document exactly this behavior.

From the ksh93 documentation:

   If any character of word is quoted,
   then no interpretation is placed upon the
   characters of the document. Otherwise, parame-
   ter expansion, command substitution, and
   arithmetic substitution occur,

so it seems that you discovered a ksh bug.

Check the original Bourne Shell at:

http://schilytools.sourceforge.net/bosh.html [^]

to verify that your example causes parameter substitution.

Note that the documentation from the Bourne Shell, ksh88, ksh93,
bash, mksh and zsh clearly mention that no expansion occurs when
any of the characters from word is quoted. The dash man page is
not written clearly and thus does not help.

Looking at the ksh88 source, it is obvious that the deviating
behavior from ksh is an unintended side-effect of the rewritten
field splitting code that is used to strip off the quoting from
"word".

(0003099)
geoffclare (manager)
2016-03-23 09:29

We are already fixing the "any character in word" problem: TC2 changes it to "any part of word". See 0000583
(0003100)
joerg (reporter)
2016-03-23 10:57

Geoff, it seems that this was a mistake as from what I can say,
the ksh behavior was changed unintentionally and the new text
is in conflict with both the documentation and the behavior
of the Bourne Shell.
(0003101)
geoffclare (manager)
2016-03-23 11:30

No it is not a mistake. That is how ksh88 behaves, and the POSIX shell was based on ksh88 not Bourne.
(0003103)
kre (reporter)
2016-03-24 00:04

Thanks for the pointer to issue 583 - I actually did a search (I looked
for references to 2.7.4) and the search did not produce that one...
But that resolution is fine for that point, though you might want to
amend the language just a little more to handle the case where the
whole redirection (including the delimiter word) is in a quoted environment
(making it clear that quotes need to be explicit in "word" itself to count
as quoting existing for this purpose). The suggestion (in 583) that
"if quote removal changes the word, it was quoted" seems about right to me,

kre
(0003104)
kre (reporter)
2016-03-24 00:18

To make the first point in the issue more clear (or one aspect of it
anyway), consider the following ...

unset X
cat <<EOF
${X=2}
EOF
echo "${X-1}"

No question but that the output from cat is "2" (a line containing 2),
but what value does the echo line print, 1 or 2 ? That all depends upon the
context in which the here doc is evaluated. If it is in the context of the
shell running the script, then the answer is 2. On the other hand if it is
in the context being established to run cat, then 1 would be the answer.

My testing shows shells (I have to test) about equally divided on this issue,
but I would have expected that 1 makes most sense, given that the here document
is processed at the correct point in the sequence of redirections.

kre
(0003105)
kre (reporter)
2016-03-24 00:28

For the sixth point, consider this example

cat > File1 <<EOF 2>File2
lines of text
but no line containing "EOF"
and the script ends right here.

There are several possibilities here, one is that the here doc with no
end delimiter is a lexical or parser level syntax error, and nothing else
is done with the command at all (a non-interactive shell would exit).
That is the one I prefer...

Another is that the faulty here document is discovered during redirect
processing, after File1 is created, before File2 is created, and things
stop in that state, again with a syntax error, but later in the processing.

Third is that it isn't an error at all, cat is run, the data present is
sent to File1, File2 is created, but empty, as there are no errors. Exit
status would be 0. This solution (though seemingly quite common) I do
not like at all, as this almost always indicate some kind of user error,
and giving either half the intended data, or just as likely, more than
intended when the end delimiter is entered in an incorrect way, is just as
bad as being unable to open a file in a normal '<' redirection, so just
going ahead and substituting some other file (/dev/null maybe) instead.
Not sane.

kre
(0003124)
chet_ramey (reporter)
2016-04-04 19:52

I'm interested in what the group would like to do about the EOF-as
here-document-delimiter issue.

As kre says, just about every shell I looked at allows EOF to delimit a
here document (mksh is the notable exception). It's clearly existing
practice, but the standard is silent. Does this render all these shells
non-conformant?

The other interesting case is whether or not a shell allows an instance of
the delimiter immediately followed by the end of a command substitution
(`)' or ``') to delimit a here-document. For example, what should the
following output?

x=$(cat <<EOF
a
b
EOF)
echo "$x"
echo after

There are varying behaviors.

Shells allowing delimiter+right paren to delimit here document in $(...):
ksh93, bash, mksh, zsh, posh
Shells that do not: dash, BSD(s) sh

Shells allowing delimiter+backquote to delimit here document in `...`:
ksh93, SVR4.2 sh, bash, mksh, zsh, posh, dash, BSD(s) sh

Even in this there is varying behavior: dash uses EOF (in the form of the
end of the `...` command substitution) as the delimiter and includes the
delimiter word as part of the here document.

Is it worthwhile to add text saying the behavior is unspecified if the
shell encounters end-of-file before finding the here-document delimiter?
What about the command substitution case?
(0003125)
jilles (reporter)
2016-04-04 21:48

Given that $(case x in x) : ;; esac) is a single valid command substitution, I would expect the following two to be as well:

$(cat <<EOF &&
EOF)
EOF
:)

$(if :; then cat <<EOF
EOF)
EOF
fi)

If that is accepted but:

$(cat <<EOF
x
EOF)

is also to be a single valid command substitution, detecting whether the end marker with closing parenthesis is an end marker is rather complicated.

There is no such issue with:

`cat <<EOF
x
EOF`
(0003126)
joerg (reporter)
2016-04-05 12:20

`cat <<EOF
x
EOF`

is parsed in a different way than

$(cat <<EOF
x
EOF)

The first one is parsed on a lexical base, i.e. the next unescaped "`"
is searched for, before any instance in the shell tries to understand
the here document.

In the second form, there is a need to recursively call the parser in
order to understand where the end of the $() command is. This is caused
by the fact that the number of opening and closing parenthesis in a
command is not always equal. For this reason, with $(), the here document
is read in during the lexial scan already, because the lexical scan calls
a recursive parser.

Note that this is a POSIXLY correct command:

echo $(if cat <<EOF
1
2
3
EOF)
EOF
then
echo a
fi)

but it is not accepted by ksh93 because ksh93 implements a funny
recognition of "EOF)".
(0003127)
kre (reporter)
2016-04-05 13:58

Re note 3124:
    Is it worthwhile to add text saying the behavior is unspecified if the
    shell encounters end-of-file before finding the here-document delimiter?

I would say yes - along with an admonition on applications to always supply
the end delimiter.

The NetBSD shell is another which (now) complains about here docs without an
end delim (treats it as a syntax error), which I think is far and away the
best thing for shells to do - and I would encourage all of you who are
implementers to do that. Since we made that change, we have (as far as I
know) encountered exactly 1 script which was working only because of the
"eof terminates a here doc" behaviour - and that one was almost certainly an
accident (it was one of many scripts, several others of which also had here
docs that ran to the end of the script, and only that one was missing the
end delimiter). So, I would not be too worried that you will be breaking
large numbers of scripts that are relying on EOF delimiting here docs, the
users don't seem to know about this, or find it simple enough to add the
string just before EOF...

On the other hand, accidentally getting the here doc end delimiter incorrect
is easy to do - say a space amongst the leading tabs that are to be stripped,
or a simple typo - having that silently cause the rest of the script to be
treated as here doc content, silently, and simply carrying on working, is
not friendly.

On:
      What about the command substitution case?

That one should simply be regarded as incorrect, the spec is already clear
on what makes an end delimiter, and a ) following the string is not it, the
newline before the ) is required. (And yes, I agree, the `` case is
quite different, in all ways, even though it seems initially to be just a
different, more difficult to nest, equivalent.)
(0003134)
kre (reporter)
2016-04-07 00:46

Since it seems agreed that there is no consistency on what happens if
a here doc is not terminated, and that applications neither need to, or
ever should, rely upon any particular behaviour there (and in practice,
do not seem to), I suggest that the following wording be added to the end
of the normative text in section 2.7.4 (just before the informative example).
[Sorry, I do not have page or line numbers - someone else will need to
add those.]

    The effect of failing to detect the here-document delimiter before the
    shell exhausts its input stream is unspecified. Applications shall
    ensure the delimiter is present.

And perhaps in a rationale section somewhere...

    Traditional shell behaviour has been to treat "end of file" as being
    equivalent to the delimiter of a here document, terminating the here
    document, usually without any indication, and continuing as if the
    delimiter had been recognised. This can cause problems where the
    delimiter had been intended to occur much earlier in the script, but
    was incorrectly entered - a mistake which for many other errors would
    have resulted in a syntax error, and an aborted script, instead simply
    generates incorrect results. Because of this some shell implementations
    have changed to reporting an undelimited here document as a syntax error.
    Other implementations are encouraged to do the same.

or maybe something less wordy with similar effect...

The other issues still need resolution.
(0003135)
kre (reporter)
2016-04-07 01:05

Another of my original issues, in which I believe (hope) there is no
dissent (as in, I believe all shells act this way, it is just not, yet,
written in the standard), also add to the normative text of section 2.7.4,
in the paragraph that begins:

    If no characters in word are quoted, ...

add the following sentence after the initial sentence of the paragraph
(again, sorry, page/line numbers not available to me):

    This expansion happens after the here document delimiter has been
    recognised and the here document extracted from the input stream,
    and thus the end delimiter for the here document cannot be generated
    as a by-product of the expansion.

And to make things even clearer (I believe this is how shells behave, but
am less certain that this is universal), at the end of that same paragraph
add

    When an unquoted backslash is followed by a newline, line joining occurs,
    and the backslash newline combination is removed. This occurs while the
    here document is being scanned, the end delimiter will not be recognised
    immediately after a newline that has been deleted in this way.
(0003572)
stephane (reporter)
2017-02-25 14:12

About your last sixth point, I'm not sure I see a problem. It's currently unspecified so applications have to make sure there delimiter is provided and implementations can do what they want when it's not, allowing warning or error message, or ignore the problem, all of which are valid approaches to me.

There's also the case of

eval "cat << EOF"
xxx
EOF

(and the same with the "." command) that may need to be covered.
(0003592)
kre (reporter)
2017-03-03 15:57

Re note 3572 ... I agree that the behaviour is unspecified in the literal
sense (in that the spec says nothing about it at all), I do not agree that
is adequate however - if unspecified behaviour is what is expected in this
case (and I'd certainly accept that as an outcome for that point), it ought
to be explicitly unspecified, not just literally.

kre
(0003690)
kre (reporter)
2017-05-11 13:46

When this issue reaches the head of the queue, it might be worthwhile
spending a minute or two on the subject of \newline continuation lines
in here docs, and their effect on tab suppression, and end-string recognition.

For the first of those, given

cat <<-EOF
        \
        X
EOF

(where the white space is supposed to represent a tab character, but is
spaces here, as I cannot seem to input a tab in the form...) what exactly
is expected to be written to stdout? That is, is the tab before X a
leading tab, or not?

For the second, the following script (with the same caveat about spaces and
tabs...)

EOF() { printf 'EOF executed as a command\n'; }
cat <<-EOF
        \
EOF
EOF

executes differently in different shells, in some it simply says "EOF"
(where the second EOF is the end-string) and in others it says
"EOF executed as a command" where the first EOF is the end string.
(Here similar things happen without the tab stripping if the \ line
contains only a \ character).
(0003691)
joerg (reporter)
2017-05-11 14:01
edited on: 2017-05-11 14:05

Re: Note: 0003690

With your first example, all shells except ksh93 seem to print just a
X

With your second example, the historic Bourne Shell, ksh88, bash,
bosh, mksh print

"EOF executed as a command"

While ksh93, dash, yash, zsh are non-compliant.

You discovered an interesting aspect.

(0003756)
kre (reporter)
2017-06-10 00:58

There is another case worth considering ...

cat <<'EOF
REALLY'
Hello
EOF
REALLY

Is this supposed to work, or not? I see nothing in the text that prohibits
a \n as one of the characters of the end delimiter (obviously, like spaces,
and other operator type characters, it can only occur in a quoted delimiter)

My tests show that yash simply forbids it. Nothing else I tested does, though
when the here doc delimiter contains a \n, none of bash, zsh, mksh, or bosh
seem to recognise anything as the delimiter. The ash derived shells (dash,
freebsd netbsd) and ksh93 all just say "Hello" when given the command above
(the two line end delimiter is handled just fine .. I didn't test more than
two, but I am fairly confident that at least the FreeBSD and NetBSD shells
would handle as many embedded \n's as you want to give - probably the others
as well.)
(0005561)
geoffclare (manager)
2021-12-17 15:01
edited on: 2021-12-17 15:03

I suggest the following changes are made to resolve this bug and the related bugs 0001043 and 0001521.

On 2018 edition page 2347 line 74748 section 2.3, change:
When an io_here token has been recognized by the grammar (see [xref to 2.10]), one or more of the subsequent lines immediately following the next NEWLINE token form the body of one or more here-documents and shall be parsed according to the rules of [xref to 2.7.4].
to:
When an io_here token has been recognized by the grammar (see [xref to 2.10]), one or more of the subsequent lines immediately following the next NEWLINE token form the body of a here-document and shall be parsed according to the rules of [xref to 2.7.4]. Any non-NEWLINE tokens (including more io_here tokens) that are recognized while searching for the next NEWLINE token shall be saved for processing after the here-document has been parsed. If a saved token is an io_here token, the corresponding here-document shall start on the line immediately following the line containing the trailing delimiter of the previous here-document. If any saved token includes a <newline> character, the behavior is unspecified.

On 2018 edition page 2348 line 74774 section 2.3, after:
While processing the characters, if instances of expansions or quoting are found nested within the substitution, the shell shall recursively process them in the manner specified for the construct that is found.
append:
For "$(" and '`' only, if instances of io_here tokens are found nested within the substitution, they shall be parsed according to the rules of [xref to 2.7.4]; if the terminating ')' or '`' of the substitution occurs before the NEWLINE token marking the start of the here-document, the behavior is unspecified.

On 2018 edition page 2362 line 75355 section 2.7.4, change:
The here-document shall be treated as a single word that begins after the next <newline> and continues until there is a line containing only the delimiter and a <newline>, with no <blank> characters in between. Then the next here-document starts, if there is one.
to:
The here-document shall be treated as a single word that begins after the next NEWLINE token and continues until there is a line containing only the delimiter and a <newline>, with no <blank> characters in between. Then the next here-document starts, if there is one. For the purposes of locating this terminating line, the end of a command_string operand (see [xref to sh]) shall be treated as a <newline> character, and the end of the commands string in <tt>$(commands)</tt> and <tt>`commands`</tt> may be treated as a <newline>. If the end of input is reached without finding the terminating line, the shell should, but need not, treat this as a redirection error.

On 2018 edition page 2362 line 75363 section 2.7.4, change:
It is unspecified whether the file descriptor is opened as a regular file, a special file, or a pipe.
to:
It is unspecified whether the file descriptor is opened as a regular file or some other type of file.

On 2018 edition page 2362 line 75366 section 2.7.4, change:
If any part of word is quoted, the delimiter shall be formed by performing quote removal on word, and the here-document lines shall not be expanded. Otherwise, the delimiter shall be the word itself.

If no part of word is quoted, all lines of the here-document shall be expanded for parameter expansion, command substitution, and arithmetic expansion. In this case, the <backslash> in the input behaves ...
to:
If any part of word is quoted, not counting double-quotes outside a command substitution if the here-document is inside one, the delimiter shall be formed by performing quote removal on word, and the here-document lines shall not be expanded. Otherwise:
  • The delimiter shall be the word itself.
  • The removal of <backslash><newline> for line continuation (see [xref to 2.2.1]) shall be performed during the search for the trailing delimiter. (As a consequence, the trailing delimiter is not recognized immediately after a <newline> that was removed by line continuation.) It is unspecified whether the line containing the trailing delimiter is itself subject to this line continuation.
  • All lines of the here-document shall be expanded, when the redirection operator is evaluated but after the trailing delimiter for the here-document has been located, for parameter expansion, command substitution, and arithmetic expansion. If the redirection operator is never evaluated (because the command it is part of is not executed), the here-document shall be read without performing any expansions.
  • Any <backslash> characters in the input shall behave ...

(As per Note: 0003097, no special mention of \" needs to be made here because 2.2.3 says "that would otherwise be considered special".)

On 2018 edition page 2362 line 75374 section 2.7.4, change:
all leading <tab> characters shall be stripped from input lines and the line containing the trailing delimiter.
to:
all leading <tab> characters shall be stripped from input lines after <backslash><newline> line continuation (when it applies) has been performed, and from the line containing the trailing delimiter. Stripping of leading <tab> characters shall occur as the here-document is read from the shell input (and consequently does not affect any <tab> characters that result from expansions).

Cross-volume changes to XRAT ...

On 2018 edition page 3736 line 128241 section C.2.7.4, insert a new paragraph:
Historical shell behavior was to treat the end of input as being equivalent to the delimiter of a here-document, terminating the here-document, usually without any indication, and continuing as if the delimiter had been recognized. This can cause problems where the delimiter had been intended to occur much earlier in the script, but was incorrectly entered - a mistake which for many other errors would have resulted in a syntax error, and an aborted script, instead simply generates incorrect results. Because of this some shell implementations have changed to reporting an undelimited here-document as a syntax error. Other implementations are encouraged to do the same.


(0005564)
shware_systems (reporter)
2021-12-17 16:32

Re: 5561
I think it needs to be more explicit line joining, especially if tokenizing is done incrementally, still has precedence over <newline> detection for finding the start of the initial io_here body. While the model is 'do line joining before tokenization of command lines' this requires more resources for potential read ahead's than otherwise.
(0005568)
kre (reporter)
2021-12-18 13:10

Re Note: 0005564

That is not an issue, a newline used in line joining is never a NEWLINE
token, and cannot possibly cause here doc reading to start (everyone
agrees this is how it should be, and though the standard, while it was
missing "token" was not clear), after the proposed change it will be
with respect to that issue.

What resources are consumed is an implementation issue, not relevant here,
unless there were implementations that found themselves unable to implement
this correctly, of which I know of none.

Re Note: 0005561 in general that looks OK - there are a few places where the
wording seems a bit strange to me, though not incorrect, and a couple of
places where the result is not what I would call ideal - but I can see that
it is the best that can be achieved under current circumstances.

I will add another note on the wording issues when I get time.

In the meantime, there is one issue that can easily be fixed:

      If any saved token includes a <newline> character, the
      behavior is unspecified.

That is unnecessary. There's no difference of opinion on this is issue,
a newline character which is not a newline token (eg: a newline in a quoted
string) never causes here document data reading - that is agreed by all.
Again, it might not have been clear in the standard before, but it is in
the implementations.

Those words might have been intended to apply to the following issue,
but they really do not (or certainly not unambiguously).

There's one omission which needs to be handled, which is the status of:

      cat <<\EOF; ls $( find . -name '*foo*' -type d -print |
                          sed '/^foo*$/d' )

Where does the here document data go in that case? The first newline token
encountered will be the one after the pipe inside the command substitution.
This one must be seen by the lexer (not necessarily the grammar, but it is
the lexer that reads here document data, and removes it from the input stream,
just as it removes line-joinings) as part of scanning that command
substitution. If it isn't parsing it correctly, when first read (whatever
is done with the results of that parse) which icludes noting newline tokens
as they appear, the end of the command substitution cannot be correctly
located in all cases.

Our implementation expects to read the here doc immediately after the pipe,
that being the next newline token encountered by the lexer after an io_here
operator was encountered. Other implementations do not agree with that,
and don't consider any newline tokens observed while parsing the command
substitution as candidates for here documents for io_here operators that
appeared outside the command substitution (and before it, obviously).

This one is going to need very explicit language to deal with.

Clearly applications can be written to avoid this kind of thing simply by
ensuring a newline token appears before the command substitution starts,
where that is possible (here, simply by replacing the ';' with a newline).
In cases where it isn't possible, the io_here and the command substitution
must be part of the same simple command, so the io_here (and any other
redirections which follow it) can simply be moved after the command substitution.

However, if an application is written as above, or even worse

      cat <<\EOF ; ls $( sed <<\DONE

Whatever text follows that, follows a newline token, and is not immediately
relevant, there is a newline token - not visible explicitly in this
note, but assume it is there, at the end of that line (after "DONE").

There is no question in that case but that the here doc data for sed
has to come after this newline token, before any future tokens belonging
to the command substitution can be encountered The question is does the
here document data for cat come before it, or after it, or somewhere else
entirely.

The general principle is that here doc data is read in the order in
which the io_here tokens appeared on the line, left to right. "Line"
is a very little used concept in the shell grammar, but it is used in
this context - not command, not anything else, but "line".

To me this implies that the here doc data for cat must precede the here
doc data for sed. But again, not everyone agrees. Sorting this one
out might not be easy.

For anyone interested, Chet Ramey and I have been having (since Nov 17)
a somewhat detailed discussion of these issues on the bug-bash list (in
a message thread with the Subject: Unclosed quotes on heredoc mode
(or Re:...). Ignore the fist few messages, they just provided the motivation
for our discussion/debate. Chet's message from Nov 17 is the first that
is relevant. I'm sure someone here can explain how to find the archives
of that list, but that someone is not me.
(0005569)
geoffclare (manager)
2021-12-20 10:23

Re Note: 0005568

The text you say is unnecessary:
If any saved token includes a <newline> character, the behavior is unspecified.
is needed for things like this:
cat << \EOF > foo$(
echo bar)
string
EOF

Some shells write string to the file foobar, but dash reports a syntax error, and ksh93 gets a memory fault (with the versions I tried) so it is anyone's guess what it is trying to do.

The example you give as an "omission" is another case where a saved token includes a <newline>, and so is also covered by the text you say is unnecessary.

If it is true that all shells handle a quoted <newline> the same, then the text could be changed to say:
If any saved token includes an unquoted <newline> character, the behavior is unspecified.
(0005570)
kre (reporter)
2021-12-20 12:30

Re Note: 0005569

You're right that the two cases (the one you showed in your note, and the
one in my note) are essentially the same - they're both where a newline
appears (as a token, not as just a newline character) in a command substitution
which starts on the same line as (but after) an io_here operator.

That's why I said:

   Those words might have been intended to apply to the following issue,

as I recognised that might have been what was intended.

My point was that it is hard to justify that, as before the token containing
the newline character is seen in this case, the lexer must have seen that
character (in the cases where it matters) as a newline token, that's the only
way the end of the command substitution can be correctly found, and that has
have happened before the token containing that newline character can be formed.

Hence, that newline character is the next NEWLINE token referenced in the
second line of your proposed text to got line 74748 - that's what I meant by
the "(or certainly not unambiguously)" comment. If that one is the next
newline TOKEN, then the issue raised (newline in a token) doesn't arise, as
the here documennt text will have been read before that token is formed.

However, I know not all shells treat things this way, and this is why I
claimed that this case (both yours and mine) need more explanation of what
is intended to work, and what isn't.

However, your example does serve to demonstrate that I was wrong in thinking
that users can always simply rearrange their input to avoid problems with
io_here and newline tokens which some shells will interpret as the significant
pne, and others will not, I had not considered the case where the problem
occurs in a subsequent redirection operator, and those cannot always simply
be arbitrarily reordered (the order could be swapped in your example, but
that's not always true) - which makes thus a rather more serious problem than
I considered it.

The question of whether the newline in question is quoted or not gets even
murkier - no issue in the examples (your or mine), it isn't, but the
examples might use "$(....)" instead of an unquoted command stustitution,
which changes nothing at all with respect to a newline token appearing
within that command substitution, but certainly makes it quoted as far as
the token being formed is concerned.

The cases where there is no issue is when a (obviously must be quoted) newline
character appears in a token i a situation where that newline character is
never considered as a token of its own, something like

    cat <<EOF >'strange
    filename'
    LINE
    EOF

which I believe all shells treat the same way - that case need not be made
unspecified or undefined or anything else other than "works as expected"
(0005571)
geoffclare (manager)
2021-12-20 15:09

> My point was that it is hard to justify that, as before the token containing
> the newline character is seen in this case, the lexer must have seen that
> character (in the cases where it matters) as a newline token, that's the only
> way the end of the command substitution can be correctly found, and that has
> have happened before the token containing that newline character can be formed.
>
> Hence, that newline character is the next NEWLINE token referenced in the
> second line of your proposed text

No, according to the standard (XCU 2.3) everything from the $( to the closing ) is part of a single token. In my example the token starts with foo$( and ends with ). The shell finds the closing ) by recursively parsing the command substitution, but it does not use the results of that parsing for anything except finding the ).

> the examples might use "$(....)" instead of an unquoted command stustitution,
> which changes nothing at all with respect to a newline token appearing
> within that command substitution, but certainly makes it quoted as far as
> the token being formed is concerned.

Characters within "$(...)" are not quoted by the outer quotes. Otherwise:

echo "$(echo foo | cat)"

would output:

foo | cat

instead of just

foo.

- Issue History
Date Modified Username Field Change
2016-03-22 03:18 kre New Issue
2016-03-22 03:18 kre Name => Robert Elz
2016-03-22 03:18 kre Section => 2.7.4
2016-03-22 03:18 kre Page Number => unknown
2016-03-22 03:18 kre Line Number => unknown
2016-03-22 06:19 kre Note Added: 0003097
2016-03-22 19:52 joerg Note Added: 0003098
2016-03-23 09:29 geoffclare Note Added: 0003099
2016-03-23 09:30 geoffclare Relationship added related to 0000583
2016-03-23 10:55 joerg Note Edited: 0003098
2016-03-23 10:57 joerg Note Added: 0003100
2016-03-23 11:13 joerg Note Edited: 0003098
2016-03-23 11:15 joerg Note Edited: 0003098
2016-03-23 11:30 geoffclare Note Added: 0003101
2016-03-24 00:04 kre Note Added: 0003103
2016-03-24 00:18 kre Note Added: 0003104
2016-03-24 00:28 kre Note Added: 0003105
2016-04-04 19:52 chet_ramey Note Added: 0003124
2016-04-04 21:48 jilles Note Added: 0003125
2016-04-05 12:20 joerg Note Added: 0003126
2016-04-05 13:58 kre Note Added: 0003127
2016-04-07 00:46 kre Note Added: 0003134
2016-04-07 01:05 kre Note Added: 0003135
2016-04-07 05:42 Don Cragun Page Number unknown => 2335-2336
2016-04-07 05:42 Don Cragun Line Number unknown => 74235-74256
2016-04-07 05:42 Don Cragun Interp Status => ---
2017-02-25 14:12 stephane Note Added: 0003572
2017-03-02 16:18 nick Relationship added related to 0001037
2017-03-03 15:57 kre Note Added: 0003592
2017-03-23 16:11 nick Relationship added related to 0001043
2017-05-11 13:46 kre Note Added: 0003690
2017-05-11 14:01 joerg Note Added: 0003691
2017-05-11 14:05 joerg Note Edited: 0003691
2017-06-10 00:58 kre Note Added: 0003756
2021-09-10 08:47 geoffclare Relationship added related to 0001521
2021-12-17 15:01 geoffclare Note Added: 0005561
2021-12-17 15:03 geoffclare Note Edited: 0005561
2021-12-17 16:32 shware_systems Note Added: 0005564
2021-12-17 16:33 shware_systems Note Added: 0005565
2021-12-17 16:33 shware_systems Note Deleted: 0005565
2021-12-18 13:10 kre Note Added: 0005568
2021-12-20 10:23 geoffclare Note Added: 0005569
2021-12-20 12:30 kre Note Added: 0005570
2021-12-20 15:09 geoffclare Note Added: 0005571
2022-01-06 17:21 geoffclare Final Accepted Text => Note: 0005561
2022-01-06 17:21 geoffclare Status New => Resolved
2022-01-06 17:21 geoffclare Resolution Open => Accepted As Marked
2022-01-06 17:21 geoffclare Tag Attached: tc3-2008
2022-01-06 17:24 Don Cragun Relationship replaced has duplicate 0001043
2022-01-06 17:26 Don Cragun Relationship replaced has duplicate 0001521
2022-02-04 09:56 geoffclare Status Resolved => Applied
2024-06-11 08:57 agadmin Status Applied => Closed


Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker