0001084: rule 3, 4, 5 do not say that a token is started, if needed

ID	Project	Category	View Status	Date Submitted	Last Update

0001084	1003.1(2016/18)/Issue7+TC2	Shell and Utilities	public	2016-10-12 08:56	2018-05-10 15:52

Reporter	Mark_Galeck	Assigned To
Priority	normal	Severity	Editorial	Type	Error
Status	Resolved	Resolution	Duplicate

Name	Mark Galeck
Organization
User Reference
Section	2.3 Token Recognition
Page Number	2347-2348
Line Number	74761-74780
Interp Status	---
Final Accepted Text


Summary	0001084: rule 3, 4, 5 do not say that a token is started, if needed
Description	It says in line 74749 "break its input into tokens by applying the first applicable rule", but if rule 3, 4, or 5 is the first applicable and a token is intended to start at the current character, then that token is never started.
Desired Action	Add to the end of rule 3: "In addition to this rule, continue to the next applicable rule for the current character, to decide if the current character starts a new token or is discarded." Add to the ends of rule 4 and 5: "If there is no current token, then the current character starts a token".
Tags	No tags attached.

shware_systems 2016-10-12 23:29 reporter bugnote:0003408	This is handled by Rule 10., as rule of last resort.

Mark_Galeck 2016-10-13 01:44 reporter bugnote:0003409	No it's not. That is my whole point. I repeat, line 74749 says "applying the _first_ applicable rule". If rule 3, 4, 5 applies, that is it, there is no "rule of last resort".

shware_systems 2016-10-14 22:16 reporter bugnote:0003416	Rule 3, 4, 5 basically apply once a token is started, which isn't the case for the circumstance cited, so Rule 10 establishes that a word token is expected, as the first fully applicable rule, then Rule 3, 4, 5 applies with the same current char and the token being started as possibly a word, is how I read it. They may force a previous token to terminate but do not force the start of a token because a sequence like "" amounts to empty text that will get subsequently discarded anyways as being not materially part of the token during quote removal, and an implementation employing limited look-ahead can safely preevaluate and skip over sequences like that as an optimization. On some platforms such optimizations aren't practical either, for architectural reasons. Granted, as stated this isn't immediately intuitive, but it has to cover recursive evaluation of tokens inside tokens too, and potential side effects of alias processing. A precondition, charclasses, postcondition matrix would have a lot more rows than the rules, however, especially if the additional rules incorporated by reference to 2.6 are added for the various substitution types. It is last resort because the other rules handle whether: the char starts an operator token, is text that should be ignored without starting a token, or is an interruption of the recognition context of a word token that has non-discardable parts already, as higher priority evaluations that may terminate preceding tokens as side effects. Rule 1 is first, I believe, because it has the additional side effect of completing the top level productions of the grammar, as highest priority. If it can't be one of those, in other words, then by default it must be the potential start of a word token.

Mark_Galeck 2016-10-15 01:31 reporter bugnote:0003417	Before I reply, I want to clarify something. It is not my intention to argue on any of my reports - I will accept any reply and resolution , even if I disagree personally, so long as: 1. The reply answers questions that were left unanswered in the current standard, as explained in the report. This includes pointing to a place in the standard, if I missed it. 2. The reply states the facts correctly that are in the standard and does not contradict the standard. So, you don't have to spend your valuable time arguing your points with me. I will take them as granted so long as they satisfy 1 and 2 above. All you have to do is satisfy 1 and 2 and that is it. This reply IMHO does not satisfy condition 2. I will explain why: >Rule 3, 4, 5 basically apply once a token is started No that is impossible, the whole thing would not work if that were the case. Rule 3 must apply in a case like >foobar When the current character is "f", Rule 3 must apply. If not, and as you say, Rule 10 applied, then the previous token would not be delimited, because Rule 10 does not say to delimit it. Rule 4 must apply in a case like 'foo'bar'baz' if it did not and only Rule 10 applied, then Rule 10 would start the word, and then after that, we have a current token, so when the second ' is seen, Rule 4 would apply and according to that rule the text 'bar' would be quoted, which is false. Rule 5 must apply in a case like $$# If Rule 10 applied at the first $ , then Rule 5 would apply at the second $, identifying '$#' as a parameter expansion, elsewhere in the standard it is explained that would yield the string '0', and then we would have the simple command '$0'. But that is not the intent of the standard and that is not how existing shells behave: they yield the simple command such as '7689#'.

shware_systems 2016-10-25 13:04 reporter bugnote:0003456	I am explaining how the standard can be interpreted so what's there does match existing behaviors and not arguing, per your point 1. as something questionable just looking at the normative text. I'm more right than not here, whether that's easy to believe or not, as someone that has been involved in the more recent changes to that text. I leave it as my opinion because it's easy enough to overlook nuances intended by the original authors of those sections, some of whom (if not all) are still involved with the list, so I do not speak for them. Note the 'same current char' and 'may terminate preceding tokens' clauses I used. The basic loop isn't: while (curchar++!=EOL) {apply a single rule}; it's more a: if (not empty input) do {apply rules, maybe recursively, and at EOL applying the grammar to see if maybe io_here bodies need to be processed, and then possibly reapplying rules due to detected alias expansions, again potentially with recursion} until (curchar==EOI); one. It is the individual rules that say when curchar++ may be executed, possibly as part of a sub-loop specific to that rule or to do look-ahead checking, such as for on-the-fly line joining when curchar=='\'. The standard leaves open curchar may be referencing an input buffer where line joining has been preprocessed as much as practical as well. Rule 10 does not say use the current char as first character of a new token and access a new char as current char, just to use it as first char. Rule 3 does apply to terminate, per above, the '>' token of '>foobar', but it does not necessarily start a new token. Rule 3 applies for '>#foobar' also, as terminating '>', but Rule 9 is what determines how the rest of that line is classified, with '#' as the current char, as beginning a comment. The delimiting newline Rule 9 says to look for and move past can also be overridden by Rule 1 if the comment is on the last line of a file that isn't terminated by an EOL. It also applies if the last character is a NUL, if the source is a C string as a sh -c argument or system() interface call, or a Ctrl-Z if that's the interactive EOF control character, as variants all symbolically EOI. For 'foo'bar'baz' the first ' gets classified by Rule 10, and then by Rule 4 as the beginning of a single quoted string. A new current char, the 'f', is then accessed according to that rule. For the last case, Rule 10 starts the token, $$ is a valid special parameter (the shells' numeric pid after evaluation) by Rule 5 and 2.5.2, and '#' by Rule 10 followed by Rule 9 again begins a comment. No, it's not straightforward, but it is essentially correct as is in describing how various implementations process most scripts.

geoffclare 2018-05-10 15:52 manager bugnote:0004028	This is being resolved as a duplicate of 0001085 because we believe the resolution of 1085 addresses this problem as well.

Date Modified	Username	Field	Change
2016-10-12 08:56	Mark_Galeck	New Issue
2016-10-12 08:56	Mark_Galeck	Name	=> Mark Galeck
2016-10-12 08:56	Mark_Galeck	Section	=> 2.3 Token Recognition
2016-10-12 08:56	Mark_Galeck	Page Number	=> 2347-2348
2016-10-12 08:56	Mark_Galeck	Line Number	=> 74761-74780
2016-10-12 23:29	shware_systems	Note Added: 0003408
2016-10-13 01:44	Mark_Galeck	Note Added: 0003409
2016-10-14 22:16	shware_systems	Note Added: 0003416
2016-10-15 01:31	Mark_Galeck	Note Added: 0003417
2016-10-25 13:04	shware_systems	Note Added: 0003456
2016-10-28 08:20	geoffclare	Relationship added	related to 0001100
2018-05-10 15:52	geoffclare	Interp Status	=> ---
2018-05-10 15:52	geoffclare	Note Added: 0004028
2018-05-10 15:52	geoffclare	Status	New => Resolved
2018-05-10 15:52	geoffclare	Resolution	Open => Duplicate
2018-05-10 15:52	geoffclare	Relationship added	duplicate of 0001085

View Issue Details

Relationships

Activities

Issue History

duplicate of	0001085	Closed		"token shall be from the current position in the input" is incorrect
related to	0001100	Closed		Rewrite of Section 2.10 Shell Grammar, of the Shell Standard, to fix previous reports, fix new issues, and improve presentation.