Anonymous | Login | 2024-12-12 14:28 UTC |
Main | My View | View Issues | Change Log | Docs |
Viewing Issue Simple Details [ Jump to Notes ] | [ Issue History ] [ Print ] | ||||||
ID | Category | Severity | Type | Date Submitted | Last Update | ||
0001649 | [Issue 8 drafts] Shell and Utilities | Objection | Error | 2023-03-31 01:55 | 2024-06-11 09:12 | ||
Reporter | kre | View Status | public | ||||
Assigned To | |||||||
Priority | normal | Resolution | Accepted As Marked | ||||
Status | Closed | Product Version | Draft 3 | ||||
Name | Robert Elz | ||||||
Organization | |||||||
User Reference | |||||||
Section | XCU 2.6.5 | ||||||
Page Number | 2476 | ||||||
Line Number | 80478 - 80504 | ||||||
Final Accepted Text | See Note: 0006488. | ||||||
Summary | 0001649: Field splitting is woefully under specified, and in places, simply wrong | ||||||
Description |
I didn't really believe this when it was pointed out on a mailing list, but nowhere in XCU 2.6,5 (Field Splitting) does it say what happens when the expansion being split is not empty but contains no IFS characters. Further, or perhaps as a more general case of the above, it also doesn't say what happens to any characters that follow the last IFS character in the expansion being split - nothing gets delimited in that case, as there is no delimiter to accomplish that. Note that in XCU 2.6.5 bullet point 1, the example contains a trailing <space> which the text says "shall be ignored" - but is still present in the example, and so could be read as delimiting the "bar" field (before being ignored). The text there is ambiguous, as it doesn't specify the ordering of the "shall be ignored" wrt the "shall delimit a field". We could delimit the field first, and then ignore the trailing IFS white space. It also doesn't explicitly say what happens to the results - it is clear that multiple fields can result, and that zero fields can result, but it doesn't say how this applies to a case like prefix${VAR}suffix in the various cases. It turns out there are a number of other errors, or differences from the way that shells actually behave, in the text as well. Lastly, as best I can tell, the operation "delimited" isn't actually defined anywhere. It is used in this section (also in XCU 2.3 (Token Recognition) and perhaps other places) but I cannot locate a definition. One might take the second paragraph of XCU 2.6.5 (lines 80481-80485) as a kind of definition, at least for the purposes of this section, but it isn't really explicit that it intends to be that. I don't think there is any disagreement what should happen in any of these cases (except perhaps one), which is perhaps why the text which says what that is is missing. We all simply know what happens, and assume that. There's another issue - shells do not agree as to what constitutes IFS white space. XBD 3.412 defines "white space" (in the POSIX locale) to include carriage return, vertical tab, and form feed, and the standard (XBD 2.9.5 bullet point 3 says "any of the white space characters in the IFS value". bash, yash and ksh93 allow those "extra" three white space characters to be IFS white space - other shells do not, only space/tab/newline are considered as candidates for being IFS white space (including mksh and bosh). I would presume (but have no easy way to test) that if a locale defined some other characters as being space characters (as in "isspace() returns true"), then bash, yash, and ksh93 would treat those chars as white space as well. Something needs to be done to handle that difference (not that I have ever seen any real code using any of \r \v or \f as an IFS character). Lastly, while I am here, one other (not directly related) issue that should be cleared up ... The shell shall treat a byte sequence forming any of the characters in the IFS value doesn't say what character encoding is intended to be used for this purpose. Clearly it is set by LC_CTYPE (I hope) - but which version of that env var? The value that LC_CTYPE had when IFS was assigned a value (which would mean LC_CTYPE=C (or POSIX, or unset) I assume for the default $' \t\n' for IFS. Or are we supposed to use the value of LC_CTYPE at the time IFS is being used, which would mean that changing LC_CTYPE might have the side effect of altering the meanings of the field separators (terminators) to be used by field splitting. [Aside: in the previous paragraph, I use "LC_CTYPE" as a shorthand to refer to the locale's character encoding settings, however that is communicated to the shell - via LANG LC_CTYPE LC_ALL or some other way - this issue isn't about locale definitions/uses so none of those differences are relevant here.] To help make sure that the rules that we end up specifying match what is actually implemented. I wrote a little test script. It and its (normal) output are reproduced below. All (Bourne compatible) shells, with two exceptions, produce identical output. One of the exceptions is the ancient implementation of pdksh which masquerades as /bin/ksh on NetBSD - that thing is full of bugs, and the (wrong) output here is just one of many examples. That one can simply be ignored. The other is mksh which produces different results for the SCSCS and S5 tests (its output will be shown below). Note here that "all shells" means those I have to test (which excludes ksh88, and the original Bourne Shell and its very close relatives - but includes bosh) but does include zsh (as long as it is run in "--emulate sh" mode - in its default mode it is "different"). ========================================== the test script (also attached) argc() { printf '%s:\t' "$1"; shift printf "%2d args:" "$#" printf " <%s>" "$@" printf '\n' } # you will probably need to edit the following line, it will not survive mantis IFS=' ,' # IFS=$' \t,' except not all shells have $'' yet SPACE=' ' FOO=foo TWO='one two' TWOS=' one two ' C=',' C2=',,' CSC=', ,' SCSCS=' , , ' S1='one,two' S2='one , two' S3=',one,two' S4='one,two,' S5=' ,one ,two, ' argc foo foo${FOO}foo argc sep foo${SPACE}foo argc two foo${TWO}foo argc twos foo${TWOS}foo argc 2two foo${TWO}${TWO}foo argc 2twos foo${TWOS}${TWOS}foo argc comma foo${C}foo argc C2 foo${C2}foo argc CSC foo${CSC}foo argc SCSCS foo${SCSCS}foo argc S1 foo${S1}foo argc S2 foo${S2}foo argc S3 foo${S3}foo argc S4 foo${S4}foo argc S5 foo${S5}foo ========================================== the test script ends, results follow foo: 1 args: <foofoofoo> sep: 2 args: <foo> <foo> two: 2 args: <fooone> <twofoo> twos: 4 args: <foo> <one> <two> <foo> 2two: 3 args: <fooone> <twoone> <twofoo> 2twos: 6 args: <foo> <one> <two> <one> <two> <foo> comma: 2 args: <foo> <foo> C2: 3 args: <foo> <> <foo> CSC: 3 args: <foo> <> <foo> SCSCS: 3 args: <foo> <> <foo> S1: 2 args: <fooone> <twofoo> S2: 2 args: <fooone> <twofoo> S3: 3 args: <foo> <one> <twofoo> S4: 3 args: <fooone> <two> <foo> S5: 4 args: <foo> <one> <two> <foo> ========================================== results end For SCSCS mksh produces SCSCS: 4 args: <foo> <> <> <foo> For S5 mksh produces: S5: 5 args: <foo> <> <one> <two> <foo> THat's using version MIRBSD KSH R59 2020/05/16 (aka R59b). I know that R59c exists, but it hasn't been upgraded in NetBSD's pkgsrc yet... Further, the changelog doesn't indicate any changes in this area, either in R59c nor in the so far unreleased version as it is now (or not that I noticed). These differences are both examples of the same omission from the standard I believe: Each occurrence in the input of a byte sequence that forms an IFS character that is not IFS white space, along with any adjacent IFS white space, shall delimit a field, as described previously. In the case in S5, the input starts " ," where the " " is IFS white space, and the ',' is an IFS character that is not IFS white space, so that should delimit a field. That's what mksh does. But isn't what anything else does, and isn't (I believe) the intended behaviour. The "as described previously" (which would be clearer if it said precisely where it was previously described) If no fields are delimited, for example if the input is empty or consists entirely of IFS white space, the result shall be zero fields (rather than an empty field). if I think intending to say that if there has been nothing, when a field is delimited and haven't delimited one previously, then nothing results, rather than an empty field. But it doesn't say that exactly. If that isn't the "as described previously" then I have no idea what is. The results show other discrepancies (in all shells) from what the standard seems to require. Eg: the "sep" test contains an expansion that is entirely IFS white space, that IFS white space is at the beginning (and end) of the input, so according to 2.6.5 3.a that IFS white space should be ignored. To me "ignored" should mean "treated as if it is missing". If that were true the result of the "sep" test would be a single field "foofoo" (just as it is "foofoofoo" in the foo test where the input contains no IFS characters at all). But it isn't, it is two fields - the expansion contributes no data to the result, but it does serve to separate the prefix and suffix in the prefix${VAR}suffix case - the "twos" "S3" and "S4" tests show that any leading/trailing white space in the input serve to separate any field produced by the expansion being field split from the prefix or suffix (as applicable). This whole section needs (yet another) complete rewrite. When doing that bullet point 1 should simply be dropped, it says nothing that bullet point 3 doesn't also say (except the example, which ought to be expanded to more than one example, after the normative text). |
||||||
Desired Action |
I am working on new wording, which I will append as a note when I have something suitable. In the meantime, everyone else, feel free to make suggestions. This is all a mess - and it should be simple. |
||||||
Tags | applied_after_i8d3, issue8 | ||||||
Attached Files |
ifs [^] (675 bytes) 2023-03-31 01:55 IFS-test [^] (2,701 bytes) 2023-09-07 15:06 POSIX-bug-1649-impl.sh [^] (7,776 bytes) 2023-09-07 15:07 Expected-Results [^] (2,209 bytes) 2023-09-07 15:09 Revised-bug-1649-suggestion [^] (6,506 bytes) 2023-09-11 03:59 |
||||||
|
Relationships | |||||||||||
|
Notes | |
(0006412) Don Cragun (manager) 2023-07-31 16:13 |
We looked at this issue again during the 2023-07-31 conference call. We are still waiting for a note with suggested changes from kre. |
(0006459) kre (reporter) 2023-09-07 14:14 |
I am about to attach a series of notes - making them separate rather than one huge combined note might make it easier to follow. When I am done, I will append one more note, which will just be an index to the previous notes, so you can more easily jump to whatever might be of interest. The following note will be my proposed replacement for section 2.6.5. As foreshadowed in the Description above "This whole section needs (yet another) complete rewrite." that is exactly what I have done. As always with my text, the words in the next note are not intended to remain untouched - by all means remove the redundant noise I tend to include, make it more "standards like" (correct terminology, formatting, etc), and whatever else. But I believe that what is here is the simplest most concise description I can give of how all this really operates (or should). As always of course, all that is required is an "as if" behaviour, the shell I maintain certainly doesn't do things in the slow baroque way described here, but the results are the same. Notes following will show my test program (rather bigger than the earlier one and easier to follow, as the text names show exactly what the input was, see the relevant note) and some discussion of what doesn't work, and as best I can tell why, in the one (two if one counts zsh - I am not counting at all, or even testing any more, the ancient broken PD ksh that is NetBSD's /bin/ksh) shell which doesn't produce the same results. Some of that might be argued to be following an interpretation of the standard as it has been, but other differences seem to simply be broken. |
(0006460) kre (reporter) 2023-09-07 14:15 |
2.6.5 Field Splitting After parameter expansion (Section 2.6.2), command substitution (Section 2.6.3), and arithmetic expansion (Section 2.6.4), and if the shell variable IFS is set, and its value is not empty, or if IFS is unset, the shell shall scan each field containing results of expansions and substitutions that did not occur in double-quotes for field splitting and multiple fields can result. If IFS is set with an empty string as its value, no field splitting occurs, however if an input field which contained expansions or substitutions is entirely empty, it shall be removed. Note that this occurs before quote removal, any input field that contains any quoting characters can never be empty at this point. Fields which contain no unquoted results from any of the expansions or substitutions shall not be affected by field splitting, and shall remain unaltered. In the following description, it is assumed that there is in the field at least one unquoted expansion or substitution result present, this assumption will not be restated. Field splitting only considers altering those parts of the field. For the purposes of this section, the term "IFS white space" shall mean any of the characters <space> <tab> or <newline> which are present in the value of the IFS variable. It is implementation defined whether other White Space characters which appear in the value of IFS are also considered as "IFS white space". If the IFS variable is unset, then for the purposes of this section, but without altering the value of the variable, its value shall be considered to contain the three characters <space> <tab> and <newline>, all of which are IFS white space characters. The three characters above defined as IFS white space characters are always IFS white space, when they occur in the value of IFS, for other locale specific White Space characters allowed by the implementation it is unspecified whether the character is considered as IFS white space if it is White Space at the time it is assigned to the IFS variable, or if it is White Space at the time field splitting occurs (the locale may have altered between those events). The shell shall treat non-empty sequences formed from the characters in the IFS value, in any order, provided that no sequence contains more than one character that is not IFS white space, and use the delimiters as field terminators to split the results of parameter expansion, command substitution, and arithmetic expansion into separate fields, as described below. Note that these delimiters terminate a field, they do not, of themselves, cause a new one to start, subsequent non IFS whitespace characters in the field are required for a new field to begin. If results of the algorithm are that no fields are delimited, that is, if the input is wholly empty or consists entirely of IFS white space, the result shall be zero fields (rather than an empty field). For the purposes of this section, when a field is said to be delimited, then the characters from the input from after the delimiter which terminated the previous field, or from the beginning of the input if this is the first, up to the point immediately preceding the delimiter, shall be considered as a candidate for forming an output field. If that candidate is not empty, or if both the previous output field (if any), and the current field being delimited were delimited a delimiter that includes a character which is not IFS white space, then the candidate shall become an output field. If the current candidate output field is empty, and either it, or the previous output field were delimited by only IFS white space, then the empty candidate shall be removed, and shall not form an output field. Each expansion, or substitution shall be processed in order as follows, examining characters in the input field, left to right, and beginning with an empty candidate field: While the input is not empty... When instructed to perform the next iteration, start again here, with the input as modified. Consider the first remaining character of the input. If it is: a. A character that did not result from an unquoted expansion or substitution: Append this character to the candidate, and remove it from the input. Next iteration of the loop. b. A character in the input that is not a character in IFS: Append this character to the candidate, and remove it from the input. Next iteration. c. An IFS white space character: Remove that character from the input, advance to the next input character, and repeat this step. d. Another IFS character, not IFS white space: Remove that character from the input, but note it was observed. Remove any immediately following IFS white space characters from the input. At this point, if the candidate is not empty, or if a non IFS white space character was seen at step d, then the candidate becomes an output field. In either case, empty the candidate, and perform the next iteration. When the input is empty, if the candidate is not empty, it becomes an output field. The set of output fields so produced, which may be none, replaces the input field. |
(0006462) kre (reporter) 2023-09-07 14:30 edited on: 2023-09-07 14:43 |
This is an extract from my test program for the shells, I won't include all of it here, as most of the end is simply lots and lots of weird test cases. In all of them, the test case names tell you what the input was, each character of the name represents something in the input. 'p' is a prefix - characters that are part of the field, that are not part of an unquoted expansion or substitution. 'q' is the suffix, the same as 'p' but following the interesting bit. There is always an interesting bit, fields that contain nothing to split at all are boring, so we have none of those. Because no field splitting happens in the prefix or suffix, in the tests those contain no chars that could be confused with IFS characters. This simplifies things for my sh strawman implementation of the algorithm in the previous note (which will appear below). 'W' is a word - a sequence of chars which contain no IFS chars. In the tests the first word is always "abc" and the second "def" (no tests currently have more than two, but you can guess what the third would be...) S and C are the IFS chars. In these tests IFS is always ' ,' (space and comma). One IFS white space character, and one other. This simplifies the testing, without limiting the generality of the algorithm (my strawman implementation only handles those, a real implementation would simply treat any IFS white space the way I treat space, and any other IFS character the way I treat ','). That's it... The order of the characters in the test name shows the order in which those elements appear in the input field, to be split. In this test we use the shell running the test to perform the field splitting. The NetBSD and FreeBSD shells, bash, bosh, dash, ksh93, yash, all produce identical results for all the tests. zsh seems to think a terminating non-whitespace IFS character starts a new field (in --emulate sh mode, otherwise the results are nothing like this at all). mksh will be discussed in a later note. Here is the test code, and some of the initial tests. I will attempt to upload the entire thing as an attached file, later. args() { name=$1; shift printf '%s:\t%d:\t' "$name" "$#" printf '[%s]' "$@" printf '\n' } IFS=' ,' S=' ' C=',' W='abc' SW=' abc' WS='abc ' # many more omitted here # The tests: args W $W args SW $SW args WS $WS args SWS $SWS args CW $CW args WC $WC [ and many nore]. The next note will show the complete set of results. |
(0006464) kre (reporter) 2023-09-07 14:45 |
W: 1: [abc] SW: 1: [abc] WS: 1: [abc] SWS: 1: [abc] CW: 2: [][abc] WC: 1: [abc] CWC: 2: [][abc] WSW: 2: [abc][def] WSSW: 2: [abd][def] WCW: 2: [abc][def] WCCW: 3: [abc][][def] WSCW: 2: [abc][def] WCSW: 2: [abc][def] WSCSW: 2: [abc][def] WSCSCSW: 3: [abc][][def] WSCSCSWS: 3: [abc][][def] WSCSCSWC: 3: [abc][][def] S: 0: [] C: 1: [] SS: 0: [] SSS: 0: [] CC: 2: [][] CCC: 3: [][][] SC: 1: [] CS: 1: [] SCCS: 2: [][] CSSC: 2: [][] SCWSCWCS: 3: [][abc][def] SSCSSCSSCSS: 3: [][][] pW: 1: [pabc] pSW: 2: [p][abc] pWS: 1: [pabc] pSWS: 2: [p][abc] pCW: 2: [p][abc] pWC: 1: [pabc] pCWC: 2: [p][abc] pWSW: 2: [pabc][def] pWSSW: 2: [pabd][def] pWCW: 2: [pabc][def] pWCCW: 3: [pabc][][def] pWSCW: 2: [pabc][def] pWCSW: 2: [pabc][def] pWSCSW: 2: [pabc][def] pWSCSCSW: 3: [pabc][][def] pWSCSCSWS: 3: [pabc][][def] pWSCSCSWC: 3: [pabc][][def] pS: 1: [p] pC: 1: [p] pSS: 1: [p] pSSS: 1: [p] pSC: 1: [p] pCS: 1: [p] pCSSC: 2: [p][] pSSS: 1: [p] pCCC: 3: [p][][] pSCCS: 2: [p][] pSCWSCWCS: 3: [p][abc][def] pSSCSSCSSCSS: 3: [p][][] Wq: 1: [abcq] SWq: 1: [abcq] WSq: 2: [abc][q] SWSq: 2: [abc][q] CWq: 2: [][abcq] WCq: 2: [abc][q] CWCq: 3: [][abc][q] WSWq: 2: [abc][defq] WSSWq: 2: [abd][defq] WCWq: 2: [abc][defq] WCCWq: 3: [abc][][defq] WSCWq: 2: [abc][defq] WCSWq: 2: [abc][defq] WSCSWq: 2: [abc][defq] WSCSCSWq: 3: [abc][][defq] WSCSCSWSq: 4: [abc][][def][q] WSCSCSWCq: 4: [abc][][def][q] Sq: 1: [q] Cq: 2: [][q] SSq: 1: [q] SSSq: 1: [q] SCq: 2: [][q] CSq: 2: [][q] CSSCq: 3: [][][q] SSSq: 1: [q] CCCq: 4: [][][][q] SCCSq: 3: [][][q] SCWSCWCSq: 4: [][abc][def][q] SSCSSCSSCSSq: 4: [][][][q] pWq: 1: [pabcq] pSWq: 2: [p][abcq] pWSq: 2: [pabc][q] pSWSq: 3: [p][abc][q] pCWq: 2: [p][abcq] pWCq: 2: [pabc][q] pCWCq: 3: [p][abc][q] pWSWq: 2: [pabc][defq] pWSSWq: 2: [pabd][defq] pWCWq: 2: [pabc][defq] pWCCWq: 3: [pabc][][defq] pWSCWq: 2: [pabc][defq] pWCSWq: 2: [pabc][defq] pWSCSWq: 2: [pabc][defq] pWSCSCSWq: 3: [pabc][][defq] pWSCSCSWSq: 4: [pabc][][def][q] pWSCSCSWCq: 4: [pabc][][def][q] pSq: 2: [p][q] pCq: 2: [p][q] pSSq: 2: [p][q] pSSSq: 2: [p][q] pSCq: 2: [p][q] pCSq: 2: [p][q] pCSSCq: 3: [p][][q] pSSSq: 2: [p][q] pCCCq: 4: [p][][][q] pSCCSq: 3: [p][][q] pSCWSCWCSq: 4: [p][abc][def][q] pSSCSSCSSCSSq: 4: [p][][][q] |
(0006465) kre (reporter) 2023-09-07 14:54 |
Apologies for the mess with the original version of the results, those reading this via the mailing list will note that I was using angle brackets as the field delimiter characters, to show what is in each field, and totally forgot that mantis would interpret those, so I have just done a quick switch to use square brackets, which works for the note, but is much uglier to look at. Anyway, here is the (now using []) shell strawman implementation of the algorithm in Note: 0006460 . Again, this is truncated, most of the actual test cases are omitted, though you can deduce what they are from the results in Note: 0006464 This test does (or did before I fiddled with the "args()" function which prints the results just now, produce identical results to the version run by the shell, in Note: 0006464 - so I am not going to include those again. I have run this with every reaosnable shell I have (not pdksh, and not zsh, as I don't really understand its differences). All of them (mksh included) produce the same results here, so I believe the code is portable enough. # This is a dummy implementation of the proposed field splitting # algorithm (witten in sh, so hopefully sh people can follow it) # to demonstrate that the algorithm as presented generates the # expected output (that generated by almost every shell). # This code knows that in the tests IFS=' ,' (space and comma) # and rather than handling that generically, which would be possible, # but messy, simply builds those two characters (literally) into the # implementation (space, as a IFS white space char, and comma as an # IFS char that is not white space). # Similarly the code "knows" that if there is a prefix in the field # (chars not to be treated as generated by an expansion, and hence # exepmt fmom splitting) that will be simply a single 'p' always, and # siumilarly a suffix will be 'q' - because of that we do not need to # have any method to indicate what part of the field is to be subject # to field splitting # In the following comments that start '##' are text lifted directly # from my proposed section 2.6.5 ("Field Splitting") text, which might # allow readers to match this algorithm with what is described there. # The results from this test match exactly the results from all shells # considered to operate correctly (the same output routine is used, and # the results compared with diff - with zero differences). S=' ' C=',' field_split() { ARG=$1 # the field that needs to be split set -- # the set of output fields, initially empty # IFS is defined (IFS=' ,') and not empty, IFS white space is ' ' # We simply know that! # C is our candidate field, # CD indicates the delimiter that terminated the candidate field # ' ' indicates the delimiter was IFS white space alone # ',' indicates the delimuter was a ',' (perhaps with white space) # '' indicates there has been no delimiter C= CD= ## Each expansion, or substitution shall be processed in order ## as follows [...] ## While the input is not empty... while test -n "${ARG}" do ## Consider the first remaining character of the input. ## If it is: ## a. A character that did not result from an unquoted ## expansion or substitution: ## b. A character in the input that is not a character in IFS: # since we know exactly what the IFS chars are, and that # chars that did not result from an expandion (etc) are not # IFS chars (our test cases ensure that) we don't need to # treat those two differently, just skip forward until we # get to an IFS char, or we run out, appending the non-IFS # chars to the candidate and removing them from the input. # here we only care about the current first char in ${ARG} while case "${ARG}" in '') break 2 # the end of the input, done ;; [\ ,]*) false # delimiter located, exit loop ;; *) TAIL=${ARG#?} # something else C=${C}${ARG%"${TAIL}"} # appended to candidate ARG=${TAIL} # removed from input ;; esac do : done # Now we are at the start of a delimiter in ARG, and the # candidate field is C # which kind of delimiter do we have? ## c. An IFS white space character: # assume the delim will be just IFS white space (case 'c') CD=' ' # and then skip any of that we find (repeating 'c' over & over) while case "${ARG}" in ' '*) ARG=${ARG#* };; *) false;; esac do :; done ## d. Another IFS character, not IFS white space: # Next if we have a non white space IFS char, # then it is the other kind of delimiter (case 'd' in the algo) case "${ARG}" in ,*) CD=, ; ARG=${ARG#,} # Remember we saw it, then remove # and skip any following IFS white space while case "${ARG}" in ' '*) ARG=${ARG#* };; *) false;; esac do :; done ;; esac # now a field has been delimited so we are subject to: ## At this point, if the candidate is not empty, or if a ## non IFS white space character was seen at step d, then ## the candidate becomes an output field. ## In either case, empty the candidate, and perform the ## next iteration. if test -n "${C}" # candicate is not empty (or...) => output then ## if the candidate is not empty ## then the candidate becomes an output field. set -- "$@" "'${C}'" # otherwise The candidate is empty, if it was delimited # by only IFS white space, then candidate is dropped elif test "${CD}" != ' ' then ## or if a non IFS white space character was seen ## then the candidate becomes an output field. set -- "$@" "''" # no need for $C, it is "" fi ## In either case, empty the candidate, and perform ## the next iteration. CD= C= done ## When the input is empty, if the candidate is not empty, it ## becomes an output field. if test -n "${C}" then # not an empty field after last delim, so it is included set -- "$@" "'${C}'" fi # return the split field, as a list of quoted words (to become fields) printf %s "$*" } args() { name=$1; shift printf '%s:\t%d:\t' "$name" "$#" printf '[%s]' "$@" printf '\n' } tst() { N=$1 eval set -- $(field_split "$2") args "$N" "$@" } W='abc' SW=' abc' WS='abc ' SWS=' abc ' CW=',abc' WC='abc,' CWC=',abc,' WSW='abc def' WSSW='abd def' # and many more definitions like that # followed by the actual test invocations tst W "$W" tst SW "$SW" tst WS "$WS" tst SWS "$SWS" tst CW "$CW" tst WC "$WC" tst CWC "$CWC" tst WSW "$WSW" tst WSSW "$WSSW" tst WCW "$WCW" tst WCCW "$WCCW" tst WSCW "$WSCW" tst WCSW "$WCSW" tst WSCSW "$WSCW" tst WSCSCSW "$WSCSCSW" # and many more. |
(0006466) kre (reporter) 2023-09-07 15:01 |
Here, for the record, are the differences between the intended results (produced by almost all shells) and those of zsh diff /tmp/good /tmp/zsh 6,7c6,7 < WC: 1: [abc] < CWC: 2: [][abc] --- > WC: 2: [abc][] > CWC: 3: [][abc][] 17c17 < WSCSCSWC: 3: [abc][][def] --- > WSCSCSWC: 4: [abc][][def][] 19c19 < C: 1: [] --- > C: 2: [][] 22,29c22,29 < CC: 2: [][] < CCC: 3: [][][] < SC: 1: [] < CS: 1: [] < SCCS: 2: [][] < CSSC: 2: [][] < SCWSCWCS: 3: [][abc][def] < SSCSSCSSCSS: 3: [][][] --- > CC: 3: [][][] > CCC: 4: [][][][] > SC: 2: [][] > CS: 2: [][] > SCCS: 3: [][][] > CSSC: 3: [][][] > SCWSCWCS: 4: [][abc][def][] > SSCSSCSSCSS: 4: [][][][] 35,36c35,36 < pWC: 1: [pabc] < pCWC: 2: [p][abc] --- > pWC: 2: [pabc][] > pCWC: 3: [p][abc][] 46c46 < pWSCSCSWC: 3: [pabc][][def] --- > pWSCSCSWC: 4: [pabc][][def][] 48c48 < pC: 1: [p] --- > pC: 2: [p][] 51,53c51,53 < pSC: 1: [p] < pCS: 1: [p] < pCSSC: 2: [p][] --- > pSC: 2: [p][] > pCS: 2: [p][] > pCSSC: 3: [p][][] 55,58c55,58 < pCCC: 3: [p][][] < pSCCS: 2: [p][] < pSCWSCWCS: 3: [p][abc][def] < pSSCSSCSSCSS: 3: [p][][] --- > pCCC: 4: [p][][][] > pSCCS: 3: [p][][] > pSCWSCWCS: 4: [p][abc][def][] > pSSCSSCSSCSS: 4: [p][][][] I believe, that all of these differences occur when there is a trailing 'C' in the input, with no other following non IFS-whitespace (but any amount of that surrounding the C). zsh is appending an additional empty field. I believe that is the only difference, but haven't examined all of those in detail. |
(0006467) kre (reporter) 2023-09-07 15:03 |
And the differences produced by mksh: 51c51 < pSC: 1: [p] --- > pSC: 2: [p][] 56,58c56,58 < pSCCS: 2: [p][] < pSCWSCWCS: 3: [p][abc][def] < pSSCSSCSSCSS: 3: [p][][] --- > pSCCS: 3: [p][][] > pSCWSCWCS: 4: [p][][abc][def] > pSSCSSCSSCSS: 4: [p][][][] 109c109 < pSCq: 2: [p][q] --- > pSCq: 3: [p][][q] 114,116c114,116 < pSCCSq: 3: [p][][q] < pSCWSCWCSq: 4: [p][abc][def][q] < pSSCSSCSSCSSq: 4: [p][][][q] --- > pSCCSq: 4: [p][][][q] > pSCWSCWCSq: 5: [p][][abc][def][q] > pSSCSSCSSCSSq: 5: [p][][][][q] That's a much smaller set, but they're much harder to characterise, so for now, I won't attempt that. |
(0006468) kre (reporter) 2023-09-07 15:15 |
I have just attached 3 new files (to join the original "ifs" which was the test used when this report was originally filed). IFS-test is the test to run with a shell, just $SHELL IFS-test >results That is what is abbreviated in Note: 0006462 but I uploaded the version which retains the angle bracket delimiters, those are much nicer to read the results. POSIX-bug-1649-impl.sh is the complete strawman implementation and test, as in Note: 0006465 - again the angle bracket version. Expected-Results is what you should expect to see from those, just the same as what is in Note: 0006464 but with angle brackets instead of square ones. |
(0006469) kre (reporter) 2023-09-07 15:21 edited on: 2023-09-11 03:39 |
Note: 0006459 - a brief introductuction to the proposed solution Note: 0006460 - a very much draft replacement for XCU 2.6.5 Note: 0006478 - the first revision of the text in Note: 0006460 (a still very much draft replacement for XCU 2.6.5) Note: 0006462 - a description of the shell test (IFS-test), and its useful code Note: 0006464 - the expected results (the square bracket version) Note: 0006465 - an intro to, and the essential code of, a strawman implementation (as sh code) of the algorithm in Note: 0006460 Note: 0006466 - the differences (one difference, repeated) from zsh Note: 0006467 - the differences in the output from mksh (perhaps someone can explain them) Note: 0006468 - a note about the attached files that have been uploaded Note: 0006469 - this one, which is the last of this set. |
(0006471) kre (reporter) 2023-09-07 16:12 |
Oops, I just realised that the args() function I managed to leave in those tests, is missing a test, the middle of the 3 printf statements (the one that contains the problematic brackets, so I won't include all of it) should look like: test $# -gt 0 && printf ... (with the args to printf being the same as they are now). This applies to both test programs (that function was just copied from one to the other). That explains why when the results say 0 fields, it prints one anyway (the 0 is correct), the empty field output is nonsense. Sorry about that! This makes no difference when comparing the results (as the same issue affected everything), but if you're trying to look at an affected result, it can be confusing (it was to me when I saw it first). I know I had that extra "test" in there in some earlier version, I must have lost that somewhere. |
(0006472) geoffclare (manager) 2023-09-07 16:40 |
The problem description includes mention of character encoding and LC_CTYPE, but the proposed wording in Note: 0006460 does not address this. Instead it describes everything in terms of characters, effectively reversing the changes made by bug 0001560. Without those changes it is unclear how field splitting works on command substitutions (or pathname expansions) where the expansion results contain byte sequences that do not form characters. Changes along the lines of those in bug 1560 need to be made to the proposed new wording. The character encoding issue could perhaps be partially addressed by extending the requirement that <period>, <slash>, <newline>, and <carriage-return> are invariant across all locales (see XBD 6.1) to include at least <space> and <tab> as well. How the handling of IFS characters with non-invariant encodings should be specified (i.e. "which LC_CTYPE") will depend on what shells currently do. (Please note that after posting this I will be unavailable for three weeks, so will not be able to respond to any follow up comments in a timely manner.) |
(0006473) kre (reporter) 2023-09-07 17:06 edited on: 2023-09-07 17:09 |
Re Note: 0006472 The following section in Note: 0006460 was intended to cover as much as I considered a problem with LC_CTYPE (though it doesn't mention that by name): It is implementation defined whether other White Space characters which appear in the value of IFS are also considered as "IFS white space". (and a little bit more explanation a bit later in the same paragraph). It also explicitly says that space tab and newline (if included in IFS obviously) are always IFS white space characters, regardless of locale. I had no intent to change anything related to whatever bug:1560 addresses, one I didn't know existed (I suppose I saw it once, but have forgotten...) I will take a look at that, but not tonight, and see if I can work out what to do to reconcile the two - but I did note that I knew the wording would need changes, handling that kind of thing is (partly) what I meant. I wanted to (hopefully finally, and forever, if I got it right - and everyone who knows an implementation, please do check to verify that) get the algorithm that is used actually specified correctly, finally. Beyond that was for some other time (or someone else) - but I will take a look, later. |
(0006477) kre (reporter) 2023-09-11 03:35 |
I am about to add a new note (to follow this one) with revised text from that in Note: 0006460 to (hopefully) address some of the concerns in Note: 0006472 The 7th paragraph (not counting the section title as a paragraph) is intended to contain all that I know what is relevant about issues relating to LC_CTYPE, and what happens if it changes during sh execution. Note that I know essentially zero about how to handle multi-byte characters (that is, how such things are intended to be processed by shells) and consequently this might be inadequate. But someone else who does understand (if there is anyone, since this all appears to be black magic to me) will need to provide better or additional wording. I have made revisions to the wording to hopefully make it fall more into line with what bug:1560 was attempting to achieve in this area (though I am not at all convinced that what it changed would have been nearly enough to make a coherent result even with the largely wishy-washy specification of field splitting that we currently have (up to and including Draft 3 of Issue 8). I also made some corrections to issues that were largely the result of my being in a hurry (and being rather late in my day) to get something posted here before last Thursday's Austin Group meeting, just in case this was to be considered again then. There was text that remained there which was from earlier ideas, which had been half deleted, but not completely, and poor descriptions, like considering the fields as a set (which would be an unordered collection) and even worse describing it as a "set which may be none" (I don't even know what that means) rather than "may be an empty set". I have also simplified things somewhat, to make it easier to read, though the actual algorithm remains unchanged - at least when only considering cases where all characters are one byte, which is all I really know anything about. I have tried to make the algorithm handle the case where that isn't true, with multi-byte characters, but I really have no idea whether what I wrote makes any sense or not (I didn't remember bug:1560 as it is about that kind of issue, and I simply switch off whenever those issues are being discussed, I know I know nothing of the issues involved, so have nothing to contribute). But it is clear from that, and from the definition of IFS in XCU 2.5.3, that IFS is supposed to contain characters, and input fields just byte sequences, and that somehow field splitting is supposed to reconcile that. Since nothing (no shell) I'm aware of actually attempts to fo that (at least as far as I know, I'm not really capable of testing it) I have no idea whether what I have written here is anything like what is intended to happen, or simply fantasy. Someone who understands what should happen (even better, if there's a shell which actually does it, who understands what does happen) needs to look very carefully and make corrections as appropriate (to the wording, or the algorithm). As well as including it in the following note, I will upload the same text as an attached file, "Revised-bug-1649-suggestion", to make it easier for someone else to download and correct (rather than needing to cut and paste from one of these notes, or from an e-mail message). Note that I have not altered the tests (though my local copy has one small section removed, to bring it in line with the slightly simplified algorithm specified now - none of the test results alter - the previous text was the way it was just because it had always been specified like that, there really was no good reason for that). The ## comments in the prototype impl (in sh) which contain lines from the spec are still from the Note: 0006460 version, not this new one - but as the basic algorithm is unaltered, it should be easy enough to find them there, and correlate with the newer version. |
(0006478) kre (reporter) 2023-09-11 03:36 |
2.6.5 Field Splitting After parameter expansion (Section 2.6.2), command substitution (Section 2.6.3), and arithmetic expansion (Section 2.6.4), and if the shell variable IFS [xref XCU 2.5.3] is set and its value is not empty, or if the IFS variable is unset, the shell shall scan each field containing results of expansions and substitutions that did not occur in double-quotes for field splitting; zero, one or multiple fields can result. For the remainder of this section, any reference to the results of an expansion, or results of expansions, shall be interpreted to mean the results from one or more unquoted variable or arithmetic expansions, or unquoted command substitutions. If the IFS variable is set and has an empty string as its value, no field splitting occurs. However if an input field which contained the results of an expansion is entirely empty, it shall be removed. Note that this occurs before quote removal, any input field that contains any quoting characters can never be empty at this point. After from the removal of any such fields from the input, the possibly modified input field list becomes the output. Each input field is considered in sequence, first to last, with the results of the algorithm described in this section causing output fields to be generated, which remain in the same order as the input fields from which they originated. Fields which contain no results from expansions shall not be affected by field splitting, and shall remain unaltered, simply moving from the list of input fields to be next in the list of output fields. In the remainder of this description, it is assumed that there is present in the field at least one expansion result, this assumption will not be restated. Field splitting only ever alters those parts of the field. For the purposes of this section, the term "IFS white space" shall mean any of the single byte characters <space>, <tab> or <newline> from the Portable Character Set [xref XBD 6.1] which are present in the value of the IFS variable, and perhaps other White Space characters. It is implementation defined whether other White Space characters which appear in the value of IFS are also considered as "IFS white space". The three characters above specified as IFS white space characters are always IFS white space, when they occur in the value of IFS, regardless of whether they are White Space characters in any relevant locale. For other locale specific White Space characters allowed by the implementation it is unspecified whether the character is considered as IFS white space if it is White Space at the time it is assigned to the IFS variable, or if it is White Space at the time field splitting occurs (the locale may have altered between those events). If the IFS variable is unset, then for the purposes of this section, but without altering the value of the variable, its value shall be considered to contain the three single byte characters <space>, <tab> and <newline> from the Portable Character Set, all of which are IFS white space characters. The shell shall use the byte sequences that form the characters in the value of the IFS variable as delimiters. Each of the characters <space> <tab> and <newline> which appears in the value of IFS shall be a single byte delimiter. The shell shall use these delimiters as field terminators to split the results of expansions, along with other adjacent bytes, into separate fields, as described below. Note that these delimiters terminate a field, they do not, of themselves, cause a new field to start, subsequent data bytes that are not from the results of an expansion, or that do not form IFS whitespace characters are required for a new field to begin. Note that the shell processes arbitrary bytes from the input fields, there is no requirement that those bytes form valid characters. If results of the algorithm are that no fields are delimited, that is, if the input field is wholly empty or consists entirely of IFS white space, the result shall be zero fields (rather than an empty field). For the purposes of this section, when a field is said to be delimited, the the candidate field, as generated below shall become an output field. When the algorithm transforms a candidate into an output field it shall be appended to the current list of output fields. Each field containing the results from an expansion shall be processed in order, intermixed with fields not containing the results of expansions, processed as described above, as if as follows, examining bytes in the input field, left to right: Begin with an empty candidate field. While the input is not empty... When instructed to perform the next iteration of the loop, start again with the test here, with the input as modified below. Consider the leading remaining byte or byte sequence of the input. No such byte sequence shall contain data such that some bytes in the sequence resulted from an expansion, and others did not. If the byte or sequence of bytes is: a. A byte (or sequence of bytes) in the input that did not result from an expansion: Append this byte (or sequence) to the candidate, and remove it from the input. Next iteration of the loop. b. A byte sequence in the input which resulted from an expansion that does not form a character in IFS: Append the first byte of the sequence to the candidate, and remove that byte from the input. Next iteration of the loop. c. A byte sequence in the input which resulted from an expansion that forms an IFS white space character: Remove that byte sequence from the input, consider the new leading input byte sequence, and repeat this step. d. A byte sequence in the input that resulted from an expansion that forms an IFS character, which is not IFS white space: Remove that byte sequence from the input, but note it was observed. At this point, if the candidate is not empty, or if a sequence of bytes representing an IFS character that is not IFS white space was seen at step d, then a field is said to have been delimited, and the candidate becomes an output field. Empty (clear) the candidate, and perform the next iteration of the loop. When the input is empty, if, and only if, the candidate is not empty, it becomes an output field. The ordered list of output fields so produced, which may be empty, replaces the list of input fields. |
(0006479) kre (reporter) 2023-09-11 03:59 |
Now, that's done, something I have no idea about whatever (as in one byte character locales the answers to these are irrelevant). Do the results of each expansion or substitution get field split separately, or is the field constructed from all the substitutions and expansions first, then those parts which resulted from expansions (whichever expansion the resulting bytes came from) interpreted for field splitting That is: assume we have a local with 2 byte characters (if you like assume we're using UTF-8 and and a char set which is encoded as a 2 byte sequence, or just some other char set with 2 byte encoding). Then assume we have X=$'\xXX' Y-$'\xYY' where neither XX nor YY as a byte is a valid character, but where the two byte sequence XXYY is a valid character in the encoding. Let's call that character 'C' (obviously it won't be, think more like alpha or a-umlaut or something encoded as 2 bytes in UTF-8, or anything you like in some other encoding). Now if we have IFS='...C...' (what else is in IFS is irrelevant, except that as $X and $Y are not characters, things would be unspecified if they were there, alone, so assume they are not.) If a field contains foo${X}${Y}bar what is the result supposed to be? If the results from the 2 expansions are considered separately, then nothing will change, as there are no IFS characters in either of those (there cannot be, as they are not characters and IFS can't contain anything which isn't). On the other hand, we we simply consider the result to be foo$'\sXX\xYY'bar with the part shown there inside the $'...' being the (unquoted) results of the expansions, that is fooCbar and treat that for field splitting, then C, the results of the (2) expansions is in IFS, so the result will be 2 fields, "foo" and "bar". Which is supposed to happen? My algorithm simply finesses this point (says nothing about it at all) which isn't good, but I have no idea what answer we should be producing, I have no idea if any shell even allows or attempts to deal with this kind if thing, so .... Please be clear in any reply whether you are just saying what you believe should happen (and in that case, please say why), or if you are indicating what some implementation actually does with this (in that case, say which, and ideally show a real worked example). If we cannot produce some consensus on this, I'll simply add text to the (draft proposed) spec saying that whether the byte sequences considered in the loop must all come from the results of a single expansion, or can be from the results of several expansions, is unspecified. |
(0006480) kre (reporter) 2023-09-11 04:18 edited on: 2023-09-11 04:20 |
And finally, for today, my first impression of an answer to the question (my question) in Note: 0006479 is that individual expansions should be field split separately, and never combined (so we never treat the example field as being fooCbar until after field splitting has happened). I suspect that's the best solution, as one can always do NEW="${X}${Y}" (where the quotes aren't really needed, as no field splitting or pathname expansions happen there anyway) and then foo${NEW}bar if one is deliberately trying to compose characters, which then might be a field splitting delimiter. On the other hand, if we just allow adjacent expansions to run together, then people would be continually needing to write foo${X}''${Y}bar just in case (as it would in this example) the results of the two substitutions might just accidentally take the trailing bytes from $X and the leading bytes from $Y and manage to concoct a character encoding from them which just happens to be in IFS (all accidental). While sticking the '' in that example would work (though is very ugly and cumbersome) in this example to solve that, if the expansion were just ${X}${Y} it wouldn't be safe, as if both $X and $Y happen to be empty ('') the results of that should be nothing (field deleted), whereas the results of ${X}''${Y} (after quote removal) would be an empty field. Without inventing new mechanism, I'm unaware of any way of solving that one. If we were to agree that this is correct (and by some chance it happened that some shell actually works this way) then I would change the current wording in the draft in Note: 0006478 from: No such byte sequence shall contain data such that some bytes in the sequence resulted from an expansion, and others did not. to No such byte sequence shall contain data such that some bytes in the sequence resulted from an expansion, and others did not, or which contains bytes resulting from the results of more that one expansion. |
(0006483) Don Cragun (manager) 2023-09-21 16:48 edited on: 2023-09-25 09:47 |
Replace XCU section 2.6.5 on issue 8 draft 3 P2476, L80478-80504 with:After parameter expansion (Section 2.6.2), command substitution |
(0006485) kre (reporter) 2023-09-21 19:28 |
Aside from the obviously unintended formatting directive (ol type=...) that appears where there almost certainly should be just a blank line, Note: 0006483 looks OK to me. I'm slightly surprised by the decision regarding the issue in Note: 0006479 but that one was always so far outside my area of competence that I can only trust that the correct decision was made. Upon reflection, I'd probably add an xref to XBD 3.412 just after the (first) use of "White Space" (soon after the xref to XBD 6.1) but that's not all that important. It would also be better, if it can be made to happen with the formatting tools available, to (at least) eliminate the blank line separating the "While the input is not empty..." and "When instructed..." paragraphs, so the "this test" reference in the latter doesn't look to be referring to nothing, and also if possible, indent that "When instructed..." bit further, so it doesn't align itself with the steps to perform that follow, it isn't one of those, it is just a "how to interpret what follows" directive, so I think would be better if visually different from what follows. |
(0006487) Don Cragun (manager) 2023-09-25 14:21 |
As noted in Note: 0006485 there are some parts of this resolution that need to be looked at again. There are also several other editorial issues that should be fixed. |
(0006488) Don Cragun (manager) 2023-09-25 15:43 edited on: 2023-10-02 17:49 |
Replace XCU section 2.6.5 on issue 8 draft 3 P2476, L80478-80504 with:After parameter expansion (Section 2.6.2), command substitution (Section 2.6.3), and arithmetic expansion (Section 2.6.4), and if the shell variable IFS [xref XCU 2.5.3] is set and its value is not empty, or if the IFS variable is unset, the shell shall scan each field containing results of expansions and substitutions that did not occur in double-quotes for field splitting; zero, one or multiple fields can result. |
(0006490) kre (reporter) 2023-09-26 13:14 |
Re: Note: 0006488 Just a couple of minor formatting issues remain, which I expect can be fixed when all of this is converted to *roff for inclusion in the draft. First, there's an unnecessary (unintended I'm sure) line break in the 9th paragraph (I think 9th) in the phrase "subsequent data bytes" where the line break after 'subsequent" shouldn't be there. Second, the whole bunch of paragraphs from "Begin with an empty ..." down to and including "Once the input is empty ..." should be indented slightly, retaining the relative (but absolute is immaterial) indentations already existing. That is, that whole group of paras is the object of the previous "Each field containing the results from an expansion shall be processed in order, [...] from beginning to end:" and is describing the processing of each field in turn. On the other hand, the final paragraph ("The ordered list...") applies after all fields have been processed, it isn't part of that "for each field". Currently that distinction isn't obvious - even worse, the excess vertical space before "Once the input is empty" might suggest that it applies only after all fields are processed, which isn't the case, a non-empty candidate becomes a field of its own if it exists after each field has been processed. Eg: given default IFS, and S=' ' (a space), then a${S}b c${S}d as 2 input fields makes 4 output fields, a b c d, both "b" and "d" are what is in the candidate when the input (field) is empty after performing the previous steps. |
(0006491) Don Cragun (manager) 2023-09-29 02:28 |
The comments in Note: 0006490 were discussed during the 2023-09-28 conference call. We agreed with the suggestions and I agreed to update Note: 0006488 after the call concluded. I fixed the unintended line break in the 9th paragraph (and also one in the 1st paragraph). I got rid of several extraneous empty lines. I think I have fixed the indentation problems. Please let me know if something else needs to be done. |
(0006492) kre (reporter) 2023-09-29 08:59 |
I see no further issues with the edited Note: 0006488. Thanks. |
(0006495) larryv (reporter) 2023-09-29 20:42 |
Some nitpicking on Note: 0006488:
|
(0006496) stephane (reporter) 2023-10-01 16:00 |
Re: Note: 0006488 > The shell shall use the byte sequences that form the characters > in the value of the IFS variable as delimiters. Each of the > characters <space> <tab> and <newline> which appears in the value > of IFS shall be a single byte delimiter. The shell shall use > these delimiters as field terminators to split the results of > expansions, along with other adjacent bytes, into separate fields, > as described below. Note that these delimiters terminate a field, > they do not, of themselves, cause a new field to start, subsequent > data bytes that are not from the results of an expansion, or that > do not form IFS white-space characters are required for a new > field to begin. The wording in the last sentence would imply that: var='var ' sh -fc ' printf "[%s]\n" $var"" ' should output only [var] Instead of [var] [] set -o noglob IFS=: for dir in $PATH''; do dir=${dir:-.} ... done Is a well known idiom to loop over the components of $PATH without discarding the empty component in PATH=/bin:/usr/bin: for instance. |
(0006497) kre (reporter) 2023-10-01 22:22 |
Re: Note: 0006495 - thank you, yes, all of those changes look good to me, obviously at least the final one (no idea how that slipped through all of the re-reading that has been happening - to make it easier, that's in the paragraph which begins "For the purposes" which is about the 12th para I think). For the rest, those changes, or something similar, look to be an improvement. One of the reasons that I never expect wording I write to survive intact, is that I tend to write long run-on sentences, which can go on forever sometimes, even though I know that's not good writing. Corrections by those better at this than I am are always welcome. Re Note: 0006496 No, there is no implication anything like you are suggesting. That "last sentence" you mean includes (requoting it here) subsequent data bytes that are not from the results of an expansion, or that do not form IFS white-space characters are required for a new field to begin. which is the part of it I believe you're focussing on. Let me slightly restate your example, inspired by the later PATH related example, just to avoid there being spaces (any white space) here - just because that is hard to display cleanly in text like this (and the PATH example shows that the issue isn't in any way related to the IFS white space delimiter your original example used). var='var:' sh -fc 'IFS=:; printf "[%s]\n" $var"" ' After variable expansion (there are no arith erxpansions or command subs to muddy the waters here) we have 3 initial fields in that printf command. The first two are just the "printf" word, and the format string, neither of which contain any expansion results, and so aren't touched by field splitting (not that the note was suggesting anything different, just for completeness here). So we have just the third field, which after variable expansion will contain the following byte sequence (into which I have inserted white space just for this note, to make things more obvious - this is why I needed to change the example from the original - just for easier visualisation in text like this): v a r : " " in which the first 4 bytes are the results of the variable expansion, and the final 2 are not. Now we do field splitting, the ':' in this case is an IFS char (in the original a space was used, which is also IFS white space, but in this example that makes no difference) when we see that the candidate contains "var" so is not empty, and becomes an output field. Now we're at the point where the sentence in question is thought to apply subsequent data bytes that are not from the results of an expansion, however does apply. there are still those two " (double quote) characters present, which are not from the results of an expansion, and hence satisfy that condition, which then allow a new candidate field to start. That one has no delimiter, so we reach the end of the input with the candidate containing two double quote characters, which is not empty, hence forms another output field - so now after field splitting we have (in this example) 4 fields printf "[%s]\n" var "" Subsequently (after pathname expansion, which is disabled in this example, so does nothing, not that there are any unquoted glob magic chars to invoke it anyway) we will do quote removal, which never alters the number of fields simply removes the quoting chars, still leaving the 4 fields, with all the " characters removed (so the final field will be a null string). That whole sentence though is redundant with the actual algorithm, which shows how that works - it was added merely as added reinforcement that if the input had not had those two trailing quote characters, then there would not be an extra field produced, just 3; which in the case you gave was never in doubt, IFS white space is always handled like that, but in the case with a ':' delimiter, many people assume there should be an empty field produced as the IFS are "field separators" implied by the 'S' in the var name - the whole sentence can (and probably should) be removed if it is thought to potentially cause more confusion than it removes (but this example isn't really a case of that I think - the problem here was that the quote characters weren't considered as being in the input, as if they'd already been replaced by a null string. The sentence earlier in the proposed spec (as it currently appears in Note: 0006488): Note that this occurs before quote removal, any input field that contains any quoting characters can never be empty at this point. was another sentence I feared might be considered redundant, as other text in the standard makes that clear already - it should not really need to be restated here - but I have found that some people find it difficult to comprehend this part of the abstract algorithms in the standard - particularly when real implementations rarely act quite like this, so I included it as an additional reinforcement of this issue. True, it is in a paragraph that's primarily concerned with what do do when IFS='' applies - but whether or not quotes have already been removed from the input fields cannot (and does not) differ in that case, so I considered that just making sure that any reader has in mind that quote characters (in the abstract implementation) are still present, unaltered, in the fields in this one place would be sufficient. I'll leave it for others to decide whether more, or less, or adjusted, forms of these two redundant sentences should remain in the final version. |
(0006498) stephane (reporter) 2023-10-02 06:27 |
OK, I see you PoV now. Then the thing I would object to is the use of "*data* byte". For me, and I'd expect most people reading the text, those quotes in $var"" are syntax, not data. (I've myself always disliked the use of "quote removal" in the POSIX specification as that's a detail of implementation, while the spec should focus on specifying the language, though I acknowledge it's probably too late to change that). Just a comment on: > but in the case with a ':' delimiter, many people assume there should be an empty field produced as the IFS are "field separators" implied by the 'S' in the var name Note that it used to be the case in most shells, including pdksh, yash, zsh (still the case in zsh) and I believe ash or at least some variant after they switched from Bourne-like to POSIX-like behaviour and they later begrudgingly changed to align with Korn/POSIX. In the Bourne shell, there was no such question as all characters were treated as Korn's IFS-whitespace characters, that is ::foo:::bar::: was split into foo and bar only. Korn in effect changed IFS from being an Internal Field Separator to being an Expansion Field Terminator, without ever documenting it so I'm not even sure the change (from separator to terminator) was intentional. That gets in the way for "read" where one can't use that "appending ''" work-around. Where IFS=: read -r header rest fails to read the "rest" properly if it contains one and only one ":" and it's at the end of the line. |
(0006499) kre (reporter) 2023-10-02 12:25 |
Re Note: 0006498 those quote characters are data bytes as far as field splitting is concerned. The abstract model used by the standard has that as a feature throughout. I will admit to also not much liking it when I first saw it - and I suspect makes some things far more difficult to describe than they ought be. But it is very clear that in the model used throughout, quoting characters are retained in the input until quote removal removes them. The shell also needs to keep track of which quote chars were in the original input (any resulting from expansions are just data) and which characters were quoted, or result from a quoted expansion, which you'd think would be enough, but just doing that is not the model. To field splitting they're irrelevant, except if they're quoting characters they can't have resulted from an expansion. They're just data. If you have any suggestions for wording changes to make any of this clearer please make them here, but don't bother attempting to alter the parsing model that the standard presumes, that's far to big a change to even contemplate at the minute. As fas as the history of separators vs terminators, that's all in the past and no longer relevant to anything - all shells (except zsh in --emuate sh mode, in its native mode, zsh doesn't appear to do field splitting at all) agree on how this works now - it isn't going to alter, so speculating upon how we got here isn't serving any useful purpose. Even mksh, which seems to be the one shell (other than the ancient broken pkksh in NetBSD_ where field splitting is broken, mostly seems to agree on this. Thanks for mentioning read though - I have known for a while that the description of how field splitting is done for read also needs corrections, but I never actually remembered to do anything about it when I was in a position to send a defect report for it. I will do that now, it should be treated as related to this bug, and the two processed together. If anyone knows of anywhere else in the standard, where the field splitting algorithm is used (I don't mean places where field splitting is said to apply, or not apply, those should all be fine as is) in a way similar to how read uses it, please let us know, so if more corrections/updates are needed, those places can also be corrected now. |
(0006508) Don Cragun (manager) 2023-10-02 17:57 |
The comments in Note: 0006495, Note: 0006496, Note: 0006497, Note: 0006498, and Note: 0006499 were discussed during the 2023-10-02 conference call. All of the changes suggested in Note: 0006495 were accepted. We also accepted the suggested change from "data bytes" to "bytes". All of these changes have be updated in Note: 0006488. This bug remains resolved with the changes now specified in Note: 0006488. |
(0006525) geoffclare (manager) 2023-10-10 09:31 |
When applying this bug I made some minor editorial wording changes, such as use of "shall" instead of present tense in some places. |
Issue History | |||
Date Modified | Username | Field | Change |
2023-03-31 01:55 | kre | New Issue | |
2023-03-31 01:55 | kre | File Added: ifs | |
2023-03-31 01:55 | kre | Name | => Robert Elz |
2023-03-31 01:55 | kre | Section | => XCU 2.6.5 |
2023-03-31 01:55 | kre | Page Number | => 2476 |
2023-03-31 01:55 | kre | Line Number | => 80478 - 80504 |
2023-07-31 16:13 | Don Cragun | Note Added: 0006412 | |
2023-09-07 14:14 | kre | Note Added: 0006459 | |
2023-09-07 14:15 | kre | Note Added: 0006460 | |
2023-09-07 14:30 | kre | Note Added: 0006462 | |
2023-09-07 14:32 | kre | Note Added: 0006463 | |
2023-09-07 14:41 | kre | Note Deleted: 0006463 | |
2023-09-07 14:43 | kre | Note Edited: 0006462 | |
2023-09-07 14:45 | kre | Note Added: 0006464 | |
2023-09-07 14:54 | kre | Note Added: 0006465 | |
2023-09-07 15:01 | kre | Note Added: 0006466 | |
2023-09-07 15:03 | kre | Note Added: 0006467 | |
2023-09-07 15:06 | kre | File Added: IFS-test | |
2023-09-07 15:07 | kre | File Added: POSIX-bug-1649-impl.sh | |
2023-09-07 15:09 | kre | File Added: Expected-Results | |
2023-09-07 15:15 | kre | Note Added: 0006468 | |
2023-09-07 15:21 | kre | Note Added: 0006469 | |
2023-09-07 15:22 | kre | Note Edited: 0006469 | |
2023-09-07 16:12 | kre | Note Added: 0006471 | |
2023-09-07 16:40 | geoffclare | Note Added: 0006472 | |
2023-09-07 16:41 | geoffclare | Relationship added | related to 0001560 |
2023-09-07 17:06 | kre | Note Added: 0006473 | |
2023-09-07 17:09 | kre | Note Edited: 0006473 | |
2023-09-11 03:35 | kre | Note Added: 0006477 | |
2023-09-11 03:36 | kre | Note Added: 0006478 | |
2023-09-11 03:39 | kre | Note Edited: 0006469 | |
2023-09-11 03:59 | kre | Note Added: 0006479 | |
2023-09-11 03:59 | kre | File Added: Revised-bug-1649-suggestion | |
2023-09-11 04:18 | kre | Note Added: 0006480 | |
2023-09-11 04:20 | kre | Note Edited: 0006480 | |
2023-09-21 15:51 | Don Cragun | Note Added: 0006482 | |
2023-09-21 15:58 | Don Cragun | Note Deleted: 0006482 | |
2023-09-21 16:48 | Don Cragun | Note Added: 0006483 | |
2023-09-21 16:52 | Don Cragun | Note Edited: 0006483 | |
2023-09-21 17:01 | Don Cragun | Note Edited: 0006483 | |
2023-09-21 17:03 | Don Cragun | Final Accepted Text | => See Note: 0006483. |
2023-09-21 17:03 | Don Cragun | Status | New => Resolved |
2023-09-21 17:03 | Don Cragun | Resolution | Open => Accepted As Marked |
2023-09-21 17:04 | Don Cragun | Tag Attached: issue8 | |
2023-09-21 19:28 | kre | Note Added: 0006485 | |
2023-09-25 09:43 | Don Cragun | Note Edited: 0006483 | |
2023-09-25 09:47 | Don Cragun | Note Edited: 0006483 | |
2023-09-25 14:21 | Don Cragun | Note Added: 0006487 | |
2023-09-25 14:21 | Don Cragun | Resolution | Accepted As Marked => Reopened |
2023-09-25 15:43 | Don Cragun | Note Added: 0006488 | |
2023-09-25 15:46 | Don Cragun | Final Accepted Text | See Note: 0006483. => See Note: 0006488. |
2023-09-25 15:46 | Don Cragun | Resolution | Reopened => Accepted As Marked |
2023-09-26 13:14 | kre | Note Added: 0006490 | |
2023-09-29 01:50 | Don Cragun | Note Edited: 0006488 | |
2023-09-29 01:58 | Don Cragun | Note Edited: 0006488 | |
2023-09-29 02:04 | Don Cragun | Note Edited: 0006488 | |
2023-09-29 02:16 | Don Cragun | Note Edited: 0006488 | |
2023-09-29 02:28 | Don Cragun | Note Added: 0006491 | |
2023-09-29 08:59 | kre | Note Added: 0006492 | |
2023-09-29 20:42 | larryv | Note Added: 0006495 | |
2023-10-01 16:00 | stephane | Note Added: 0006496 | |
2023-10-01 22:22 | kre | Note Added: 0006497 | |
2023-10-02 06:27 | stephane | Note Added: 0006498 | |
2023-10-02 12:25 | kre | Note Added: 0006499 | |
2023-10-02 14:17 | kre | Note Added: 0006501 | |
2023-10-02 14:19 | kre | Note Deleted: 0006501 | |
2023-10-02 16:21 | geoffclare | Relationship added | related to 0001778 |
2023-10-02 17:49 | Don Cragun | Note Edited: 0006488 | |
2023-10-02 17:57 | Don Cragun | Note Added: 0006508 | |
2023-10-10 09:30 | geoffclare | Status | Resolved => Applied |
2023-10-10 09:31 | geoffclare | Note Added: 0006525 | |
2023-10-10 09:32 | geoffclare | Tag Attached: applied_after_i8d3 | |
2024-06-11 09:12 | agadmin | Status | Applied => Closed |
Mantis 1.1.6[^] Copyright © 2000 - 2008 Mantis Group |