0001778: The read utility needs field splitting updates/corrections )and a little more)

Notes
(0006500) kre (reporter) 2023-10-02 14:00 edited on: 2023-10-02 14:44	Replace the paragraph beginning The terminating logical line delimiter at line 11869, and continuing down past the three bullet points to line 11878 (inclusive) with the following: The terminating logical line delimiter (if any) shall be removed from the input. Then if the shell IFS variable [xref XCU 2.5.3] is set, and its value is an empty string, the resulting data shall be assigned to the first variable var named upon the command line, with any other variables named being set to the empty string. No other processing takes place. If IFS is unset, or is set to any non-empty value, then a modified version of the field splitting algorithm specified in [xref XCU 2.6.5] shall be applied, with the modifications as follows: The input to the algorithm shall be the result obtained from the read, and shall be considered as a single initial field, all of which resulted from an expansion. The loop over the contents of that field shall cease when either the input is empty, or when n output fields have been generated, where n is one less than the number of variable name arguments to the read command. Any remaining input in the original field being processed shall be returned to the read command, unmodified, except that any leading or trailing IFS white space, as defined in XCU 2.6.5, shall be removed. The number of output fields generated by the field splitting algorithm shall be called m. The first m (which must be less than or equal to n) variables named upon the read command line are assigned the values from the ordered list of output fields, in the order they were generated, starting with the first variable named, and continuing sequentially until m variables have had values assigned. If m = n then the remaining unsplit input returned from the modified field splitting algorithm shall be assigned to the n+1'st (final) variable named upon the command line. Any variables to which no value has yet been assigned shall be assigned an empty string. Note that in the case where just one variable name was given on the read command line, the algorithm in XCU 2.6.5 ceases after producing zero (m) output fields, and simply returns the input field with any leading and trailing IFS white space removed. No output fields exist to assign to variables, so the remaining modified input is assigned to the first (and only) variable named. In addition to this, somewhere we should probably also add either: If any variable name is specified more than once as an argument to read the results are unspecified. or, as all shells seem to implement it this way, even though it doesn't seem even vaguely useful to me: If any variable name is specified more than once as an argument to read then the values are assigned in order as specified above, with any later assignment replacing the earlier, including when an empty string is assigned because no field data was available.

(0006502) kre (reporter) 2023-10-02 14:20 edited on: 2023-10-02 14:34	Re Note: 0006498 (a note added to 0001649), where it says: Where IFS=: read -r header rest fails to read the "rest" properly if it contains one and only one ":" and it's at the end of the line. I'm afraid I don't understand the perceived problem. What is meant there? If I have the following script (which is a bit more general than it needs to be for this, as it was a modified version of another script I used for testing just how shells process read(1)): I=${IFS-$' \t\n'} case "$1" in -i) I=${1#-i}; shift;; -I) I=unset; shift;; esac printf %s\\n "$" \| { case "$I" in unset) unset IFS;; *) IFS="$I";; esac read -r var rest && printf %s\\n var="[${var-unset}] rest=[${rest-unset}]" } and I call it /tmp/r2 and run it as /usr/pkg/bin/bash /tmp/r2 -i: line: then every shell I tested produces: var=[line] rest=[] as the result. I'm not sure what different you'd be expecting there. Here "every shell" includes zsh, even when not in --emulate sh mode, and NetBSD's crappy ancient pkksh implementation. That is, apart from ksh88, which I don't have, just about every currently used Bourne shell derivative which is still in active use.

(0006503) kre (reporter) 2023-10-02 14:30	Oops, I submitted this as if it applied to Issue 7 TC1 (of all things), it is meant to be (and the page and line numbers are from) Issue 8, draft 3. Could someone who is able please move it where it belongs? Apologies, I should have noticed the missing "which draft" field on the form, but didn't!

(0006507) geoffclare (manager) 2023-10-02 16:20	Re Note: 0006503 I have changed this to be an Issue 8 draft 3 bug, as requested.

(0006509) geoffclare (manager) 2023-10-03 09:23	Re the last part of Note: 0006500, when there are duplicate variable names passed to read the standard already clearly requires the "assigned in order" behaviour. I don't see the need to add anything about it, except perhaps as rationale.

(0006510) kre (reporter) 2023-10-03 11:59	Re Note: 0006509 Could you quote the text which you believe clearly requires that, as I see nothing at all about the order in which the assignments take place. The current version does say: the first field shall be assigned to the first variable var, the second field to the second variable var, and so on. but that doesn't say anything about the order in which those assignments are to occur. Just which field goes to which variable. Further it also says: If there are fewer fields than there are var operands, the remaining vars shall be set to empty strings. but that says nothing about the order in which that occurs relative to the other assignments. One reasonable implementation might be to simply start by setting all the vars to empty strings, and then update those for which there is data from the read result. Nothing here says that cannot be done. While I'm here, the current text also doesn't say what to do if there's an error making one of those assignments - certainly read exits with a status > 1 in that case, that's clear, but is read expected to go ahead and assign to the other variables, or may it abort processing as soon as the error is detected? That's unstated as well as best I can tell. As a hint: not all shells do the same thing here, there are actually 3 different behaviours I have detected. Further not all shells do the >1 exit status either, some exit 1 in this case (for some the shell (or subshell) exits (treating this as a variable assignment error and doing as specified by table 2.8.1 I presume).

(0006511) kre (reporter) 2023-10-03 12:11	Further re Note: 0006509 I assumed in Note: 0006510 that your "already clearly states" meant what is in Draft 3 (and earlier) already, rather than what is in my proposed (but needs wordsmithing) text in Note: 0006500 In the latter I do specify left to right (or first to last) assignment of the variables to which data is to be applied - as that's what shells do, so it seemed reasonable - but there is a potential problem for the duplicate var name case, it also says: Any variables to which no value has yet been assigned shall be assigned an empty string. which would imply that the empty string assignments happen after the others, but (even though I wrote that) is unclear as to what "variables to which no value has yet been assigned" means in the case we have read a b a <<\! x y ! clearly a=x and b=y happens, but now are we supposed to also do a='' for the duplicate use of a (as that's what shells do) or is a in that case a variable to which a value has already been assigned. This kind of thing needs to be bulletproof.

(0006512) geoffclare (manager) 2023-10-03 13:42	> Could you quote the text which you believe clearly requires that Utility Description Defaults, para 3 under OPERANDS (in draft 3 it's P2449 L79351): Unless otherwise stated, the standard utilities that accept operands shall process those operands in the order specified in the command line.

(0006513) kre (reporter) 2023-10-04 14:42 edited on: 2023-10-05 15:41	Re Note: 0006512 .. OK, but that still leaves open the question of what it means to "process those operands" in the case of read (1). For utilities like ls, cp, mkdir, ... where there is a basic operation to perform, and then that operation is performed upon each operand in turn, the "process" is easy to understand, and doing everything in sequence is natural. Since read's operation isn't (really) to process each operand in turn (that is, it is required to read the input up to the end delimiter before doing anything else, not to read one field at a time for each var named) - that simply setting up the vars named ready to be assigned to, once data is available, is all the processing that is required. Then the actual order in which the assignments get made would not be specified. And yes, I know that's something of a stretch, which is why I did not write the text to allow that, Note: 0006500 requires that the fields be assigned in order, and probably should be updated to ensure that all variable names (args on the command line) for which no data is available do get assigned the empty string, even if the same var name was previously the object of an assignment. But there's another more specific (related) requirement in Note: 0006500 than what is in the current text - my new version requires that the field splitting complete before any variable assignments are made, whereas while the current text suggests that, it isn't actually required. This makes a difference if a variable assigned to can alter the way the field splitting works, for which the obvious example is IFS - but (some of) the locale variables can have an effect as well, as they can affect how IFS is interpreted. What's more, when we do this we have some wild behaviours amongst shells, a majority (all ash derived shells, yash, ksh93) behave as described in Note: 0006500 but several (older ksh's, bosh, and zsh) assign the variables as each field is produced, such that altering IFS alters how the next field is delimited, and bash seems to simply disable (further) field splitting if IFS is altered, and the remainder of the input is assigned to the next variable. Or at least that's the general behaviour, things get more strange if IFS white space is assigned to IFS during the process. Given: IFS=: ; printf %s\\n "$input" \| read -r IFS var rest then with input=',:abc:def,ghi' I see as the final values of IFS var and rest IFS=[,] var=[abc] rest=[def,ghi] (shells where vars are assigned late) IFS=[,] var=[abc:def,ghi] rest=[] (bash) IFS=[,] var=[abc:def] rest=[ghi] (shells which do field by field assignments) With input=':abc:def,ghi' only the final set changes (and in a consistent way) IFS=[] var=[abc:def,ghi] rest=[] With input=' : abc:def,ghi ' IFS=[ ] var=[ abc] rest=[def,ghi ] (shells which field split first) IFS=[ ] var=[ abc:def,ghi ] rest=[] (bash) Both of those are consistent with the first set of results, just with different input data IFS=[ ] var=[abc:def,ghi] rest=[] IFS=[ ] var=[] rest=[abc:def,ghi] IFS=[ ] var=[] rest=[ abc:def,ghi] from the shells which do immediate updates to vars when fields are split (but only the first of those makes any real sense to me at all). What we ought to do about this I am unwilling to say - keep Note: 0006500 as it is, or make it unspecified whether vars are assigned immediately each output field is produced by field splitting, either of those would be possible. Mandating immediate assignment would be unacceptable I believe.

(0006514) kre (reporter) 2023-10-04 17:36	There's another issue with read which I decided to do nothing about in the Note: 0006500 text - partly to see if anyone would notice it, and partly because I have no idea what (precisely) should be done about it. Text in (draft 3 and earlier) of read (@line 111859... in draft 3) says: By default, unless the -r option is specified, <backslash> shall act as an escape character. An unescaped <backslash> shall preserve the literal value of the following character, with the exception of either <newline> or [...] The "or" stuff at the end is new, but \<newline> (or its new alternative) isn't the issue here, it is the: preserve the literal value of the following character, which is a meaningless thing to say, when we're considering field splitting, for two reasons: First, the data being split is supposed to be (subsequent to 0001560) a byte string, not characters -- and that's reinforced in read, at line 111905... where it says: If the -d delim option is specified and delim is the null string, the standard input shall contain zero or more bytes (which need not form valid characters). but the use of <escape> as affecting the following 'character' does not depend upon the -d setting. (Also, behaviour is unspecified if the -d (delim) arg is not a single byte character, or empty which also suggests that random encoded chars in the input are not really intended to be handled, or we'd need to allow applications to specify the delim to be any arbitrary character as well, not make it unspecified if they do). And while this issue is why I am unwilling to propose complete text to fix this problem, as I have no idea what is actually supposed to happen in cases like this, and no good way to test what shells do, this isn't the issue which is most significant. Second, "literal value" has no meaning with field splitting - doesn't as it is proposed to be processed in 0001649 and didn't as field splitting was specified before. For this, there's no question what is actually intended, the idea is that the escaped character (or byte) can never be an IFS character, it is always to be treated as field data, and so assigned to a variable. But there isn't anything which actually says this - field splitting, as used in the rest of the shell has no need for anything like this, what it deals with is quoted, or unquoted, expansion results. Anything quoted by a \ can't be either of those, that's all shell quoting, and is handled by the tokenisation of the input, and then quote removal. What we have with read (and all shells implement it as intended I believe) is something similar with the issue with \ in patterns - in patterns written directly in the shell, there never was an issue, as the \ is either a quoting character, or a (must have been quoted) ordinary character - and the normal rules for interpreting quoting applied, and things just worked. But when a \ appears in the result of an unquoted expansion then we need to handle it specially, even though normal pattern processing wouldn't normally be concerned with it at all. All this resulted in more text to handle \ in patterns, when it does appear. Since we're treating the data being obtained by read (after line joining happens in this case, since this is only relevant when -r is not given) as if it were the result of an unquoted expansion, we have a similar issue with field splitting, we want the \ in there to be magic - but normal field splitting has no need for any magic like that, so doesn't specify it. Here, I believe, there is (aside from dealing with the first issue above, which I will return to a little more below) a relatively simple solution available. Where Note: 0006500 says (bullet point 1) The input to the algorithm shall be the result obtained from the read, and shall be considered as a single initial field, all of which resulted from an expansion. (which should probably have said "unquoted expansion") - we'd just change that to be something like The input to the algorithm shall be the result obtained from the read, and shall be considered as a single initial field, all of which resulted from expansions, any data bytes specified above to be treated literally, and the preceding escape character shall be treated as if they were the results of a quoted expansion, all other data shall be treated as if it were (perhaps several) results from unquoted expansions. Then field splitting will leave the \X sequences alone, and split only when it finds IFS delimiters in the parts designated as unquoted. But we still need to deal with the first issue, and this one is not at all simple (but I guess lots more unspecified can be thrown around if needed). The problem is that the whole point of \ escaping is to hide a character (which is, I suppose, why the current text says "character" and not byte) which might be in IFS, but which is not intended to be a field delimiter. It is fairly clear from the draft 3 text (and this was not altered in 0001649) that IFS contains characters. If IFS contains a locale specific encoded non-ascii character (let's say, a greek lower case alpha, in UTF-8 encoding) then that's going to (usually) be 2 or more data bytes (2 in this case I believe). If we don't want a specific greek alpha that appears in the input data to be treated as a field delimiter, we need to escape that character (\alpha). This means we need to be treating the data after the \ as a character (which is what the text currently demands) but which then leaves open the question of what happens when the data read has a \ followed by bytes which do not properly form a character. I have no idea what shells do with this, so I can't suggest any text to handle this issue. (Most shells probably simply don't deal). Note however, that if the suggested (type of) fix for this issue suggested just above is adopted, then it becomes irrelevant that field splitting deals with strings of arbitrary bytes, and we're trying somehow to deal with characters in those bytes here - as by specifying the escaped text as the result of a quoted expansion, field splitting will simply leave that section alone, as bytes to go in a field, which is exactly what we want - the field splitting algorithm doesn't need to know whether those bytes form chars or not. So, what's left is to work out what the "data bytes specified above to be treated literally" (in the proposed text in this note) actually is to be in the various cases, without reverting to the earlier presumption that everything obtained by read is necessarily well formed characters in all cases.

(0006515) geoffclare (manager) 2023-10-05 08:59 edited on: 2023-10-10 11:06	Re Note: 0006514 [misinformation about backslash and field splitting removed - see Note: 0006517] You have a point about character v. byte and we should address that.

(0006516) kre (reporter) 2023-10-05 11:08 edited on: 2023-10-12 18:45	Re Note: 0006515 You might be right that I am misinterpreting that section, but if I am, the current specification of how read works (including what is in Note: 0006500) is even more broken than I thought. You did note that I said in that note: all shells implement it as intended by which I (hope obviously) meant as I considered it to be intended. Try it yourself: printf %s\\n ' a x:c\:c d\ d e\:\ e f ' \| $SHELL -c 'IFS=": "; read a b c d e f g h ; printf "%s=<%s>\\n" a "$a" b "$b" c "$c" d "$d" e "$e" f "$f" g "$g" h "$h"' The results I get from every shell I have to test are: a=<a> b=<x> c=<c:c> d=<d d> e=<e: e> f=<f> g=<> h=<> If you're correct about how that paragraph in the text is supposed to be interpreted, then what explains those results? Is every single shell broken? But I don't believe your interpretation makes sense, if that was what was intended, the obvious thing to do would have been to specify that any escape chars be removed from the line (or whatever we call it these days) that read fetched from stdin, after line joining, and before field splitting. But it doesn't, it explicitly says that the escape chars are removed after the fields have been split. The only reason for requiring that would be if those escape characters were intended to apply (somehow) to the field splitting operation. We just currently have the situation that whereas all shells agree on how that should be done (at least for single byte char sets, no idea about others) the standard doesn't say anything, and is simply assuming that "treated literally" will have the desired effect. And there I agree with your interpretation of that - those words don't actually say that. That was the point of Note: 0006514

(0006517) geoffclare (manager) 2023-10-05 14:53	Re Note: 0006516 Mea culpa. I have never seen backslash used to quote IFS characters in the input to read, so I have always assumed it doesn't work. I should have tested it before posting my note, so thank you for putting me right.

(0006519) kre (reporter) 2023-10-05 15:38	No problem. I also finally looked here, and saw that I had forgotten (yet again) that less-than b greater-than doesn't produce those 3 chars in mantis, and after a few attempts I couldn't see how to fix it. So I just changed the data (by editing the note - not rerunning) from 'b' to 'x' for what the b variable is set to.

(0006524) chet_ramey (reporter) 2023-10-06 21:41	Re: kre's observations in 6513: that's pretty clearly a bug in bash, and the intent is to assign the variables `late'. I'll fix it.

(0006526) geoffclare (manager) 2023-10-10 14:36	Suggested changes ... On page 3291 line 111859 section read, change: By default, unless the -r option is specified, <backslash> shall act as an escape character. An unescaped <backslash> shall preserve the literal value of the following character, with the exception of ... to: If the -r option is not specified, <backslash> shall act as an escape character. An unescaped <backslash> shall preserve the literal value of a following <backslash> and shall prevent a following byte (if any) from being used to split fields, with the exception of ... On page 3291 line 111869 section read, replace the paragraph beginning The terminating logical line delimiter and continuing down past the three bullet points to line 11878 (inclusive) with the following: The terminating logical line delimiter (if any) shall be removed from the input. Then if the shell variable IFS (see [xref XCU 2.5.3]) is set, and its value is an empty string, the resulting data shall be assigned to the variable named by the first var operand, and the variables named by other var operands (if any) shall be set to the empty string. No other processing shall be performed in this case. If IFS is unset, or is set to any non-empty value, then a modified version of the field splitting algorithm specified in [xref XCU 2.6.5] shall be applied, with the modifications as follows: The input to the algorithm shall be the logical line (minus terminating delimiter) that was read from standard input, and shall be considered as a single initial field, all of which resulted from expansions, with any escaped byte and the preceding <backslash> escape character treated as if they were the result of a quoted expansion, and all other bytes treated as if they were the results of unquoted expansions. The loop over the contents of that initial field shall cease when either the input is empty or n output fields have been generated, where n is one less than the number of var operands passed to the read utility. Any remaining input in the original field being processed shall be returned to the read utility ``unsplit''; that is, unmodified except that any leading or trailing IFS white space, as defined in [xref to XCU 2.6.5], shall be removed. The specified var operands shall be processed in the order they appear on the command line, and the output fields generated by the field splitting algorithm shall be used in the order they were generated, by repeating the following checks until neither is true: If more than one var operand is yet to be processed and one or more output fields are yet to be used, the variable named by the first unprocessed var operand shall be assigned the value of the first unused output field. If exactly one var operand is yet to be processed and there was some remaining unsplit input returned from the modified field splitting algorithm, the variable named by the unprocessed var operand shall be assigned the unsplit input. If there are still one or more unprocessed var operands, each of the variables names by those operands shall be assigned an empty string. Note that in the case where just one var operand is given on the read command line, the modified field splitting algorithm ceases after producing zero output fields and simply returns the original input field, with any leading and trailing IFS white space removed, as unsplit input. This unsplit input is assigned to the variable named by the var operand. On page 3292 line 111900 section read (OPERANDS), after: The name of an existing or nonexisting shell variable. append: If a var operand names the variable IFS, the behavior is unspecified. On page 3292 line 111902 section read (STDIN), change: If the -d delim option is not specified, or if it is specified and delim consists of one single-byte character, the standard input shall contain zero or more characters and shall not contain any null bytes. to: If the -d delim option is not specified, or if it is specified and delim is not the null string, the standard input shall contain zero or more bytes (which need not form valid characters) and shall not contain any null bytes. On page 3293 line 111959 section read (APPLICATION USAGE), change: When the current locale is not the C or POSIX locale, pathnames can contain bytes that do not form part of a valid character, and therefore portable applications need to ensure that the current locale is the C or POSIX locale when using read with arbitrary pathnames as input. (If IFS is not set to the null string this applies even when using -d "", because the field splitting performed by read is a character-based operation.) When reading a pathname it is also inadvisable ... to: When reading a pathname it is inadvisable ... After page 3293 line 111967 section read (APPLICATION USAGE), add: Since the var operands are processed in the order specified on the command line, if any variable name is specified more than once as a var operand, the last assignment made is the one that is in effect when read returns, including when an empty string is assigned because no field data was available.

(0006527) kre (reporter) 2023-10-10 17:19	That (Note: 0006526) mostly looks OK, but there is one of minor nit... The change at line 111900 (unspecified what happens if IFS is one of the vars) probably should also mention the locale vars (or at least LANG, LC_CTYPE and LC_ALL) as changes to those can affect how field splitting interprets what is in IFS.

(0006528) geoffclare (manager) 2023-10-11 08:04	> changes to those can affect how field splitting interprets what is in IFS The word "can" there is doing a lot of work. Given this text from the resolution of bug 0001649 ... It is implementation defined whether other white-space characters which appear in the value of IFS are also considered as "IFS white space". The three characters above specified as IFS white-space bytes are always IFS white space, when they occur in the value of IFS, regardless of whether they are white-space characters in any relevant locale. For other locale specific white-space characters allowed by the implementation it is unspecified whether the character is considered as IFS white space if it is white space at the time it is assigned to the IFS variable, or if it is white space at the time field splitting occurs (the locale may have changed between those events). ... perhaps something like the following would work: If a var operand names one of the variables LANG, LC_CTYPE, or LC_ALL and the new value assigned to the variable would change which characters in IFS are considered to be IFS white space (see [xref XCU 2.6.5]), it is unspecified what, if any, effect the change has on how read performs field splitting.

(0006529) kre (reporter) 2023-10-11 17:13	Re: Note: 0006528 It is more than working out what is IFS white space - since IFS is defined to contain characters, a change to the encoding (switching from UTF-8 to 8859-1 for example) can alter what characters appear there, a 2 byte encoded UTF-8 character might now be treated as 2 characters instead. Any change to whichever of those 3 locale vars actually applies (changes to ones that are overridden by another of them are obviously harmless) has the potential to affect the results that field splitting produces - if the vars actually change while read is still splitting the fields. Your text doesn't forbid that (whereas mine did) - and I understand that, as several shells work that way, but the effect is that changes to anything which can affect the results of field splitting needs to result in unspecified behaviour (as implementations aren't required to have the vars updates immediately a field is delimited either).

(0006530) geoffclare (manager) 2023-10-12 08:30	> changes to anything which can affect the results of field splitting > needs to result in unspecified behaviour I agree, but the difficulty is how to describe, precisely, which changes qualify for this treatment. We should not make the behaviour unspecified in cases where field splitting would definitely not be affected, for example: * Any locale change if IFS is empty, unset, or has its default value of <space><tab><newline>. * A locale change that does not change the encoding and would not change which characters in IFS are considered to be IFS white space. * Changes to LANG or LC_CTYPE that do not change the locale (e.g. if LC_ALL is set) Here's my previous attempt with an added condition related to encoding: If a var operand names one of the variables LANG, LC_CTYPE, or LC_ALL and the new value assigned to the variable would change how the bytes in IFS form characters or which characters in IFS are considered to be IFS white space (see [xref XCU 2.6.5]), it is unspecified what, if any, effect the change has on how read performs field splitting.

(0006531) kre (reporter) 2023-10-12 18:33	Re Note: 0006530 For the 3 bullet points, OK for the first, the second would be impossible to explain accurately in any way almost anyone could comprehend, and for the third, yes, of course. But: We should not make the behaviour unspecified in cases where field splitting would definitely not be affected Normally I'd agree with the sentiment behind that, less unspecified stuff is better, when there isn't a very good reason for it. But here? Only a demented lunatic would use IFS or one of the locale vars, any of them, including the ones (like LC_TIME) that can't possibly affect field splitting, as the variables used in a read command. I doubt anyone has ever even considered doing that in the 44 years (or whatever it is) that Bourne type shells have existed, until I did (and you can draw whatever conclusions you like about my mental state for attempting it.) I had to fix a bug in the NetBSD shell (it still exists in FreeBSD's) to get sane results from more complex tests using IFS than I showed here. Chet has acknowledged a bug in bash in this area. All of these shells are very old, and as best I can tell, no-one has ever reported any bug in this area about any of them. The NetBSD one would have been detected by a use after free type sanitizer, if anyone had used one and thought of trying with it "read IFS xxx" (needed at least one more var after IFS, and data read to assign to it, to trigger, was only externally visible with at least 2 more, and did not occur at all if IFS was initially unset, or of course if it was initially set to "", as nothing is split in that case). So, in this case I don't think it would do any harm to simply make doing a read into any of these vars be unspecified (even add in 100% irrelevant ones like OPTIND etc if desired). After all if someone really wanted to do: read LC_CTYPE all they need do instead is read var && LC_CTYPE=$var (or use ; instead of && if it is OK to set LC_CTYPE to "" in the case that nothing is read, which the simple "read LC_CTYPE" would do. It wouldn't exactly be a huge imposition. Better application code would be to validate the value read before using it for something that has consequences of course, doing that demands the use of an intermediate variable. So I'd just make the line 111900 change say If a var operand names any of the the variables IFS, LC_ALL, LC_CTYPE, or LANG, the behavior is unspecified. That's clear, can be understood by anyone, and in practice, affects no-one, and doesn't even require shells with bugs in this area to fix them, unless perhaps they cause crashes, and none I have encountered do that - just strange results from read. (Personally I'd spell it behaviour, but never mind! No expl'n req'd.)

(0006532) kre (reporter) 2023-10-12 18:54	One more thing, in Note: 0006510, I ended with the following paragraph: While I'm here, the current text also doesn't say what to do if there's an error making one of those assignments - certainly read exits with a status > 1 in that case, that's clear, but is read expected to go ahead and assign to the other variables, or may it abort processing as soon as the error is detected? That's unstated as well as best I can tell. As a hint: not all shells do the same thing here, there are actually 3 different behaviours I have detected. Further not all shells do the >1 exit status either, some exit 1 in this case (for some the shell (or subshell) exits (treating this as a variable assignment error and doing as specified by table 2.8.1 I presume). As best I can see, the proposed resolution in Note: 0006526 does not address this issue. We should be fixing this one as well.

(0006533) geoffclare (manager) 2023-10-16 09:02 edited on: 2023-10-16 09:03	> Only a demented lunatic would use IFS or one of the locale vars, any of them, > including the ones (like LC_TIME) that can't possibly affect field splitting, > as the variables used in a read command. I agree for IFS but not for locale vars. I can imagine situations in which an application writer might choose to do that, and the choice would be reasonable. For example, suppose an application has an installation script that needs to validate format strings in a file structured as pairs of lines with the first of each pair containing a locale name and the second containing the format string for that locale. This could be done using code like the following: unset LANG LC_ALL LC_CTYPE ... export LANG=C while IFS="" read -r LANG do ... validate $LANG ... IFS="" read -r format \|\| ... ... validate $format ... done < file > if someone really wanted to do: > read LC_CTYPE > all they need do instead is > read var && LC_CTYPE=$var True, but why force them to do that if there's no need? Doesn't my suggestion at the end of Note: 0006530 adequately distinguish cases that need to be unspecified from those that don't?

(0006534) geoffclare (manager) 2023-10-16 09:19	> the current text also doesn't say what to do if there's an error making one of those assignments If there's a problem here, it should be the subject of a separate bug as it isn't related to field splitting, which is what this bug is about. Currently this is what draft 3 has to say: An error in setting any variable (such as if a var has previously been marked readonly) shall be considered an error of read processing, and shall result in a return value greater than one. It is silent about when read exits (immediately or after processing later operands) so both behaviours are allowed.

(0006537) kre (reporter) 2023-10-16 21:43	Re Note: 0006533 I can imagine situations in which an application writer might choose to do that, and the choice would be reasonable. I cannot, or not in any case where using IFS would not also be acceptable (perhaps even more acceptable than altering the locale vars). In the case in the example you gave IFS="" read -r LANG there is no problem, as field splitting never happens in this case, and so changes of locale cannot generate unspecified results, one could also do IFS= ; read -r IFS where I made a separate assignment to IFS, rather than using a var-assign, purely because the effects of a var-assign, and an assignment to the same variable in a non-special builtin are not very well defined (many shells will simply reset the var-assigned variable to its value in the shell env from before the command was executed, nullifying any change the builtin might have made to it) - that issue however has nothing to do with the current one (in this example returning IFS to some other more appropriate value for the commands that follow is assumed to be done elsewhere if required). Similarly the -r is irrelevant to this, whether \ escaping (and line joining) happens is also immaterial. Where things get weird is in a case like: unset IFS read [-r] LANG var1 var2 LANG var3 var4 Doesn't my suggestion at the end of Note: 0006530 adequately distinguish cases I have two issues with it. First it is too complex, and very few people are going to be able to work out just when it applies, and worse, as a script writer, how am I supposed to know whether the user supplied data being read is going to: change how the bytes in IFS form characters or which characters in IFS are considered to be IFS white space ?? If the script writer knew what value was in the data, they wouldn't use read to set it, they'd just do LANG=xxx and be done. Beyond that, the Note: 0006526 text doesn't specify whether vars are assigned as the fields are split, or after all the field splitting has completed - which I understand, as shells exist which do it both ways, so users cannot rely upon it doing anything useful at all for field splitting during the read command, nor can they assume it won't (even if the script writer somehow believes that the conditions in the Note: 0006530 text are met, and so no effect upon the field splitting results was planned). Last, for this, this technique while IFS="" read -r LANG do .. validate $LANG ... is OK for most variables, where in the time between when the var is set by read, and the time it is validated, we know the var is not going to be used (as it is only used when expanded in the script) for the variables the shell uses itself, this is a horrible, and very unreliable technique, and should never even be hinted at as unacceptable practice. The "validate $LANG" (unspecified there how it is done) is likely to use come comparisons using test, or pattern matching in a case statememt - for which the value of LANG can affect the results (different collating sequences, different char orders for [a-z] type expressions - lots of possible variations). The only safe way to update any of the variables the shell actually uses as part of its processing (PATH, IFS, all the locale vars, OPTIND, etc), based upon externally supplied input (however it is supplied, as a command like arg, via read, from the results of a command substitution which is just returning user data (like $(sed 1q < file)) or anything else) is to validate the data first, and if it is OK, assign it to the variable afterwards. So: True, but why force them to do that if there's no need? There is a need, In your example, as you wrote it, what LC_MESSAGES file is going to be used for any message from the "validate $LANG" code, if that discovers there is a problem? I suppose the script is supposed to reset things to LANG=C in that case. But then we'd have lost the value that we consider faulty, which we'd want to include in the error message, so we'd need to have another variable and do BADLANG=$LANG LANG=C printf whatever ... "${BADLANG}" ... so in practice, we're saving nothing, the extra var is still needed, we're making the validation process almost impossible to code, and making the language in the standard almost incomprehensible, and all for what purpose?

(0006538) kre (reporter) 2023-10-16 22:10	Re Note: 0006534 OK (though the practice of rolling in other changes that are noticed when a bug is being fixed is not all that uncommon here) - 0001779 has now been filed for that issue.

(0006543) geoffclare (manager) 2023-10-17 13:34	> as a script writer, how am I supposed to know whether the user supplied data being read is going to: > change how the bytes in IFS form characters or which characters in IFS are > considered to be IFS white space > ?? Fair point. Here's a new attempt that doesn't try to cover the second bullet point from Note: 0006530 (you "okayed" the other bullet points): If a var operand names one of the variables LANG, LC_CTYPE, or LC_ALL, the new value assigned to the variable would change the current locale, and IFS contains any characters other than <space>, <tab>, and <newline>, it is unspecified what, if any, effect the change has on how read performs field splitting.

(0006547) kre (reporter) 2023-10-18 00:08	Re Note: 0006543 This proposed new text has the same problem as (one of the ones) in the previous version: the new value assigned to the variable would change the current locale again, that's unknowable to the script writer, so their only possible response is to not read directly into whichever of those vars you imagine that they're using (LANG in the example from your previous note). It truly is not worth attempting to make this as specified as possible, there is no rational safe way to do a read into IFS, PATH, any of the locale vars (and more) directly - ones like PATH don't need mention here, as using that (or anything similar) would be perfectly defined for how read processes it, but that doesn't make it sane for an application to do (and perhaps the APPLICATION USAGE section should make that point) - vars which immediately affect how the shell works need to be validated before being altered if the input comes from outside the script itself. That's just sensible programming. So, I still claim there is no possible point attempting to find language to do other than simply specify that any attempt to read into any var whose value is (or ever could be) used as part of the read processing produces unspecified results (at least in cases when read processes the input, if you want to exempt the case when IFS="" then that's fine - that's simple enough - I'd also need to think about \ processing when -r isn't used, not for line joining, that's simple enough, but for removing it afterwards if the IFS='' read LANG has managed to make a change such that the code point that represents \ is no longer the same after the read as it was before, since even though no field splitting happens, the escape chars still need to be removed, and it would be truly weird, if the char that enabled lines to be joined, and the one later removed, were not the same one).

(0006548) geoffclare (manager) 2023-10-19 14:05	> the new value assigned to the variable would change the current locale > > again, that's unknowable to the script writer Under certain circumstances that might be true, e.g. if the script just runs with inherited environment variables, but the script writer can certainly control whether the new value assigned to the variable would change the current locale. For example: export LC_ALL=C read LANG < file is guaranteed not to change it. My concern is that there might be scripts in existence which, for some strange reason, pass LANG or LC_CTYPE (or even LC_ALL) as an operand to read in circumstances where there is no reason to make the behaviour unspecified. In which case, doing so would unnecessarily affect the conformance of those scripts.

(0006550) kre (reporter) 2023-10-19 16:33 edited on: 2023-10-19 16:37	Re Note: 0006548 I mentioned earlier (not sure which note it was) that a change to an env var which is being overridden by a higher priority one would be OK, but to specify just that the language would have to be much more specific than your last attempt, rather than "would change the current locale" it would need to say something more like "is guaranteed to be unable to alter the current locale" (and you could add that example somewhere in the application usage showing one example of how to make that guarantee if that would be useful). WRT: My concern is that there might be scripts in existence [...] When making something unspecified here (as distinct from when deciding to specify some particular behaviour) this isn't an issue we need to worry about at all. Explicitly making something unspecified is our way of telling implementers that they do not need to change whatever they currently do. If a script writer wants a script that will work on any POSIX compat shell, they need to worry about this, but if someone has a script that is currently working in the environment in which they use it, that they are using unspecified behaviour is irrelevant - it works, there's no reason (from here) for the implementation they're using to alter, so it will just keep on working. But now, if they were to try to move the script to a different environment, where things currently (and will continue to) work differently, we would be avoiding arguments between users and implementations about "your shell is broken because my script works fine with this other shell" since the "your shell" maintainers can simply point to the standard to show that the script is using something not guaranteed to work. Note that the alternative to all of this concern about particular vars being unspecified to use with read would have been to simply specify late binding of the vars. All field splitting first, then all assignments after that, as I did in my original suggestion. But that one would have required some implementations to alter to remain conformant, and that might have broken some scripts (as unlikely as it seems, as there are really just these few variables where it makes any difference) in which case your concern would have been valid. So I'd still suggest opting for maximum simplicity here, along with maximum clarity, and minimum conditional "unspecified unless" type complexity. There really is no harm to that.

(0006592) nick (manager) 2023-11-27 17:29	On page 3291 line 111859 section read, change: By default, unless the -r option is specified, <backslash> shall act as an escape character. An unescaped <backslash> shall preserve the literal value of the following character, with the exception of ... to: If the -r option is not specified, <backslash> shall act as an escape character. An unescaped <backslash> shall preserve the literal value of a following <backslash> and shall prevent a following byte (if any) from being used to split fields, with the exception of ... On page 3291 line 111869 section read, replace the paragraph beginning The terminating logical line delimiter and continuing down past the three bullet points to line 11878 (inclusive) with the following: The terminating logical line delimiter (if any) shall be removed from the input. Then if the shell variable IFS (see [xref XCU 2.5.3]) is set, and its value is an empty string, the resulting data shall be assigned to the variable named by the first var operand, and the variables named by other var operands (if any) shall be set to the empty string. No other processing shall be performed in this case. If IFS is unset, or is set to any non-empty value, then a modified version of the field splitting algorithm specified in [xref XCU 2.6.5] shall be applied, with the modifications as follows: The input to the algorithm shall be the logical line (minus terminating delimiter) that was read from standard input, and shall be considered as a single initial field, all of which resulted from expansions, with any escaped byte and the preceding <backslash> escape character treated as if they were the result of a quoted expansion, and all other bytes treated as if they were the results of unquoted expansions. The loop over the contents of that initial field shall cease when either the input is empty or n output fields have been generated, where n is one less than the number of var operands passed to the read utility. Any remaining input in the original field being processed shall be returned to the read utility ``unsplit''; that is, unmodified except that any leading or trailing IFS white space, as defined in [xref to XCU 2.6.5], shall be removed. The specified var operands shall be processed in the order they appear on the command line, and the output fields generated by the field splitting algorithm shall be used in the order they were generated, by repeating the following checks until neither is true: If more than one var operand is yet to be processed and one or more output fields are yet to be used, the variable named by the first unprocessed var operand shall be assigned the value of the first unused output field. If exactly one var operand is yet to be processed and there was some remaining unsplit input returned from the modified field splitting algorithm, the variable named by the unprocessed var operand shall be assigned the unsplit input. If there are still one or more unprocessed var operands, each of the variables names by those operands shall be assigned an empty string. Note that in the case where just one var operand is given on the read command line, the modified field splitting algorithm ceases after producing zero output fields and simply returns the original input field, with any leading and trailing IFS white space removed, as unsplit input. This unsplit input is assigned to the variable named by the var operand. On page 3292 line 111900 section read (OPERANDS), after: The name of an existing or nonexisting shell variable. append: If a var operand names the variable IFS, the behavior is unspecified. If a var operand names one of the variables LANG, LC_CTYPE, or LC_ALL and the new value assigned to the variable would change how the bytes in IFS form characters, or which characters in IFS are considered to be IFS white space (see [xref XCU 2.6.5]), it is unspecified what effects, if any, the change has on how read performs field splitting. On page 3292 line 111902 section read (STDIN), change: If the -d delim option is not specified, or if it is specified and delim consists of one single-byte character, the standard input shall contain zero or more characters and shall not contain any null bytes. to: If the -d delim option is not specified, or if it is specified and delim is not the null string, the standard input shall contain zero or more bytes (which need not form valid characters) and shall not contain any null bytes. On page 3293 line 111959 section read (APPLICATION USAGE), change: When the current locale is not the C or POSIX locale, pathnames can contain bytes that do not form part of a valid character, and therefore portable applications need to ensure that the current locale is the C or POSIX locale when using read with arbitrary pathnames as input. (If IFS is not set to the null string this applies even when using -d "", because the field splitting performed by read is a character-based operation.) When reading a pathname it is also inadvisable ... to: When reading a pathname it is inadvisable ... After page 3293 line 111967 section read (APPLICATION USAGE), add: Since the var operands are processed in the order specified on the command line, if any variable name is specified more than once as a var operand, the last assignment made is the one that is in effect when read returns, including when an empty string is assigned because no field data was available. On page 3293, lines 111979-111980, change: Since read affects the current shell execution environment, it is generally provided as a shell \|regular built-in. to: Since read affects the current shell execution environment, it is required to be intrinsic.

Issue History
Date Modified	Username	Field	Change
2023-10-02 13:58	kre	New Issue
2023-10-02 13:58	kre	Name	=> Robert Elz
2023-10-02 13:58	kre	Section	=> XCU 3/read
2023-10-02 13:58	kre	Page Number	=> 3291-3294
2023-10-02 13:58	kre	Line Number	=> 111869-111878, 111961-111963, 11946, 11979-11980
2023-10-02 14:00	kre	Note Added: 0006500
2023-10-02 14:02	kre	Note Edited: 0006500
2023-10-02 14:20	kre	Note Added: 0006502
2023-10-02 14:22	kre	Note Edited: 0006502
2023-10-02 14:24	kre	Note Edited: 0006502
2023-10-02 14:30	kre	Note Added: 0006503
2023-10-02 14:33	kre	Note Edited: 0006502
2023-10-02 14:34	kre	Note Edited: 0006502
2023-10-02 14:44	kre	Note Edited: 0006500
2023-10-02 16:18	geoffclare	Project	1003.1(2013)/Issue7+TC1 => Issue 8 drafts
2023-10-02 16:19	geoffclare	version	=> Draft 3
2023-10-02 16:20	geoffclare	Note Added: 0006507
2023-10-02 16:21	geoffclare	Relationship added	related to 0001649
2023-10-03 09:23	geoffclare	Note Added: 0006509
2023-10-03 11:59	kre	Note Added: 0006510
2023-10-03 12:11	kre	Note Added: 0006511
2023-10-03 13:42	geoffclare	Note Added: 0006512
2023-10-04 14:42	kre	Note Added: 0006513
2023-10-04 17:36	kre	Note Added: 0006514
2023-10-05 08:59	geoffclare	Note Added: 0006515
2023-10-05 11:08	kre	Note Added: 0006516
2023-10-05 14:53	geoffclare	Note Added: 0006517
2023-10-05 15:34	kre	Note Edited: 0006516
2023-10-05 15:35	kre	Note Edited: 0006516
2023-10-05 15:36	kre	Note Edited: 0006516
2023-10-05 15:38	kre	Note Added: 0006519
2023-10-05 15:41	kre	Note Edited: 0006513
2023-10-06 07:58	geoffclare	Note Edited: 0006515
2023-10-06 21:41	chet_ramey	Note Added: 0006524
2023-10-10 11:06	geoffclare	Note Edited: 0006515
2023-10-10 14:36	geoffclare	Note Added: 0006526
2023-10-10 17:19	kre	Note Added: 0006527
2023-10-11 08:04	geoffclare	Note Added: 0006528
2023-10-11 17:13	kre	Note Added: 0006529
2023-10-12 08:30	geoffclare	Note Added: 0006530
2023-10-12 18:33	kre	Note Added: 0006531
2023-10-12 18:45	kre	Note Edited: 0006516
2023-10-12 18:54	kre	Note Added: 0006532
2023-10-16 09:02	geoffclare	Note Added: 0006533
2023-10-16 09:03	geoffclare	Note Edited: 0006533
2023-10-16 09:19	geoffclare	Note Added: 0006534
2023-10-16 21:43	kre	Note Added: 0006537
2023-10-16 22:10	kre	Note Added: 0006538
2023-10-17 13:34	geoffclare	Note Added: 0006543
2023-10-18 00:08	kre	Note Added: 0006547
2023-10-19 14:05	geoffclare	Note Added: 0006548
2023-10-19 16:33	kre	Note Added: 0006550
2023-10-19 16:37	kre	Note Edited: 0006550
2023-11-27 17:29	nick	Note Added: 0006592
2023-11-27 17:30	nick	Final Accepted Text	=> See Note: 0006592
2023-11-27 17:30	nick	Status	New => Resolved
2023-11-27 17:30	nick	Resolution	Open => Accepted As Marked
2023-11-27 17:31	nick	Tag Attached: issue8
2023-11-30 16:18	nick	Relationship added	child of 0001779
2023-11-30 16:20	nick	Relationship replaced	parent of 0001779
2023-12-07 14:20	geoffclare	Status	Resolved => Applied
2023-12-07 14:20	geoffclare	Tag Attached: applied_after_i8d3
2024-06-11 09:12	agadmin	Status	Applied => Closed

Aardvark Mark IV