0000249: Add standard support for $'...' in shell

Notes
(0000548) nick (manager) 2010-09-16 16:17	The C standard also includes Universal Character constants, and these should also be considered. 6.4.3 Universal character names Syntax 1 universal-character-name: \u hex-quad \U hex-quad hex-quad hex-quad: hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit Constraints 2 A universal character name shall not specify a character whose short identifier is less than 00A0 other than 0024 ($), 0040 (@), or 0060 (‘), nor one in the range D800 through DFFF inclusive.) Description 3 Universal character names may be used in identifiers, character constants, and string literals to designate characters that are not in the basic character set. Semantics 4 The universal character name \Unnnnnnnn designates the character whose eight-digit short identifier (as specified by ISO/IEC 10646) is nnnnnnnn.) Similarly, the universal character name \unnnn designates the character whose four-digit short identifier is nnnn (and whose eight-digit short identifier is 0000nnnn).

(0000560) geoffclare (manager) 2010-10-01 12:48	So far, discussion of this bug has been about which escapes to include. There are some problems with other aspects of the proposed text that need to be sorted out. The ones I have spotted are: Use of "A word beginning" is wrong, as the expansion can be used within a word. E.g. a$'b'c expands to abc. The sentence beginning "The expanded result is treated as if it was single-quoted" needs to use "shall", and the rest of that sentence should either be removed or reworked. This addition to 2.2.3 Double-Quotes is not right: On page 2298, after line 72385, add: Within the string of characters from an enclosed "$'" to the matching "'", processing shall occur as described in section 2.2.4. since ksh and bash do not expand $'...' within double quotes. A possible alternative might be to append the following to line 72376: The <dollar-sign> shall not retain its special meaning introducing the dollar-single-quotes form of quoting (see [xref to 2.2.4]). Finally an additional change is needed in 2.3 Token Recognition in list item 4.

(0000565) mirabilos (reporter) 2010-10-08 17:28	The following documents existing practice for $'…' in mksh, called “C style” there to distinguish from the printf builtin’s escape sequence parsing (where, for example, \123 is invalid and must be written as \0123). Supported escapes: \a \b \c# \E \e \f \n \r \t \v \### \U######## \u#### \x#… \' \\ Notes on some: \c# where # is ? will yield 0x7F, every other octet is ANDed with 0x1F (using non-ASCII chars is undefined behaviour as I plan to let mksh use wide characters internally in $some $large time) \E and \e yield 0x1B \### looks for one, two or three octal digits and converts to octet, stops if encountering a non-digit octet \U and \u look for up to eight/four hexadecimal digits and convert to UTF-8 (actually, mksh only supports the BMP though; may as well be CESU-8) \x in “C style” mode gobbles up as many hexadecimal characters as it can (doesn’t just eat one or two) stopping at a non-digit octet and converts to octets if <= 0xFF, to UTF-8 (see above) otherwise \' allows this to work: x=a$'\''b \\ should be self-explanatory Every other escaped character \X is replaced with itself X (although the amount of supported escape sequences may vary, so it’s in practice UD too). (The print builtin mode excludes e.g. \? from this for compatibility reasons.) This means we have the following differences: • \E is new (from GNU bash… although why POSIX (GNU’s Not Unix) would look at it is beyond me), which counts as UD, so is ok • \x doesn’t terminate after two hex digits; I’d like to make this a user error and standardise only for the case where the third octet after \x is NOT in the range [0-9a-fA-F] for the sake of compatibility • \u supports manual creation of surrogates in CESU-8 encoding, they’re not forbidden • \U for values outside the BMP is undesirable in mksh (but this can be viewed as limitation of the operating environment; while mksh doesn’t use the OS’ implementation it shares MirBSD’s which uses an encoding based on CESU-8 internally and has a 16-bit wchar_t) • in practice, \u/\U > 0xFFFD yield 0xFFFD (0xFFFE and 0xFFFF are invalid anyway) • since this escape method is more generic than identifiers, all values from 0x0000 (inclusive) and up are supported (this may be desirable) In mksh, a$'b'c expands to abc and "a$'b'c" expands to a$'b'c as suggested above as well. There’s a sort of testsuite at https://www.mirbsd.org/cvs.cgi/src/bin/mksh/check.t?rev=HEAD [^] – search for “dollar-quoted-strings” and follow it down, including “dollar-quotes-in-heredocs” (use <<$'…' delimiters) and “dollar-quotes-in-herestrings” (for the <<< extension). These are somewhat modelled after ksh93 while looking at GNU bash and zsh, with some real-life testing added. --- So, to summarise, I’d like to get the following as implementation-defined behaviour: When a sequence backslash 'x' is followed by two octets in the range [0-9a-zA-Z], the behaviour if the next (third) octet is also in the range [0-9a-zA-Z]. The rest is merely for consideration; it might be wise to not limit the u and U escapes to characters valid for identifiers. From a lession learned with JSON, it should be explicitly stated that the octet following the backslash character is matched case-sensitively. (Do we assume ASCII here?)

(0000590) Don Cragun (manager) 2010-10-25 06:17 edited on: 2015-08-20 15:25	NB: this note has been revised following the 2015-08-20 conference call: The following is a complete replacement for the Desired Action to be taken to resolve this issue. It is based on other notes in this bug report, numerous e-mail messages on the austin-group-l alias, and discussions held during Austin Group conference calls: In XCU section 2.2 (Quoting) change: "The various quoting mechanisms are the escape character, single-quotes, and double-quotes." on P2298, L72359 to: "The various quoting mechanisms are the escape character, single-quotes, double-quotes, and dollar-single-quotes." Add the following new paragraph to XCU section 2.2.3 (Double-Quotes) after P2298, L72385 (indented to the same level as the paragraph on L72382-72385): "The <dollar-sign> shall not retain its special meaning introducing the dollar-single-quotes form of quoting (see [xref to 2.2.4])." In XCU 2.2.3 (Double-Quotes) change: "A single-quoted or double-quoted string that begins, but does not end, within the "‘...‘" sequence" on P2299, L72392-72393 to: "A quoted (single-quoted, double-quoted, or dollar-single-quoted) string that begins, but does not end, within the "‘...‘" sequence" At the end of XCU section 2.2 after P2299, L72401 add a new subsection, with this text: 2.2.4 Dollar-Single-Quotes A sequence of characters starting with a <dollar-sign> immediately followed by a single-quote ($') shall preserve the literal value of all characters up to an unescaped terminating single-quote ('), with the exception of certain backslash escape sequences, as follows: \" yields a <quotation-mark> (double-quote) character. \' yields an <apostrophe> (single-quote) character. \\ yields a <backslash> character. \a yields an <alert> character. \b yields a <backspace> character. \e yields an <ESC> character. \f yields a <form-feed>> character. \n yields a <newline> character. \r yields a <carriage-return> character. \t yields a <tab> character. \v yields a <vertical-tab> character. \cX yields the control character listed in the Value column of Table 4.21 on page 3203 (xref to XCU Table 4.21) in the Operands section of the stty utility when X is one of the characters listed in the ^c column of the same table, except that \c\\ yields the <FS> control character since the <backslash> character must be escaped. \uXXXX yields the character specified by ISO/IEC 10646 where XXXX is two to four hexadecimal digits (with leading zeros supplied for missing digits) whose four-digit short universal character name is XXXX (and whose eight-digit short universal character name is 0000XXXX). \UXXXXXXXX yields the character specified by ISO/IEC 10646 where XXXXXXXX is two to eight hexadecimal digits (with leading zeros supplied for missing digits) whose eight-digit short universal character name is XXXXXXXX. \xXX yields the byte whose value is the hexadecimal value XX (one or two hex digits) \XXX yields the byte whose value is the octal value XXX (one to three octal digits) If a \xXX or \XXX escape sequence yields a byte whose value is 0, it is unspecified whether that nul byte is included in the result or if that byte and any following regular characters and escape sequences up to the terminating unescaped single-quote are evaluated and discarded. If a universal character name specified by \uXXXX or \UXXXXXXXX is less than 0020 (<space>), the results are undefined. If more than two hexadecimal digits immediately follow \x or if the octal value specified by \XXX will not fit in a byte, the results are unspecified. If a backslash is followed by any sequence not listed above or if \e, \cX, \uXXXX, or \UXXXXXXXX yields a character that is not present in the current locale, the result is implementation-defined. If a backslash escape sequence represents a single-quote character (for example \u0060, \U00000060, or \'), that sequence shall not terminate the dollar-single-quote sequence. In XCU section 2.3 (Token Recognition) point 4, change: "If the current character is <backslash>, single-quote, or double-quote and it is not quoted, it shall affect quoting for subsequent characters up to the end of the quoted text. The rules for quoting are as described in Section 2.2 (on page 2298). During token recognition no substitutions shall be actually performed, and the result token shall contain exactly the characters that appear in the input (except for <newline> joining), unmodified, including any embedded or enclosing quotes or substitution operators, between the <quotationmark> and the end of the quoted text. The token shall not be delimited by the end of the quoted field." on P2299-2300, L72425-72432 to: "If the current character is an unquoted <backslash>, single-quote, or double-quote or is the first character of an unquoted <dollar-sign> single-quote sequence, it shall affect quoting for subsequent characters up to the end of the quoted text. The rules for quoting are as described in Section 2.2 (on page 2298). During token recognition no substitutions shall be actually performed, and the result token shall contain exactly the characters that appear in the input (except for <newline> joining), unmodified, including any embedded or enclosing quotes or substitution operators, between the start and the end of the quoted text. The token shall not be delimited by the end of the quoted field." In XCU section 2.6, add: * a <single-quote> to the end of the list created by the changes identified in Mantis BugID 49. (BugID 49 modifies text on P2305, L72669-72671 in XCU7.) In XCU 2.6.3 (Command Substitution) change: "A single-quoted or double-quoted string that begins, but does not end, within the "‘...‘" sequence produces undefined results." on P2309, L72827-72829 to: "A quoted string that begins, but does not end, within the "‘...‘" sequence produces undefined results." In XCU section 2.6.7 (Quote Removal), change: "The quote characters (<backslash>, single-quote, and double-quote) that were present in the original word shall be removed unless they have themselves been quoted." on P2311, L72904-72905 to: "The quote character sequence <dollar-sign> single-quote and the characters (<backslash>, single-quote, and double-quote) that were present in the original word shall be removed unless they have themselves been quoted." In the Note in XCU section 2.10.2 (Shell Grammar Rules) rule 1, change: "Because at this point <quotation-mark> characters are retained in the token, quoted strings cannot be recognized as reserved words." on P2325, L73436-73437 to: "Because at this point quoting characters (<backslash>, single-quote, <quotation-mark>, and the <dollar-sign> single-quote sequence) are retained in the token, quoted strings cannot be recognized as reserved words." Add a new section in XRAT following P3649, L124068: <italic>C.2.2.4 Dollar-Single-Quotes</italic> The $'...' quoting construct has been implemented in several recent shells. It is similar to character string literals ("...") in the C Standard with the following exceptions: 1. The \x escape sequence in C can be followed by an arbitrary number of hexadecimal digits. The ksh93 implementation of $'...' also consumes an arbitrary number of hexadecimal digits; bash consumes at most two hexadecimal digits in this case. This standard leaves the result unspecified if more than two hexadecimal digits follow \x. (Note that a hexadecimal escape followed by a literal hexadecimal character can always be represented as $'\xXX'X) 2. The \c escape sequence is not included in the C standard. There was also some disagreement in shells that historically supported \c escape sequences in $'...'. These include: A. whether \cA through \cZ produced the byte values 1 through 26, respectively or supported the codeset independent control character as specified by the stty utility. This standard requires codeset independence. B. whether \c[, \c\\, \c], \c^, \c_, and \? could be used to yield the <ESC>, <FS>, <GS>, <RS>, <US>, and <DEL> control characters, respectively. This standard requires support for all of the control characters except NULL (matching what is done in the stty utility). C. whether \c\\ or \c\ was used to represent <FS>. This standard requires \c\\ to make backslash escape processing consistent. The implementors of the most common shells that implement $'\cX' agreed to convert to the behavior specified in this standard. Some shells also allow \c<arbitrary_control_character> to act as an inverse function to \cX (i.e., \cm and \cM yield <CR> and \c<CR> yields m or M. This standard leaves this behavior implementation-defined. 3. The \e escape sequence is not included in the C Standard, but was provided by all historical shells that supported $'...'. Some also supported \E as a synonym. One member of the group objected to adding \e because the <ESC> control character is not required to be in the portable character set. The \e sequence is included because many historical users of $'...' expect it to be there. The \E sequence is not included in this standard because <backslash> escape sequences that start with <backslash> followed by an uppercase letter (except \U) are reserved by the C Standard for implementation use. 4. The \XXX octal escape sequence and the \xXX hexadecimal escape sequence can be used to insert a null byte into a C Standard character string literal and into a $'...' quoted word in this standard. In C, any characters specified after that null byte (including escape sequences) continue to be processed and added to the expansion of the character string literal. In $'...' in the shell this standard allows the equivalent behavior but also allows the null byte and all remaining characters up to the terminating unescaped single-quote be evaluated and discarded. The latter (which was historic practice in bash, but not in ksh93) allows an escape sequence producing a null byte to terminate the dollar-single-quoted expansion, but not terminate the token in which it appears if there are characters remaining in the token. For example: printf a$'b\0c\''d is required by this standard to produce: abd while historic versions of ksh93 produced: ab 5. The \uXXXX and \UXXXXXXXX escape sequences must have exactly 4 and 8 hexadecimal digits, respectively, in C. In $'...' in the shell, fewer digits are accepted. If the character following either form is itself a hexadecimal digit, the application shall ensure that the full 4 or 8 character width ( using leading zeros as necessary ) is used to disambiguate the interpretation. 6. The double-quote (") character can be used literally, while the single-quote (') character must be represented as an escape sequence. In C, single-quote can be used literally, while double-quote requires an escape sequence. In C, if a character specified by \uXXXX or \UXXXXXXXX in "..." does not exist in the execution character set, the results are implementation-defined. Similarly, this standard makes the results implementation-defined if \e, \cX, \uXXXX, or \UXXXXXXXX specifies a character that is not present in the current locale. Note that the escape sequences recognized by $'...', file format notation (see Table 5-1 on page 121), XSI-conforming implementations of the echo utility (see the echo utility's operands section on page 2615), and the printf utility's format operand (see the printf utility's extended description section on page 3050) are not the same. Some escape sequences are not recognized by all of the above, the \c escape sequence in echo is not at all like the \c escape sequence in $'...', octal escape sequences in some of the above accept one to four octal digits and require a leading zero while others accept one to three octal digits and do not require a leading zero.

(0000599) nick (manager) 2010-11-04 16:06	The issue of adding \cX and \e was discussed at the WG14 meeting. A paper to add these character constants was requested and submitted (see http://wiki.dinkumware.com/twiki/pub/WG14/Documents/n1534.htm). [^] However, since this paper was in response to a late liaison statement, objection to considering it was raised. The committee agreed that POSIX was free to use these characters, and encouraged resubmission of N1534 as a liaison ballot comment, probably reworded to be an update to the future directions section. WG14 has indicated that POSIX use of the "Future Standardization" is satisfactory.

(0000601) Don Cragun (manager) 2010-11-05 03:00 edited on: 2010-11-05 03:04	Note: 0000590 was updated today to account for another difference in the way bash and ksh93 processed $'...'. If an escape sequence expanded to a null byte, bash terminated the $'...' expansion, but still concatenated any following characters that appeared after the $'...' expansion that were in the same lexical token. On the other hand, ksh93 terminated the entire token in this case. The ksh93 developers agreed to change to match bash in this case if the standard requires this behavior. Also note that if the C1x standard adopts the changes suggested in the document referenced in Note: 0000599, the "\e, " in the last paragraph of the newly created 2.2.4 section can be removed (since the <ESC> character would be moved into the basic character set in C1X). If this happens we should also consider adding \e to XBD Table 5-1 and checking the places in the standard that refer to that table to determine whether \e processing should also be required in those places.

(0000609) nick (manager) 2010-11-05 14:52	This issue was presented to WG14 (C). The initial reaction was to request a paper to add \cX and \e as character constants (see http://wiki.dinkumware.com/twiki/pub/WG14/Documents/n1534.htm, [^] attached). This paper was subsequently procedurally rejected as it was late and made fundamental differences, which might also impact C++. Instead, the paper was revised to propose adding these characters to the Future Directions section, and a liaison comment on the CD ballot to ask for this would be appropriate. The WG14 committee agreed that there was no problem in POSIX adding these characters to the shell, and did not object to the use of the namespace.

(0002793) nick (manager) 2015-08-20 15:31 edited on: 2015-08-20 15:33	Note: 0000590 was updated today to allow for greater flexibility in the number of digits in a \uXXXX and \uXXXXXXXX sequence and other minor tweaks to reflect existing practice: The detailed changes recorded in the minutes are: Proposed changes to bugnote #590: Change: If a \xXX or \XXX escape sequence yields a byte whose value is 0, that byte and any following regular characters and escape sequences up to the terminating unescaped single-quote shall be evaluated and discarded. to: If a \xXX or \XXX escape sequence yields a byte whose value is 0, it is unspecified whether that nul byte is included in the result or if that byte and any following regular characters and escape sequences up to the terminating unescaped single-quote are evaluated and discarded. Change: \uXXXX yields the character specified by ISO/IEC 10646 whose four-digit short universal character name is XXXX (and whose eight-digit short universal character name is 0000XXXX). \UXXXXXXXX yields the character specified by ISO/IEC 10646 whose eight-digit short universal character name is XXXXXXXX. to: \uXXXX yields the character specified by ISO/IEC 10646 where XXXX is two to four hexadecimal digits (with leading zeros supplied for missing digits) whose four-digit short universal character name is XXXX (and whose eight-digit short universal character name is 0000XXXX). \UXXXXXXXX yields the character specified by ISO/IEC 10646 where XXXXXXXX is two to eight hexadecimal digits (with leading zeros supplied for missing digits) whose eight-digit short universal character name is XXXXXXXX. Change: If a universal character name specified by \uXXXX or \UXXXXXXXX is less than 00A0 other than 0024 ($), 0040 (@), or 0060('), the results are undefined. to: If a universal character name specified by \uXXXX or \UXXXXXXXX is less than 0020 (<space>), the results are undefined. Change: In C, any characters specified after that null byte (including escape sequences) continue to be processed and added to the expansion of the character string literal. In $'...' in the shell, however, this standard requires the null byte and all remaining characters up to the terminating unescaped single-quote be evaluated and discarded. (This was historic practice in bash, but not in ksh93.) This allows ... to: In C, any characters specified after that null byte (including escape sequences) continue to be processed and added to the expansion of the character string literal. In $'...' in the shell this standard allows the equivalent behavior but also allows the null byte and all remaining characters up to the terminating unescaped single-quote be evaluated and discarded. The latter (which was historic practice in bash, but not in ksh93) allows ... Change: 5. The double-quote (") character can be used ... to: 5. The \uXXXX and \UXXXXXXXX escape sequences must have exactly 4 and 8 hexadecimal digits, respectively, in C. In $'...' in the shell, fewer digits are accepted. If the character following either form is itself a hexadecimal digit, the application shall ensure that the full 4 or 8 character width ( using leading zeros as necessary ) is used to disambiguate the interpretation. 6. The double-quote (") character can be used ...

(0002809) rhansen (manager) 2015-09-03 16:40 edited on: 2015-09-10 20:27	The following is a work in progress, discussed during the 2015-09-03 telecon. The changes from Note: 0000590 are: allow the \u and \U escape sequences to have 1 hexadecimal digit refer to ISO/IEC 6429 for characters \u0 through \u1f (and \U0 through \U1F) change "yields the character specified by" to "yields the character named by" delete the paragraph that says that the results of characters less than 0020 (<space>) are unspecified change "or if ... yields a character that is not present in the current locale" to "or if ... specifies a character that does not have an encoding in the locale in effect during token recognition" add a new paragraph to the end of the new rationale section At the end of the telecon, we were still discussing whether the behavior of characters not in the current locale should be more specified than just implementation-defined. Note: The bug report is against the original version of Issue 7, but the following changes are against Issue 7 with TC1 integrated (page and line numbers refer to Issue 7 with TC1). At page 2320 line 73594 (XCU section 2.2, Quoting) change: The various quoting mechanisms are the escape character, single-quotes, and double-quotes. to: The various quoting mechanisms are the escape character, single-quotes, double-quotes, and dollar-single-quotes. After page 2320 page 73620 (XCU section 2.2.3, Double-Quotes) insert a new paragraph intented to the same level as the paragraph at lines 73617-73620: The <dollar-sign> shall not retain its special meaning introducing the dollar-single-quotes form of quoting (see [xref to 2.2.4]). At page 2321 lines 73626-73627 (XCU 2.2.3, Double-Quotes), change: A single-quoted or double-quoted string that begins, but does not end, within the "`...`" sequence to: A quoted (single-quoted, double-quoted, or dollar-single-quoted) string that begins, but does not end, within the "`...`" sequence After page 2321 line 73635 (end of XCU section 2.2), insert a new subsection: 2.2.4 Dollar-Single-Quotes A sequence of characters starting with a <dollar-sign> immediately followed by a single-quote ($') shall preserve the literal value of all characters up to an unescaped terminating single-quote ('), with the exception of certain backslash escape sequences, as follows: <tt>\"</tt> yields a <quotation-mark> (double-quote) character. <tt>\'</tt> yields an <apostrophe> (single-quote) character. <tt>\\</tt> yields a <backslash> character. <tt>\a</tt> yields an <alert> character. <tt>\b</tt> yields a <backspace> character. <tt>\e</tt> yields an <ESC> character. <tt>\f</tt> yields a <form-feed>> character. <tt>\n</tt> yields a <newline> character. <tt>\r</tt> yields a <carriage-return> character. <tt>\t</tt> yields a <tab> character. <tt>\v</tt> yields a <vertical-tab> character. <tt>\cX</tt> yields the control character listed in the Value column of [xref to XCU Table 4.21] in the Operands section of the stty utility when X is one of the characters listed in the ^c column of the same table, except that \c\\ yields the <FS> control character since the <backslash> character must be escaped. <tt>\uXXXX</tt> yields the character named by ISO/IEC 10646 or, for \u0 to \u1f, by ISO/IEC 6429 where XXXX is one to four hexadecimal digits (with leading zeros supplied for missing digits) whose four-digit short universal character name is XXXX (and whose eight-digit short universal character name is 0000XXXX). <tt>\UXXXXXXXX</tt> yields the character named by ISO/IEC 10646 or, for \u0 to \u1f, by ISO/IEC 6429 where XXXXXXXX is one to eight hexadecimal digits (with leading zeros supplied for missing digits) whose eight-digit short universal character name is XXXXXXXX. <tt>\xXX</tt> yields the byte whose value is the hexadecimal value XX (one or two hex digits) <tt>\XXX</tt> yields the byte whose value is the octal value XXX (one to three octal digits) If a \xXX or \XXX escape sequence yields a byte whose value is 0, it is unspecified whether that nul byte is included in the result or if that byte and any following regular characters and escape sequences up to the terminating unescaped single-quote are evaluated and discarded. If more than two hexadecimal digits immediately follow \x or if the octal value specified by \XXX will not fit in a byte, the results are unspecified. If a backslash is followed by any sequence not listed above or if \e, \cX, \uXXXX, or \UXXXXXXXX specifies a character that does not have an encoding in the locale in effect during token recognition, the result is implementation-defined. If a backslash escape sequence represents a single-quote character (for example \u0060, \U00000060, or \'), that sequence shall not terminate the dollar-single-quote sequence. At page 2321 lines 73658-73664 (XCU section 2.3 (Token Recognition) point 4), change: 4. If the current character is <backslash>, single-quote, or double-quote and it is not quoted, it shall affect quoting for subsequent characters up to the end of the quoted text. The rules for quoting are as described in Section 2.2 (on page 2298). During token recognition no substitutions shall be actually performed, and the result token shall contain exactly the characters that appear in the input (except for <newline> joining), unmodified, including any embedded or enclosing quotes or substitution operators, between the <quotationmark> and the end of the quoted text. The token shall not be delimited by the end of the quoted field. to: 4. If the current character is an unquoted <backslash>, single-quote, or double-quote or is the first character of an unquoted <dollar-sign> single-quote sequence, it shall affect quoting for subsequent characters up to the end of the quoted text. The rules for quoting are as described in [xref to Section 2.2]. During token recognition no substitutions shall be actually performed, and the result token shall contain exactly the characters that appear in the input (except for <newline> joining), unmodified, including any embedded or enclosing quotes or substitution operators, between the start and the end of the quoted text. The token shall not be delimited by the end of the quoted field. After page 2327 line 73900 (XCU section 2.6, Word Expansions), insert a new bullet point: a <single-quote> At page 2331 lines 74071-74073 (XCU 2.6.3, Command Substitution), change: A single-quoted or double-quoted string that begins, but does not end, within the "`...`" sequence produces undefined results. to: A quoted string that begins, but does not end, within the "`...`" sequence produces undefined results. At page 2333 lines 74157-74158 (XCU section 2.6.7, Quote Removal), change: The quote characters (<backslash>, single-quote, and double-quote) that were present in the original word shall be removed unless they have themselves been quoted. to: The quote character sequence <dollar-sign> single-quote and the characters (<backslash>, single-quote, and double-quote) that were present in the original word shall be removed unless they have themselves been quoted. At page 2348 lines 74718-74719 (the Note in XCU section 2.10.2 (Shell Grammar Rules) rule 1), change: Because at this point <quotation-mark> characters are retained in the token, quoted strings cannot be recognized as reserved words. to: Because at this point quoting characters (<backslash>, single-quote, <quotation-mark>, and the <dollar-sign> single-quote sequence) are retained in the token, quoted strings cannot be recognized as reserved words. After page 3677 line 125685 (XRAT C.2.2), insert a new subsection: C.2.2.4 Dollar-Single-Quotes The $'...' quoting construct has been implemented in several recent shells. It is similar to character string literals ("...") in the C Standard with the following exceptions: The \x escape sequence in C can be followed by an arbitrary number of hexadecimal digits. The ksh93 implementation of $'...' also consumes an arbitrary number of hexadecimal digits; bash consumes at most two hexadecimal digits in this case. This standard leaves the result unspecified if more than two hexadecimal digits follow \x. (Note that a hexadecimal escape followed by a literal hexadecimal character can always be represented as $'\xXX'X) The \c escape sequence is not included in the C standard. There was also some disagreement in shells that historically supported \c escape sequences in $'...'. These include: whether \cA through \cZ produced the byte values 1 through 26, respectively or supported the codeset independent control character as specified by the stty utility. This standard requires codeset independence. whether \c[, \c\\, \c], \c^, \c_, and \? could be used to yield the <ESC>, <FS>, <GS>, <RS>, <US>, and <DEL> control characters, respectively. This standard requires support for all of the control characters except NULL (matching what is done in the stty utility). whether \c\\ or \c\ was used to represent <FS>. This standard requires \c\\ to make backslash escape processing consistent. The implementors of the most common shells that implement $'\cX' agreed to convert to the behavior specified in this standard. Some shells also allow \c<arbitrary_control_character> to act as an inverse function to \cX (i.e., \cm and \cM yield <CR> and \c<CR> yields m or M. This standard leaves this behavior implementation-defined. The \e escape sequence is not included in the C Standard, but was provided by all historical shells that supported $'...'. Some also supported \E as a synonym. One member of the group objected to adding \e because the <ESC> control character is not required to be in the portable character set. The \e sequence is included because many historical users of $'...' expect it to be there. The \E sequence is not included in this standard because <backslash> escape sequences that start with <backslash> followed by an uppercase letter (except \U) are reserved by the C Standard for implementation use. The \XXX octal escape sequence and the \xXX hexadecimal escape sequence can be used to insert a null byte into a C Standard character string literal and into a $'...' quoted word in this standard. In C, any characters specified after that null byte (including escape sequences) continue to be processed and added to the expansion of the character string literal. In $'...' in the shell this standard allows the equivalent behavior but also allows the null byte and all remaining characters up to the terminating unescaped single-quote be evaluated and discarded. The latter (which was historic practice in bash, but not in ksh93) allows an escape sequence producing a null byte to terminate the dollar-single-quoted expansion, but not terminate the token in which it appears if there are characters remaining in the token. For example: printf a$'b\0c\''d is required by this standard to produce: abd while historic versions of ksh93 produced: ab The \uXXXX and \UXXXXXXXX escape sequences must have exactly 4 and 8 hexadecimal digits, respectively, in C. In $'...' in the shell, fewer digits are accepted. If the character following either form is itself a hexadecimal digit, the application shall ensure that the full 4 or 8 character width ( using leading zeros as necessary ) is used to disambiguate the interpretation. The double-quote (") character can be used literally, while the single-quote (') character must be represented as an escape sequence. In C, single-quote can be used literally, while double-quote requires an escape sequence. In C, if a character specified by \uXXXX or \UXXXXXXXX in "..." does not exist in the execution character set, the results are implementation-defined. Similarly, this standard makes the results implementation-defined if \e, \cX, \uXXXX, or \UXXXXXXXX specifies a character that is not present in the current locale. Note that the escape sequences recognized by $'...', file format notation (see [xref to Table 5-1]), XSI-conforming implementations of the echo utility (see the echo utility's operands section on [xref to echo]), and the printf utility's format operand (see the printf utility's extended description on [xref to printf]) are not the same. Some escape sequences are not recognized by all of the above, the \c escape sequence in echo is not at all like the \c escape sequence in $'...', octal escape sequences in some of the above accept one to four octal digits and require a leading zero while others accept one to three octal digits and do not require a leading zero. Some historical implementations of $'...' incorrectly translated \uXXXX sequences into UTF-8 encoded characters even when the current locale does not use UTF-8 as its codeset. When a Unicode character specified by \uXXXX names a character that is present in the codeset in the current locale, that character's encoding in the current locale's character set must be used, not the UTF-8 encoding of that character. If the named character is not in the current locale's character set, the result is implementation-defined.

(0002824) rhansen (manager) 2015-09-10 20:25 edited on: 2015-09-10 20:26	We made a bit more progress on this issue during today's teleconference. For the latest version of the proposed wording, see: http://posix.rhansen.org:9001/p/bug249 [^] (see the emailed minutes for the username and password) Feel free to make changes, but please highlight your changes (e.g., use strikeout when deleting and color the changes red).

(0002893) steffen (reporter) 2015-11-09 21:10	For the record i'm appending an item of mine from the mailing-list (austin-group-l:archive/latest/23158, 20150930122537.COnKa3bTi%sdaoden@yandex.com): the ISO C11 draft i have states in 6.4.3, Universal character names: The universal character name \Unnnnnnnn designates the character whose eight-digit short identifier (as specified by ISO/IEC 10646) is nnnnnnnn [.] And even though ISO C forbids for unknown reasons \u65\u301, it still supports ISO 10646 for the remains, which includes that characters for which no precomposed character is available a combination of multiple successive \u sequences must naturally be usable. Simply because that is ISO 10646. Furthermore ISO C _explicitly_ forbids many specific code ranges for identifiers, and thus _implicitly_ allows them for other use cases -- isn't that the usual way to treat it? Therefore back to my initial message, i think the standard should today say a word on wether software is allowed to perform the normalization and show replacement characters for final graphem boundaries instead of individual bytes or "characters". [...] I suggested that software which is capable to understand that \u65\u301 can actually be composed to \ue9 can do so if possible in the locale or if replacement has to be performed. Looking at the above i don't know wether that is the right approach for a shell, but if we end up with something that is visual to the user (even isatty(STDOUT_FILENO)) then i think it is a replacement just as valid (maybe not for the shell). Regarding \[Uu] in general: this escape carries an ISO 10646 character losslessly around whatever transport to the user, where it is to be transformed to something in the users current locale. \u65\u301 is \ue9 in ISO-8859-1 a.k.a. LATIN1. Period. For other locales similar composition examples can be assumed, the special thing about LATIN1 in this regard being simply that it has been chosen for the first 256 entries of Unicode / ISO 10646, likely because its first 128 entries are compatible to US-ASCII. It is one of the typical ISO C annoyances that the ASCII range plus C1 controls can't be used in \[Uu], real-life extensions exist. My opinion: wether software is capable to perform normalization and composition of \u65\u301 or not, thus ending up with \ue9 a.k.a. "é", or otherwise "e?", "e\u0301" or any other replacement is purely based upon its level of Unicode / ISO 10646 compatibility / compliance. It has nothing at all to do with \u65\u301, which _is nothing but \ue9 in LATIN1_. [...] This was a followup of (austin-group-l:archive/latest/23150, 20150929092823.nia8y-PmH%sdaoden@yandex.com) in which i wrote: it thus follows that those characters which have no precomposed code point in Unicode must be representable, since they may have a mapping in a given locale (even though i would assume that Unicode actually contains precomposed code points for characters in most, if not all well-known locales). Unicode (6.3) section 5.6, Normalization: Compared to the number of possible combinations, only a relatively small number of pre-composed base character plus nonspacing marks have independent Unicode character values. And more specific examples, e.g., section 8.2, Arabic: A basic Arabic letter plus any of these types of marks is never encoded as a separate, precomposed character, but must always be represented as a sequence of letter plus combining mark.. I'm pretty sure that a linguist or Unicode specialist would know more such examples (if Arabic by itself would not be sufficient; [.]

(0002922) shware_systems (reporter) 2015-11-13 19:19	Re:2893 From Chapter 2, p.12, of Unicode v7: "Glyphs represent the shapes that characters can have when they are rendered or displayed. In contrast to characters, glyphs appear on the screen or paper as particular representations of one or more characters. A repertoire of glyphs makes up a font. Glyph shape and methods of identifying and selecting glyphs are the responsibility of individual font vendors and of appropriate standards and are not part of the Unicode Standard." Such compositions, of combining sequence characters, are therefore left the province of final consumers with extended facilities for processing them, such as a font driver or Unicode-aware hardcopy output devices. Unicode provides code points that indicate which types of combining may be desired but no consumer of those characters is required to honor them. These types of devices are common, but still a subset of the devices supported by the General Terminal interface, directly or via emulation, so for POSIX those aspects of Unicode are optional. For compatibility with the POSIX locale, only the code points for precomposed Latn-script glyphs that use left to right advancing are required. It is not the province of intermediary filtering utilities such as the shell to attempt such combining conversions unilaterally given the range of devices currently allowed. Historically, where preprocessing of a character sequence is required because a system operator knows a device does not have sufficient capabilities to represent a glyph expected of a sequence, and for that operator using a supported replacement character is acceptable, this has been accommodated by iconv() as a situational elective, not a locale based requirement on implementations for character element processing. For backwards compatibility, code points for additional scripts may be compatible with the User Portability Option Group and be part of an extended repertoire. The remaining scripts, including the example Arabic, have to be part of a new Option Group that is a superset of the current User Portability option, as I see it, and will never be a candidate for the POSIX base because of the specialized device support required to get portably consistent results. It's certainly an XSI possibility for Issue 9, but the reference implementation of such a group for Issue 8 hasn't been done yet. The pieces exist, spread across various libraries like icu and freetype; a standardizable grouping of them that is suitably documented doesn't, that I'm aware of, that respects the POSIX extensions model as it exists.

(0002923) steffen (reporter) 2015-11-13 21:22	@2922: Glyphs..are the responsibility of individual font Yes - if you enter a decomposed Unicode graphem that spans multiple Unicode code points, each of which encoded in UTF-8, then it is up to the terminal / -driver to perform the composition to the glyph that is displayed. It is not the province of intermediary filtering utilities such as the shell to attempt such combining conversions The problem is when the \u sequence is to be displayed in a locale that is not Unicode aware, that requires conversion of the \u sequence. And ISO 10646 that POSIX refers to doesn't contain all characters in precomposed form. The remaining scripts, including the example Arabic.. will never be a candidate for the POSIX base.. specialized device support required.. Well it is a combination of base letters and combining marks. I think iconv(3) has to be extended to be able to deal with that fact when converting from Unicode data to some other character set in which there is a precomposed character that corresponds to a given decomposed sequence. Stéphane has given the example of his name on the mailing list several times: $'St\u65\u301phane'. This is representable in LATIN1, correct conversion provided. What else should be correct in LATIN1? "Ste?phane" can't be it, i'd say. But this issue is about $'' support in the shell, not iconv(3) nor any other improved text manipulation, including drop of strxfrm. I was suggesting that the forthcoming standard text says a word about character composition for \u sequences in $'' and Unicode graphem boundaries in general. Because ISO 10646 doesn't offer precomposed entries for all possible characters --- not because code points are yet unassigned or due to private range extensions, but by design of ISO 10646, already today. And therefore, if a utility that works in the LC_CTYPE locale offers the possibility to enter ISO 10646 / Unicode data -- which this issue is about to standardize for the shell via the $'' construct and the \Uu escape sequence --, it should be able to deal with it and try its best to ensure that the input ends up in the best possible LC_CTYPE form. The standard should include wording that heads towards this goal. (I don't think it is a show stopper that there is no string interface that deals with real language strings, because i think on the long run iconv(3) has to be capable to perform proper conversion, as above, and i guess that is therefore the natural choice for \uU since there is no wctomb_l(). And even if the existence of a Unicode locale would be a guess, and even if not then wchar_t is transparent and not necessarily in UTF-32. C11 brings c32rtomb(), but i fail to see how this helps regarding Unicode. It will, however and most likely, ensure that the iconv(3) available in the C library supports the Unicode character sets. Yay.)

(0002925) steffen (reporter) 2015-11-13 21:24	erm, i said Yes - if you enter a decomposed Unicode graphem that spans multiple Unicode code points, each of which encoded in UTF-8, then it is up to the terminal / -driver to perform the composition to the glyph that is displayed. I meant: if you are in a UTF-8 aware locale. Of course.

(0002929) shware_systems (reporter) 2015-11-13 23:23	No, it does not require conversion. The entire U0300 block is illegal code points for a basic ISO-8859-1 based locale, and most other single_byte locales. An extended device that reports to the implementation that it does handle the U0300 block in a way that substitutes a precomposed Unicode glyph code point might be usable with an extended locale, as part of that new Option Group, which would also use 8859-1 as its single byte encoding. Then the substitution of \uA9 as a character element would have a direct map to \xA9 as a code point. Such processing requires the device support this capabilities reporting, which precludes both regular and character special file types being used with those strings. Adding the capabilities to these types requires changes to all existing implementations, afaik. Adding an Option Group doesn't. Conversely, such a substitution should not be made if the device advertises it handles the U0300 block by the backspace/overprint method. What glyph is assigned to \uA9 may have an entirely different appearance than produced after overstrike. Yes, it would be silly for a font to do this, but it's not precluded either. If the shell substitutes \uA9 anyways, the results would be confusing visually, at least. I agree where appropriate such mappings are desirable. I see appropriate as being in a separate shell work alike utility compiled to require the availability of the new group, not the sh utility directly.

(0002942) steffen (reporter) 2015-11-14 14:26	No, it does not require conversion. It should have the option to do so. The entire U0300 block is illegal code points for a basic ISO-8859-1 based locale Performing a composition step may result in a valid one. Looking back it roots in english centric problem solutions, surely also caused by the resource restrictions of earlier computers (even relative given the power of average desktop computers say in 1999, only sixteen years ago). But other languages exist, and character encodings with state machines existed decades ago already to overcome the technologic shortcomings of the single-track=english environments, so that other people from different cultures (which still existed back then) had a chance to take part at modern technologies and keep up with the speed of the world trade etc. (Which, as a side note, i personally don't understand because of its negative side effects. But that's how sapiens sapiens acts. He doesn't like falling behind.) And to what extend is Unicode with its graphem boundaries different to those encodings with state machines? I would say it is because of the fact that this standard was designed with an integral, with a holistic point of view from the very beginning, in order to change that. So more than two decades ago the ISO 10646 / Unicode effort started and that is where it ended up: not every "user-receivable character" has its own codepoint, but multiple such may combine with exactly defined rules to form such a "graphem". If that graphem happens to have a representation in the charset of the currently active locale then a sensitive application will want to use that very representation, instead of using replacement characters of whatever form. You cannot do anything sensitive except iconv(3) anyway, since the POSIX standard doesn't offer any other interface. And ISO C11 comes with a function but which surely is backed by the very same mechanism that iconv(3) uses internally. But wording should ensure that multiple adjacent (at least \uU) escape sequences may together form a complete ISO 10646 character sequence. That is ISO 10646. Those passed in conjunction to a truly Unicode-aware iconv(3) should return a normalized composed variant. I would think too many actual uses of iconv(3) in the wild prevent If the input buffer ends with an incomplete character or shift sequence, conversion shall stop after the previous successfully converted bytes. to be extended to handle the case that a valid character at EOS is seen but which _may_ combine if further input is fed in. So this is what you refer to? Do you want me to open an issue to add niconv{,_open,_flush,_close}() that requires a _flush(), or without _flush() but with a niconv() requirement of a final call with inbuflen=0 to signal "EOF", before the output is truly complete? For the future even POSIX / ISO C software needs to be capable to deal with base characters plus combining marks. An extended niconv may be an easy way to upgrade a lot of otherwise charset agnostic software in the most minimal intrusive way possible.

(0003247) steffen (reporter) 2016-06-06 21:07	Re @2809 (the proposed text) \|The \XXX octal escape sequence and the \xXX hexadecimal escape sequence can be used to insert a null byte into a C Standard character string literal and into a $'...' quoted word in this standard. I think this should also be noted for \u and \U, since POSIX doesn't clamp the allowed range of characters and Unicode / ISO 10646 and therefore \u can yield any possible control character that \x and \0 can yield. P.S.: currently implementing such a thing it.. doesn't amuse me.. that variable expansion hasn't been included in $''. We now have four (4!) different escape mechanisms with four (4!) different sets of rules, and not a single mechanism offers all the possibilities. I would assume this is a nightmare for any normal user, and anyway it ends up nearly as cryptic as C format strings, though maybe not just as dangerous when misused. I will implement a \$ escape that can be used to perform variable expansion. Like that a user can well end up using single-quotes for literal input and dollar-single-quotes for anything else.

(0003248) shware_systems (reporter) 2016-06-07 19:50 edited on: 2016-06-07 19:53	The /u and /U1..1f escapes may be mapped in the current locale to values other than 1 to 31, depending on what code point, single or multiple byte, the charmap assigns the referenced ISO-6429 functionality; if it supports the function to begin with. While few charmaps are expected to do any mapping of this nature, this possibility is why they weren't included. Such mapping does not occur with the \cX, \XXX or \xXX forms. Variable and other expansions were not included because sequences like $'text'$var$'end' do the same thing due to string concatenation while tokenizing. Treating $'..' sequences as a form of expansion inside double quotes was also discussed, but consensus on how to make the behavior portable couldn't be reached. At most possibly a character or two of source input might be saved with an inordinate increase in complexity of implementing it in existing $'' code, in my opinion. Folding the new escapes into double or single quotes was also discussed, so a new quoting form wasn't required, but it was felt this would break too many existing scripts that expect backslashes to have their current literal or escaping usage.

(0003249) steffen (reporter) 2016-06-07 20:25	Re @3248: what has that to do with \u0 or \U0? As you know well Unicode[0..256] = ISO 8859-1[0..127] = US-ASCII. I was thinking it is nothing but an oversight that \000 = \x00 = \u0, yet wording is missing. \| Variable and other expansions were not included because sequences like $'text'$var$'end' Oh yes. Nice! Terribly cryptic. Since $'' adds expansion of backslash escape sequences i have added a \$VAR (and \${VAR}) extension. Really, this is C and sh(1), not brainfuck or something. Sorry. Btw. i've defined the implementation-defined \c@ a.k.a. NUL to mean the same as printf(1)s \c, since we have two (or, with \u0, three) incarnations of possibilities to end processing of the current quote, but nothing which extends beyond that, and i wonder how much time resources i should spend for implementing this. Granted: printf(1) i don't need. \| Folding the new escapes into double or single quotes was also discussed Ok, it is understood that this wasn't possible to not break backward compatibility. Since at least bash(1) already added $"" with a meaning that i have not looked deeper into we've reached the end of the road. Yay.

(0003250) steffen (reporter) 2016-06-07 20:27	Re @3249: \| As you know well Unicode[0..256] = ISO 8859-1[0..127] = US-ASCII. 0..255.

(0003251) shware_systems (reporter) 2016-06-08 06:51	Unicode[0..31] is Unassigned, with the C0 group of ISO-6429 being the recommended, not mandatory, usage assignments for interchange. This applies to UTF8 also. Everyone on the calls agreed this was the case. POSIX via C requires the \0 code point have a constant usage that differs markedly from ISO-6429, so it is special to begin with and why I didn't include it in the range. Sorry, I could have made that clearer. The other differences in definitions are relatively inconsequential. With Unicode having it as unassigned it was felt the standard should leave open a future version of Unicode might assign it a third interpretation as required usage, which may or may not map to a code point of existing charmaps. It is not exempt from ever being mapped; that is simply the expected ongoing practice. So the full range is \U0..31. ISO-8859-1 just defines the G1 group [160..255] as a 96-char encoding; it defers to, and by convention most interchange uses, ISO-646-IRV as the default G0 group but any registered 94-char encoding used with it is technically a conforming variant. Variant announcement via ISO-2022 ESC sequences is expected if the default isn't used, and for alternate C0 and C1 groups than ISO-6429. POSIX currently assumes the announcement(s) is encoded in the locale label somehow and an application has some unspecified out of band means for getting the relevant label when context changes to use with setlocale(). The minimal C0 group usable as an alternate, per ISO-6429 deferring to ISO-2022, only requires 3 byte values have specific assignments, however, and \0 isn't one of them. So an 8859 variant may require mapping also to convert it to the C locale. The $"..." format was looked at, and tabled as having too many platform-specific variations being possible to warrant inclusion at this time, related to infrastructure aspects the standard has as implementation defined or unspecified. I won't belabor which aspects; the expectation is a future bug report will address them.

(0003252) steffen (reporter) 2016-06-08 10:43	Re @3251: i don't know what you are talking about. It is plain untrue. You seem to have become confused about reading ISO/IEC 10646:2012, section 11.

(0003254) shware_systems (reporter) 2016-06-08 19:55	No, I'm not confused. Look at the UCD XML directly; neither 00..1F or 7F..9F have "na=" long names assigned to the code points. The ISO-6429 names are "na2=" commentary, not normative. For Unicode 8 the relevant text is Chapter 23.1, not 11. The text there surprised some, but all agreed that it did reflect the UCD. The code points are reserved for application usage, and it is up to the application producing Unicode text to use application specific means to communicate what arbitrary assignments it uses those code points for. Saying an application should assume ISO-6429 as the recommended default if it does not implement a compatible receiver of what the producer communicates is a bug that just hasn't bit many people yet, imo.

(0003256) steffen (reporter) 2016-06-09 13:55	Re ..the proposed text: I object against \|whether \c\\ or \c\ was used to represent <FS>. This standard requires \c\\ to make backslash escape processing consistent. and i don't know which current real-life implementation the standard bases this upon. ?0[steffen@wales tmp]$ cat t.sh;for i in bash mksh; do echo $i:;$i t.sh\|s-hex;done echo a$'\c\b'c bash: 00000000 61 1c 62 63 0a \|a.bc.\| 00000005 mksh: 00000000 61 1c 62 63 0a \|a.bc.\| 00000005 ?0[sdaoden@wales tmp]$ So in left-to-right parsing \c is a function that takes an argument, if that is the backslash \ it is consumed, no need for quoting it. Just as "i" (and meaning the quote) would expect it to be. Maybe the standard core developers ran into a bash bug: ?0[steffen@wales tmp]$ cat t.sh;for i in bash mksh; do echo $i:;$i t.sh\|s-hex;done echo a$'\c\'c bash: t.sh: line 1: unexpected EOF while looking for matching `'' t.sh: line 2: syntax error: unexpected end of file mksh: 00000000 61 1c 63 0a \|a.c.\| 00000004 ?0[steffen@wales tmp]$ cat t.sh;for i in bash mksh; do echo $i:;$i t.sh\|s-hex;done echo a$'\c\\'c bash: 00000000 61 1c 63 0a \|a.c.\| 00000004 mksh: t.sh[2]: no closing quote but that shouldn't be used as a template for a standard that yet needs to be born. Imho.

(0003257) stephane (reporter) 2016-06-09 14:10 edited on: 2016-06-09 14:26	ksh93 is the shell $'...' comes from (while $'\uxxxx' comes from zsh: http://www.zsh.org/mla/workers/2003/msg00223.html) [^] $ ksh93 -c "printf %s $'\cA'" \| hd 00000000 01 \|.\| 00000001 $ ksh93 -c "printf %s $'\c\1'" \| hd 00000000 41 \|A\| 00000001 So in ksh93 $'\c\x' seems to be a feature and the reverse of $'\cx' Or more precisely, $'\cX' toggles the bit 6 of the first byte of the expansion (\xxx expansion) of X

(0003258) shware_systems (reporter) 2016-06-09 14:15	It may be invention, but the purpose is to allow constructs like \c\u41 as a synonym for \cA, currently, and \c\x01 for A if the converse extension is implemented. Using \\ makes it consistent with the other escapes so lookahead isn't required.

(0003259) steffen (reporter) 2016-06-09 15:07	Re @3257: \| So in ksh93 $'\c\x' seems to be a feature and the reverse of $'\cx' Nice, but the standard core developers don't allow this but _only_ and _explicitly_ enforce and allow it for \c\\ a.k.a. <FS>. Due to the context shown in @3256 it could be deduced that this desire de facto originates in something that is nothing but a bug, of bash(1) to be exact. (Luckily I just fixed one of my bugs, so i dare: ..that only looked for a single backslash at EOL to quote LF, instead of looking backwards wether the backslash is itself escaped (i.e., if there is a non-even number of backslashes before the LF): just in case you cannot parse left-to-right.)

(0003262) stephane (reporter) 2016-06-09 18:42	Re @3259 yes, from a cursory look, I agree it does look like something like: \c\<anything-other-than-\> -> unspecified is missing from the proposed specification in @590 (as of 2016-06-09) but it's kind-of implied. I would say you're free to implement $'\c\a' as either $'\x48' (as ksh93 does) or as $'\x1f'a or as "invalid control sequence".

(0005221) calestyo (reporter) 2021-01-31 05:38	Is this still under consideration? Would probably be quite useful to have a portable way for such escape sequences.

(0005222) stephane (reporter) 2021-01-31 06:48 edited on: 2021-01-31 18:34	As most of the remaining issues are with $'\uXXXX' and $'\UXXXXXXXX', I would suggest that those be dropped from the specification of $'...' for issue8 for now. That could be revisited in issue9 where we should also consider at the same time: - adding those in printf format and %b ($'\uXXXX' which comes from zsh in 2003, was actually inspired from the same added to GNU printf in 2000. See https://www.zsh.org/mla/workers/2003/msg00223.html [^] ) and echo. - specifying a C.UTF-8 locale - more generally specify how and when utilities decode their argument from byte to characters (and the handling of when that fails).

(0005223) calestyo (reporter) 2021-01-31 17:08	It should perhaps not be forgotten, that, if a standard fails to standardise things which are apparently needed,... implementers will simply go their own way (like they did not only in this case, but also other examples)... which are likely incompatible and/or non-portable... which is in turn obviously what standards It's quite clear that a standard cannot and should not jump on every hype train but this ticket is open for 10 years now, and there is still no easy/proper way to specify, e.g. unicode characters or even simply things like <tab>. So hopefully this can be addressed then at least in issue9. :-)

(0005226) dwheeler (reporter) 2021-02-04 21:38	I agree with @stephane. I'd rather have half a loaf than no loaf, and adding this without \u or \U still makes it easier to insert tab, newline, and so on into strings. I'll note that my original proposal from 2010 didn't include proposing \u or \U.

(0005227) dwheeler (reporter) 2021-02-05 16:02	Robert Elz said: > At least once we either drop \u, or properly define how it is supposed > to work (if anyone actually has an idea what that is), $' is entirely the > same as ' once the internal expansions are done (as part of lexical analysis) > so is trivial to add, makes it easier to encode some strings (just easier, > nothing that cannot already be done) and is trivial to implement. Agreed. The goal is to make it easy to do basic things, like handle newline and tab. It's not hard to implement, and it's already implemented by many shells. > ps: unrelated to \$ in $' but while I am here, since I mentioned it above, > in the NetBSD sh, \u (which accepts any number of hex digits up to 4, or > up to 8 for \U, not just exactly 4 or 8, but that's a frill) the interpretation > is that the UTF-8 encoding of the code point specified is embedded in the > string. No more, no less. In particular it is not the shell's job to > validate the UTF sequences so that they make sense, or can rationally be > interpreted as anything at all (that's on the application). They're just > bit patterns. Similarly, since the author of the script cannot be assumed > to know what locale the user running it will have set, converting the \u > sequence to some other locale (while it is still being processed inside the > shell) cannot be correct either. ... > Since there doesn't seem to be to be a lot of > reason any more for anyone not to use non UTF-8 encodings, that would really > mean doing a whole lot of nothing most of the time (hopefully, always). I would be very happy with the interpretation that \u and \U simply generate the UTF-8 encoding, at least for the C locale, POSIX locale, and any locale ending in ".UTF-8". That is easy to implement, and I agree that trying to validate all sequences is absurd; just output it thank you. We could allow shells to convert to the local locale or generate UTF-8 for \u and \U. If we allow \u, I suggest ONLY standardizing \u 4-digits and \U 8-digits, and allowing but NOT requiring support for byte 0 in strings. Allowing less-than-4-digits makes it easy to make a mistake by having a hex character "after" a short sequence. But I could certainly live with it either way.

(0005228) dwheeler (reporter) 2021-02-05 16:08	On further reflection: I changed my mind. Robert Elz's proposal is simple and clear. So: If we add \u and \U within $'...' to POSIX, just have it always generate the equivalent UTF-8 (without trying tof figure out if it's "legal"). Then the user always knows what it generates. If they want something other than UTF-8, they can use $'...' to generate the UTF-8, and then use some other tool like iconv to convert it (e.g., when printing).

(0005229) calestyo (reporter) 2021-02-05 16:42	Maybe it would be really better to leave out \u and \U for the time being? Forcing users with non-UTF-8 encoding to convert UTF8 output again with some external tool, is not much better than forcing them now to produce such characters or even basic ones like \t, in the first place. Maybe it's to simple minded, but can't one just leave \u and \U to be Unicode code points, with it's being up to the shell to convert it to the current encoding and with unspecified results if no mapping is possible.

(0005273) mirabilos (reporter) 2021-03-15 20:38	I must say \c\\ irritates me. Otherwise (I cannot figure out how to find “XCU Table 4.21” in the HTML version) it looks like mksh is already compliant (plus supporting \E as alias for \e for GNU bash compatibility). mksh currently defines \cX for any X (I hope this matches the mysterious table I cannot find) as: The sequence “\c%”, where ‘%’ is any octet, translates to Ctrl-%, that is, “\c?” becomes DEL, everything else is bitwise ANDed with 0x9F. With this, I also get FS from \c< or \c\| if I must, but… meh. I also read \cX as eating X no matter what it was, at $'…' parse time. So, what am I supposed to do, change mksh now? So after \c we first have another level of character interpretation and the \c acts only then?

(0005275) geoffclare (manager) 2021-03-16 09:40	In the HTML version, the equivalent of XCU Table 4.21 is "Table: Circumflex Control Characters in stty" on the stty page.

(0005714) calestyo (reporter) 2022-02-25 02:56	Sorry if I state something which has been clarified somewhere already, but from If a backslash is followed by any sequence not listed above or if \e, \cX, \uXXXX, or \UXXXXXXXX yields a character that is not present in the current locale, the result is implementation-defined. I'd assume that the general idea is, that that some escape sequences produce byte sequences for a given character with respect to the current locale (i.e. category LC_CTYPE, as set by shell variables like LANG, LC_CTYPE, LC_ALL), which would make quite some sense, of course. However, the standard should clearly specify which of them are: - subject to the current locale \e \cX \uXXXX \UXXXXXXXX => there should be some rationale why these are actually subject to the current locale. - subject to the "lexical" locale (which is immutable) I'd assume at least: \" \' \\ but possibly also any others, which POSIX requests to be present in any locale: \a \b \f \t \v these are anyway invariant: \n \r - generally not subject to any locale: I'd assume \xXX and \XXX Further... there are quite some "implementation-defined"s in the above text. Are these really necessary, because there are already differing implementations (for which there is no hope that they'd reconcile? Something like the "implementation-defined" from the quoted text from above just cries for the problems of tomorrow. One implementation may e.g. simply ignore any \uXXXX that doesn't produce a valid char, another may use some replacement char. People willy rely in their scripts on one or the other behaviour, thus we get some non-portability, when we leave this things open. To the least one should reserve any not yet standardised escape sequences for the future, i.e. making the results implementation-defined, but not allow compliant implementations to use them for their own purposes (thereby introducing extensions, which again may cause portability issues in the future).

(0005746) calestyo (reporter) 2022-03-14 00:52	Another possible issue with the most recent(?) proposed text in https://www.austingroupbugs.net/view.php?id=249#c2809 [^] ... I'm not an expert, but my understanding was that the quotings are not "resolved" during token recognition ("During token recognition no substitutions shall be actually performed...") but in Quote Removal, right? Previously that said: "The quote characters (<backslash>, single-quote, and double-quote) that were present in the original word shall be removed unless they have themselves been quoted." which was enough, because the desired literal string was already the result when any quotes were removed, but now with $'...', this is no longer the case. I miss somehow the step which says that any escape sequences (like \t, \xXX, etc.) have to be replaced depending on their respective definition.

(0005747) calestyo (reporter) 2022-03-14 01:22	Also, if \e \cX \uXXXX \UXXXXXXXX would be standardised as be dependant on some locale (and not just producing always UTF-8 - which I'm not sure whether this would be a good idea), then the following should be highlighted somewhere for educational purposes: E.g. bash does right now: $ LC_ALL=en_US.UTF-8 $ foo() { echo $'\u2208' ; } $ foo ∈ $ LC_ALL=C $ echo $'\u2208' \u2208 $ foo ∈ $ However, I'd bet that many people would expect the 2nd invocation of foo to also result in \u2208, or whatever representation U+2208 has in the then current locale.

(0005751) shware_systems (reporter) 2022-03-14 21:22	Re: 5746 According to the Etherpad bug249 page, currently line 316, escaped sequences are converted after alias substitution completes, if needed (in case $' strings are part of an alias body), and before tilde or other expansions of XCU 2.6, with removal of the "$'" and trailing <single-quote> during the Quote Removal phase. At that point the converted text should be treated the same as a single-quoted string that didn't need escape sequences would.

(0005995) geoffclare (manager) 2022-10-18 10:42 edited on: 2022-10-18 10:50	These are the agreed changes from https://posix.rhansen.org/p/bug249 [^] (omitting \uXXXX and \UXXXXXXXX). Page and line numbers are for the 2013 edition (C138.pdf) At page 2319 line 73573 (XCU section 2.1, Shell Introduction, item 4) change: The shell performs various expansions (separately) on different parts of each command, resulting in a list of pathnames and fields to be treated as a command and arguments; see [xref to 2.6].x to: For each word within a command, the shell processes backslash escape sequences inside dollar-single-quotes (see [xref to 2.2.4]) and then performs various word expansions (see [xref to 2.6]). In the case of a simple command, the results usually include a list of pathnames and fields to be treated as a command name and arguments; see [xref to 2.9]. At page 2320 line 73594 (XCU section 2.2, Quoting) change: The various quoting mechanisms are the escape character, single-quotes, and double-quotes. to: The various quoting mechanisms are the escape character, single-quotes, double-quotes, and dollar-single-quotes. At page 2320 lines 73609-73611 (XCU 2.2.3, Double-Quotes), change: $ The <dollar-sign> shall retain its special meaning introducing parameter expansion (see Section 2.6.2), a form of command substitution (see Section 2.6.3), and arithmetic expansion (see Section 2.6.4). to: $ The <dollar-sign> shall retain its special meaning introducing parameter expansion (see Section 2.6.2), a form of command substitution (see Section 2.6.3), and arithmetic expansion (see Section 2.6.4), but shall not retain its special meaning introducing the dollar-single-quotes form of quoting (see [xref to 2.2.4]). At page 2321 lines 73626-73627 (XCU 2.2.3, Double-Quotes), change: A single-quoted or double-quoted string that begins, but does not end, within the "`...`" sequence to: A quoted (single-quoted, double-quoted, or dollar-single-quoted) string that begins, but does not end, within the "`...`" sequence After page 2321 line 73635 (end of XCU section 2.2), insert a new subsection: 2.2.4 Dollar-Single-Quotes A sequence of characters starting with a <dollar-sign> immediately followed by a single-quote (<tt>$'</tt>) shall preserve the literal value of all characters up to an unescaped terminating single-quote (<tt>'</tt>), with the exception of certain backslash escape sequences, as follows: <tt>\"</tt> yields a <quotation-mark> (double-quote) character. <tt>\'</tt> yields an <apostrophe> (single-quote) character. <tt>\\</tt> yields a <backslash> character. <tt>\a</tt> yields an <alert> character. <tt>\b</tt> yields a <backspace> character. <tt>\e</tt> yields an <ESC> character. <tt>\f</tt> yields a <form-feed> character. <tt>\n</tt> yields a <newline> character. <tt>\r</tt> yields a <carriage-return> character. <tt>\t</tt> yields a <tab> character. <tt>\v</tt> yields a <vertical-tab> character. <tt>\cX</tt> yields the control character listed in the Value column of [xref to XCU Table 4.21] in the Operands section of the stty utility when X is one of the characters listed in the ^c column of the same table, except that \c\\ yields the <FS> control character since the <backslash> character must be escaped. <tt>\xXX</tt> yields the byte whose value is the hexadecimal value XX (one or more hex digits). If more than two hex digits follow \x, the results are unspecified. <tt>\ddd</tt> yields the byte whose value is the octal value ddd (one to three octal digits). The behavior of a <backslash> immediately followed by any other character, including <newline>, is unspecified. In cases where a variable number of characters can be used to specify an escape sequence (\xXX and \ddd), the escape sequence shall be terminated by the first character that is not of the expected type or, for \ddd sequences, when the maximum number of characters specifed has been found, whichever occurs first. These backslash escape sequences shall be processed (replaced with the bytes or characters they yield) immediately prior to word expansion (see [xref to 2.6]) of the word in which the dollar-single-quotes sequence occurs. If a \xXX or \ddd escape sequence yields a byte whose value is 0, it is unspecified whether that null byte is included in the result or if that byte and any following regular characters and escape sequences up to the terminating unescaped single-quote are evaluated and discarded. If the octal value specified by \ddd will not fit in a byte, the results are unspecified. If a \e or \cX escape sequence specifies a character that does not have an encoding in the locale in effect when these backslash escape sequences are processed, the result is implementation-defined. However, implementations shall not replace an unsupported character with bytes that do not form valid characters in that locale's character set. If a backslash escape sequence represents a single-quote character (for example \'), that sequence shall not terminate the dollar-single-quote sequence. At page 2321 lines 73658-73664 (XCU section 2.3 (Token Recognition) point 4), change: 4. If the current character is <backslash>, single-quote, or double-quote and it is not quoted, it shall affect quoting for subsequent characters up to the end of the quoted text. The rules for quoting are as described in Section 2.2 (on page 2298). During token recognition no substitutions shall be actually performed, and the result token shall contain exactly the characters that appear in the input (except for <newline> joining), unmodified, including any embedded or enclosing quotes or substitution operators, between the <quotationmark> and the end of the quoted text. The token shall not be delimited by the end of the quoted field. to: 4. If the current character is an unquoted <backslash>, single-quote, or double-quote or is the first character of an unquoted <dollar-sign> single-quote sequence, it shall affect quoting for subsequent characters up to the end of the quoted text. The rules for quoting are as described in [xref to Section 2.2]. During token recognition no substitutions shall be actually performed, and the result token shall contain exactly the characters that appear in the input unmodified, including any embedded or enclosing quotes or substitution operators, between the start and the end of the quoted text. The token shall not be delimited by the end of the quoted field. After page 2327 line 73900 (XCU section 2.6, Word Expansions), insert a new bullet point: a <single-quote> At page 2331 lines 74071-74073 (XCU 2.6.3, Command Substitution), change: A single-quoted or double-quoted string that begins, but does not end, within the <tt>"`...`"</tt> sequence produces undefined results. to: A quoted string that begins, but does not end, within the <tt>"`...`"</tt> sequence produces undefined results. At page 2333 lines 74157-74158 (XCU section 2.6.7, Quote Removal), change: The quote characters (<backslash>, single-quote, and double-quote) that were present in the original word shall be removed unless they have themselves been quoted. to: The quote character sequence <dollar-sign> single-quote and the single-character quote characters (<backslash>, single-quote, and double-quote) that were present in the original word shall be removed unless they have themselves been quoted. Note that the single-quote character that terminates a <dollar-sign> single-quote sequence is itself a single-character quote character. Note that after quote removal the shell still remembers which characters were quoted. This is necessary for purposes such as matching patterns in a case conditional construct (see [xref to 2.9.4.3] and [xref to 2.13]). At page 2348 lines 74718-74719 (the Note in XCU section 2.10.2 (Shell Grammar Rules) rule 1), change: Because at this point <quotation-mark> characters are retained in the token, quoted strings cannot be recognized as reserved words. to: Because at this point quoting characters (<backslash>, single-quote, <quotation-mark>, and the <dollar-sign> single-quote sequence) are retained in the token, quoted strings cannot be recognized as reserved words. After page 3677 line 125685 (end of XRAT C.2.2.3), insert a new paragraph: The $'...' construct does not retain its special meaning inside double quotes. This was discussed by the standard developers and rejected. Note that $'...' is a quoting mechanism and not an expansion. Losing the special meaning inside double quotes is consistent with other quoting mechanisms losing their special meaning when quoted. After the above insertion and before page 3678 line 125686 (XRAT C.2.3), insert a new subsection: C.2.2.4 Dollar-Single-Quotes The $'...' quoting construct has been implemented in several recent shells. It is similar to character string literals ("...") in the ISO C standard with the following exceptions: The \x escape sequence in C can be followed by an arbitrary number of hexadecimal digits. The ksh93 implementation of $'...' also consumes an arbitrary number of hexadecimal digits; bash consumes at most two hexadecimal digits in this case. This standard leaves the result unspecified if more than two hexadecimal digits follow \x. (Note that a hexadecimal escape followed by a literal hexadecimal character can always be represented as $'\xXX'X.) The \c escape sequence is not included in the ISO C standard. There was also some disagreement in shells that historically supported \c escape sequences in $'...'. These include: whether \cA through \cZ produced the byte values 1 through 26, respectively or supported the codeset independent control character as specified by the stty utility. This standard requires codeset independence. whether \c[, \c\\, \c], \c^, \c_, and \c? could be used to yield the <ESC>, <FS>, <GS>, <RS>, <US>, and <DEL> control characters, respectively. This standard requires support for all of the control characters except NULL (matching what is done in the stty utility). whether \c\\ or \c\ was used to represent <FS>. This standard requires \c\\ to make backslash escape processing consistent. The implementors of the most common shells that implement $'\cX' agreed to convert to the behavior specified in this standard. Some shells also allow \c<arbitrary_control_character> to act as an inverse function to \cX (i.e., \cm and \cM yield <CR> and \c<CR> yields m or M. This standard leaves this behavior implementation-defined. The \e escape sequence is not included in the ISO C standard, but was provided by all historical shells that supported $'...'. Some also supported \E as a synonym. One member of the group objected to adding \e because the <ESC> control character is not required to be in the portable character set. The \e sequence is included because many historical users of $'...' expect it to be there. The \E sequence is not included in this standard because <backslash> escape sequences that start with <backslash> followed by an uppercase letter (except \U) are reserved by the C Standard for implementation use. The \ddd octal escape sequence and the \xXX hexadecimal escape sequence can be used to insert a null byte into a C Standard character string literal and into a $'...' quoted word in this standard. In C, any characters specified after that null byte (including escape sequences) continue to be processed and added to the character string literal. In $'...' in the shell this standard allows the equivalent behavior but also allows the null byte and all remaining characters up to the terminating unescaped single-quote be evaluated and discarded. The latter (which was historic practice in bash, but not in ksh93) allows an escape sequence producing a null byte to terminate the dollar-single-quoted expansion, but not terminate the token in which it appears if there are characters remaining in the token. For example: printf a$'b\0c\''d is required by this standard to produce: abd while historic versions of ksh93 produced: ab The ISO C standard specifies \uXXXX and \UXXXXXXXX escape sequences. These need not be supported by $'...' in the shell. The double-quote (") character can be used literally, while the single-quote (') character must be represented as an escape sequence. In C, single-quote can be used literally, while double-quote requires an escape sequence. A <backslash> immediately followed by a <newline> has unspecified behavior. In C, this sequence is used for line continuations, where both the <backslash> and <newline> are deleted and a diagnostic is required if a closing quote is not encountered before a <newline> that is not preceded by <backslash>. Backslash escape sequences not described in the standard result in unspecified behavior. In C, the result is not a token and a diagnostic is required. This allows shells to recognize other backslash escape sequences in other ways as extensions to the standard's requirements. Furthermore, existing implementations already had different behaviors for some backslash escape sequences when $'...' processing was added to the standard. This standard makes the results implementation-defined if \e or \cX specifies a character that is not present in the current locale. Application authors should note that implementations are permitted to have a wide range of behaviors when encountering an unsupported character. For example: the shell might produce an error, possibly causing the shell to terminate the unsupported character might be silently discarded the unsupported character might be replaced with another character of a different character class the unsupported character might be replaced with a shell-special character (e.g., '?') the unsupported character might be replaced with multiple characters, shell-special or regular (e.g. if <ESC> is not supported $'\e' may be replaced by "???", "XXX" or "<ESC>") However, implementations must document their behavior, and they are prohibited from replacing an unsupported character with bytes that do not form valid characters in the current locale's character set (e.g., encoding in UTF-8 when the locale has a 7-bit character set). This standard does not specify a way for script authors to determine beforehand whether a particular \cX sequence specifies a character that exists in the current locale. At the time this feature was standardized, no known implementations provided such a capability. Note that the escape sequences recognized by $'...', file format notation (see [xref to Table 5-1]), XSI-conforming implementations of the echo utility (see the utility's operands section on [xref to echo]), and the printf utility's format operand (see the utility's extended description on [xref to printf]) are not the same. Some escape sequences are not recognized by all of the above, the \c escape sequence in echo is not at all like the \c escape sequence in $'...', octal escape sequences in some of the above accept one to four octal digits and require a leading zero while others accept one to three octal digits and do not require a leading zero.

(0005997) kre (reporter) 2022-10-19 11:50 edited on: 2022-10-19 11:56	In Note: 0005995 in the proposed new section 2.2.4, the penultimate bullet point is: <tt>\ddd</tt> yields the byte whose value is the octal value ddd (one to three octal digits). Which suggests to me that \567 is required to yeild the byte whose octal value is 0567 which is something of a challenge. There are multiple ways this could be fixed, but as far as I can tell, all shells which I can test, which also implement $' (as other than a $ preceding a single quoted string) all conform to: <tt>\ddd</tt> yields the byte whose value is the least significant 8 bits of the octal value ddd (one to three octal digits). I understand the need for the final bullet point (\<nl> gives unspecified results) -- I see all 3 reasonable interpretations for that in at least one shell which supports $' (treating it as a line join, simply pretending the backslash is not there <ie: \<nl> is <nl>) is treating the sequence literally \<nl> is \<nl>) [aside: only the first makes any sense to me, the ability to insert a line wrap in the middle of the string is useful, and while it can be achieved by ending the $' string, adding a \<nl> and then starting a new $' string, that is cumbersome - on the other hand the 2nd interp (simply drop the \) can be done by simply omitting the \ in the input string, and the last, by properly escaping the \ (resulting in \\<nl>) both of which are easy, but never mind]. However, there is nothing in the text being added to XRAT about why \<nl> is unspecified, and there probably should be, for future generations. Also, assuming that \u and \U are part of the C standard (I actually have no idea on that assumption) then explaining why they're not being included here would also help.

(0005998) hvd (reporter) 2022-10-19 12:22 edited on: 2022-10-19 13:32	> There are multiple ways this could be fixed, but as far as I can tell, all shells which I can test, which also implement $' (as other than a $ preceding a single quoted string) all conform to: > <tt>\ddd</tt> yields the byte whose value is the least significant 8 > bits of the octal value ddd (one to three octal digits). busybox ash does not behave this way. I am seeing $'\567' result in '.', not 'w'. This is because it treats it as $'\56', simply discarding the 7 that would result in an out-of-range byte.

(0005999) kre (reporter) 2022-10-19 14:04	Re Note: 0005998 busybox is one I don't have, so cannot test (along with ksh88, though it probably doesn't have $' at all). Another of the possible fixes is to make it unspecified what happens when there are 3 octal digits, and the first is not 0 1 2 or 3.

(0006000) steffen (reporter) 2022-10-19 20:53	qMy MUA says ? : $'\567' s-nail: \0 argument exceeds byte: $'\567': \567 ? : $'\U11FFFF' s-nail: \U argument exceeds 0x10FFFF: $'\U11FFFF': \U11FFFF Because these are almost ever written constants as opposed to variable expansions, it seems to me hinting an error seems better than anything else. bash is known to parse all sorts of integers beyond any measure, wrapping around; i do not think it is desirable to standardize it, it requires unrolling ascii-to-integer instead of using standardized functions (busybox ash uses those, for example (in general, maybe not for \567)).

(0006001) steffen (reporter) 2022-10-19 21:29	re 5997: Also, assuming that \u and \U are part of the C standard (I actually have no idea on that assumption) then explaining why they're not being included here would also help. Supporting them is easy (i share implementation with \X and \x, just the length differs, and the aftermath). You get UTF-32 code points and convert them to UTF-8. In an UTF-8 locale you are done, if you have iconv(3) you pass it through and are done on success, otherwise i do i = snprintf(stackbuf, sizeof stackbuf, "\\%c%0*X", (no > 0xFFFFu ? 'U' : 'u'), (int)(no > 0xFFFFu ? 8 : 4), (u32)no); This is really easy no? The problem with ISO as i recall them were the mysterious codepoint holes that i did not understand (render some valid ranges undefined, what for?). The other problem are grapheme sequences that span over multiple UTF-32 codepoints. This only in the iconv(3) case anyway (and last i looked no iconv(3) did care for them, so entirely hypothetic). That is: if adjacent \U escapes exist (in the single quote), and an iconv(3) fallback is to be driven, pass them all as a continous string. I do not do that in my mailer. But i should, as there _are_ graphems which span multiple codepoints. I repeat that adding an additional \$VARiable sequence would turn $'' into a quoting mechanism that includes the capabilities of all other mechanisms. It would allow quoting entire sentences etc even with embedded expansions.

(0006002) geoffclare (manager) 2022-10-20 08:49 edited on: 2022-10-20 08:49	> Another of the possible fixes is to make it unspecified what happens when > there are 3 octal digits, and the first is not 0 1 2 or 3. Since the description is silent about what happens if the value is too large to be represented in a byte, the behaviour is implicitly unspecified. The same is already true for the printf utility, which says: "\ddd", where ddd is a one, two, or three-digit octal number, shall be written as a byte with the numeric value specified by the octal number and (for %b): "\0ddd", where ddd is a zero, one, two, or three-digit octal number that shall be converted to a byte with the numeric value specified by the octal number

(0006003) geoffclare (manager) 2022-10-20 09:03	> Also, assuming that \u and \U are part of the C standard (I actually have > no idea on that assumption) then explaining why they're not being included > here would also help. I have added a suggestion for this to the bug249 etherpad page in blue.

(0006004) geoffclare (manager) 2022-10-20 10:08	> However, there is nothing in the text being added to XRAT about why \<nl> > is unspecified, and there probably should be, for future generations. I found an old email from Jilles Tjoelker which says: I have found three different behaviours: 1. change backslash-newline to a newline, in ksh93, mksh and zsh; 2. leave backslash-newline unchanged, in bash (both non-POSIX and POSIX mode); 3. delete backslash and newline, in FreeBSD sh. I have added a suggestion based on this to the bug249 etherpad page in blue.

(0006006) geoffclare (manager) 2022-10-20 15:12 edited on: 2022-10-20 15:12	Page and line numbers are for the 2013 edition (C138.pdf) At page 2319 line 73573 (XCU section 2.1, Shell Introduction, item 4) change: The shell performs various expansions (separately) on different parts of each command, resulting in a list of pathnames and fields to be treated as a command and arguments; see [xref to 2.6].x to: For each word within a command, the shell processes backslash escape sequences inside dollar-single-quotes (see [xref to 2.2.4]) and then performs various word expansions (see [xref to 2.6]). In the case of a simple command, the results usually include a list of pathnames and fields to be treated as a command name and arguments; see [xref to 2.9]. At page 2320 line 73594 (XCU section 2.2, Quoting) change: The various quoting mechanisms are the escape character, single-quotes, and double-quotes. to: The various quoting mechanisms are the escape character, single-quotes, double-quotes, and dollar-single-quotes. At page 2320 lines 73609-73611 (XCU 2.2.3, Double-Quotes), change: $ The <dollar-sign> shall retain its special meaning introducing parameter expansion (see Section 2.6.2), a form of command substitution (see Section 2.6.3), and arithmetic expansion (see Section 2.6.4). to: $ The <dollar-sign> shall retain its special meaning introducing parameter expansion (see Section 2.6.2), a form of command substitution (see Section 2.6.3), and arithmetic expansion (see Section 2.6.4), but shall not retain its special meaning introducing the dollar-single-quotes form of quoting (see [xref to 2.2.4]). At page 2321 lines 73626-73627 (XCU 2.2.3, Double-Quotes), change: A single-quoted or double-quoted string that begins, but does not end, within the "`...`" sequence to: A quoted (single-quoted, double-quoted, or dollar-single-quoted) string that begins, but does not end, within the "`...`" sequence After page 2321 line 73635 (end of XCU section 2.2), insert a new subsection: 2.2.4 Dollar-Single-Quotes A sequence of characters starting with a <dollar-sign> immediately followed by a single-quote (<tt>$'</tt>) shall preserve the literal value of all characters up to an unescaped terminating single-quote (<tt>'</tt>), with the exception of certain backslash escape sequences, as follows: <tt>\"</tt> yields a <quotation-mark> (double-quote) character, but note that <quotation-mark> can be included unescaped. <tt>\'</tt> yields an <apostrophe> (single-quote) character. <tt>\\</tt> yields a <backslash> character. <tt>\a</tt> yields an <alert> character. <tt>\b</tt> yields a <backspace> character. <tt>\e</tt> yields an <ESC> character. <tt>\f</tt> yields a <form-feed> character. <tt>\n</tt> yields a <newline> character. <tt>\r</tt> yields a <carriage-return> character. <tt>\t</tt> yields a <tab> character. <tt>\v</tt> yields a <vertical-tab> character. <tt>\cX</tt> yields the control character listed in the Value column of [xref to XCU Table 4.21] in the Operands section of the stty utility when X is one of the characters listed in the ^c column of the same table, except that \c\\ yields the <FS> control character since the <backslash> character must be escaped. <tt>\xXX</tt> yields the byte whose value is the hexadecimal value XX (one or more hex digits). If more than two hex digits follow \x, the results are unspecified. <tt>\ddd</tt> yields the byte whose value is the octal value ddd (one to three octal digits). The behavior of a <backslash> immediately followed by any other character, including <newline>, is unspecified. In cases where a variable number of characters can be used to specify an escape sequence (\xXX and \ddd), the escape sequence shall be terminated by the first character that is not of the expected type or, for \ddd sequences, when the maximum number of characters specifed has been found, whichever occurs first. These backslash escape sequences shall be processed (replaced with the bytes or characters they yield) immediately prior to word expansion (see [xref to 2.6]) of the word in which the dollar-single-quotes sequence occurs. If a \xXX or \ddd escape sequence yields a byte whose value is 0, it is unspecified whether that null byte is included in the result or if that byte and any following regular characters and escape sequences up to the terminating unescaped single-quote are evaluated and discarded. If the octal value specified by \ddd will not fit in a byte, the results are unspecified. If a \e or \cX escape sequence specifies a character that does not have an encoding in the locale in effect when these backslash escape sequences are processed, the result is implementation-defined. However, implementations shall not replace an unsupported character with bytes that do not form valid characters in that locale's character set. If a backslash escape sequence represents a single-quote character (for example \'), that sequence shall not terminate the dollar-single-quote sequence. At page 2321 lines 73658-73664 (XCU section 2.3 (Token Recognition) point 4), change: 4. If the current character is <backslash>, single-quote, or double-quote and it is not quoted, it shall affect quoting for subsequent characters up to the end of the quoted text. The rules for quoting are as described in Section 2.2 (on page 2298). During token recognition no substitutions shall be actually performed, and the result token shall contain exactly the characters that appear in the input (except for <newline> joining), unmodified, including any embedded or enclosing quotes or substitution operators, between the <quotationmark> and the end of the quoted text. The token shall not be delimited by the end of the quoted field. to: 4. If the current character is an unquoted <backslash>, single-quote, or double-quote or is the first character of an unquoted <dollar-sign> single-quote sequence, it shall affect quoting for subsequent characters up to the end of the quoted text. The rules for quoting are as described in [xref to Section 2.2]. During token recognition no substitutions shall be actually performed, and the result token shall contain exactly the characters that appear in the input unmodified, including any embedded or enclosing quotes or substitution operators, between the start and the end of the quoted text. The token shall not be delimited by the end of the quoted field. After page 2327 line 73900 (XCU section 2.6, Word Expansions), insert a new bullet point: a <single-quote> At page 2331 lines 74071-74073 (XCU 2.6.3, Command Substitution), change: A single-quoted or double-quoted string that begins, but does not end, within the <tt>"`...`"</tt> sequence produces undefined results. to: A quoted string that begins, but does not end, within the <tt>"`...`"</tt> sequence produces undefined results. At page 2333 lines 74157-74158 (XCU section 2.6.7, Quote Removal), change: The quote characters (<backslash>, single-quote, and double-quote) that were present in the original word shall be removed unless they have themselves been quoted. to: The quote character sequence <dollar-sign> single-quote and the single-character quote characters (<backslash>, single-quote, and double-quote) that were present in the original word shall be removed unless they have themselves been quoted. Note that the single-quote character that terminates a <dollar-sign> single-quote sequence is itself a single-character quote character. Note that after quote removal the shell still remembers which characters were quoted. This is necessary for purposes such as matching patterns in a case conditional construct (see [xref to 2.9.4.3] and [xref to 2.13]). At page 2348 lines 74718-74719 (the Note in XCU section 2.10.2 (Shell Grammar Rules) rule 1), change: Because at this point <quotation-mark> characters are retained in the token, quoted strings cannot be recognized as reserved words. to: Because at this point quoting characters (<backslash>, single-quote, <quotation-mark>, and the <dollar-sign> single-quote sequence) are retained in the token, quoted strings cannot be recognized as reserved words. After page 3677 line 125685 (end of XRAT C.2.2.3), insert a new paragraph: The $'...' construct does not retain its special meaning inside double quotes. This was discussed by the standard developers and rejected. Note that $'...' is a quoting mechanism and not an expansion. Losing the special meaning inside double quotes is consistent with other quoting mechanisms losing their special meaning when quoted. After the above insertion and before page 3678 line 125686 (XRAT C.2.3), insert a new subsection: C.2.2.4 Dollar-Single-Quotes The $'...' quoting construct has been implemented in several recent shells. It is similar to character string literals ("...") in the ISO C standard with the following exceptions: The \x escape sequence in C can be followed by an arbitrary number of hexadecimal digits. The ksh93 implementation of $'...' also consumes an arbitrary number of hexadecimal digits; bash consumes at most two hexadecimal digits in this case. This standard leaves the result unspecified if more than two hexadecimal digits follow \x. (Note that a hexadecimal escape followed by a literal hexadecimal character can always be represented as $'\xXX'X.) The \c escape sequence is not included in the ISO C standard. There was also some disagreement in shells that historically supported \c escape sequences in $'...'. These include: whether \cA through \cZ produced the byte values 1 through 26, respectively or supported the codeset independent control character as specified by the stty utility. This standard requires codeset independence. whether \c[, \c\\, \c], \c^, \c_, and \c? could be used to yield the <ESC>, <FS>, <GS>, <RS>, <US>, and <DEL> control characters, respectively. This standard requires support for all of the control characters except NULL (matching what is done in the stty utility). whether \c\\ or \c\ was used to represent <FS>. This standard requires \c\\ to make backslash escape processing consistent. The implementors of the most common shells that implement $'\cX' agreed to convert to the behavior specified in this standard. Some shells also allow \c<arbitrary_control_character> to act as an inverse function to \cX (i.e., \cm and \cM yield <CR> and \c<CR> yields m or M. This standard leaves this behavior implementation-defined. The \e escape sequence is not included in the ISO C standard, but was provided by all historical shells that supported $'...'. Some also supported \E as a synonym. One member of the group objected to adding \e because the <ESC> control character is not required to be in the portable character set. The \e sequence is included because many historical users of $'...' expect it to be there. The \E sequence is not included in this standard because <backslash> escape sequences that start with <backslash> followed by an uppercase letter (except \U) are reserved by the C Standard for implementation use. The \ddd octal escape sequence and the \xXX hexadecimal escape sequence can be used to insert a null byte into a C Standard character string literal and into a $'...' quoted word in this standard. In C, any characters specified after that null byte (including escape sequences) continue to be processed and added to the character string literal. In $'...' in the shell this standard allows the equivalent behavior but also allows the null byte and all remaining characters up to the terminating unescaped single-quote be evaluated and discarded. The latter (which was historic practice in bash, but not in ksh93) allows an escape sequence producing a null byte to terminate the dollar-single-quoted expansion, but not terminate the token in which it appears if there are characters remaining in the token. For example: printf a$'b\0c\''d is required by this standard to produce: abd while historic versions of ksh93 produced: ab The ISO C standard specifies \uXXXX and \UXXXXXXXX escape sequences. These need not be supported by $'...' in the shell. They were omitted because current shell implementations that support them differ in behavior. In particular, some shells always convert them to the UTF-8 encoding for the named character, even if the current locale's character set does not have UTF-8 encoding. The double-quote (") character can be used literally, while the single-quote (') character must be represented as an escape sequence. In C, single-quote can be used literally, while double-quote requires an escape sequence. A <backslash> immediately followed by a <newline> has unspecified behavior. In C, this sequence is used for line continuations, where both the <backslash> and <newline> are deleted and a diagnostic is required if a closing quote is not encountered before a <newline> that is not preceded by <backslash>. In current shell implementations, three different behaviors have been observed. Backslash escape sequences not described in the standard result in unspecified behavior. In C, the result is not a token and a diagnostic is required. This allows shells to recognize other backslash escape sequences in other ways as extensions to the standard's requirements. Furthermore, existing implementations already had different behaviors for some backslash escape sequences when $'...' processing was added to the standard. This standard makes the results implementation-defined if \e or \cX specifies a character that is not present in the current locale. Application authors should note that implementations are permitted to have a wide range of behaviors when encountering an unsupported character. For example: the shell might produce an error, possibly causing the shell to terminate the unsupported character might be silently discarded the unsupported character might be replaced with another character of a different character class the unsupported character might be replaced with a shell-special character (e.g., '?') the unsupported character might be replaced with multiple characters, shell-special or regular (e.g. if <ESC> is not supported $'\e' may be replaced by "???", "XXX" or "<ESC>") However, implementations must document their behavior, and they are prohibited from replacing an unsupported character with bytes that do not form valid characters in the current locale's character set (e.g., encoding in UTF-8 when the locale has a 7-bit character set). This standard does not specify a way for script authors to determine beforehand whether a particular \cX sequence specifies a character that exists in the current locale. At the time this feature was standardized, no known implementations provided such a capability. Note that the escape sequences recognized by $'...', file format notation (see [xref to Table 5-1]), XSI-conforming implementations of the echo utility (see the utility's operands section on [xref to echo]), and the printf utility's format operand (see the utility's extended description on [xref to printf]) are not the same. Some escape sequences are not recognized by all of the above, the \c escape sequence in echo is not at all like the \c escape sequence in $'...', octal escape sequences in some of the above accept one to four octal digits and require a leading zero while others accept one to three octal digits and do not require a leading zero.

Issue History
Date Modified	Username	Field	Change
2010-04-30 21:42	dwheeler	New Issue
2010-04-30 21:42	dwheeler	Status	New => Under Review
2010-04-30 21:42	dwheeler	Assigned To	=> ajosey
2010-04-30 21:42	dwheeler	Name	=> David A. Wheeler
2010-04-30 21:42	dwheeler	Section	=> 2.2 Quoting
2010-04-30 21:42	dwheeler	Page Number	=> 2298-2299
2010-04-30 21:42	dwheeler	Line Number	=> 72348-72401
2010-09-16 16:17	nick	Note Added: 0000548
2010-09-18 18:12	Don Cragun	Relationship added	related to 0000322
2010-10-01 12:48	geoffclare	Note Added: 0000560
2010-10-06 01:26	nick	Tag Attached: c99
2010-10-08 17:28	mirabilos	Note Added: 0000565
2010-10-25 06:17	Don Cragun	Note Added: 0000590
2010-10-25 14:51	Don Cragun	Note Edited: 0000590
2010-10-25 15:55	Don Cragun	Note Edited: 0000590
2010-10-26 06:44	Don Cragun	Note Edited: 0000590
2010-10-26 20:39	Don Cragun	Note Edited: 0000590
2010-10-26 20:40	Don Cragun	Note Edited: 0000590
2010-10-26 20:40	Don Cragun	Note Edited: 0000590
2010-10-26 20:45	Don Cragun	Note Edited: 0000590
2010-10-26 21:04	Don Cragun	Note Edited: 0000590
2010-10-27 03:29	Don Cragun	Note Edited: 0000590
2010-11-04 16:07	nick	Note Added: 0000599
2010-11-05 02:34	Don Cragun	Note Edited: 0000590
2010-11-05 03:00	Don Cragun	Note Added: 0000601
2010-11-05 03:04	Don Cragun	Note Edited: 0000601
2010-11-05 14:52	nick	Note Added: 0000609
2010-11-05 14:54	nick	File Added: n1534.htm
2010-11-05 14:56	nick	File Added: n1534_original.htm
2010-11-11 16:40	Don Cragun	Note Edited: 0000590
2010-11-11 16:42	Don Cragun	Note Edited: 0000590
2010-11-11 16:44	Don Cragun	Interp Status	=> ---
2010-11-11 16:44	Don Cragun	Final Accepted Text	=> See Note: 0000590
2010-11-11 16:44	Don Cragun	Status	Under Review => Resolved
2010-11-11 16:44	Don Cragun	Resolution	Open => Accepted As Marked
2010-11-11 16:44	Don Cragun	Tag Attached: issue8
2010-12-09 16:12	Don Cragun	Note Edited: 0000590
2015-07-31 15:59	stephane	Issue Monitored: stephane
2015-08-20 15:24	nick	Note Edited: 0000590
2015-08-20 15:25	nick	Note Edited: 0000590
2015-08-20 15:31	nick	Note Added: 0002793
2015-08-20 15:32	nick	Note Edited: 0002793
2015-08-20 15:33	nick	Note Edited: 0002793
2015-09-03 16:40	rhansen	Note Added: 0002809
2015-09-03 16:45	rhansen	Note Edited: 0002809
2015-09-03 16:48	rhansen	Note Edited: 0002809
2015-09-03 17:00	rhansen	Note Edited: 0002809
2015-09-03 17:05	rhansen	Note Edited: 0002809
2015-09-03 17:10	rhansen	Tag Attached: UTF-8_Locale
2015-09-10 20:25	rhansen	Note Added: 0002824
2015-09-10 20:26	rhansen	Note Edited: 0002824
2015-09-10 20:26	rhansen	Note Edited: 0002824
2015-09-10 20:27	rhansen	Note Edited: 0002809
2015-10-08 06:43	Don Cragun	Final Accepted Text	See Note: 0000590 => Previously accepted text is in Note: 0000590.
2015-10-08 06:43	Don Cragun	Resolution	Accepted As Marked => Reopened
2015-10-08 16:32	Don Cragun	Status	Resolved => Under Review
2015-10-08 16:41	rhansen	Relationship added	related to 0000985
2015-11-09 21:10	steffen	Note Added: 0002893
2015-11-13 19:19	shware_systems	Note Added: 0002922
2015-11-13 21:22	steffen	Note Added: 0002923
2015-11-13 21:24	steffen	Note Added: 0002925
2015-11-13 23:23	shware_systems	Note Added: 0002929
2015-11-14 14:26	steffen	Note Added: 0002942
2016-06-06 21:07	steffen	Note Added: 0003247
2016-06-07 19:50	shware_systems	Note Added: 0003248
2016-06-07 19:53	shware_systems	Note Edited: 0003248
2016-06-07 20:25	steffen	Note Added: 0003249
2016-06-07 20:27	steffen	Note Added: 0003250
2016-06-08 06:51	shware_systems	Note Added: 0003251
2016-06-08 10:43	steffen	Note Added: 0003252
2016-06-08 19:55	shware_systems	Note Added: 0003254
2016-06-09 13:55	steffen	Note Added: 0003256
2016-06-09 14:10	stephane	Note Added: 0003257
2016-06-09 14:15	shware_systems	Note Added: 0003258
2016-06-09 14:18	stephane	Note Edited: 0003257
2016-06-09 14:26	stephane	Note Edited: 0003257
2016-06-09 15:07	steffen	Note Added: 0003259
2016-06-09 18:42	stephane	Note Added: 0003262
2021-01-31 05:37	calestyo	Issue Monitored: calestyo
2021-01-31 05:38	calestyo	Note Added: 0005221
2021-01-31 06:48	stephane	Note Added: 0005222
2021-01-31 17:08	calestyo	Note Added: 0005223
2021-01-31 18:31	stephane	Note Edited: 0005222
2021-01-31 18:34	stephane	Note Edited: 0005222
2021-02-04 21:38	dwheeler	Note Added: 0005226
2021-02-05 16:02	dwheeler	Note Added: 0005227
2021-02-05 16:08	dwheeler	Note Added: 0005228
2021-02-05 16:42	calestyo	Note Added: 0005229
2021-02-18 17:11	nick	Relationship added	parent of 0001413
2021-03-15 20:38	mirabilos	Note Added: 0005273
2021-03-16 09:40	geoffclare	Note Added: 0005275
2022-01-14 20:20	calestyo	Issue End Monitor: calestyo
2022-01-14 20:20	calestyo	Issue Monitored: calestyo
2022-02-25 02:56	calestyo	Note Added: 0005714
2022-03-14 00:52	calestyo	Note Added: 0005746
2022-03-14 01:22	calestyo	Note Added: 0005747
2022-03-14 21:22	shware_systems	Note Added: 0005751
2022-10-18 10:42	geoffclare	Note Added: 0005995
2022-10-18 10:50	geoffclare	Note Edited: 0005995
2022-10-19 11:50	kre	Note Added: 0005997
2022-10-19 11:52	kre	Note Edited: 0005997
2022-10-19 11:54	kre	Note Edited: 0005997
2022-10-19 11:56	kre	Note Edited: 0005997
2022-10-19 12:22	hvd	Note Added: 0005998
2022-10-19 13:32	hvd	Note Edited: 0005998
2022-10-19 14:04	kre	Note Added: 0005999
2022-10-19 20:53	steffen	Note Added: 0006000
2022-10-19 21:29	steffen	Note Added: 0006001
2022-10-20 08:49	geoffclare	Note Added: 0006002
2022-10-20 08:49	geoffclare	Note Edited: 0006002
2022-10-20 09:03	geoffclare	Note Added: 0006003
2022-10-20 10:08	geoffclare	Note Added: 0006004
2022-10-20 15:12	geoffclare	Note Added: 0006006
2022-10-20 15:12	geoffclare	Note Edited: 0006006
2022-10-20 15:14	geoffclare	Final Accepted Text	Previously accepted text is in Note: 0000590. => Note: 0006006
2022-10-20 15:14	geoffclare	Status	Under Review => Resolved
2022-10-20 15:14	geoffclare	Resolution	Reopened => Accepted As Marked
2022-10-20 15:14	geoffclare	Tag Detached: c99
2022-11-08 14:32	geoffclare	Status	Resolved => Applied

Aardvark Mark IV