|Anonymous | Login | Signup for a new account||2013-05-23 16:20 UTC|
|Main | My View | View Issues | Change Log | Docs|
|Viewing Issue Simple Details|
|ID||Category||Severity||Type||Date Submitted||Last Update|
|0000249||[1003.1(2008)/Issue 7] Shell and Utilities||Objection||Enhancement Request||2010-04-30 21:42||2010-11-11 16:44|
|Priority||normal||Resolution||Accepted As Marked|
|Name||David A. Wheeler|
|Final Accepted Text||See Note: 0000590|
|Summary||0000249: Add standard support for $'...' in shell|
It is annoyingly difficult to set shell variables with certain characters, for example, to set a shell variable to a value that ends with "\n". This is particularly a problem when manipulating IFS, but the problem occurs in many other situations too. One reason is because command substitution consumes all trailing newlines, so constructs like this don't work:
It's possible to work around this (e.g., with clever parameter expansions), but the work-arounds are ugly and error-prone. It would be better to support simple, clear scripts when you want to insert characters like newline.
Thus, many shells add the $'...' construct, which is *widely* supported. It is supported by ksh, zsh, bash, and the busybox "sh" command at *least*. This widespread support suggests that this is useful and easy to implement. The fact that even *busybox* supports it, even though they focus on being very small, suggests that this is *important* functionality.
As the korn shell documentation says, "the purpose of $'...' is to solve the problem of entering special characters in scripts [using] ANSI-C rules to translate the string... It would have been cleaner to have all "..." strings handle ANSI-C escapes, but that would not be backwards compatible."
I used bash's list of constructs supported in $'...'. I think supporting hex values is a good one. \cX constructs are less necessary, but since they're in bash, may as well include it too (it's trivial to support).
Page 2229, line 72401: At the end of section 2.2, add a new subsection, "2.2.4 Dollar-single-quotes", with this text:
A word beginning with dollar-single-quote ($'), continuing all the way to an unescaped matching single-quote ('), shall preserve the literal value of all characters within, with the exception of certain backslash escape sequences.
The defined backslash escape sequences are:
\a alert (bell)
\e an escape character
\f form feed
\n new line
\r carriage return
\t horizontal tab
\v vertical tab
\' single quote
\nnn the eight-bit character whose value is the octal value
nnn (one to three digits)
\xHH the eight-bit character whose value is the hexadecimal
value HH (one or two hex digits)
\cx a control-x character; \cA is 1, while \cZ is 26.
If a backslash is followed by any other sequence, the result is implementation-defined.
The expanded result is treated as if it was single-quoted (as if the dollar sign had not been present); in particular, it is *not* split using the IFS values.
On page 2298, after line 72385, add:
Within the string of characters from an enclosed "$'" to the matching "'", processing shall occur as described in section 2.2.4.
In 0000049 , need to add to the end of the list:
* a <single-quote>
See: http://austingroupbugs.net/view.php?id=49 [^]
n1534.htm [^] (1,668 bytes) 2010-11-05 14:54
n1534_original.htm [^] (3,813 bytes) 2010-11-05 14:56
The C standard also includes Universal Character constants, and these should also be considered.
6.4.3 Universal character names
\U hex-quad hex-quad
2 A universal character name shall not specify a character whose short identifier is less than
00A0 other than 0024 ($), 0040 (@), or 0060 (‘), nor one in the range D800 through
3 Universal character names may be used in identifiers, character constants, and string
literals to designate characters that are not in the basic character set.
4 The universal character name \Unnnnnnnn designates the character whose eight-digit
short identifier (as specified by ISO/IEC 10646) is nnnnnnnn.) Similarly, the universal
character name \unnnn designates the character whose four-digit short identifier is nnnn
(and whose eight-digit short identifier is 0000nnnn).
So far, discussion of this bug has been about which escapes to include.
There are some problems with other aspects of the proposed text that
need to be sorted out. The ones I have spotted are:
Use of "A word beginning" is wrong, as the expansion can be used
within a word. E.g. a$'b'c expands to abc.
The sentence beginning "The expanded result is treated as if it was
single-quoted" needs to use "shall", and the rest of that sentence
should either be removed or reworked.
This addition to 2.2.3 Double-Quotes is not right:
On page 2298, after line 72385, add:
Within the string of characters from an enclosed "$'" to the
matching "'", processing shall occur as described in section 2.2.4.
since ksh and bash do not expand $'...' within double quotes.
A possible alternative might be to append the following to line 72376:
The <dollar-sign> shall not retain its special meaning introducing
the dollar-single-quotes form of quoting (see [xref to 2.2.4]).
Finally an additional change is needed in 2.3 Token Recognition in
list item 4.
The following documents existing practice for $'…' in mksh, called “C style” there to distinguish from the printf builtin’s escape sequence parsing (where, for example, \123 is invalid and must be written as \0123).
\a \b \c# \E \e \f \n \r \t \v \### \U######## \u#### \x#… \' \\
Notes on some:
\c# where # is ? will yield 0x7F, every other octet is ANDed with 0x1F (using non-ASCII chars is undefined behaviour as I plan to let mksh use wide characters internally in $some $large time)
\E and \e yield 0x1B
\### looks for one, two or three octal digits and converts to octet, stops if encountering a non-digit octet
\U and \u look for up to eight/four hexadecimal digits and convert to UTF-8 (actually, mksh only supports the BMP though; may as well be CESU-8)
\x in “C style” mode gobbles up as many hexadecimal characters as it can (doesn’t just eat one or two) stopping at a non-digit octet and converts to octets if <= 0xFF, to UTF-8 (see above) otherwise
\' allows this to work: x=a$'\''b
\\ should be self-explanatory
Every other escaped character \X is replaced with itself X (although the amount of supported escape sequences may vary, so it’s in practice UD too). (The print builtin mode excludes e.g. \? from this for compatibility reasons.)
This means we have the following differences:
• \E is new (from GNU bash… although why POSIX (GNU’s Not Unix) would look at it is beyond me), which counts as UD, so is ok
• \x doesn’t terminate after two hex digits; I’d like to make this a user error and standardise only for the case where the third octet after \x is NOT in the range [0-9a-fA-F] for the sake of compatibility
• \u supports manual creation of surrogates in CESU-8 encoding, they’re not forbidden
• \U for values outside the BMP is undesirable in mksh (but this can be viewed as limitation of the operating environment; while mksh doesn’t use the OS’ implementation it shares MirBSD’s which uses an encoding based on CESU-8 internally and has a 16-bit wchar_t)
• in practice, \u/\U > 0xFFFD yield 0xFFFD (0xFFFE and 0xFFFF are invalid anyway)
• since this escape method is more generic than identifiers, all values from 0x0000 (inclusive) and up are supported (this may be desirable)
In mksh, a$'b'c expands to abc and "a$'b'c" expands to a$'b'c as suggested above as well.
There’s a sort of testsuite at https://www.mirbsd.org/cvs.cgi/src/bin/mksh/check.t?rev=HEAD [^] – search for “dollar-quoted-strings” and follow it down, including “dollar-quotes-in-heredocs” (use <<$'…' delimiters) and “dollar-quotes-in-herestrings” (for the <<< extension). These are somewhat modelled after ksh93 while looking at GNU bash and zsh, with some real-life testing added.
So, to summarise, I’d like to get the following as implementation-defined behaviour:
When a sequence backslash 'x' is followed by two octets in the range [0-9a-zA-Z], the behaviour if the next (third) octet is also in the range [0-9a-zA-Z].
The rest is merely for consideration; it might be wise to not limit the u and U escapes to characters valid for identifiers.
From a lession learned with JSON, it should be explicitly stated that the octet following the backslash character is matched case-sensitively. (Do we assume ASCII here?)
Don Cragun (manager)
edited on: 2010-12-09 16:12
The following is a complete replacement for the Desired Action to
be taken to resolve this issue. It is based on other notes in this
bug report, numerous e-mail messages on the austin-group-l
alias, and discussions held during Austin Group conference calls:
In XCU section 2.2 (Quoting) change: "The various quoting mechanisms are the escape character, single-quotes, and double-quotes." on P2298, L72359 to: "The various quoting mechanisms are the escape character, single-quotes, double-quotes, and dollar-single-quotes." Add the following new paragraph to XCU section 2.2.3 (Double-Quotes) after P2298, L72385 (indented to the same level as the paragraph on L72382-72385): "The <dollar-sign> shall not retain its special meaning introducing the dollar-single-quotes form of quoting (see [xref to 2.2.4])." In XCU 2.2.3 (Double-Quotes) change: "A single-quoted or double-quoted string that begins, but does not end, within the "‘...‘" sequence" on P2299, L72392-72393 to: "A quoted (single-quoted, double-quoted, or dollar-single-quoted) string that begins, but does not end, within the "‘...‘" sequence" At the end of XCU section 2.2 after P2299, L72401 add a new subsection, with this text: 2.2.4 Dollar-Single-Quotes A sequence of characters starting with a <dollar-sign> immediately followed by a single-quote ($') shall preserve the literal value of all characters up to an unescaped terminating single-quote ('), with the exception of certain backslash escape sequences, as follows: \" yields a <quotation-mark> (double-quote) character. \' yields an <apostrophe> (single-quote) character. \\ yields a <backslash> character. \a yields an <alert> character. \b yields a <backspace> character. \e yields an <ESC> character. \f yields a <form-feed>> character. \n yields a <newline> character. \r yields a <carriage-return> character. \t yields a <tab> character. \v yields a <vertical-tab> character. \cX yields the control character listed in the Value column of Table 4.21 on page 3203 (xref to XCU Table 4.21) in the Operands section of the stty utility when X is one of the characters listed in the ^c column of the same table, except that \c\\ yields the <FS> control character since the <backslash> character must be escaped. \uXXXX yields the character specified by ISO/IEC 10646 whose four-digit short universal character name is XXXX (and whose eight-digit short universal character name is 0000XXXX). \UXXXXXXXX yields the character specified by ISO/IEC 10646 whose eight-digit short universal character name is XXXXXXXX. \xXX yields the byte whose value is the hexadecimal value XX (one or two hex digits) \XXX yields the byte whose value is the octal value XXX (one to three octal digits) If a \xXX or \XXX escape sequence yields a byte whose value is 0, that byte and any following regular characters and escape sequences up to the terminating unescaped single-quote shall be evaluated and discarded. If a universal character name specified by \uXXXX or \UXXXXXXXX is less than 00A0 other than 0024 ($), 0040 (@), or 0060('), the results are undefined. If a universal character name specified by \uXXXX or \UXXXXXXXX is in the range D800 through DFFF inclusive, the results are undefined. If more than two hexadecimal digits immediately follow \x or if the octal value specified by \XXX will not fit in a byte, the results are unspecified. If a backslash is followed by any sequence not listed above or if \e, \cX, \uXXXX, or \UXXXXXXXX yields a character that is not present in the current locale, the result is implementation-defined. If a backslash escape sequence represents a single-quote character (for example \u0060, \U00000060, or \'), that sequence shall not terminate the dollar-single-quote sequence. In XCU section 2.3 (Token Recognition) point 4, change: "If the current character is <backslash>, single-quote, or double-quote and it is not quoted, it shall affect quoting for subsequent characters up to the end of the quoted text. The rules for quoting are as described in Section 2.2 (on page 2298). During token recognition no substitutions shall be actually performed, and the result token shall contain exactly the characters that appear in the input (except for <newline> joining), unmodified, including any embedded or enclosing quotes or substitution operators, between the <quotationmark> and the end of the quoted text. The token shall not be delimited by the end of the quoted field." on P2299-2300, L72425-72432 to: "If the current character is an unquoted <backslash>, single-quote, or double-quote or is the first character of an unquoted <dollar-sign> single-quote sequence, it shall affect quoting for subsequent characters up to the end of the quoted text. The rules for quoting are as described in Section 2.2 (on page 2298). During token recognition no substitutions shall be actually performed, and the result token shall contain exactly the characters that appear in the input (except for <newline> joining), unmodified, including any embedded or enclosing quotes or substitution operators, between the start and the end of the quoted text. The token shall not be delimited by the end of the quoted field." In XCU section 2.6, add: * a <single-quote> to the end of the list created by the changes identified in Mantis BugID 49. (BugID 49 modifies text on P2305, L72669-72671 in XCU7.) In XCU 2.6.3 (Command Substitution) change: "A single-quoted or double-quoted string that begins, but does not end, within the "‘...‘" sequence produces undefined results." on P2309, L72827-72829 to: "A quoted string that begins, but does not end, within the "‘...‘" sequence produces undefined results." In XCU section 2.6.7 (Quote Removal), change: "The quote characters (<backslash>, single-quote, and double-quote) that were present in the original word shall be removed unless they have themselves been quoted." on P2311, L72904-72905 to: "The quote character sequence <dollar-sign> single-quote and the characters (<backslash>, single-quote, and double-quote) that were present in the original word shall be removed unless they have themselves been quoted." In the Note in XCU section 2.10.2 (Shell Grammar Rules) rule 1, change: "Because at this point <quotation-mark> characters are retained in the token, quoted strings cannot be recognized as reserved words." on P2325, L73436-73437 to: "Because at this point quoting characters (<backslash>, single-quote, <quotation-mark>, and the <dollar-sign> single-quote sequence) are retained in the token, quoted strings cannot be recognized as reserved words." Add a new section in XRAT following P3649, L124068: <italic>C.2.2.4 Dollar-Single-Quotes</italic> The $'...' quoting construct has been implemented in several recent shells. It is similar to character string literals ("...") in the C Standard with the following exceptions: 1. The \x escape sequence in C can be followed by an arbitrary number of hexadecimal digits. The ksh93 implementation of $'...' also consumes an arbitrary number of hexadecimal digits; bash consumes at most two hexadecimal digits in this case. This standard leaves the result unspecified if more than two hexadecimal digits follow \x. (Note that a hexadecimal escape followed by a literal hexadecimal character can always be represented as $'\xXX'X) 2. The \c escape sequence is not included in the C standard. There was also some disagreement in shells that historically supported \c escape sequences in $'...'. These include: A. whether \cA through \cZ produced the byte values 1 through 26, respectively or supported the codeset independent control character as specified by the stty utility. This standard requires codeset independence. B. whether \c[, \c\\, \c], \c^, \c_, and \? could be used to yield the <ESC>, <FS>, <GS>, <RS>, <US>, and <DEL> control characters, respectively. This standard requires support for all of the control characters except NULL (matching what is done in the stty utility). C. whether \c\\ or \c\ was used to represent <FS>. This standard requires \c\\ to make backslash escape processing consistent. The implementors of the most common shells that implement $'\cX' agreed to convert to the behavior specified in this standard. Some shells also allow \c<arbitrary_control_character> to act as an inverse function to \cX (i.e., \cm and \cM yield <CR> and \c<CR> yields m or M. This standard leaves this behavior implementation-defined. 3. The \e escape sequence is not included in the C Standard, but was provided by all historical shells that supported $'...'. Some also supported \E as a synonym. One member of the group objected to adding \e because the <ESC> control character is not required to be in the portable character set. The \e sequence is included because many historical users of $'...' expect it to be there. The \E sequence is not included in this standard because <backslash> escape sequences that start with <backslash> followed by an uppercase letter (except \U) are reserved by the C Standard for implementation use. 4. The \XXX octal escape sequence and the \xXX hexadecimal escape sequence can be used to insert a null byte into a C Standard character string literal and into a $'...' quoted word in this standard. In C, any characters specified after that null byte (including escape sequences) continue to be processed and added to the expansion of the character string literal. In $'...' in the shell, however, this standard requires the null byte and all remaining characters up to the terminating unescaped single-quote be evaluated and discarded. (This was historic practice in bash, but not in ksh93.) This allows an escape sequence producing a null byte to terminate the dollar-single-quoted expansion, but not terminate the token in which it appears if there are characters remaining in the token. For example: printf a$'b\0c\''d is required by this standard to produce: abd while historic versions of ksh93 produced: ab 5. The double-quote (") character can be used literally, while the single-quote (') character must be represented as an escape sequence. In C, single-quote can be used literally, while double-quote requires an escape sequence. In C, if a character specified by \uXXXX or \UXXXXXXXX in "..." does not exist in the execution character set, the results are implementation-defined. Similarly, this standard makes the results implementation-defined if \e, \cX, \uXXXX, or \UXXXXXXXX specifies a character that is not present in the current locale. Note that the escape sequences recognized by $'...', file format notation (see Table 5-1 on page 121), XSI-conforming implementations of the echo utility (see the echo utility's operands section on page 2615), and the printf utility's format operand (see the printf utility's extended description section on page 3050) are not the same. Some escape sequences are not recognized by all of the above, the \c escape sequence in echo is not at all like the \c escape sequence in $'...', octal escape sequences in some of the above accept one to four octal digits and require a leading zero while others accept one to three octal digits and do not require a leading zero.
The issue of adding \cX and \e was discussed at the WG14 meeting. A paper to add these character constants was requested and submitted (see http://wiki.dinkumware.com/twiki/pub/WG14/Documents/n1534.htm). [^] However, since this paper was in response to a late liaison statement, objection to considering it was raised.
The committee agreed that POSIX was free to use these characters, and encouraged resubmission of N1534 as a liaison ballot comment, probably reworded to be an update to the future directions section. WG14 has indicated that POSIX use of the "Future Standardization" is satisfactory.
Don Cragun (manager)
edited on: 2010-11-05 03:04
Note: 0000590 was updated today to account for another difference in the way bash and
ksh93 processed $'...'. If an escape sequence expanded to a null byte, bash terminated
the $'...' expansion, but still concatenated any following characters that appeared after
the $'...' expansion that were in the same lexical token. On the other hand, ksh93
terminated the entire token in this case. The ksh93 developers agreed to change to
match bash in this case if the standard requires this behavior.
Also note that if the C1x standard adopts the changes suggested in the document
referenced in Note: 0000599, the
"\e, "in the last paragraph of the
newly created 2.2.4 section can be removed (since the <ESC> character would be
moved into the basic character set in C1X). If this happens we should also consider
adding \e to XBD Table 5-1 and checking the places in the standard that refer to
that table to determine whether \e processing should also be required in those places.
This issue was presented to WG14 (C). The initial reaction was to request a paper to add \cX and \e as character constants (see http://wiki.dinkumware.com/twiki/pub/WG14/Documents/n1534.htm, [^] attached). This paper was subsequently procedurally rejected as it was late and made fundamental differences, which might also impact C++.
Instead, the paper was revised to propose adding these characters to the Future Directions section, and a liaison comment on the CD ballot to ask for this would be appropriate.
The WG14 committee agreed that there was no problem in POSIX adding these characters to the shell, and did not object to the use of the namespace.
|2010-04-30 21:42||dwheeler||New Issue|
|2010-04-30 21:42||dwheeler||Status||New => Under Review|
|2010-04-30 21:42||dwheeler||Assigned To||=> ajosey|
|2010-04-30 21:42||dwheeler||Name||=> David A. Wheeler|
|2010-04-30 21:42||dwheeler||Section||=> 2.2 Quoting|
|2010-04-30 21:42||dwheeler||Page Number||=> 2298-2299|
|2010-04-30 21:42||dwheeler||Line Number||=> 72348-72401|
|2010-09-16 16:17||nick||Note Added: 0000548|
|2010-09-18 18:12||Don Cragun||Relationship added||related to 0000322|
|2010-10-01 12:48||geoffclare||Note Added: 0000560|
|2010-10-06 01:26||nick||Tag Attached: c99|
|2010-10-08 17:28||mirabilos||Note Added: 0000565|
|2010-10-25 06:17||Don Cragun||Note Added: 0000590|
|2010-10-25 14:51||Don Cragun||Note Edited: 0000590|
|2010-10-25 15:55||Don Cragun||Note Edited: 0000590|
|2010-10-26 06:44||Don Cragun||Note Edited: 0000590|
|2010-10-26 20:39||Don Cragun||Note Edited: 0000590|
|2010-10-26 20:40||Don Cragun||Note Edited: 0000590|
|2010-10-26 20:40||Don Cragun||Note Edited: 0000590|
|2010-10-26 20:45||Don Cragun||Note Edited: 0000590|
|2010-10-26 21:04||Don Cragun||Note Edited: 0000590|
|2010-10-27 03:29||Don Cragun||Note Edited: 0000590|
|2010-11-04 16:07||nick||Note Added: 0000599|
|2010-11-05 02:34||Don Cragun||Note Edited: 0000590|
|2010-11-05 03:00||Don Cragun||Note Added: 0000601|
|2010-11-05 03:04||Don Cragun||Note Edited: 0000601|
|2010-11-05 14:52||nick||Note Added: 0000609|
|2010-11-05 14:54||nick||File Added: n1534.htm|
|2010-11-05 14:56||nick||File Added: n1534_original.htm|
|2010-11-11 16:40||Don Cragun||Note Edited: 0000590|
|2010-11-11 16:42||Don Cragun||Note Edited: 0000590|
|2010-11-11 16:44||Don Cragun||Interp Status||=> ---|
|2010-11-11 16:44||Don Cragun||Final Accepted Text||=> See Note: 0000590|
|2010-11-11 16:44||Don Cragun||Status||Under Review => Resolved|
|2010-11-11 16:44||Don Cragun||Resolution||Open => Accepted As Marked|
|2010-11-11 16:44||Don Cragun||Tag Attached: issue8|
|2010-12-09 16:12||Don Cragun||Note Edited: 0000590|
|Mantis 1.1.6[^] Copyright © 2000 - 2008 Mantis Group|