Austin Group Defect Tracker

Aardvark Mark III


Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0000375 [1003.1(2008)/Issue 7] Shell and Utilities Objection Enhancement Request 2011-02-07 18:34 2017-05-19 21:44
Reporter dwheeler View Status public  
Assigned To ajosey
Priority normal Resolution Open  
Status Under Review  
Name David A. Wheeler
Organization
User Reference
Section test
Page Number 3224-3225
Line Number 107503-107513
Interp Status ---
Final Accepted Text
Summary 0000375: Extend test/[...] conditionals: ==, <, >, -nt, -ot, -ef
Description Many implementations of "test" (aka "["), including shell built-ins, implement conditionals beyond those specified in the current version POSIX. What's more, many extant programs rely on these extensions. I recommend formally adding these widely-implemented extensions to the POSIX specification itself, as these extensions have become widespread and are ready to be standardized.

Specifically, add:
* "s1 == s2" as a synonym for "s1 = s2". Since "=" is used for "assignment" in shell scripts, the spelling "==" is a clearer/less-ambiguous way to state that this is a test and not an assignment. This is already implemented in bash, ksh, and busybox sh.
* "s1 < s2" and "s1 > s2" for lexicographic comparison. Users must quote "<" and ">", but even so, it's a useful operation. Alternatives tend to be ugly, e.g., awk -v v1="1" -v v2="fcd" 'BEGIN{exit !(v1 "" < "" v2)}' This is implemented in bash, dash, busybox sh, and ksh.
* "pathname1 -nt pathname2" and "pathname1 -ot pathname2" - newer than/older than, comparing modification times. You can work around this by using expressions such as [ "$(find 'file1' -prune -newer 'file2')" ], but this is ugly compared to a straightforward test. Determining if something should be done, based on whether or not one file is newer than another, is a common operation, and thus makes sense to include. Implemented in ksh, bash, and busybox sh. The semantics when the files don't exist appear to be the same between bash and ksh, so I've included those semantics in the "desired action" below.
* "pathname1 -ef pathname2". True if pathname1 and pathname2 are hard linked (refer to the same device and inode numbers). Implemented in bash, ksh, and busybox sh.

Some are identified as "bashisms" in places like http://mywiki.wooledge.org/Bashism [^] but in fact these are widely implemented and/or depended upon. But because they're not in the POSIX specification, they can't be universally depended upon, and I think that needs to change.
Desired Action Add to the list of primaries for "test" (circa page 3224):

* s1 == s2 True if the strings s1 and s2 are identical; otherwise, false. This primary is equivalent to s1 = s2.
* s1 < s2 True if the string s1 is lexicographically less than s2; otherwise, false.
* s1 > s2 True if the string s1 is lexicographically greater than s2; otherwise, false.
* pathname1 -nt pathname2 True if pathname1 exists and pathname2 does not, or if pathname1 is newer than pathname2 according to their modification times; otherwise, false. If pathname1 does not exist, the result is false.
* pathname1 -ot pathname2 True if pathname2 exists and pathname1 does not, or if pathname1 is older than pathname2 according to their modification times; otherwise, false. If pathname2 does not exist, the result is false.
* pathname1 -ef pathname2 True if pathname1 and pathname2 are hard linked, that is, refer to the same file. See section 3.191 {the section number defining "hard link".}
Tags No tags attached.
Attached Files pdf file icon Extendingshellconditionals.pdf [^] (324,995 bytes) 2011-11-15 19:25
txt file icon Extendingshellconditionals.txt [^] (23,138 bytes) 2011-11-15 19:27
pdf file icon Extendingshellconditionals-2011-11-23.pdf [^] (368,158 bytes) 2011-11-23 23:49
? file icon Extendingshellconditionals-2011-11-23.odt [^] (97,978 bytes) 2011-11-23 23:50
txt file icon Extendingshellconditionals-2011-11-23.txt [^] (30,171 bytes) 2011-11-23 23:50
? file icon Extendingshellconditionals-2011-11-28.odt [^] (53,716 bytes) 2011-11-28 15:51
pdf file icon Extendingshellconditionals-2011-11-28.pdf [^] (248,508 bytes) 2011-11-28 15:52
? file icon Extendingshellconditionals-2013-10-24.odt [^] (58,679 bytes) 2013-10-24 15:15
pdf file icon Extendingshellconditionals-2013-10-24.pdf [^] (256,955 bytes) 2013-10-24 15:16
? file icon Extendingshellconditionals-2013-11-29.odt [^] (60,345 bytes) 2013-11-29 18:56
pdf file icon Extendingshellconditionals-2013-11-29-track-changes.pdf [^] (1,443,146 bytes) 2013-11-29 18:57
pdf file icon Extendingshellconditionals-2013-11-29.pdf [^] (1,300,831 bytes) 2013-11-29 18:58
? file icon Extendingshellconditionals-2013-12-08.odt [^] (64,218 bytes) 2013-12-09 01:32
pdf file icon Extendingshellconditionals-2013-12-08.pdf [^] (1,269,805 bytes) 2013-12-09 01:33
? file icon Extendingshellconditionals-2013-12-09.odt [^] (65,088 bytes) 2013-12-10 04:52
pdf file icon Extendingshellconditionals-2013-12-09.pdf [^] (1,258,736 bytes) 2013-12-10 04:53

- Relationships
has duplicate 0000762Closedajosey 1003.1(2008)/Issue 7 Add "==" as synonym for "=" in test 
related to 0000813Closed 1003.1(2013)/Issue7+TC1 Utility numeric argument syntax requirements ambiguous 

-  Notes
(0000666)
dwheeler (reporter)
2011-02-07 18:49

Since this is a request for new functionality in the specification, I believe "Severity" should be "Objection" (not "editorial") and "Type" should be "Enhancement Request" (not "Clarification requested"). Sorry for the confusion.
(0000667)
Don Cragun (manager)
2011-02-07 20:23
edited on: 2011-02-07 22:22

Severity, Type, Description, and Desired Action changed as requested by submitter.

(0000668)
dwheeler (reporter)
2011-02-07 21:56

Note: In the "Desired Action" text above, "file1" and "file2" should be "pathname1" and "pathname2" respectively.
(0000669)
dwheeler (reporter)
2011-02-07 22:50

One issue that needs discussion is if "<" and ">" should be affected by the current locale's collation sequence (LC_COLLATE if set, else LANG/LC_ALL). Unfortunately, historical implementations seem to ignore the collation setting. So I recommend a compromise position, adding to each operator the statement that "The effective collation sequence must be set to POSIX or C; whether or not another setting is applied is implementation-defined." Other options are certainly possible; Here's the background.


The GNU bash and ksh93 documentation lead me to believe that they IGNORE the collation settings when using "<" or ">" inside "test" and "[". GNU bash 4 documentation suggests that it *does* pay attention to the collation setting if you use its nonstandard "[[" extension, but not with "test" or "[". A quick test of GNU bash leads me to believe that its documentation (http://www.gnu.org/software/bash/manual/bashref.html) [^] is correct - LC_COLLATE is ignored by these two operators.

If the POSIX group decides that these operators should match historical precedent, then this should be specifically documented. E.G., for both "<" and ">", add: "This test operates as if LC_COLLATE is set to POSIX." That's the *easy* thing to do.

Now, it *would* make sense if the proposed "<" and ">" primitives were affected by the current locale's collation sequence (LC_COLLATE if set, else LANG/LC_ALL). If a shell implementer doesn't want to include a collation library in their shell, yet "test" is a built-in, their shell could implement "<" and ">" as a call out to the 'test' command (and then have *test* implement the collation defined by the locale). You could make the case that ignoring these settings is a bug in current implementations, as "<" and ">" are fundamentally collation comparisons.

However, this would be a *change* to current behavior. It would require implementations to change their behavior, and for applications who care to switch to notations like this:
  if LC_COLLATE=POSIX [ "$string1" '<' "$string2" ]
I don't know how the shell implementers feel about changing the semantics of "<" or ">" in test to pay attention to the locale, or how much that would affect real code.

A useful compromise could be that "<" and ">" could only have a defined meaning when the effective collation (as determined by LC_COLLATE and friends) are POSIX or C. This would mean that technically you'd have to set LC_COLLATE to do the comparison. This would grandfather existing code, and create a transition process so that eventually paying attention to the collation would be required. Basically, to meet the POSIX spec, application authors would know that they should add LC_COLLATE=... to their code, and then it'd be easier to add support for LC_COLLATE.

So I suggest adding the following to "<" and ">":
"The effective collation sequence must be set to POSIX or C; whether or not another setting is applied is implementation-defined."

If no consensus can be reached on these two primitives (< and >), I'd still like to see the others added.
(0000670)
eblake (manager)
2011-02-07 23:13

Due to the fact that < and > must be quoted with test (or [), and in light of the issues of some current test implementations failing to implement locale-specific collation, I'd much rather see an effort made to standardize [[ (where < and > do not have to be quoted, and where collation must be done according to lcale), than to standardize < and > for test with loose semantics. But adding the other four operators (==, -ot, -nt, -ef) make sense, although I think the wording for -ef should be "names the same file" rather than "are hard links", seeing as how 'touch a; ln -s a b; test a -ef b' sets $? to 0 on at least bash, coreutils, and ksh.
(0000688)
dwheeler (reporter)
2011-03-08 03:15

FYI, here's a (partial) report on implementation effort, specifically on adding "==" to the test/[ implementation in dash.

It turns out to be trivial to modify dash to support "==". It's 1 line, plus a few lines to update comments and documentation. Here is the patch:
http://permalink.gmane.org/gmane.comp.shells.dash/498 [^]

I do not know if the dash project will add this functionality; that's up to them. However, several have commented that they'd prefer to add this (or just about any other functionality) only if it's specifically added to POSIX. There *did* seem to be universal agreement that they would *definitely* add this functionality if "==" was in the newer POSIX spec, at least, that's how I understand the comments.

As noted earlier, "==" is *already* in bash, busybox ash, and ksh, so it requires nothing to add to them. I now also know that it'd be trivial to add to dash, and based on that experience, it'd probably be trivial to add "==" to other shells as well. And, as I noted earlier, "==" is *already* in wide use in shell scripts. Not all "trivial to add" and "already in wide use" capabilities are added to the POSIX spec, but they *are* reasonable reasons to add something to the POSIX spec.

Thanks.
(0000754)
jsonn (reporter)
2011-04-24 13:16

I don't see any value in adding "==". The only reason why it is used by a bunch of scripts is because some popular shells like GNU bash accept it (even in POSIX mode). As such this only legalises the dependency of completely redundant vendor extensions. The GNU toolchain is not even consistent in this regard, since GNU coreutils has a test that doesn't support ==.
(0000755)
dwheeler (reporter)
2011-04-24 19:01

Why add "=="? The issue is that "=" is already used for shell variable assignment. Using the SAME spelling for both "is-equal-to" and "assignment" is confusing, and thus is something that *MANY* languages avoid. Many languages that use "=" for assignment use a SEPARATE "==" operator for comparison; this includes C, C++, Java, C#, Python, and Perl. Many languages (like Pascal) that use "=" for comparison use another spelling like ":=" for assignment, again, to keep their spellings separate. In some cases these languages can always disambiguate from context, and even then, they intentionally do NOT use the same spelling. It is inconsistent and misleading that the shell, unlike all these other languages, conflates the spellings of assignment and is-equal-to. I think it's too late to get rid of "=" for comparison, but it's easy to add "==" as a synonym, which is what is proposed.

In short, adding "==" (1) clarifies whether assignment or comparison is being used, and (2) improves consistency between shell and the MANY other languages that use "=" for assignment.

It's true that many shell scripts use "==" and that a lot of popular shell implementations support "==". But that is not an argument against standardization; that is an argument *FOR* standardization. If a lot of implementations support it (they do), and a lot users use it (they do), then I think that's a pretty good argument that it should be added to the standard. This is EASY for implementations to add, it takes a trivial amount of space (even Busybox, which is for embedded systems, has it), and it's easy to add to the standard, too.

As far as GNU coreutils' test goes, FSF's development version of test added "=="
support on 2011-03-22. So yes, the current GNU toolchain doesn't support "==", but that will be fixed soon. So GNU test is an argument *FOR* adding support for "==".

Here's the current state of "==" in some implementations of sh and test that I know of; there's a lot of support for it, and it'd be easy to add to others:
* (GNU) bash: Supports ==.
* pdksh (public domain korn shell): Supports ==. See: http://web.cs.mun.ca/~michael/pdksh/pdksh-man.html [^]
* mksh (MirBSD(TM) Korn Shell): Supports ==. See http://www.mirbsd.org/mksh.htm [^]
* GNU coreutils test: Added on 2011-03-22.
* Official ksh: N/A; Official ksh from AT&T doesn't have test/[ built-in. I originally said that ksh supports "==", but it turns out I was actually testing a pdksh that was named ksh, so my comments were really about pdksh noted above. Its "[[" condition DOES now support "==" as a synonym for "=", and "=" is considered obsolete.
* NetBSD: Doesn't support "==", but a patch has been submitted to add it. The last comment (2011-03-18) on it was positive, but I don't know what they will do about it (http://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=44733l) [^]
http://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=44733 [^]
* dash: Does not support "==", but doing so is a one-line patch. Patch submitted 2011-03-06. Developers seemed to agree that if POSIX added "==" as a requirement, dash would implement it.
* OpenBSD's /bin/sh supports "==" (it's not documented, but it DOES work).
* FreeBSD-current's /bin/sh and /bin/test have recently added "==". See http://svn.freebsd.org/base/head/bin/test/test.c. [^]
* busybox ash: Supports ==.

This is such a common vendor extension that I think it's proved itself; it's time to standardize it.

*ALL* of the implementations agree that "==" is a synonym for "=", so there's no question about "different semantics so which one should we implement?". Indeed, a good reason to add this to the standard is to make sure that everyone else does the same thing in the future.
(0000756)
jsonn (reporter)
2011-04-24 21:48

The "=" looks like an assignment is bogus. Contrary to C, Python, Pascal or any other language in this league, you can not write "test $a=foo" or "test $a==foo" and expect it to work. On the same line of reasoning, "a = foo" doesn't work either. The chance of mixing both up is a sign of not knowing the shell language at all.

OpenBSD's /bin/sh is a ksh. FreeBSD added == for compatibility with bash etc. None of this make a good reason reason other than "people don't know how to write portable scripts". With that argument, bash could just be declared the standard, since that's what everyone is using after all...
(0000758)
dwheeler (reporter)
2011-04-25 20:45

See part 2 of my argument, namely, that adding == would improve "consistency between shell and the MANY other languages that use '=' for assignment". There's value to working WITH "muscle memory" and having consistency with other languages when it's easy to do so. C, C++, Objective-C, Python, Perl, PHP, C#, Java, and many other languages that use "=" for assignment also use "==" for is-equal-to. It is inconsistent that test does not, which is why this addition is implemented so widely.

There are now at least 6 implementations of sh or test that accept "==" by my count; bash is only one of them. Most standards organizations prefer to only standardize stuff that's implemented at least once, and prefer to standardize something that's implemented more widely. This is already implemented widely, and that should be a point in its favor.

As far as "legalising the dependency of completely redundant vendor extensions", I disagree. Redundancy for backwards-compatibility is perfectly reasonable, first of all, and while the semantic is the same, the spelling is different. And as far as extensions go, I believe POSIX has always permitted extensions and that each version of POSIX has taken some vendor extensions and pulled them into the specification itself. It's the usual way to get better ideas tried out before they're standardized, and once they're in the standard, they aren't "vendor extensions" - they ARE part of the standard.

Anyway, I hope that clarifies things.
(0000759)
markh (reporter)
2011-04-26 09:25

There are many variants of is-equal-to. "==" is commonly used to test for numeric equality, which is -eq in "test". Some languages have one equality operator that tests for either string or numeric equality depending on the types of the operands ("test" obviously does not fall into that category since it is not typed) and in that case "==" may also be used to test for string equality.

Because "test" has separate operators for numeric and string equality, it seems particularly confusing to make "==" an alias for the string equality operator "=". In other languages, including C and Perl which also have separate numeric and string equality operators, "==" is used for numeric value comparisons. So if someone familiar with these languages sees the "==" they are more likely to incorrectly interpret it as a numeric comparison. For someone writing a shell script, it would seem to be better if their implementation did not support the "==" extension and instead produced an error. Besides the minor benefit of helping them to write more portable shell scripts (even if the standard is changed, "=" is portable even to older implementations), it would also help those familiar with "==" from other languages to avoid a programming error. Since the standard string comparison operator is "=", someone using "==" may do so because they are familiar with C/Perl and intended to write a numeric comparison (-eq), or because they are familiar with a language that uses "==" for both numeric and string comparisons and did not think about what kind of comparison they actually wanted, so they may end up unintentionally using the wrong comparison without realizing it.
(0000760)
wpollock (reporter)
2011-04-26 09:57

I disagree with those who think adding == will make shell scripts less clear and more error prone. Nearly every popular language uses == for equality testing: C, C++, C#, PHP, Java, JavaScript, Ruby, Python, ... , and many modern shells. It is true that many languages support several equality tests, but there is no good reason to break shell by redefining == and -eq to match e.g. perl.

What I wonder is, should expr allow == as well, for consistency? Note awk already uses ==.
(0000761)
jsonn (reporter)
2011-04-26 10:36

I am not arguing that adding == makes shell script less clear. I am arguing that it is completely redundant and therefore without merits. Please sit down for a moment and go back to the start. Imagine for a moment that GNU bash wouldn't silently provide this "feature" even in /bin/sh mode. The basic result would be that noone would use it and we would not have this discussion about adding a redundant operator.

Note that I don't have a problem with the other operators. I would suggest being consistent and having -nte etc too. I am not sure about the mnemonic behind -ef and "<" and ">" begs for a string name, but all those operators add value.
(0000967)
ajosey (manager)
2011-09-22 08:59

This defect report was discussed at the September 8 teleconference meeting.

It was agreed that the best way to progress this would be for the
submitter to produce a whitepaper expanding the proposal (similar
to proposals made in the past, for example the LFS proposal,
http://www.unix.org/version2/whatsnew/lfs20mar.html). [^]

This could then be widely circulated amongst all interested parties
to look for consensus.

The standard developers recommend that the white paper should pay
particular attention to note 670.
(0000974)
dwheeler (reporter)
2011-09-25 00:36

That sounds very reasonable. I will create a first draft and post its URL here for discussion.
(0000975)
gber (reporter)
2011-09-25 18:51

It should be noted that there are widespread implementations of test -nt/-ot with different and incompatible semantics in FreeBSD/NetBSD/OpenBSD and dash. These test implementations all trace their roots to the test builtin of pdksh before version 5.2.14, the difference to the behavior described above is that test will return failure in case the second file does not exist. The test implementations with this behavior have been used by NetBSD since 1994 and by FreeBSD since 1999 and it seems to have been used by dash since the first Linux port of ash in 1993.
(0001011)
dwheeler (reporter)
2011-11-15 17:26

Per comment 967 above, I've created a whitepaper:

https://docs.google.com/document/d/1Gd9r0f0rmmUIZlBnO4NyhTz2q-_1PHHGVCHAglFbJoY/edit [^]

I tried to take all of the comments into account. In particular, I've given up adding < and > into test/[ itself; as is noted above, they're awkward to use (they require quoting) and implementations differ on whether or not they use the locale. Instead, I propose adding "[[" ... "]]", with < and > primaries that are easier to use and always locale-aware.

I used the LFS proposal as a model. In particular, the rationale text is separate from the proposal itself. In particular, I propose a specific semantic for -nt and -ot, and I give the specific rationale for it.

All of the updated proposal is widely implemented in a variety of implementations, and are already in use in shell scripts. There's some variation in -nt and -ot; I'd rather define a specific semantic that would require minor changes in a few implementations, but if that's a bridge too far, I also describe an alternative semantic that would require no changes to anyone who implements -nt and -ot (though at the cost of weakening the semantics somewhat).

Comments are, of course, very welcome!
(0001013)
ajosey (manager)
2011-11-15 19:30

I have also added the proposal referenced in bugnote 1011 as an attached file to this bug in txt and pdf formats
(0001044)
dwheeler (reporter)
2011-11-23 23:57

I have tried to respond to the comments on this proposal, including those by David Korn and Geoff Clare.

The revised proposal for bug 375 is here:
https://docs.google.com/document/d/1Gd9r0f0rmmUIZlBnO4NyhTz2q-_1PHHGVCHAglFbJoY/edit [^]
I have also attached the revised proposal in .pdf, .txt, and .odt formats to this bug report.

Please let me know what needs fixing, and preferably how to fix it.
(0001060)
geoffclare (manager)
2011-11-28 15:50
edited on: 2011-11-28 16:10

I have looked at the new Nov 23 draft of the white paper and I spotted quite a lot of changes that are still needed to the proposed wording changes in the standard. Rather than describe them, I thought I would take the more direct route of downloading the odt file and making the changes myself. The changes are recorded in the new odt file; you can choose whether to see them or not in OpenOffice/LibreOffice (and I assume in other alternatives). I have also produced a clean pdf that does not show the changes, just the end result. I will attach them to the bug after posting this note.

As with my emailed comments, I have only addressed the proposed changes to the standard, not the rationale, in the white paper. Also I have not attempted to fix the style issues.

(0001061)
dwheeler (reporter)
2011-11-28 16:38

Regarding comment #1060:

I like Geoff Clare's improved proposal, that's fantastic. Thanks for doing that.
(0001064)
Roger Marquis (reporter)
2011-11-29 03:26
edited on: 2011-11-29 03:27

/bin/sh shell does not have these features because of the need, by systems administrators, for cross-platform, backwards and forwards compatibility.

Since /bin/sh is the system (admin's) shell it is technically outside of the scope of the Austin Group to make changes. This scope has been violated in the past, however, that does not change the charter. Most importantly, this proposed change to /bin/sh would degrade compatibility more than all of previous systems-administration-related changes combined.

There certainly is a need for these operators in non-system shells and interpreted languages such as Perl, Ruby and Python but that does not mean there is a need for these operators in /bin/sh or that the damage to cross-platform compatibility would be offset by their addition. Indeed, people use /bin/sh instead of other shells or interpreters specifically because its syntax does not change, will not break across implementations, and does not need to be upgraded like non-system packages and applications.

The proposed changes would not improve compatibility between operating systems over time, would not increase compatibility across OS distributions, they would not improve compatibility between operating systems in any way. For this reason and the other reasons spelled out above this change to /bin/sh is A) bad idea and B) outside of the Austin Group's jurisdiction.

Perhaps this would be a good time to fork the POSIX shell definition to specify ksh and bash as unique, changeable and distinct from /bin/sh?

(0001065)
dwheeler (reporter)
2011-11-29 04:51

Regarding comment #1064:

I disagree.

1. POSIX.1-2008 is to enable "portability of application programs" (page iv) - and the shell interface sh is DEFINITELY used by application programs. For example, users certainly write shell scripts; make by default executes commands using the shell (modulo optimizations); and functions system() and popen(), which are used by user applications, also depend on the shell. Therefore, the definition of sh is clearly within the POSIX specification's scope. Yes, system administrators often use sh, but they also use ls, cp, mv, and many other commands that users and application programs use. The fact that system administrators and application programs share certain interfaces does not make these commands out-of-scope for POSIX. My understanding from page iv is that facilities ONLY for system administrators are excluded from POSIX... but clearly sh is not exclusively used by system administrators. Besides, POSIX.1-2008's XCU chapter 2 *defines* the shell interface; it "describes the syntax of that command language as it is used by the sh utility and the system( ) and popen( ) functions defined in the System Interfaces volume of POSIX.1-2008". I'm not creating a new scope, merely trying to improve what's already there. If the Austin Group has a radically different scope than POSIX.1-2008 page iv ("Purpose"), please let me know.

2. Many shell implementations, including whatever /bin/sh is, implement [[ and ]]. POSIX page 2301 lines 72489-72491 *specifically* reserves [[ ... ]] because of their widespread use. Thus, anyone using [[...]] is already NOT limiting themselves to the POSIX specification, and the POSIX specification already warned people about this. The primaries are likewise not in the POSIX specification, so no POSIX-compliant shell script uses them anyway.

3. You could argue that in some environments it's too much to ask /bin/sh to implement something. But that's hard to justify in the case of "==", "-nt", and "-ot" at least; even busybox's sh implements these (busybox is widely used in embedded systems). You could make a better case for that with "[[", I suppose. But that doesn't seem to be the argument you're making anyway.

4. The proposals add to the specification some capabilities that are already widely used and depended upon. They would certainly improve compatibility between operating systems and distributions over time, since people could then reliably depend on them between any POSIX-compliant system).

Thanks.
(0001869)
dwheeler (reporter)
2013-10-10 01:15

I now think I put too much into a single proposal, sorry about that.

I've separated the proposal about "==" into a separate bug report: http://austingroupbugs.net/view.php?id=762 [^]

My hope is that by separating that out, it will be easier to discuss each of these proposals.
(0001881)
dwheeler (reporter)
2013-10-12 04:08

Note that http://www.opengroup.org/austin/docs/austin_630.txt [^] discusses this (including the now-separate 762).
(0001944)
geoffclare (manager)
2013-10-24 15:23

I have uploaded a new version of the white paper as
Extendingshellconditionals-2013-10-24.{odt,pdf}. As before, the odt
file records the changes since the previous version and the pdf is a
clean version.

This version addresses some new comments made by David Korn to the
core team mailing list and fixes some other problems I spotted while
working on them. I have also had a go at the context-dependent
recognition rules for the new DB_UNARY and DB_BINARY tokens.

As previously, I have only worked on the proposed changes to the
standard, not on the rationale part of the white paper.
(0001945)
geoffclare (manager)
2013-10-25 10:26

Just noticed something I missed in the update to the white paper.
It gives the format of the "[[" command as:

[[ conditional-expression ]]

but it should be:

[[ conditional-expression ]][io-redirect ...]
(0001950)
ranjit (reporter)
2013-11-01 15:36

I'd just like to get my opposition to == on the record for this bug, in line with the objections from markh and jsonn, and as outlined in http://austingroupbugs.net/view.php?id=762#c1923. [^] Note that this is not just about redundancy, nor how easy it is to implement, but simply about language design and consistency.

AFAIK all [[ already implement == as a synonym for = and if POSIX merely specifies = all implementations will conform, and be free to add == if they choose. Thus = would work exactly as it does in bash [[, with fnmatch semantics, and quoting used to suppress special chars like '*'. And one would use = in both [ and [[; '=' is also the style recommended by #bash; based on many years of learning from that channel, and seeing which form greycat and others _always_ use.

greycat is the main op, and responsible for the wooledge.org[1] site which includes the BashFAQ, BashPitfalls, and BashGuide, which are in essence *the* best resources for learning BASH available. My point simply is that professional BASH scripters do not use '==', and stick to '=' for the reasons given: in a nutshell to remind ourselves that we are writing shell, not some other language we may also use to implement.

"Muscle memory" from other languages is a bad idea, as this is not another language, and the discussion in that note, simply underlines exactly why sh was *first* _designed_ to use = for string-comparison. There is no visual ambiguity as jsonn discussed, so that is a spurious argument as well, and should not be given any credence. It certainly does not belong in the Rationale, as it implies POSIX does not understand sh. Given: a=$foo vs: [[ $a = "$foo" ]] -- is anyone seriously claiming that those can be confused?

The proposed change for DB_BINARY from "one" to "two" chars is also not in the pdf; I'd ask that be included as well, again for consistency and clarity in language-design, and simplicity of implementation and use.

[1] http://mywiki.wooledge.org/ [^]
(0001953)
dwheeler (reporter)
2013-11-01 19:09

I'd like to record my disagreement about "==", in response to #1950. Yes, people will be free to add "==" to test if they wish. In fact, it'd be impractical to prevent it, since a vast number of shells support it (bash, pdksh, mksh, GNU coreutils test, OpenBSD /bin/sh, FreeBSD /bin/sh, busybox ash).

The problem is that I am constantly trying to change "==" to "=" to work with dash, which is the *only* shell that I normally encounter that does not already support "==" as equality in test. This is a losing game, and absurd. It would be far more helpful to accept "==" as a synonym for "=" in sh.

The "professional scripters" for bash I see use "==", not "="; clearly there are sampling differences.

I agree that [[...]] should be added to the spec, but I don't see an adequate justification for making test/[ obsolescent.
(0001954)
shware_systems (reporter)
2013-11-01 19:43

It was brought up during the 31-Oct phone call that ksh has 3 forms of equality test, namely
= or == WORD for glob matching,
= or == "TEXT" for straight equality,
=~ WORD or "EXPR" for ERE matching.

test and [ has
= WORD or "TEXT" for straight equality, with == as synonym in most cases.

What is missing is BRE's as the argument, so I was thinking for [[ to un-obsolete = and use
= WORD as straight equality like with test, and
= "EXPR" as a BRE match.

This gives = and == separate, useful functionality. They're not synonyms, in other words. Whether ksh changes to use this would be up to dgk. Some ksh scripts that are using = still in [[ may need editing to
use == only, to work properly in sh, but if ksh doesn't change it can still execute those without edits. Those people that want to convert scripts from test to [[ are editing them anyways. I think backward incompatibility isn't really a factor, from the application side. As this is new for sh as a requirement there's no real issue from the implementation side either.

I'm not taking sides whether = or == should be used, or test deprecated or not. I think this is something that can be lived with and fills the hole of no BREs at all in shell.
(0001979)
dwheeler (reporter)
2013-11-10 14:46

I am the original proposer for adding [[...]], and I still think it is a good idea to add it. I'm glad it's being seriously discussed.

However, I am *opposed* to making "test" and "[ expr ]" obsolescent as mentioned in http://austingroupbugs.net/view.php?id=762#c1948 [^]

First of all, the test/[ construct is in WIDESPREAD use. Normally marking something as obsolescent is a warning that it is likely to be removed in a future. Yet this isn't true in this case. Comment 1948 admits that test/[ will "never be removed from the standard". And I agree that it should not be removed in the future; there are probably millions of lines of shell scripts that depend on this functionality. If it will never be removed, it is erroneous to mark it as obsolete.

Second, there are use cases where [[..]] is a bad idea. The obvious example is the widely-used autoconf tool; obsoleting "test" would mean that obvious comparisons would look like SOMETHING([[[[x == y]]]], ...).

I do agree test's "-a" and "-o" are problematic. If you want to mark something as obsolescent, mark those as obsolescent, and recommend the use of "&&" and "||" respectively when using sh. But don't throw out the baby with the bathwater.
(0002017)
dwheeler (reporter)
2013-11-22 22:13

This was discussed in the 21 November 2013 Teleconference; notes are given here:
http://www.opengroup.org/austin/docs/austin_635.txt [^]

In that discussion there was a preference to only add == to [ and test, and add -nt -ot etc. to [[ only. I will modify the current proposal based on that discussion.
(0002031)
dwheeler (reporter)
2013-11-29 19:11

I've updated and posted "Extending shell conditionals" dated 2013-11-29. This responds to the decisions of 2013-11-21. The proposal adds "==" to test/[ (and nothing else), and adds "[[...]]" which in turn contains -nt, -ot, -ef, <, >, and pattern comparison operators (among others).

Since Geoff Clare and David Korn have made so many proposals, I've claimed all three of us as proposal authors. Please let me know if there's an objection; I'm just trying to give credit where credit is due.

This updated proposal adds various capabilities to POSIX that are already implemented (by multiple implementations) and are widely used. It'll be a big improvement to have them in the POSIX specification itself. As far as I know, all proposal issues have been resolved, but please let me know if there are additional issues or fixes that need to be made.
(0002032)
dwheeler (reporter)
2013-11-29 19:24

Regarding comment #0001954:

No, "=" should NOT mean BRE matching inside [[...]]. This would directly conflict with the MirBSD Korn shell, where its "=" means the same as the proposed "==". Indeed, MirBSD Korn shell version I have on Cygwin doesn't even *support* "==" right now, never mind suggesting that it is obsolete.

And while it may be noted "obsolete" in Korn shell, it'd still represent a silent change for implementations. That is best avoided.

To be honest, I think supporting ERE matching is enough. That's usually what you want for regexes. But if it's REALLY important to support BRE matching (vs. ERE matching), I think a completely different operator syntax should be used, say "=~b" or some such.
(0002033)
jilles (reporter)
2013-12-01 22:52

Regarding [[ -v word ]], this is redundant with -n "${word+set}" if word is a literal. If not, you still need eval (or non-standard namerefs) to actually access the value, so why not do that for the test as well. It seems to make more sense to add [[ -v word ]] in a separate proposal, together with other features for specifying variable names indirectly.

Regarding the new grammar rule, I think it makes sense to recognize ]]
everywhere, so that things like
  [[ a == ]] && ...
or
  [[ -n ]] ...
can be rejected at the presumably erroneous ]].

Some of the rationale for 'test' (line 109222-109225 in posix.1-2008+tc1) discussing why [[ was not added should be removed if this is accepted.
(0002034)
shware_systems (reporter)
2013-12-02 08:59

Re: 0002032
Yes, you're more right than not... BREs would be nice but aren't that necessary. I overlooked a couple things that would make adding them overly ambiguous as presented, and even as a separate operator would mean possible grammar changes so is better left as a possible extension. One point, though, is that the proposed '==' text has "Characters that are quoted in pattern (e.g., inside double-quotes) shall not be treated as special.", which is not explicit about applying just to pattern special characters per XCU 2.13 or all shell special characters per XCU 2.2 and 2.13. The latter would have double quotes acting more like single quotes, which supports the straight compare "TEXT" form for inside '[[ ]]' as the intent I heard, but changes it from how "..." handled for test.
(0002035)
shware_systems (reporter)
2013-12-02 11:15

Re: 0002033
"Regarding [[ -v word ]], this is redundant with -n "${word+set}" if word is a literal."

While redundant if it doesn't exclude special parameter characters allowed by ${parameter+word} form, or ignores the set +u/-u option, it adds clarity of intent and is probably faster to process. If it is limited to WORDs that could be assignment variable names or ignores -u that should probably be mentioned, perhaps as an Application Usage or Rationale entry. If it isn't common to all current '[[' extensions, or has differing behavior amongst them, I agree it should be a separate proposal.
(0002052)
dwheeler (reporter)
2013-12-08 22:06

Re: 0002033 and 0002035
Regarding "-v":

The "-v" option is in ksh93:
http://www2.research.att.com/sw/download/man/man1/ksh.html [^]

The "-v" option is also in the latest version of bash:
https://www.gnu.org/software/bash/manual/html_node/Bash-Conditional-Expressions.html#Bash-Conditional-Expressions [^]
Note that this a recent addition; bash version 4.0.23(1) does NOT have it.

The MirBSD ksh does NOT have "-v". However, there's no conflicting interpretation of "-v", so it could be added without trouble:
https://www.mirbsd.org/htman/i386/man1/mksh.htm [^]

I tried out the ksh93 version, and there is NO limitation on word; "-v VAR" checks for literal VAR, while "-v $VAR" as expected retrieves the content of VAR, allowing indirect checking if a given value exists.

I could go either way on this one. Yes, it's a little redundant. However, the fact that it's implemented by multiple implementations makes me think we may as well include it in the [[ ... ]] proposal. For the moment I plan to continue to keep it, but I have no strong feelings about it.

Comments?
(0002053)
dwheeler (reporter)
2013-12-08 22:48

Re: 0002034

"One point, though, is that the proposed '==' text has "Characters that are quoted in pattern (e.g., inside double-quotes) shall not be treated as special.", which is not explicit about applying just to pattern special characters per XCU 2.13 or all shell special characters per XCU 2.2 and 2.13."

You're right, that needs to be much clearer. Here's my try below, suggestions welcome:

*****
In addition, the following conditional operators shall be supported inside a conditional command:
string == pattern
 true if the entire string matches the pattern as described in “Pattern Matching Notation” (section 2.13); otherwise false. Characters have a special meaning in pattern matching notation (“?”, “*”, and “[“) that are quoted in pattern (using <backslash>, single-quote, or double-quote) shall not have a special meaning, and shall instead match themselves.
string =~ regex
 true if string matches the extended regular expression (ERE) regex as defined in section 9.4; otherwise false. Characters that are quoted in regex (using <backslash>, single-quote, or double-quote) shall not be treated as an ERE special character, but shall instead be treated as an ERE ordinary character.
*****
(0002054)
dwheeler (reporter)
2013-12-08 23:05

Re: 0002033
"Regarding the new grammar rule, I think it makes sense to recognize ]]
everywhere, so that things like
  [[ a == ]] && ...
or
  [[ -n ]] ...
can be rejected at the presumably erroneous ]]."

I *think* the grammar rule already does this. A "]]" is going to be considered "Rdbracket", so [[ a == ]] ... would fail to meet the syntax requirements.

We don't really want it recognized "everywhere", because that would interfere with backwards compatibility. For example, this should print [[ ]]:
  echo [[ ]]
I note that both bash and mksh work this way.

We could make that clearer by adding this:
"Once a conditional command is begun with the command “[[”, an unquoted “]]” word shall terminate it."
(0002055)
dwheeler (reporter)
2013-12-09 01:35

I've posted files that update the proposal per previous comments. I left "-v" in (unchanged), but I did try to clarify that quoting only (by itself) disables the special characters for pattern-matching.

Please let me know if there are any remaining issues.
(0002056)
shware_systems (reporter)
2013-12-09 06:55

Re: 0002052
"I tried out the ksh93 version, and there is NO limitation on word; "-v VAR" checks for literal VAR, while "-v $VAR" as expected retrieves the content of VAR, allowing indirect checking if a given value exists."

By "special parameter character" in 0002035 I meant those of XSH 2.5.2, which '${' provides the syntax hook prefix for, so "${*+set}" evaluates to 'set'. The possible differences would be in how does -v currently handle those, as in "-v $*", "-v $!" or "-v $$", or "VAR = '$*'; [[ -v $VAR ]];".

Re: 0002053
"In addition, the following conditional operators shall be supported inside a conditional command:
string == pattern
 true if the entire string matches the pattern as described in “Pattern Matching Notation” (section 2.13); otherwise false. Characters have a special meaning in pattern matching notation (“?”, “*”, and “[“) that are quoted in pattern (using <backslash>, single-quote, or double-quote) shall not have a special meaning, and shall instead match themselves."

Suggest:
"In addition, the following conditional operators shall be supported inside a conditional command:
string == pattern
 true if the entire string matches the pattern as described in “Pattern Matching Notation” (section 2.13); otherwise false. Characters with a special meaning in pattern matching notation (“?”, “*”, and “[“) that are quoted in the pattern (using <backslash>, or within single- or double-quote pairs) shall not have a special meaning, and shall instead match themselves."

Putting it that way avoids any people thinking they should use "?"*"[ instead of "?*[". Similar applies to '=~' text. Additionally, a WORD fully in double-quotes already inhibits field splitting and path expansions, so having that as a separate restriction doesn't appear warranted. A caveat that not using quotes may lead to mysterious behavior depending on expansions used does still apply.
=============================================
Also, the issue about limiting additional named unary operators to single characters in Note 10 still up in air, though was discussed on ML more than here:
"When the TOKEN is exactly “-b”, “-c”, “-d”, “-e”, “-f”, “-g”, “-h”, “-L”, “-n”, “-p”, “-r”, “-S”, “-s”, “-t”, “-u”, “-w”, “-x”, or “-z”, and the preceding token was neither DB_UNARY nor DB_BINARY, the token identifier DB_UNARY shall result. Implementations may recognize additional DB_UNARY tokens consisting of a <hyphen> and an alphabetic character from the portable character set. When the TOKEN is exactly “==”, “!=”, “=~”, “<”, “>”, “-ef”, “-eq”, “-ne”, “-gt”, “-ge”, “-lt”, “-le”, “-nt”, or “-ot”, and the preceding token was neither DB_UNARY nor DB_BINARY, the token identifier DB_BINARY shall result. Implementations may recognize additional DB_BINARY tokens consisting of one or more punctuation characters from the portable character set or consisting of a <hyphen> and one or more alphabetic characters from the portable character set."

Both "-v" and "=" by itself needs to be included now, and I still think it should use "-_" as a prefix to reserve a namespace for unary tokens where more than one alpha desired, or needed for a 53rd operator.

Suggest:
"When the TOKEN is exactly “-b”, “-c”, “-d”, “-e”, “-f”, “-g”, “-h”, “-L”, “-n”, “-p”, “-r”, “-S”, “-s”, “-t”, “-u”, “-v”, “-w”, “-x”, or “-z”, and the preceding token was neither DB_UNARY nor DB_BINARY, the token identifier DB_UNARY shall result. Implementations may recognize additional DB_UNARY tokens consisting of a <hyphen> and an alphabetic character from the portable character set, or <hyphen><underscore> and one or more alphabetic characters from the portable character set. When the TOKEN is exactly “=”, “==”, “!=”, “=~”, “<”, “>”, “-ef”, “-eq”, “-ne”, “-gt”, “-ge”, “-lt”, “-le”, “-nt”, or “-ot”, and the preceding token was neither DB_UNARY nor DB_BINARY, the token identifier DB_BINARY shall result. Implementations may recognize additional DB_BINARY tokens consisting of one or more punctuation characters from the portable character set, excluding '-' and '-_', or consisting of a <hyphen> and two or more alphabetic characters from the portable character set."

I've also a concern for the integer comparison operators, whether it should be explicit as to numerical range that shall be supported, and 'out of range' being a new reserved error code that sh may return. This was not a potential issue with test being a separate utility, as it glossed it over as one of the '>0' return values, but is for '[[', I think.

Additionally, as ( e ) already marked obsolete in test, I don't think it should be included in '[[' with the same semantic. I can see it for its usual precedence override usage, as indicated by the grammar additions, but the text doesn't reflect that.
=============================================
Lastly, do any use -m integer and -o integer for negation (-minusing) and ones complementing? All I see now is ! integer for logical negation. They'd be aliases of $((-integer)) and $((~integer)), I realize, but less typing like with -v.
(0002059)
dwheeler (reporter)
2013-12-10 04:46

Re: 0002056

"By "special parameter character" in 0002035 I meant those of XSH 2.5.2, which '${' provides the syntax hook prefix for, so "${*+set}" evaluates to 'set'. The possible differences would be in how does -v currently handle those, as in "-v $*", "-v $!" or "-v $$", or "VAR = '$*'; [[ -v $VAR ]];"."

These examples don't make sense to me:

1. First of all, in both ksh93 and bash, running "echo ${*+set}" produces a blank line, not "set". And that makes sense to me, too. After all, the "*" variable has not been set by an assignment (it's a special parameter). I don't see any text about whether or not special parameters are considered "set", but if so, that text probably belongs earlier in the ${...+...} text.

2. Also, I think you're missing the semantics of "-v"... perhaps they are not clearly stated? It would be unusual for the parameter after "-v" to begin with "$", and many of the examples listed above seem pointless. "[[ -v $$ ]]" would retrieve the current PID, and then see if a variable by that numeric name is set. The parameter after -v *could* begin with "$", to do indirect access, so it should be permitted... but typically users will want to do stuff like "[[ -v flag ]]".

I like your rewording of the condition text to make it clearer... I'll change it to match. Thanks!

I don't know how many implementations would use -_X for more unary operators, but reserving a namespace seems like a reasonable idea. Also, thanks for the list fixes.

"I've also a concern for the integer comparison operators, whether it should be explicit as to numerical range that shall be supported, and 'out of range' being a new reserved error code that sh may return. This was not a potential issue with test being a separate utility, as it glossed it over as one of the '>0' return values, but is for '[[', I think."

I think we should appeal to the "Arithmetic Expansion" section, which already specifies a range (at least signed integer). Here's what I propose:
"Minimum requirements for numeric processing in conditional expressions are defined in section 2.6.4 (“Arithmetic Expansion”)." The arithmetic expansion text already admits that arithmetic can fail, and just notes that there would be an error. Do we really need more detail? It's not clear that we could get more agreement than that.

"Additionally, as ( e ) already marked obsolete in test, I don't think it should be included in '[[' with the same semantic. I can see it for its usual precedence override usage, as indicated by the grammar additions, but the text doesn't reflect that."

The whole reason the "( e )" rule is given is *specifically* to override precedence. I think it already does that, however, it sounds like that's not clear enough. So I'll append to the text about "( e )" the following: "This is used to override other precedence rules."


"Lastly, do any use -m integer and -o integer for negation (-minusing) and ones complementing? All I see now is ! integer for logical negation. They'd be aliases of $((-integer)) and $((~integer)), I realize, but less typing like with -v. "

I don't know of any shell with "-m integer" or "-o integer" built-in. But as you mentioned, the $((...)) expansion already includes - negation, ~ bitwise complement, and lots of other goodies, and you can include $((...)) inside [[...]]. In addition, ksh93 uses "-o" for other functionality (e.g., to determine if a given option is on).
(0002060)
shware_systems (reporter)
2013-12-10 09:41

Yes, for general usage the special parameters don't make a lot of sense, and the examples could be better, but they do expand to WORDs. As parameters, special or otherwise, can mostly be used where variable names are used, maybe it's better to ask if WORD is expected to be treated as the parameter of ${parameter} or just as a NAME like the non-switch arguments to unset. That concern is a special case of that general question. I could see someone wanting to do something like:
params=1; while ! [[ -v $params ]] { . . . ;params=$(($params+1));}
to walk an argument list with over 9 arguments in a script, where some might be explicitly "" when IFS includes a comma.

"I don't know how many implementations would use -_X for more unary operators,"

Yes, it looks a bit silly but is in keeping with leading underscore being reserved identifiers for variables in C. Here '-_' would be an operator more than '-' '_op' being the semantic. I'm open to just '_' as a leading char being used also, if var names are limited to starting with portable alpha, but that could be visually confusing, e.g. '_v' and '-v', and easier to typo.

"I think we should appeal to the "Arithmetic Expansion" section, which already specifies a range (at least signed integer)."

That limits it to signed long as integer, which would be mostly compatible, yes. I'm thinking for scripts dealing with testing file offsets or sizes, they may need long long. It somewhat begs the question should that be updated also, or be a set option to toggle long or long long arithmetic as range, as a separate issue.

"The whole reason the "( e )" rule is given is *specifically* to override precedence. I think it already does that, however, it sounds like that's not clear enough. So I'll append to the text about "( e )" the following: "This is used to override other precedence rules." "

That text being:
"( e )
true if expression e is true, otherwise false."
It has the additional behavior of an implied "-eq 0" above the override function, as stated, converting all grouping to logical when sub-expressions may need to be string or integer still. While with the current operators that's the net effect anyways, I'm trying to leave open that the text doesn't require all implementation defined operators have a logical result type. The -m and -o are more examples of that than I expected they are implemented yet, but it'd be nice, from my perspective.
(0002061)
dwheeler (reporter)
2013-12-10 16:08

Re: 0002060

"Yes, for general usage the special parameters don't make a lot of sense, and the examples could be better, but they do expand to WORDs. As parameters, special or otherwise, can mostly be used where variable names are used, maybe it's better to ask if WORD is expected to be treated as the parameter of ${parameter} or just as a NAME like the non-switch arguments to unset. That concern is a special case of that general question. I could see someone wanting to do something like:
params=1; while ! [[ -v $params ]] { . . . ;params=$(($params+1));}
to walk an argument list with over 9 arguments in a script, where some might be explicitly "" when IFS includes a comma."

That makes more sense. I tried this out with ksh when passing one parameter. ${1+set} returns "set" as expected, but "[[ -v 1 ]]" is false in the same circumstance. Ugh. I think people don't normally use special parameters with [[ -v .... ]], so nobody noticed. I think it's easily argued that this is a bug in ksh; it's hard to imagine anyone DEPENDING on this behavior.

We could say that '[[ -v VAR ]]' returns the same truth result as '[ "${VAR+set}" = "set" ]'. I like that idea; that would leave them consistent, at least. But I think we're exposing that the spec is unclear about when special parameters are set.

Should we require that [[ -v 1 ]] return true if $1 is set? Or should we just declare that the results are undefined for special parameters in either case?


I said:
"I think we should appeal to the "Arithmetic Expansion" section, which already specifies a range (at least signed integer)."

Comment reply:
"That limits it to signed long as integer, which would be mostly compatible, yes. I'm thinking for scripts dealing with testing file offsets or sizes, they may need long long. It somewhat begs the question should that be updated also, or be a set option to toggle long or long long arithmetic as range, as a separate issue."

Well, technically it doesn't limit *implementations* to signed long, instead, that is the limit that applications can portably depend on. But ignoring that nit...

I completely disagree with the premise. [[..]] would normally be implemented as a built-in in sh. As such, it doesn't make sense for [[..]] to mandate support for arithmetic calculations for data types larger than the type sh uses for arithmetic expressions. You can't compare data you can't process! If you think sh should handle larger values, I suggest arguing the case as a separate change to the "Arithmetic Expansion" section.

Now I agree that sometimes you need to compare larger values. But you can use other tools, like bc, in such special circumstances.

Regarding "( e )", "It has the additional behavior of an implied "-eq 0" above the override function, as stated, converting all grouping to logical when sub-expressions may need to be string or integer still. While with the current operators that's the net effect anyways, I'm trying to leave open that the text doesn't require all implementation defined operators have a logical result type. The -m and -o are more examples of that than I expected they are implemented yet, but it'd be nice, from my perspective."

Oh, I see. As you note, with the proposed standard operators it doesn't make any difference anyway. But I don't mind making the text more flexible so that extensions could allow value returns; that seems to provide flexibility with no real downside. Okay, I think we should change that text to be more flexible, probably to something like "returns the value of e".
(0002063)
jilles (reporter)
2013-12-10 22:36

Re: 0002054

"recognize ]] everywhere" was not clear enough. What I had meant to say was to add to the new grammer rule (that is, after "After line 74771 (2013 edition) add a new grammar rule" in the document), something like:

When the TOKEN is exactly "]]", the token identifier Rdbracket shall result.

Note that this leaves "!" as the only reserved word that is not recognized after a DB_UNARY or DB_BINARY token.

Re: 0002059

I think the numerical range should match the range for "test". The "test" page does not specify it implicitly, so XCU 1.1.2.1 Arithmetic Precision and Operations applies, stating that the type is signed long following the rules of the C standard (therefore, overflow is undefined and may happen to be implemented as a larger type).

Referring to arithmetic expressions may confuse things: the decimal numbers accepted by e.g. [ -eq are not a subset of arithmetic expressions. For example, "010" evaluated as a decimal number is ten, but "010" evaluated as an arithmetic expression is eight.
(0002064)
dwheeler (reporter)
2013-12-11 00:46

Re: 2063

Comment: "recognize ]] everywhere" was not clear enough. What I had meant to say was to add to the new grammer rule (that is, after "After line 74771 (2013 edition) add a new grammar rule" in the document), something like: "When the TOKEN is exactly "]]", the token identifier Rdbracket shall result." Note that this leaves "!" as the only reserved word that is not recognized after a DB_UNARY or DB_BINARY token.

Ah, now I understand. Okay, I'll make that change in an updated proposal.


Comment says: "Referring to arithmetic expressions may confuse things: the decimal numbers accepted by e.g. [ -eq are not a subset of arithmetic expressions. For example, "010" evaluated as a decimal number is ten, but "010" evaluated as an arithmetic expression is eight."

I've run some tests, and it turns out that this is a subtle difference between bash and ksh93/mksh (ugh). Basically, bash is more like test, while ksh93/mksh omits the capability to handle base 8 and base 16 prefixes.

So... let's get into the details. The arithmetic expression says that, "Only the decimal-constant, octal-constant, and hexadecimal-constant constants specified in the ISO C standard, Section 6.4.4.1 are required to be recognized as constants." It turns out that bash implements this as well, so inside [[ ... ]], bash will convert 0xNUM and 0NUM from base 16 or 8 when asked to do a numeric comparison. In contrast, ksh93 and mksh do not. For example, this prints two "0" lines in bash:
 [[ 0x10 -gt 15 ]]
 echo $?
 [[ 010 -lt 9 ]]
 echo $?
In contrast, ksh93 prints "1" (false) both times, and will print "0" (true) for this:
[[ 010 -eq 10 ]]
 echo $?


I think the bash behavior should be required in this case. These numeric operators (such as -eq and -lt) are *specifically* for numbers, so there's no ambiguity, and failing to support base 16 and base 8 is a gratuitous incompatibility and loss of functionality as compared to test. I think the goal is to encourage people to use [[...]] instead of [...] as much as possible; losing base 8 and base 16 is a strange incompatible omission.

While this would require changing ksh93 and mksh, I think it highly unlikely that people are supplying "0x10" and assuming this does NOT mean hexadecimal. It's more plausible for "010" meaning the same as "10", but I again I doubt it, and it's not hard to strip leading 0s from data if you need to.

So for the moment I intend no change, but this is certainly worthy of discussion. Comments?

Comments?
(0002065)
shware_systems (reporter)
2013-12-11 00:48

"Well, technically it doesn't limit *implementations* to signed long, instead, that is the limit that applications can portably depend on. But ignoring that nit..."

No, it doesn't, as extensions. For conforming mode it does limit them, though, as overflows are expected to both produce diagnostic messages on variable contents atoi evaluation or conversion when a real used to evaluate an expression internally, and for all cases when untrapped raise a SIGFPE abort as SIG_DFLT action. That's how I interpret the last paragraph of XSH 2.6.4 and "Only signed long integer arithmetic". It may be portable but appears limiting nowadays on 64-bit processors.

"I completely disagree with the premise. [[..]] would normally be implemented as a built-in in sh. As such, it doesn't make sense for [[..]] to mandate support for arithmetic calculations for data types larger than the type sh uses for arithmetic expressions. You can't compare data you can't process! If you think sh should handle larger values, I suggest arguing the case as a separate change to the "Arithmetic Expansion" section."

I agreed with that, that it should be consistent, but it appears current practice includes it as the de facto preference so is why I added 'mostly'. jilles point about being consistent with test as an alternative is also worth discussion, IMO. As a separate grammar section is being used it isn't necessarily limited to how AEs handle it. I'm not saying it has to be one way or another, just that enough thought is given so whatever decided is least likely to come back as portable but non-useful.

"Now I agree that sometimes you need to compare larger values. But you can use other tools, like bc, in such special circumstances."

That's doable, but tends to obfuscate the intent and is implicitly slower to execute. If bc had a -c switch like sh, so $(bc -c " . . . ") could be used for single expression evaluation, that would still be pretty clear even with the exec() penalty. A self contained script needs a pipe or 'here doc' based construct for similar effect as things stand, which may have I/O related side effects also complicating things.
(0002066)
dwheeler (reporter)
2013-12-11 00:58

Re: 0002064 and 0002063.

Whups, please allow me to correct myself. I just reviewed the spec for "test", and it's not obvious to me that test/[ requires support for 0xNUM or 0NUM. I thought it did, but while $((...)) does require support for conversion, there doesn't seem to be a requirement to do so in test/[. I did a few tests and found that at least some test/[ implementations do not implement base conversions with -eq for example. Oddly enough, bash's implementation of "test" does not support 0x10, but its implementation of [[...]] *does* support 0x10.

So should base conversions in [[...]] be forbidden (since the test spec, ksh93, and mksh do not include it)? Required (as bash does)? Silently allowed (which is the easiest for the implementers, though not for the users)? Thoughts welcome.
(0002068)
shware_systems (reporter)
2013-12-12 01:38
edited on: 2013-12-12 06:04

Re: 0002066
Given mailing list discussion, it appears XCU 1.1.2.1 supersets and supersedes XBD 12.1, so both test and [[ ]] as a part of sh should be handling those bases as required, with [0[:alnum:]#] being a common extension for arbitrary base that I feel should be left as an extension at this point. Versions that handle only decimal, with or without multiple leading zeros, may be compliant with a pre SUSV6 version but not the current wording, it looks, so are buggy from this versions perspective and I'd hope for Issue 8 as well. The semantics of C insist the next char after a leading 0 shall be one of 0 to 7, 'x', 'X', 'u', 'U', 'l', 'L' for integers, or '.', 'e', 'E', 'p' and 'P' that switch it to real representation, so multiple leading zeros are only allowed for base 8, not base 10. While limiting it to signed long removes it being necessary to process suffixes, though it should be specific they'll be silently ignored as normative, the otherwise blanket language about conversions doesn't exclude properly determining base from prefix. Arithmetic expressions in [[ ]] fall under extensions permitted because binary operators may be unused combinations of punct chars in the proposed text, but those should be left as extensions also IMO.

(0002176)
geoffclare (manager)
2014-03-06 17:15

Should [[...]] be required to support numeric values with magnitude up to the maximum file size supported by the implementation? This is required for test (see XCU 1.5), or at least it will be once the conflict with XBD 12.1 is fixed via 0000813.
(0002177)
ranjit (reporter)
2014-03-06 19:13

In response to #1953:
You have avoided the substantive point, which is that this is not about making commands easier and more consistent, nor about making POSIX sh some sort of superset of current implementations (so everyone's scripts "work"), but about the design of a language.

Your justification for this addition is:
> The issue is that "=" is already used for shell variable assignment. Using
> the SAME spelling for both "is-equal-to" and "assignment" is confusing,
> and thus is something that *MANY* languages avoid.
..which completely ignores the point that shell was is ALREADY a different language. You cannot disallow the use of [ bar = "$foo" ] (there's far too many portable sh scripters who have it in their muscle-memory already) so this "confusion" point is moot. In fact you _add_ confusion when there are two operators with the exact same meaning, as I have seen with newbs in #bash (another reason no-one in there uses ==.)

And if your sampling included anyone from most Linux distros, or who's learnt from TLDP ABS, then you likely haven't really spoken to very competent scripters. Linux distros, having completely failed to script anything very well, have in the main decided that means shell is bad. Hence systemd (+ javascript!).

I'm in favour of -nt, etc, as said, and indeed [[ without field-splitting. Just not == and the reasons are nothing to do with convenience (although I feel it is more convenient with just =, based on experience of teaching others both bash and sh, and far too many hours working on scripts) and everything to do with language consistency. There is clear contextual introduction to strcmp in shell, and further it is *impossible* to confuse it with assignment (rendering the justification void). In addition it reads more like algol, or like a mathematical statement, in tune with its history. You are in a different language: respect that, don't try to twist it into a hybrid.

As for ease of typing by the time I've got there, I just want to hit = and move on to the RHS. I'm well aware I'm inside a [ by that point (i have to remember to close.) So I don't recognise any of your reasons as valid, except for someone who isn't really a shell-scripter, just dabbling while their muscle-memory and brain is still in another language. We'll have agree to disagree, evidently.

With respect, I'd like the Committee to consider that they are guardians of the sh language, and that changes to it are of a different nature than changes to command-line utilities; adding sed -E is basic common sense. Adding == with the exact same meaning as = is bad language design.

Can you imagine WG14 adding another operator to Standard C, with the exact same semantic and meaning as an existing one, but a different token? I certainly cannot.

Since the proposer also wishes '=' inside '[[' to mean the same thing as '==', again I would recommend simply adding '=' with the fnmatch() semantic. POSIX can, and I would argue should, simply standardise a *consistent* approach, regardless of what implementation extensions currently exist, exactly as you are doing with make, a format and not a language.
(0002178)
shware_systems (reporter)
2014-03-06 21:22

Re: 0002068
During the phone discussion of Bug #813, it was stated the term 'decimal number' in XBD 12.1 is to be construed as requiring all utilities in conforming mode to treat leading zeroes as permitted when converting arguments required to represent integer values, thereby precluding all syntax formats that use a leading zero to indicate an alternate base is desired.
(0002179)
dwheeler (reporter)
2014-03-07 15:46

Re: comment #2177:

The use and implementation of "==" instead of "=" is *already* widespread; see comment #755. This proposal merely updates the specification to match many users' expectations and widespread practice. A lot of people, including me, *prefer* "==" in sh (e.g., because it is visually distinct from "="). It would be possible to deprecate "=" and require all future scripts to use "==", but I think that would be an unnecessary burden on script writers for no good purpose.

There are other cases where POSIX provides more than one mechanism for a purpose to support worthy goals such as portability and backwards compatibility. Even test itself can be spelled "test" or "[", so it's not true that sh tests have "only one way to do things". This is just another example of allowing multiple approaches to support larger goals.
(0002180)
dwheeler (reporter)
2014-03-07 15:52

Re: 0002178

That's good to know, thanks for clarifying the point about leading zeros.
(0002181)
shware_systems (reporter)
2014-03-11 14:52

Re: 0002180
I'm just reporting it as how I heard what the ORs felt was existing practice, despite XCU 1.1.2.1 requiring otherwise because of its deferral to c99 section 6.5.1p3. Consensus was not reached, IOW, even if is just me dissenting. For the utilities where XCU 1.1.2.1 applies, including sh and the c99 compiler, it is up to the application developer to detect and strip leading zeroes before use or storage of an argument, and detect and convert to a decimal base any value stored using an alternate base format before using it as an argument to another utility.
(0002182)
dwheeler (reporter)
2014-03-12 23:05

Okay. It seems to me that the "easy" way forward is to not *require* support for leading 0 (and bases) for integers in the spec, though allowing it in implementations. It's not hard to get this functionality in other ways.
(0002186)
ranjit (reporter)
2014-03-14 13:25

Again you are ignoring the substantive point. I am not interested in hearing a repetition of an argument I have just addressed, as if that would change my mind. This is *not* a command-line utility. It is a language, so arguing from the point of view of "widespread practise" is muddle-headed. WG14 refuse to standardise things just on that basis, and so should POSIX when it comes to the syntax of a language under its purview.

Let's leave it there, since there is no point in talking past each other. Both sides have been adequately expressed and repeating the same points is futile, and no doubt tedious to read.
(0002418)
rhansen (manager)
2014-10-13 16:44

I noticed that the 'Requirements' section doesn't list backwards compatibility. Is this intentional? Are we not worried about breaking existing scripts by introducing '[['? (People are unlikely to encounter any problems, but it's still theoretically possible.)
(0002419)
geoffclare (manager)
2014-10-13 16:52

The standard already reserves '[[' in order to allow it as an extension. See XCU 2.4.
(0002420)
dwheeler (reporter)
2014-10-13 16:55

Regarding comment 2418: As comment #2419 noted, the sequence "[[" is already reserved. In addition, support for "[[" is widespread. The goal is to standardize widely-used practice. The [[...]] form adds important functionality and reduces the risks of certain kinds of errors.
(0002421)
rhansen (manager)
2014-10-13 17:54

Ah, right, I forgot that [[ was already reserved.

Additional questions:
  • Would it be appropriate to add '[[ -o <name> ]]' to this proposal? (This would be used to test if a shell option is enabled.)
  • Do existing implementations always expand all WORDs, including those that have been short-circuited via '||' or '&&'? (Some expansions have side effects.)
(0003382)
stephane (reporter)
2016-09-19 21:46

For the =~ part, we'd need to define a new type of token for what's on the right of =~.

For instance in [[ aacc =~ (aa|bb)c ]], currently (aa|bb)c can't be a WORD as (, | and ) are token delimiters (note that [[ "a(a)" = *(a) ]] doesn't return true in bash or ksh93).

See also [[ a1 =~ (a)\1 ]] that returns true in bash and false in ksh93 (\ is not like the other quoting operators). See also [[ a =~ \b ]] (or \>...).

There's also the questions of the effect of quoting in things like [[ a =~ ["$chars"] ]], should [[ : =~ ["^:#"] ]] return true like in bash or false like in ksh93?

There's also the question of ~ expansion. Should

   HOME=/
   [[ / =~ ~ ]]

return true like in bash or false like in ksh93?
(0003383)
shware_systems (reporter)
2016-09-20 17:16

The intent in the proposal is the right hand side respect the extended_reg_exp production of the grammar in XBD 9.5, so it is already defined. I agree adding a production clause to db_factor, or modifying Note 10, is desirable to make this cross reference clearer in the grammar. Also, the cross reference to 9.4 in the description of =~ is missing "XBD" as volume referent.

As the ~ is a non-special ERE character, I don't believe it should be expanded to / for the compare, so think the ksh93 result more the intent. For [[ / = ~ ]] as the expression, that I'd expect to return true.
(0003384)
stephane (reporter)
2016-09-20 20:34

Re: Note: 0003383

1- about grammar

That doesn't work. For instance the space character is not special in EREs, yet

[[ $a =~ a b ]]

doesn't work obviously.

The behaviour for the a\ b ERE is unspecified. We may need to acknowledged somehow that backslash is overloaded as a quoting operator in the shell and as an escaping operator for EREs.

For instance, is re='a\b'; [[ a =~ $re ]] meant to be treated the same as [[ a =~ a\b ]] (it's not in bash). And re='a"."b'; [[ a =~ $re ]] vs [[ a =~ a"."b ]]

See also:

$ bash -c '[[ a =~ & ]]'
bash: -c: line 0: syntax error in conditional expression: unexpected token `&'
bash: -c: line 0: syntax error near `&'
bash: -c: line 0: `[[ a =~ & ]]'

(same for #, < , >)

IMO, the original [[ =~ ]] implementation (in bash31) which is also the one of zsh, where shell quoting doesn't make any difference in terms of RE matching is a lot cleaner and easier to specify (applications just do [[ $var =~ '(..|X)&' ]], make sure the argument to =~ is a normal *shell* WORD.

Or we could just give up on specifying [[...]] altogether (my preference as it's broken by design except in zsh IMO as its =/== operator is not an equality operator and it has hardly any benefit over the "[" command) and add the =~ operator to "[" like in yash or zsh which would also be straightforward to specify and unambiguous.

2- about ~

Whether ~ is a ERE operator or not is irrelevant here. The fact that it's *not* a ERE operator (it is in ksh93 REs though as in [[ a =~ (?K:~(E)a) ]]) makes it even a justification that ~ can be safely expanded. ` is not a ERE operator either, yet [[ Linux =~ `uname` ]] returns true on Linux with both ksh93 and bash.

It's just that we need to clarify what expansions are or may be performed in the argument to =~. Expanding ~ is not very useful because it's only expanded at the start of the RE, but I can't see why it wouldn't be when `...`, $param, $((arith)), $(cmd) are.

Also note that in bash the expansion of ~ ends up being quoted from an ERE point of view, which makes sense as ~ is meant to expand to a path, not a RE. As in HOME=.; [[ a/a =~ ~/a ]] returns false.

In bash, the same applies to process substitution.

[[ "<:" =~ <(:) ]] returns false (true in ksh93)

Both bash and ksh93 have taken some liberties with the standard inside [[...]] as that construct was not specified by the standard. It's going to be hard to come up with a specification that doesn't break either or both. And if you want to throw zsh in the picture as well, it's going to be more difficult... or simpler if we only specify [[ a =~ $var ]], that is leave it unspecified unless the word after =~ is an unquoted variable and the content of the variable contains a valid ERE. At the moment, that's the only thing that is portable across bash31, bash32-or-above, zsh and ksh93.
(0003385)
stephane (reporter)
2016-09-20 21:03
edited on: 2016-09-20 21:07

Re Note: 0003384
> Both bash and ksh93 have taken some liberties with the standard inside [[...]] as that construct was not specified by the standard

Amongst those liberties:

a='+(a)'; [[ a = $a ]]

currently returns true in both bash and ksh even though the current proposal would mandate it to return false.

While in case a in $a) ... it would not match as required. Do we want to specify some of the ksh extended operators or force bash and ksh to modify their [[...]] when in POSIX mode? Any good reason why one may want to use that pattern matching operator (with "=" being the operator!?) when we already have the case foo in pattern) much more sensible one?

(0003386)
shware_systems (reporter)
2016-09-21 07:35

Re: 3384
Ok, I oversimplified a bit on 1. It is the result of the WORD, after substitutions and quote removal, but minus field splitting, globbing, and I was thinking tilde expansions, that is meant to be reparsed as an extended_reg_exp. It is left to the script author to figure out how much the input needs to be quoted so the result has blanks and other meta-characters EREs don't consider special appearing where required after the quote removal. I still think an additional token type isn't required for expressing this in the grammar. Possibly a Note 11 if this is cleaner than adding to Note 10, but the overall context is more similar to the pattern production than the context of ASSIGNMENT_WORD, that I see.

If tilde expansion is included then 2. should return true, otherwise no. I can see limiting tilde expansion to those operators that expect a filepath string as the WORD result, and non-special otherwise.
(0003387)
stephane (reporter)
2016-09-21 10:20

Let me rephrase. [[ =~ ]] in ksh93 and bash32+ is a bit broken by design, but to use it properly in both (which seems to be the intent of this proposal: to specify an operator compatible with both), it seems we need (for the argument after =~) to:

- quote &, <, >, blanks (locale-dependant (*)) and unmatched ) and some unmatched ] at least or we'd get errors in bash or ksh93
- *not* quote the ERE operators (.*?+[]{}()|$^) if we want them to retain their special ERE significance. You'll notice that several of those (()|) can't be in WORD
- quote `, ~, $ and quotes to remove their special meaning as shell tokens (but not inside [...] for some)
- to remove the special meaning of those ERE operators, we can use double quotes, single quotes, backslash or $'...' but not when they're inside [...].
- other characters can be quoted as well but not with backslash as that could introduce ERE extensions like \<, \>, \b, \w in ksh93.
- for parts of the arguments that are the result of an unquoted expansion other than ~ expansion, \ (in the content of the expansion) removes the special meaning of ERE operators and may introduce new ones.


Examples:

  a='<foo>' bash -c '[[ $a =~ <.*> ]]' gives an error
  a='foo' ksh93 -c '[[ $a =~ \<.*\> ]]' returns true (\<, \> taken as word boundaries)

You need to use:

  a='<foo>' shell -c '[[ $a =~ "<".*">" ]]' so it works in both shells or
  a='<foo>' shell -c 're="<.*>"; [[ $a =~ $re ]]' which also works in zsh and bash31

Other example:

  a='blah' shell -c '[[ $a =~ [xy)] ]]' gives an error in both ksh and bash
  a='\' shell -c '[[ $a =~ [xy\)] ]]' doesn't give an error but matches in ksh (not in bash). and [[ $a =~ [xy")"] ]] also matches on backslash in ksh93, bash used to have a similar bug.

  a='blah' shell -c 're="[xy)]"; [[ $a =~ $re ]]' works in all shells (zsh, bash31 included)

(*) as already discussed, since blank recognition is locale dependant, you get behaviours like:

$ LC_CTYPE=fr_FR.ISO8859-15@euro bash -c '[[ $a =~ tête-à-tête ]]'
bash: -c: line 0: syntax error in conditional expression
bash: -c: line 0: syntax error near `tête-à-tête'
bash: -c: line 0: `[[ $a =~ tête-à-tête ]]'

as that à UTF-8 character is 0xc3 0xa0 and 0xa0 happens to be a blank in ISO8859-15 locales on Solaris.

So in effect you may need to quote every character that is not in the portable character set in case they may be a blank in the user's locale.
(0003388)
shware_systems (reporter)
2016-09-21 22:44

This proposal, as a white paper, is a case where the behaviors are being merged to be future compatible more than backwards. This is also why test is being deprecated and marked LEGACY. Both it and the [[ extensions are broken one way or the other, so this proposal is taking what isn't broken and giving it a firmer logical model by adding it to the grammar with extensibility provisions. The caveat about what the standard says being reliable only in the C locale and locales using nationalized ASCII applies here too, for now; future Enhancement Reqs. against the V8 drafts that define additional required locales will address remaining localization concerns.

It is known this may break any current script using [[ as an extension. The changes to fix those scripts are expected to be minimal; as proposed, for EREs, enclosing the entire WORD in double quotes will suffice in most cases, only needing \", \` and \$ internally to disable substitutions processing. If tilde expansion is enabled and desired something like "head"~"tail" is required. The onus is already on script authors, if they don't enclose a WORD in quotes, to escape any character that implicitly delimits the token. New scripts are expected to do version checking that they're running on V8 or later to avoid conflicts of interpretation with older shells. It should not break any conforming scripts that were only using test with operators other than -a or -o.

Last, for portable scripts blank detection is currently not locale-dependant. For tokenization only the SPC and TAB code points count, as the default members of the blank charclass. Recognizing additional members of a locale's blank charclass is undefined behavior. Adding default members from ISO-646 is possible, but not likely.
(0003389)
stephane (reporter)
2016-09-22 08:42

Re: Note: 0003387
> - *not* quote the ERE operators (.*?+[]{}()|$^) if we want them
> to retain their special ERE significance. You'll notice that
> several of those (()|) can't be in WORD
[...]
> - to remove the special meaning of those ERE operators, we can
> use double quotes, single quotes, backslash or $'...' but not
> when they're inside [...].

Actually, I hadn't realised ksh93's was so broken so the above
is incorrect. In ksh93, quotes other than backslash only escape
the ERE operators that also are wildcard operators (?*[]()|&),
not ^$.+, as in [[ a =~ "." ]] returns true (so does r=.; [[ a =
"$r" ]]).

And [[ "\\" =~ ["?"] ]] (or any of the other glob operators in
place of ?) returns true like in old versions of bash for RE
operators.

It looks like internally ksh93 replaces [[ x =~ y ]] with [[ x =
~(E)y ]]. [[ a = ~(E)"." ]] also returns true, while [[ a = "?"
]] returns false.
(0003390)
stephane (reporter)
2016-09-22 08:59

Re: Note: 0003388
> This proposal, as a white paper, is a case where the behaviors are being
> merged to be future compatible more than backwards. This is also why test is
> being deprecated and marked LEGACY. Both it and the [[ extensions are broken
> one way or the other, so this proposal is taking what isn't broken and giving
> it a firmer logical model by adding it to the grammar with extensibility
> provisions. The caveat about what the standard says being reliable only in
> the C locale and locales using nationalized ASCII applies here too, for now;
> future Enhancement Reqs. against the V8 drafts that define additional
> required locales will address remaining localization concerns.

With -a, -o, (, ) and an argument-less -t removed, "test"/"[" is no longer
broken.

What I'm saying is that the two requirements that the =~ argument must be a WORD and that quoting ERE operators remove their special meaning are incompatible (or at least make for a not useful operator).

If you need to make it a WORD, then for instance, you can't do

[[ $string =~ foo|bar ]]

and if you do

[[ $string =~ "foo|bar" ]]

Then you're disabling that "|" ERE operator. You'd need:

re="foo|bar"
[[ $string =~ $re ]]

to make it work.

Then, you might as well specify the bash31/zsh behaviour where

[[ $string =~ "foo|bar" ]]

does test whether $string contain foo or bar as opposed to containing "foo|bar"

> It is known this may break any current script using [[ as an extension. The
> changes to fix those scripts are expected to be minimal; as proposed, for
> EREs, enclosing the entire WORD in double quotes will suffice in most cases,
> only needing \", \` and \$ internally to disable substitutions processing. If
> tilde expansion is enabled and desired something like "head"~"tail" is
> required. The onus is already on script authors, if they don't enclose a WORD
> in quotes, to escape any character that implicitly delimits the token. New
> scripts are expected to do version checking that they're running on V8 or
> later to avoid conflicts of interpretation with older shells. It should not
> break any conforming scripts that were only using test with operators other
> than -a or -o.

tildes are not expanded in "head"~"tail", only at the beginning of a word and when not quoted (though rules are different in assignments).

> Last, for portable scripts blank detection is currently not locale-dependant.
> For tokenization only the SPC and TAB code points count, as the default
> members of the blank charclass. Recognizing additional members of a locale's
> blank charclass is undefined behavior. Adding default members from ISO-646 is
> possible, but not likely.

What makes you say that? IIRC both bash and yash explicitely made their tokenisation locale dependant for POSIX compliance. (I do agree it's not desirable though)
(0003391)
shware_systems (reporter)
2016-09-23 04:17

"foo|bar" after quote removal effectively gets passed to regcomp() as the C string foo|bar\0. A shell can use an internal workalike function that includes the effect of regexec()/free() too. That is what is expected to handle the extended_reg_exp production, not the code that handles shell operators. The result is then the same for
re="foo|bar"
[[ $str =~ $re ]]
or even =~ `printf %s%s%s foo '|' bar`.

The concerns you have apply if the ERE was an argument to [[ as a utility, as then the simple_command production governs interpretation, but as a keyword pair with its own productions those constraints don't hold.

The locale's charmap determines the bit patterns of each code point. So the shell is dependant on them for determining whether a byte is a defined char and which class it is. Hardcoding a reliance on the C locale's charmap may fail, as a result, if the charmap of a locale says the TAB code point is represented by 0xFF instead.

This is a separate issue from what code points are required to be in each charclass for all locales. For a script to be portable unquoted input with grammatical significance in terms of being token delimiters is limited to these code points. Relying on code points a locale may add to a class means the script can not function properly on a platform that doesn't provide localedef or a locale with the same additions, so to me is undefined behavior.

This does not limit scripts to using only the POSIX locale, but it is a subset of the locales scripts that don't care about portability might use.
(0003392)
stephane (reporter)
2016-09-23 06:59

Re: Note: 0003391

Mark, you're missing the point I'm trying to make since the
beginning. What you're describing (about quoting of RE
operators) is the zsh/bash31 behaviour, not the one of
bash32+/ksh93 or what this proposal specifies. Quoting the
proposal:

> string =~ regex
> true if string matches the extended regular expression (ERE)
> regex as defined in section 9.4; otherwise false. Characters
> that are quoted in regex (using <backslash>, single-quote, or
> double-quote) shall not be treated as an ERE special character,
> but shall instead be treated as an ERE ordinary character and
> thus only match themselves.

See also appendix B

So [[ $a =~ "foo|bar" ]] and s='foo|bar'; [[ $a =~ "$s" ]] is
testing whether $a contains "foo|bar" (as if by calling
regcomp("foo\\|bar")), [[ $a =~ foo|bar ]] is not allowed and
re='foo|bar'; [[ $a =~ $re ]] is matching $a against the foo|bar
ERE (regcomp("foo|bar")).

With bash-4.3:

$ ltrace -e regcomp bash -c '[[ a =~ "a|b" ]]'
bash->regcomp(0x7ffff013f3b0, "a\\|b", 1)                                              
                               = 0
+++ exited (status 1) +++
$ ltrace -e regcomp bash -c '[[ a =~ a|b ]]'
bash->regcomp(0x7fff33a64d50, "a|b", 1)                                                
                               = 0
+++ exited (status 0) +++
$ ltrace -e regcomp bash -c 're="a|b"; [[ a =~ "$re" ]]'
bash->regcomp(0x7fffba6ec710, "a\\|b", 1)                                              
                               = 0
+++ exited (status 1) +++
$ ltrace -e regcomp bash -c 're="a|b"; [[ a =~ $re ]]'
bash->regcomp(0x7ffe497b2700, "a|b", 1)                                                
                               = 0
+++ exited (status 0) +++
$ ltrace -e regcomp bash -c '[[ a =~ ["a|b"] ]]'
bash->regcomp(0x7ffe32decc20, "[a|b]", 1)                                              
                               = 0
+++ exited (status 0) +++
$ ltrace -e regcomp bash -c '[[ a =~ "[a|b]" ]]'
bash->regcomp(0x7ffea326a1e0, "\\[a\\|b]", 1)                                          
                               = 0
+++ exited (status 1) +++


With zsh-5.1.1

$ ltrace -e regcomp zsh -c '[[ a =~ "a|b" ]]'
regex.so->regcomp(0x7ffc3aaa2ac0, "a|b", 1)                                            
                               = 0
+++ exited (status 0) +++
$ ltrace -e regcomp zsh -c '[[ a =~ a|b ]]'
zsh:1: parse error near `|'
+++ exited (status 1) +++
(0003393)
chet_ramey (reporter)
2016-09-23 13:05

Re: http://austingroupbugs.net/view.php?id=375#c3388 [^]

2.3 Token Recognition, rule 8:

"If the current character is an unquoted <blank>, any token containing the
previous character is delimited and the current character shall be
discarded."

<blank> is defined in 3.74 as

"One of the characters that belong to the blank character class as defined
via the LC_CTYPE category in the current locale. In the POSIX locale, a
<blank> character is either a <tab> or a <space>."

So unless you say that portable scripts are restricted to the POSIX locale,
which you may be saying, blank detection during tokenization depends on the
current locale.
(0003405)
shware_systems (reporter)
2016-10-10 11:35

Re: 3392
Yes, you're right; as stated it's a bit wonky. I was looking at it as that applied to escaping or single-quoted sequences inside double or single-quotes, like "foo'|'bar" or 'foo"|"bar' would match 'foo|bar', with quote removal still effective on the ERE as a WORD token before processing internal quoting like that according to the proposed text. The 'After quote removal, ' preamble is missing, though, and the caveat \' and \" are needed for them to be treated as the ERE ordinary chars they usually are. I'd expect
<foo\" =~ 'foo\"|\"bar'> to return true.

Re: 3393
I am not saying portable scripts are limited to the POSIX locale. I am saying a script that limits what characters of the blank class it uses as token separators to the 2, TAB and SPC, of the POSIX and C locales are portable whatever locale is in effect, as those 2 are required to be in all locale's blank charclass. A script author that knows on a particular system a locale includes a <halfwidthspace> code point in its blank charclass can use it to separate tokens also, but that script is not guaranteed portable.
(0003698)
mirabilos (reporter)
2017-05-19 21:20

Hello dwheeler,

please DO NOT add [[ to POSIX in a way that invalidates all existing scripts using it.

This means:

PLEASE *DO NOT* ADD [[ TO POSIX!


The *ksh world has been using [[ in scripts since forever and told people to use it exclusively for secure shell scripting, so basically *every* script uses it. If POSIX is going to invade on existing territory like that, it’s like a declaration of war.


Some notes on =~ (which mksh does not implement at this time): I’ve read attempts to put a RE grammar onto the RHS.
I don’t think this is the correct way because we also need to be able to put the RE/extglob into a variable, in order to make it possible to specify them dynamically:
x='b*r'; [[ bar = $x ]] # returns true

As for quoting… incidentally, I figured this out when adding the BSD [[:<:]] and [[:>:]] to mksh (while adding character classes in the first place):
[[ 'a foo b' = *[[:\<:]]foo[[:\>:]]* ]] # returns true

Combining the two from above:
x='[[:<:]]foo[[:>:]]'; [[ 'a foo b' = *$x* ]] # returns true
(0003700)
shware_systems (reporter)
2017-05-19 21:44
edited on: 2017-05-19 21:46

The proposal extends the ksh93 usage, not breaks it... Attention to being backwards compatible is part of the discussion. The main reason I see it's preferable to using [/test, security wise, is that it's built-in and can't be overridden, not due to any security specific features above what test provides. As it is an extension, no portable script uses it, since forever...


- Issue History
Date Modified Username Field Change
2011-02-07 18:34 dwheeler New Issue
2011-02-07 18:34 dwheeler Status New => Under Review
2011-02-07 18:34 dwheeler Assigned To => ajosey
2011-02-07 18:34 dwheeler Name => David A. Wheeler
2011-02-07 18:34 dwheeler Section => test
2011-02-07 18:34 dwheeler Page Number => 3224-3225
2011-02-07 18:34 dwheeler Line Number => 107503-107513
2011-02-07 18:49 dwheeler Note Added: 0000666
2011-02-07 20:23 Don Cragun Interp Status => ---
2011-02-07 20:23 Don Cragun Note Added: 0000667
2011-02-07 20:23 Don Cragun Severity Editorial => Objection
2011-02-07 20:23 Don Cragun Type Clarification Requested => Enhancement Request
2011-02-07 21:56 dwheeler Note Added: 0000668
2011-02-07 22:20 Don Cragun Description Updated
2011-02-07 22:20 Don Cragun Desired Action Updated
2011-02-07 22:22 Don Cragun Note Edited: 0000667
2011-02-07 22:50 dwheeler Note Added: 0000669
2011-02-07 23:13 eblake Note Added: 0000670
2011-03-08 03:15 dwheeler Note Added: 0000688
2011-04-24 13:16 jsonn Note Added: 0000754
2011-04-24 19:01 dwheeler Note Added: 0000755
2011-04-24 21:48 jsonn Note Added: 0000756
2011-04-25 20:45 dwheeler Note Added: 0000758
2011-04-26 09:25 markh Note Added: 0000759
2011-04-26 09:57 wpollock Note Added: 0000760
2011-04-26 10:36 jsonn Note Added: 0000761
2011-09-22 08:59 ajosey Note Added: 0000967
2011-09-25 00:36 dwheeler Note Added: 0000974
2011-09-25 18:51 gber Note Added: 0000975
2011-09-28 07:09 olof Issue Monitored: olof
2011-11-15 17:26 dwheeler Note Added: 0001011
2011-11-15 19:25 ajosey File Added: Extendingshellconditionals.pdf
2011-11-15 19:27 ajosey File Added: Extendingshellconditionals.txt
2011-11-15 19:30 ajosey Note Added: 0001013
2011-11-23 23:49 dwheeler File Added: Extendingshellconditionals-2011-11-23.pdf
2011-11-23 23:50 dwheeler File Added: Extendingshellconditionals-2011-11-23.odt
2011-11-23 23:50 dwheeler File Added: Extendingshellconditionals-2011-11-23.txt
2011-11-23 23:57 dwheeler Note Added: 0001044
2011-11-28 15:50 geoffclare Note Added: 0001060
2011-11-28 15:51 geoffclare File Added: Extendingshellconditionals-2011-11-28.odt
2011-11-28 15:52 geoffclare File Added: Extendingshellconditionals-2011-11-28.pdf
2011-11-28 16:10 geoffclare Note Edited: 0001060
2011-11-28 16:38 dwheeler Note Added: 0001061
2011-11-29 03:26 Roger Marquis Note Added: 0001064
2011-11-29 03:27 Roger Marquis Note Edited: 0001064
2011-11-29 04:51 dwheeler Note Added: 0001065
2011-12-29 19:01 antoinel Issue Monitored: antoinel
2013-10-10 01:15 dwheeler Note Added: 0001869
2013-10-12 04:08 dwheeler Note Added: 0001881
2013-10-24 15:15 geoffclare File Added: Extendingshellconditionals-2013-10-24.odt
2013-10-24 15:16 geoffclare File Added: Extendingshellconditionals-2013-10-24.pdf
2013-10-24 15:23 geoffclare Note Added: 0001944
2013-10-25 10:26 geoffclare Note Added: 0001945
2013-10-31 15:55 Don Cragun Relationship added has duplicate 0000762
2013-11-01 15:36 ranjit Note Added: 0001950
2013-11-01 19:09 dwheeler Note Added: 0001953
2013-11-01 19:43 shware_systems Note Added: 0001954
2013-11-10 14:46 dwheeler Note Added: 0001979
2013-11-22 22:13 dwheeler Note Added: 0002017
2013-11-29 18:56 dwheeler File Added: Extendingshellconditionals-2013-11-29.odt
2013-11-29 18:57 dwheeler File Added: Extendingshellconditionals-2013-11-29-track-changes.pdf
2013-11-29 18:58 dwheeler File Added: Extendingshellconditionals-2013-11-29.pdf
2013-11-29 19:11 dwheeler Note Added: 0002031
2013-11-29 19:24 dwheeler Note Added: 0002032
2013-12-01 22:52 jilles Note Added: 0002033
2013-12-02 08:59 shware_systems Note Added: 0002034
2013-12-02 11:15 shware_systems Note Added: 0002035
2013-12-08 22:06 dwheeler Note Added: 0002052
2013-12-08 22:48 dwheeler Note Added: 0002053
2013-12-08 23:05 dwheeler Note Added: 0002054
2013-12-09 01:32 dwheeler File Added: Extendingshellconditionals-2013-12-08.odt
2013-12-09 01:33 dwheeler File Added: Extendingshellconditionals-2013-12-08.pdf
2013-12-09 01:35 dwheeler Note Added: 0002055
2013-12-09 06:55 shware_systems Note Added: 0002056
2013-12-10 04:46 dwheeler Note Added: 0002059
2013-12-10 04:52 dwheeler File Added: Extendingshellconditionals-2013-12-09.odt
2013-12-10 04:53 dwheeler File Added: Extendingshellconditionals-2013-12-09.pdf
2013-12-10 09:41 shware_systems Note Added: 0002060
2013-12-10 16:08 dwheeler Note Added: 0002061
2013-12-10 22:36 jilles Note Added: 0002063
2013-12-11 00:46 dwheeler Note Added: 0002064
2013-12-11 00:48 shware_systems Note Added: 0002065
2013-12-11 00:58 dwheeler Note Added: 0002066
2013-12-12 01:38 shware_systems Note Added: 0002068
2013-12-12 06:04 shware_systems Note Edited: 0002068
2014-03-06 17:08 eblake Relationship added related to 0000813
2014-03-06 17:15 geoffclare Note Added: 0002176
2014-03-06 19:13 ranjit Note Added: 0002177
2014-03-06 21:22 shware_systems Note Added: 0002178
2014-03-07 15:46 dwheeler Note Added: 0002179
2014-03-07 15:52 dwheeler Note Added: 0002180
2014-03-11 14:52 shware_systems Note Added: 0002181
2014-03-12 23:05 dwheeler Note Added: 0002182
2014-03-14 13:25 ranjit Note Added: 0002186
2014-10-13 16:44 rhansen Note Added: 0002418
2014-10-13 16:52 geoffclare Note Added: 0002419
2014-10-13 16:55 dwheeler Note Added: 0002420
2014-10-13 17:54 rhansen Note Added: 0002421
2016-09-19 21:46 stephane Note Added: 0003382
2016-09-20 17:16 shware_systems Note Added: 0003383
2016-09-20 20:34 stephane Note Added: 0003384
2016-09-20 21:03 stephane Note Added: 0003385
2016-09-20 21:07 stephane Note Edited: 0003385
2016-09-21 07:35 shware_systems Note Added: 0003386
2016-09-21 10:20 stephane Note Added: 0003387
2016-09-21 22:44 shware_systems Note Added: 0003388
2016-09-22 08:42 stephane Note Added: 0003389
2016-09-22 08:59 stephane Note Added: 0003390
2016-09-23 04:17 shware_systems Note Added: 0003391
2016-09-23 06:59 stephane Note Added: 0003392
2016-09-23 13:05 chet_ramey Note Added: 0003393
2016-10-10 11:35 shware_systems Note Added: 0003405
2017-05-19 21:20 mirabilos Note Added: 0003698
2017-05-19 21:44 shware_systems Note Added: 0003700
2017-05-19 21:46 shware_systems Note Edited: 0003700


Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker