0001564: clariy on what (character/byte) strings pattern matching notation should work

ID	Project	Category	View Status	Date Submitted	Last Update

0001564	Issue 8 drafts	Shell and Utilities	public	2022-02-23 01:54	2024-06-11 09:12

Reporter	calestyo	Assigned To
Priority	normal	Severity	Editorial	Type	Clarification Requested
Status	Closed	Resolution	Accepted As Marked
Product Version	Draft 2.1

Name	Christoph Anton Mitterer
Organization
User Reference
Section	2.13 Pattern Matching Notation
Page Number	2351
Line Number	76099
Final Accepted Text	0001564:0005796


Summary	0001564: clariy on what (character/byte) strings pattern matching notation should work
Description	On the mailing list, the question arose (from my side) what the current wording in the standard implies as to whether pattern matching works on byte or character strings. - In some earlier discussion it was pointed out that shell variables should be strings (of bytes, other than NUL) => which could one lead to think that pattern matching must work on any such strings - 2.6.2 Parameter Expansion doesn't seem to say, what the #, ##, % and %% special forms of expansion work on: bytes or characters, it just refers to the pattern matching chapter - 2.13. Pattern Matching Notation says: "The pattern matching notation described in this section is used to specify patterns for matching strings in the shell." => strings... would mean bytes (as per 3.375 String) - 2.13.1 Patterns Matching a Single Character however says: "The following patterns matching a single character shall match a single character: ordinary characters,..." I questioned whether one could deduce from that, that patten matching is required to cope with any non-characters in the string it operates upon. This was however rejected on the list, and Geoff Clare pointed out, that since no behaviour is specified (i.e. how the implementation would need to handle such invalidly encoded character) the use of pattern matching on arbitrary byte strings is undefined behaviour.
Desired Action	Either: 1) - In line 76099, replace "strings" with "character strings" and perhaps mention that the results when this is done on strings that contain any byte sequence that is not a character in the current locale, the results are undefined. Perhaps also clarify this in fnmatch() (page 879), this doesn't seem to mention locales at all, but when the above assumption is true, and pattern matching operates on characters only, wouldn't it then need to be subject of the current LC_CTYPE? 2) Alternatively, some expert could check whether there are any shell/fnmatch() implementations which do not simply carry on any bytes that do not form characters. Probably there are (yash?). But if there weren't POSIX might even chose to standardise that behaviour, which would probably be better than leaving it unspecified?!
Tags	tc3-2008

calestyo 2022-02-25 04:57 reporter bugnote:0005716	3) A third aspect, that should perhaps be considered by some more knowledgable person than me: One main use of pattern matching is matching filenames (2.13.3 Patterns Used for Filename Expansion). Filenames however are explicitly byte strings So if pattern matching is indeed intended to only have a specified meaning on character strings (in the current locale), then 2.13.3 should somehow explain this. Especially how such patterns should deal with filenames that aren't character strings in the current locale (e.g. error, unspecified, ignored?). For example, would a "ls *" in principle be expected to only match filenames who are character strings in the current locale?

mirabilos 2022-02-25 20:54 reporter bugnote:0005719	「would a "ls " in principle be expected to only match filenames who are character strings in the current locale?」 That’d be the logical consequence of treating it as characters in the current encoding. When I added locale “things” to my projects, I extended the definition of character. Instead of just accepting the characters that are valid in the current encoding (which is either 8-bit-transparent 7-bit ASCII or (back then BMP-only, but I’m moving to full 21-bit) UTF-8), so-called “raw octets” are also mapped into the wide character range. Every time a conversion fails, the first octet of it is handled as raw octet, then the conversion restarts on the next one. (This can obviously be optimised for illegal UTF-8 sequences if one is careful about the beginning of the next possibly valid sequence.) In 16-bit wchar_t times (basically “until 2022Q1”), this is mapped into a PUA range reserved by the CSUR for this. (Not quite optimal.) This is U+EF80‥U+EFFF. (What happens when you encounter \xEE\xBE\x80 can only be described as fun.) In the new scheme, I’m mapping them to U-10000080‥U-100000FF which is outside of the range of things, so not a problem (except now I’m wondering what to set WCHAR_MAX to, but I think 0x10FFFFU still, because only these are, strictly speaking, valid?) There’s a complication that has to do with the idiotic Standard C API for mbrtowc(3) in that “return value == 0” is the sole test for “pwc == L'\0'” and so cannot be used to signal that 0 octets have been eaten, which means I might need to use even higher numbers for 2‑, 3‑ and 4-byte raw octet sequences. (The latter of which has wcwidth() == 4…) But that’s detail. The thing relevant here is that this is (could be, but anything else is either discarding the notion of character here (which would be hard to make congruent with the existence of character classes) or an active and certainly harmful disservice to users (the “only match filenames that are valid” I quoted above)) a middle ground between characters and bytes: bytes that are characters if possible and have character semantics applied, but may not. This is currently unspecified. I’d like to (continue) treat(int) things in a way that means that, for example, ? is either a character or a single byte from an invalid multibyte sequence (of length 1 or more), with a subsequent ? catching a possible second byte, and so on. Raw octets are displayed as � with a wcwidth() of 1 each (or some application-local suitable encoding, where that is possible).

calestyo 2022-03-03 03:37 reporter bugnote:0005729	Well I guess the whole thing is also, why your point had been earlier, that '.' as sentinel would be enough, and any implementation that wouldn't carry on invalid encodings (i.e. bytes that do not form characters), would be buggy in that respect already. I guess on the one hand, Geoff is clearly right, when he says that any such behaviour (especially the complicated mappings that you explained above) are not expected to be carried out by an implementation (at least not from the current standard)... and as such '.' would in fact not be enough as the sentinel for command substitution with trailing newlines, but the LC_ALL=C would be required. OTOH... the '*' example above was intended to question whether there are really any implementations which would filter out filenames which contain bytes that do not form characters. I'd guess not. So one more reason, why I think that this should be clearly specified (which I've requested with this issue)... i.e. if there's consensus that it (pattern matching) operates on character strings only - clearly name this, declare operation on non-character strings unspecified (with respect to their results) and remove all current references that indicate that it would operate on bytes.

geoffclare 2022-04-11 13:55 reporter bugnote:0005796	Suggested changes... On page 2351 line 76098 section 2.13 Pattern Matching Notation, change: The pattern matching notation described in this section is used to specify patterns for matching strings in the shell. to: The pattern matching notation described in this section is used to specify patterns for matching character strings in the shell. After page 2351 line 76102 section 2.13 Pattern Matching Notation, add a new paragraph: If an attempt is made to use pattern matching notation to match a string that contains one or more bytes that do not form part of a valid character, the behavior is unspecified. Since pathnames can contain such bytes, portable applications need to ensure that the current locale is the C or POSIX locale when performing pattern matching (or expansion) on arbitrary pathnames.

kre 2022-04-11 22:58 reporter bugnote:0005797	How can a conforming application possibly (sanely) ensure the C locale is in use when performing pathname expansion using user input that has been presented in the user's locale (and if that is not to be allowed, how can the user ever sanely use pathnames containing characters that are not ASCII, and if that is not to be allowed, what good are locales ?) I truly wish we could simply stop attempting to make the standard consistent with regard to the current locale mess, it is all way too broken to be useful.

geoffclare 2022-04-12 08:51 reporter bugnote:0005798	> How can a conforming application possibly (sanely) ensure the C locale is in use when performing pathname expansion using user input that has been presented in the user's locale 1. The vast majority of apps will never need to do that because they know (or can assume) that the pathnames they handle either always use the portable filename character set or use the user's locale. I.e. the pathnames are not abitrary (a word I was careful to include in the proposed changes). 2. In apps that truly do need to do matching or expansion on arbitrary pathnames, a C program can call uselocale() before and after calls to fnmatch(), glob(), and wordexp(). A shell script can set LC_ALL=C before handling pathnames (and unset it or restore it afterwards).

calestyo 2022-04-15 02:12 reporter bugnote:0005804	Re: https://www.austingroupbugs.net/view.php?id=1561#c5797 Well this proposal is not really changing anything, is it? Why do you think it's worse to name that something results in undefined behaviour than not saying anything at all and leave it ambiguous (especially when the wording is already contradictory)? Also... it doesn't seem as if locales would ever go away. And even if the POSIX/C community (and all other affected groups) would decide tomorrow to abolish locales or at least the choice of character encodings and make all UTF-8... there would be still millions of lines of code which assume different character encodings to exist and which thus somehow need to be defined in a proper manner.

calestyo 2022-04-15 02:17 reporter bugnote:0005805	Re: https://www.austingroupbugs.net/view.php?id=1561#c5796 In principle that clarifies my original point. (Although I'd probably have preferred if all (relevant) implementations just behave the same already... and this could be made non-undefined. Do you think something should be done about fnmatch(), page 879? While it refers to sections 2.13.1 and 2.13.2 (which also fall under your proposed changes in 2.13) it still uses "string" all over and doesn't mention any dependency on the locale's character encoding?

Date Modified	Username	Field	Change
2022-02-23 01:54	calestyo	New Issue
2022-02-23 01:54	calestyo	Name	=> Christoph Anton Mitterer
2022-02-23 01:54	calestyo	Section	=> 2.13 Pattern Matching Notation
2022-02-23 01:54	calestyo	Page Number	=> 2351
2022-02-23 01:54	calestyo	Line Number	=> 76099
2022-02-25 04:57	calestyo	Note Added: 0005716
2022-02-25 20:54	mirabilos	Note Added: 0005719
2022-03-03 03:37	calestyo	Note Added: 0005729
2022-04-07 16:30	geoffclare	Relationship added	related to 0001561
2022-04-11 13:55	geoffclare	Note Added: 0005796
2022-04-11 22:58	kre	Note Added: 0005797
2022-04-12 08:51	geoffclare	Note Added: 0005798
2022-04-15 02:12	calestyo	Note Added: 0005804
2022-04-15 02:17	calestyo	Note Added: 0005805
2022-10-31 16:13	geoffclare	Final Accepted Text	=> 0001564:0005796
2022-10-31 16:13	geoffclare	Status	New => Resolved
2022-10-31 16:13	geoffclare	Resolution	Open => Accepted As Marked
2022-10-31 16:13	geoffclare	Tag Attached: tc3-2008
2022-11-30 16:37	geoffclare	Status	Resolved => Applied
2024-06-11 09:12	agadmin	Status	Applied => Closed

View Issue Details

Relationships

Activities

Issue History