0001564: clariy on what (character/byte) strings pattern matching notation should work

Notes
(0005716) calestyo (reporter) 2022-02-25 04:57	3) A third aspect, that should perhaps be considered by some more knowledgable person than me: One main use of pattern matching is matching filenames (2.13.3 Patterns Used for Filename Expansion). Filenames however are explicitly byte strings So if pattern matching is indeed intended to only have a specified meaning on character strings (in the current locale), then 2.13.3 should somehow explain this. Especially how such patterns should deal with filenames that aren't character strings in the current locale (e.g. error, unspecified, ignored?). For example, would a "ls *" in principle be expected to only match filenames who are character strings in the current locale?

(0005719) mirabilos (reporter) 2022-02-25 20:54	「would a "ls " in principle be expected to only match filenames who are character strings in the current locale?」 That’d be the logical consequence of treating it as characters in the current encoding. When I added locale “things” to my projects, I extended the definition of character. Instead of just accepting the characters that are valid in the current encoding (which is either 8-bit-transparent 7-bit ASCII or (back then BMP-only, but I’m moving to full 21-bit) UTF-8), so-called “raw octets” are also mapped into the wide character range. Every time a conversion fails, the first octet of it is handled as raw octet, then the conversion restarts on the next one. (This can obviously be optimised for illegal UTF-8 sequences if one is careful about the beginning of the next possibly valid sequence.) In 16-bit wchar_t times (basically “until 2022Q1”), this is mapped into a PUA range reserved by the CSUR for this. (Not quite optimal.) This is U+EF80‥U+EFFF. (What happens when you encounter \xEE\xBE\x80 can only be described as fun.) In the new scheme, I’m mapping them to U-10000080‥U-100000FF which is outside of the range of things, so not a problem (except now I’m wondering what to set WCHAR_MAX to, but I think 0x10FFFFU still, because only these are, strictly speaking, valid?) There’s a complication that has to do with the idiotic Standard C API for mbrtowc(3) in that “return value == 0” is the sole test for “pwc == L'\0'” and so cannot be used to signal that 0 octets have been eaten, which means I might need to use even higher numbers for 2‑, 3‑ and 4-byte raw octet sequences. (The latter of which has wcwidth() == 4…) But that’s detail. The thing relevant here is that this is (could be, but anything else is either discarding the notion of character here (which would be hard to make congruent with the existence of character classes) or an active and certainly harmful disservice to users (the “only match filenames that are valid” I quoted above)) a middle ground between characters and bytes: bytes that are characters if possible and have character semantics applied, but may not. This is currently unspecified. I’d like to (continue) treat(int) things in a way that means that, for example, ? is either a character or a single byte from an invalid multibyte sequence (of length 1 or more), with a subsequent ? catching a possible second byte, and so on. Raw octets are displayed as � with a wcwidth() of 1 each (or some application-local suitable encoding, where that is possible).

(0005729) calestyo (reporter) 2022-03-03 03:37	Well I guess the whole thing is also, why your point had been earlier, that '.' as sentinel would be enough, and any implementation that wouldn't carry on invalid encodings (i.e. bytes that do not form characters), would be buggy in that respect already. I guess on the one hand, Geoff is clearly right, when he says that any such behaviour (especially the complicated mappings that you explained above) are not expected to be carried out by an implementation (at least not from the current standard)... and as such '.' would in fact not be enough as the sentinel for command substitution with trailing newlines, but the LC_ALL=C would be required. OTOH... the '*' example above was intended to question whether there are really any implementations which would filter out filenames which contain bytes that do not form characters. I'd guess not. So one more reason, why I think that this should be clearly specified (which I've requested with this issue)... i.e. if there's consensus that it (pattern matching) operates on character strings only - clearly name this, declare operation on non-character strings unspecified (with respect to their results) and remove all current references that indicate that it would operate on bytes.

(0005796) geoffclare (manager) 2022-04-11 13:55	Suggested changes... On page 2351 line 76098 section 2.13 Pattern Matching Notation, change: The pattern matching notation described in this section is used to specify patterns for matching strings in the shell. to: The pattern matching notation described in this section is used to specify patterns for matching character strings in the shell. After page 2351 line 76102 section 2.13 Pattern Matching Notation, add a new paragraph: If an attempt is made to use pattern matching notation to match a string that contains one or more bytes that do not form part of a valid character, the behavior is unspecified. Since pathnames can contain such bytes, portable applications need to ensure that the current locale is the C or POSIX locale when performing pattern matching (or expansion) on arbitrary pathnames.

(0005797) kre (reporter) 2022-04-11 22:58	How can a conforming application possibly (sanely) ensure the C locale is in use when performing pathname expansion using user input that has been presented in the user's locale (and if that is not to be allowed, how can the user ever sanely use pathnames containing characters that are not ASCII, and if that is not to be allowed, what good are locales ?) I truly wish we could simply stop attempting to make the standard consistent with regard to the current locale mess, it is all way too broken to be useful.

(0005798) geoffclare (manager) 2022-04-12 08:51	> How can a conforming application possibly (sanely) ensure the C locale is in use when performing pathname expansion using user input that has been presented in the user's locale 1. The vast majority of apps will never need to do that because they know (or can assume) that the pathnames they handle either always use the portable filename character set or use the user's locale. I.e. the pathnames are not abitrary (a word I was careful to include in the proposed changes). 2. In apps that truly do need to do matching or expansion on arbitrary pathnames, a C program can call uselocale() before and after calls to fnmatch(), glob(), and wordexp(). A shell script can set LC_ALL=C before handling pathnames (and unset it or restore it afterwards).

(0005804) calestyo (reporter) 2022-04-15 02:12	Re: https://www.austingroupbugs.net/view.php?id=1561#c5797 [^] Well this proposal is not really changing anything, is it? Why do you think it's worse to name that something results in undefined behaviour than not saying anything at all and leave it ambiguous (especially when the wording is already contradictory)? Also... it doesn't seem as if locales would ever go away. And even if the POSIX/C community (and all other affected groups) would decide tomorrow to abolish locales or at least the choice of character encodings and make all UTF-8... there would be still millions of lines of code which assume different character encodings to exist and which thus somehow need to be defined in a proper manner.

(0005805) calestyo (reporter) 2022-04-15 02:17	Re: https://www.austingroupbugs.net/view.php?id=1561#c5796 [^] In principle that clarifies my original point. (Although I'd probably have preferred if all (relevant) implementations just behave the same already... and this could be made non-undefined. Do you think something should be done about fnmatch(), page 879? While it refers to sections 2.13.1 and 2.13.2 (which also fall under your proposed changes in 2.13) it still uses "string" all over and doesn't mention any dependency on the locale's character encoding?

Issue History
Date Modified	Username	Field	Change
2022-02-23 01:54	calestyo	New Issue
2022-02-23 01:54	calestyo	Name	=> Christoph Anton Mitterer
2022-02-23 01:54	calestyo	Section	=> 2.13 Pattern Matching Notation
2022-02-23 01:54	calestyo	Page Number	=> 2351
2022-02-23 01:54	calestyo	Line Number	=> 76099
2022-02-25 04:57	calestyo	Note Added: 0005716
2022-02-25 20:54	mirabilos	Note Added: 0005719
2022-03-03 03:37	calestyo	Note Added: 0005729
2022-04-07 16:30	geoffclare	Relationship added	related to 0001561
2022-04-11 13:55	geoffclare	Note Added: 0005796
2022-04-11 22:58	kre	Note Added: 0005797
2022-04-12 08:51	geoffclare	Note Added: 0005798
2022-04-15 02:12	calestyo	Note Added: 0005804
2022-04-15 02:17	calestyo	Note Added: 0005805
2022-10-31 16:13	geoffclare	Final Accepted Text	=> Note: 0005796
2022-10-31 16:13	geoffclare	Status	New => Resolved
2022-10-31 16:13	geoffclare	Resolution	Open => Accepted As Marked
2022-10-31 16:13	geoffclare	Tag Attached: tc3-2008
2022-11-30 16:37	geoffclare	Status	Resolved => Applied

Aardvark Mark IV