Anonymous | Login | 2024-10-09 02:59 UTC |
Main | My View | View Issues | Change Log | Docs |
Viewing Issue Simple Details [ Jump to Notes ] | [ Issue History ] [ Print ] | |||||||||||
ID | Category | Severity | Type | Date Submitted | Last Update | |||||||
0001857 | [1003.1(2024)/Issue8] Base Definitions and Headers | Objection | Error | 2024-09-14 12:54 | 2024-10-03 09:12 | |||||||
Reporter | dannyniu | View Status | public | |||||||||
Assigned To | ||||||||||||
Priority | normal | Resolution | Open | |||||||||
Status | New | |||||||||||
Name | DannyNiu/NJF | |||||||||||
Organization | Individual | |||||||||||
User Reference | ||||||||||||
Section | 9.1 Regular Expression Definitions # and others. | |||||||||||
Page Number | 179-180 and others | |||||||||||
Line Number | 6366-6368 and others. | |||||||||||
Interp Status | --- | |||||||||||
Final Accepted Text | ||||||||||||
Summary | 0001857: Several problems with the new "lazy" regex quantifier. | |||||||||||
Description |
1. Newly added text describe "shortest" match with the word "longest": ==== lines 6366-6368 on page 180: > However, matching the ERE "(.*?).*" against "abcdef", the subpattern "(.*?)" matches the empty string, since that is the **longest** possible match for the ERE ".*?". The qualifier "?" modifies the quantifier to make them "lazy", and the "longest" make the intention of the standard writer confusing. Maybe it should be "shortest"? 2. the supposed length of subpatterns. ==== Lines 6362-6363 on page 180: > Consistent with the whole match being the longest of the leftmost matches, each subpattern, from left to right, shall match the longest possible string This is okay for longest matches and without the "lazy" qualifier. In a conceptual implementation of ERE, a back-tracking recursive-decending matcher greedily match, from left to right, each subpattern - so that they're longest before the final match. For each new match that're longer than the previous, the right-most subpatterns are contracted first, before left-side ones. Thus after all iterations, the result conform to the requirement laid out for subpatterns naturally. However, when "lazy" qualifier's applied to a component in a subpattern (one that's parenthesized), without other "greedy" quantifiers applied to other component(s) of the subpattern, the subpattern can only be "shortest". Therefore, the rules regarding the matched lengths of subpatterns with "lazy" qualifier(s) needs to be updated. Next, when `REG_MINIMAL` is applied to the whole regex, the quantifiers become "lazy" by default, therefore absant any qualifier, subpatterns matched can only be "shortest". Thus re-iterating the need to update the rules for matching subpatterns when "lazy" qualfiers/specifiers are used. |
|||||||||||
Desired Action | Various. The rules needs to be worked out carefully, and I have no definitive desired action at this moment. Also, this issue need to be related #793 and #1329, unless it's against procedure to relate new bugs to closed ones. | |||||||||||
Tags | No tags attached. | |||||||||||
Attached Files | ||||||||||||
|
Notes | |
(0006879) dannyniu (reporter) 2024-09-20 08:05 edited on: 2024-09-20 08:13 |
My first attempt at a proposed wording. Section 9.1, term definition **matched** 3rd para: Change: > ... If the pattern permits a variable number of matching characters and thus there is more than one such sequence starting at that point, the longest such sequence is matched. ... To > ... If the pattern permits a variable number of matching characters and thus there is more than one such sequence starting at that point, the length-fittest such sequence is matched. ... (Examples adapts accordingly) (I'll define length-fittest soon). 4th para: Change: > Consistent with the whole match being the longest of the leftmost matches, each subpattern, from left to right, shall match the longest possible string. To: > Within the entire match, where there's quantifier in the entire pattern and any and all of its subpatterns, for 2 adjacent quantifiers (in the sense that there's no other quantifier exists on any segment between the 2 quantifiers) that would result in multiple valid matches, the length-fittest rule applies to the former quantifiers in preference - in other word, the latter quantifier yields to the former when applying the length-fittest rule. Such precedence of preference/yielding is transitive across all quantifiers of the entire expression. Add a term definition **length-fittest**: > A concept defined for quantifiers for the purpose of determining the region and sub-regions of match. For quantifiers without the `?` lazy quantifier, the most number of possible repetition is the fittest in terms of length; likewise, for quantifiers with the `?` lazy quantifier, the least number of possible repetition is the fittest in terms of length. |
(0006880) geoffclare (manager) 2024-09-23 08:56 |
> For quantifiers without the `?` lazy quantifier, the most number of possible repetition is the fittest in terms of length; likewise, for quantifiers with the `?` lazy quantifier, the least number of possible repetition is the fittest in terms of length. This would change the established-for-decades "longest" requirement to "most repetitions", which is not the same thing. And it turns out that on macOS the '?' modifier does not change to matching the least repetitions, it is shortest match; the re_format(7) man page is wrong. Tested using the program at the end of https://posix.rhansen.org/p/2020-11-09 [^] with REG_MINIMAL removed: $ ./a.out '([ab]{6}|a)*?b' aaaabbbb regexec() returned 0 rm_so 0, rm_eo 5 (Least repetitions would give rm_eo 7.) Same test with grep, using -o to see what matched: $ echo aaaabbbb | grep -E -o '([ab]{6}|a)*?b' aaaab b b b This behaviour makes sense as the whole point of REG_MINIMAL and the '?' modifier is to change to the opposite greediness, and the opposite of longest is shortest. Having the default as longest and REG_MINIMAL/'?' as least repetitions would produce the same output in the above tests with and without the '?', making them pointless in such cases. |
(0006881) geoffclare (manager) 2024-09-24 10:46 edited on: 2024-09-26 13:28 |
Suggested changes ... On page 179 line 6348 section 9.1, add a sentence: The matching process is described in [xref to 9.2].and move the remaining paragraphs of the "matching" definition to after page 181 line 6413 section 9.2. On page 179 line 6357 section 9.1, change: If the pattern permits a variable number of matching characters and thus there is more than one such sequence starting at that point, the longest such sequence is matched. For example, the BRE "bb*" matches the second to fourth characters of the string "abbbc", and the ERE "(wee|week)(knights|night)" matches all ten characters of the string "weeknights".to: If the pattern permits a variable number of matching characters and thus there is more than one such sequence starting at that point, the matched sequence shall be the longest such sequence for which any minimal repetitions (see [xref to 9.4.6]) used in the match have the shortest possible match. For example, the BRE "bb*" matches the second to fourth characters of the string "abbbc", and the ERE "(wee|week)(knights|night)" matches all ten characters of the string "weeknights". However, the ERE "(aaa??)*" matches only the first four characters of the string "aaaaa", not all five, because in order to match all five, "a??" would match with length one instead of zero; the ERE "(aaa??)*|(aaa?)*" matches all five because the longest match is one which does not use any minimal repetitions. On page 180 line 6367 section 9.1, change: the subpattern "(.*?)" matches the empty string, since that is the longest possible match for the ERE ".*?"to: the subexpression "(.*?)" matches the empty string, since that is the longest possible match for which the minimal repetition ".*?" has the shortest possible match (zero length). On page 179 line 6370 section 9.1, change: the longest sequence shall be measuredto: the sequence length shall be measured After page 191 line 6814 section 9.5, add a note: <small>Note:The grammar defines syntax only and places no requirements on implementations as to how the parsed BRE or ERE is used for matching. The matching process is described in [xref to 9.2].</small> After XRAT page 3716 line 127617 section A.9.4.6, add a paragraph: Note that the repetition modifier '?' (<question-mark>) is specified as changing the matching behavior for the modified repetition from the leftmost longest possible match to the leftmost shortest possible match. This does not necessarily give the same result as matching with the least repetitions. For example, the ERE "([ab]{6}|a)*?b" matches the first five characters of the string "aaaabbbb" as this is the shortest for the minimal repetition "*?". Matching with the least repetitions would match the first seven characters by using one repetition of "[ab]{6}" instead of four repetitions of "a". This distinction is only possible because the alternatives in an ERE alternation are chosen according to which gives the longest (or shortest) match. Other types of regular expression exist (notably in perl, php, and python) where the alternatives are tried in order; for those there is no difference between longest and most repetitions or between shortest and least repetitions. |
(0006882) dannyniu (reporter) 2024-09-24 11:54 edited on: 2024-09-24 12:12 |
Now that we've got a wording: - how do we incorporate the `REG_MINIMAL` flag? - if both greedy **AND** lazy quantifiers're nested (possibly recursively!), how long should the nested and the overall match of the parenthesized submatches be? For the sake of public record, I was not fully aware of macOS ERE behavior when I proposed my initial wording. I took inspiration from perlre's "best/worst" match concept. I proposed the wording because I worried a single implementation whose document explicitly acknowledged some POSIX requirements were violated would be a little unclear/ambiguous/inconsistent to be accepted into the standard. |
(0006883) geoffclare (manager) 2024-09-24 14:04 |
> how do we incorporate the `REG_MINIMAL` flag? I think my suggested wording works with REG_MINIMAL because of the way it is specified in 9.4.6 list item 6 as changing the default (and reversing what the '?' modifier does). > if both greedy **AND** lazy quantifiers're nested ... That was the reason for wording it as "longest possible ... for which any minimal repetitions used ... have the shortest possible match". A minimal repetition nested inside a greedy one has precedence (if used); otherwise, each just follows its normal rule. |
(0006884) dannyniu (reporter) 2024-09-25 08:28 edited on: 2024-09-25 11:36 |
Geoff, I tested the `re` module from Python v3.10.15's standard library, and the PCRE module from PHP v8.3.11, they all match the initial 7 characters of the string "aaaabbbb" with '([ab]{6}|a)*?b'. It seems that it's the macOS implementation is an outcast. Given this, I question if we should standardize macOS behavior, given that its documentation isn't entirely clear & unambiguous & consistent, and explicitly acknowledge violation of POSIX requirements, as I mentioned before. If it's breaking an existing implementation that you worry, then hear me out: 1. POSIX-2024 is a new standard, and macOS is generally slow on adopting conformance to newer standard (they were still certified to 03 back when v7 was published for quite some time), they still have time to fix their implementation before claiming conformance to 2024. 2. Portable applications could not have used the new lazy quantifier, given that it's not yet clearly & unambiguously & consistently specified. It is almost certain that they haven't yet used the new feature, but for those that do use the lazy quantifier, they're probably using some other languages, runtime, and/or libraries (such as Python and PHP I've used as example). 3. The whole point of standardizing lazy quantifier is so that applications from point #2 I mentioned can use standardized facility, if they found that the semantics' different, then this won't be possible anymore. Source code: ```python3 #!/usr/bin/env python3 import re print(re.search(r'([ab]{6}|a)*?b', "aaaabbbb").group(0)) ``` ```php #!/usr/bin/env php <?php $m = []; preg_match('/([ab]{6}|a)*b/', "aaaabbbb", $m); var_dump($m); ``` |
(0006885) geoffclare (manager) 2024-09-25 13:17 edited on: 2024-09-27 07:09 |
Re Note: 0006884. I am strongly of the opinion that this is a misfeature in python and php. I suspect that most python and php programmers would be surprised to discover they behave that way. If there are two possible matches with different lengths, having the non-greedy RE give the longer of the two, the same as the greedy RE does, violates the principle of least surprise. [Later retracted - see Note: 0006892] As regards "macOS is generally slow on adopting conformance to newer standard", that is irrelevant. This new addition in POSIX is intended to be a standardisation of long-standing existing practice. We would need a very good reason to specify behaviour that is different than that existing practice, thus breaking not just the existing implementations[*] but also applications that rely on the existing behaviour. It doesn't matter that they were not "portable applications" when they were written if a forced change to existing implementations is what causes them to break. [*] Yes, "implementations" plural. Known to be at least macOS and DragonFly BSD. |
(0006886) dannyniu (reporter) 2024-09-25 15:08 edited on: 2024-09-26 06:50 |
Re Note: 6885. I actaully asked about this on StackOverflow before resorting to issue tracker as no one seem to understand the situation in enough detail to make informed statements. One of the comments said straightforwardly that the lazy qualifier applies to the immediate preceding repetition (i.e. rather than recursively to subpattern matches as implied as necessary to make the match shortest by Geoff's proposed wording) by echoing my words in the question. Relevant URL: https://stackoverflow.com/questions/78984645/regarding-posix-eres-new-repetition-modifier#comment139266744_78984645 [^] If a wording must be specified, it must be clear & unambiguous & consistent. As a torture test for any potential wording, I propose testing the wording with the following: - regex: [0-9](((((([0-9][a-z]+[0-9])+?)+)+?)+)+?)+[0-9] - string to be matched: 12abc34def56ghi78jkl90mnop12qrst34uvw56xyz78 The reason I propose this regex, is that you have to have some rules to resolve the ambiguity of recursively, alternatingly, nested lazy and greedy quantifier, and it's definitely not as easy as saying "minimal repetitions have a high precedence". And I'd like to set it as a challenge to propose a wording whose description covers the behavior of the torture test regex. Finally, I suggest doing a survey on GitHub, on the use of lazy quantifier in Python-language projects if anyone still see value in keeping interoperability with Python users. For starters: https://github.com/search?q=lang%3Apython+%22%29*%3F%22+OR+%22%29%2B%3F%22&type=code [^] |
(0006887) steffen (reporter) 2024-09-25 22:10 |
re:
#?2|kent:tmp$ ./p-pcre2 '([ab]{6}|a)*?b' aaaabbb 0: 0/7 <aaaabbb> 1: 0/6 <aaaabb> #?0|kent:tmp$ ./p-tre '([ab]{6}|a)*?b' aaaabbb HA at <6> HA at <|> MINIINININI 0 mini=0 MINIINININI 1 mini=1 rest:* HAHAHAH 0: 0/5 <aaaab> 1: 3/4 <a> #?0|kent:tmp$ ./p-c '([ab]{6}|a)*?b' aaaabbb 0: 0/7 <aaaabbb> 1: 0/6 <aaaabb> <small>now i start using hypertext in Mantis, that worn down i am ..</small> |
(0006888) steffen (reporter) 2024-09-25 22:33 |
ok and also re: https://www.austingroupbugs.net/view.php?id=1857#c6886 [^]- regex: [0-9](((((([0-9][a-z]*[0-9])*?)*)*?)*)*?)*[0-9] - string to be matched: 12abc34def56ghi78jkl90mnop12qrst34uvw56xyz78 And i say there is no challenge, and the word "lazy" was never heard before regarding ungreedy regular expression matching, the reasons why i have used "minimal" has already been told. #?0|kent:tmp$ ./p-c '[0-9](((((([0-9][a-z]*[0-9])*?)*)*?)*)*?)*[0-9]' 12abc34def56ghi78jkl90mnop12qr st34uvw56xyz78 0: 0/44 <12abc34def56ghi78jkl90mnop12qrst34uvw56xyz78> 1: 1/43 <2abc34def56ghi78jkl90mnop12qrst34uvw56xyz7> 2: 1/43 <2abc34def56ghi78jkl90mnop12qrst34uvw56xyz7> 3: 1/43 <2abc34def56ghi78jkl90mnop12qrst34uvw56xyz7> 4: 1/43 <2abc34def56ghi78jkl90mnop12qrst34uvw56xyz7> 5: 1/43 <2abc34def56ghi78jkl90mnop12qrst34uvw56xyz7> 6: 38/43 <6xyz7> #?0|kent:tmp$ ./p-pcre2 '[0-9](((((([0-9][a-z]*[0-9])*?)*)*?)*)*?)*[0-9]' 12abc34def56ghi78jkl90mnop 12qrst34uvw56xyz78 0: 0/2 <12> 1: 1/1 <> 2: -1/-1 3: -1/-1 4: -1/-1 5: -1/-1 6: -1/-1 #?0|kent:tmp$ ./p-tre '[0-9](((((([0-9][a-z]*[0-9])*?)*)*?)*)*?)*[0-9]' 12abc34def56ghi78jkl90mnop12 qrst34uvw56xyz78 MINIINININI 0 mini=0 MINIINININI 0 mini=0 MINIINININI 1 mini=1 rest:* MINIINININI 0 mini=0 MINIINININI 0 mini=0 MINIINININI 1 mini=1 rest:* MINIINININI 0 mini=0 MINIINININI 0 mini=0 MINIINININI 1 mini=1 rest:* MINIINININI 0 mini=0 HAHAHAH 0: 0/2 <12> 1: 1/1 <> 2: 1/1 <> 3: 1/1 <> 4: 1/1 <> 5: 1/1 <> 6: -1/-1 |
(0006889) steffen (reporter) 2024-09-25 22:36 |
$ echo 12abc34def56ghi78jkl90mnop12qr | perl -e '$i=<STDIN>;if($i =~ "[0-9](((((([0-9][a- z]*[0-9])*?)*)*?)*)*?)*[0-9]"){print "i<$i>; 1<$1> 2<$2> 3<$3> &am p;<$&>\n"}else{print "no match\n"}' i<12abc34def56ghi78jkl90mnop12qr >; 1<> 2<> 3<> &<12> |
(0006890) geoffclare (manager) 2024-09-26 08:41 |
> One of the comments said straightforwardly that the lazy qualifier applies to the immediate preceding repetition (i.e. rather than recursively to subpattern matches as implied as necessary to make the match shortest by Geoff's proposed wording) by echoing my words in the question. There is certainly no intention to require the '?' modifier to act recursively, and I can't see any way to interpret my suggested wording as implying it. |
(0006891) dannyniu (reporter) 2024-09-26 11:43 edited on: 2024-09-26 11:50 |
Geoff, you said you have a strong opinion that the Python and PHP's PCRE behavior is mis-feature, violates "principle of least surprise", and question its real-world use if any. While I don't want to and can't say if it's a mis-feature, I believe the rules are well-defined and doesn't violate the principle of least surprise when developer carefully read the document and understand all implications of using the lazy qualifier. I'd like to propose that if (a big if) we ultimately cannot agree on a wording to use, at least we make it clear that implementations are split on this by saying the behavior is undefined (or implementation-defined) when the lazy quantifiers're used on groupings. |
(0006892) geoffclare (manager) 2024-09-26 12:16 edited on: 2024-09-26 13:30 |
> Geoff, you said you have a strong opinion that the Python and PHP's PCRE behavior is mis-feature I retract that statement. It was based on the incorrect assumption that python and php choose between alternatives the same way POSIX does. As per my recent emails to the mailing list, that assumption is false; they try the alternatives in order. This makes POSIX EREs and perl/python/php REs fundamentally different when alternatives are used, and nobody should expect to see the same behaviour for patterns such as "([ab]{6}|a)*?b". [Update: I have edited Note: 0006881 to include something about this in the suggested XRAT addition.] |
(0006896) dannyniu (reporter) 2024-09-27 11:34 edited on: 2024-09-27 11:37 |
Geoff, have you considered my torture testing case? How do you see your wording cover that case? I recall "minimal quantifier takes precedence", do you intend that to mean for my torture test regex, the inner greedy quantifiers yields? I'm all ears. |
(0006897) steffen (reporter) 2024-09-27 15:51 |
I think the XRAT addition is a very helpful text / example that addresses real-life knowledge deficiencies. |
(0006898) geoffclare (manager) 2024-09-30 09:26 |
> Geoff, have you considered my torture testing case? How do you see your wording cover that case? - regex: [0-9](((((([0-9][a-z]+[0-9])+?)+)+?)+)+?)+[0-9] - string to be matched: 12abc34def56ghi78jkl90mnop12qrst34uvw56xyz78 The key part of my suggested wording is "the longest such sequence for which any minimal repetitions used in the match have the shortest possible match". first [0-9] matches the first 1 ([0-9][a-z]+[0-9])+? matches 2abc3 with 1 repetition (and 3 of [a-z] obviously) (([0-9][a-z]+[0-9])+?)+ matches 2abc3 with 1 repetition as that's the longest match for which the minimal repetition has the shortest match ((([0-9][a-z]+[0-9])+?)+)+? matches 2abc3 with 1 repetition (((([0-9][a-z]+[0-9])+?)+)+?)+ matches 2abc3 with 1 repetition as that's the longest match for which the minimal repetitions have the shortest match ((((([0-9][a-z]+[0-9])+?)+)+?)+)+? matches 2abc3 with 1 repetition (((((([0-9][a-z]+[0-9])+?)+)+?)+)+?)+ matches 2abc3 with 1 repetition as that's the longest match for which the minimal repetitions have the shortest match final [0-9] matches the first 4 matched string is 12abc34 |
(0006899) dannyniu (reporter) 2024-10-02 02:07 |
One more question do we measure length by bytes, characters, or collating element? |
(0006900) geoffclare (manager) 2024-10-03 09:12 edited on: 2024-10-03 09:15 |
> do we measure length by bytes, characters, or collating element? Characters. The collating element issue is explicitly addressed in 9.1: When a multi-character collating element in a bracket expression (see Section 9.3.5, on page 182) is involved, the longest sequence shall be measured in characters consumed from the string to be matched; that is, the collating element counts not as one element, but as the number of characters it matches. Note that my suggested changes include a change from "longest sequence" to "sequence length" here. |
Mantis 1.1.6[^] Copyright © 2000 - 2008 Mantis Group |