Anonymous | Login | 2024-10-09 03:00 UTC |
Main | My View | View Issues | Change Log | Docs |
Viewing Issue Simple Details [ Jump to Notes ] | [ Issue History ] [ Print ] | ||||||
ID | Category | Severity | Type | Date Submitted | Last Update | ||
0001329 | [1003.1(2013)/Issue7+TC1] Base Definitions and Headers | Objection | Clarification Requested | 2020-03-29 20:54 | 2024-06-11 08:54 | ||
Reporter | rhialto | View Status | public | ||||
Assigned To | |||||||
Priority | normal | Resolution | Accepted As Marked | ||||
Status | Closed | ||||||
Name | Olaf 'Rhialto' Seibert | ||||||
Organization | |||||||
User Reference | |||||||
Section | Vol 1. 9.4.6, Vol. 1. 13. regex.h, Vol 2. regcomp(), Vol. 4. A.9.2 | ||||||
Page Number | 190 | ||||||
Line Number | 6195 | ||||||
Interp Status | --- | ||||||
Final Accepted Text | Note: 0005111 | ||||||
Summary | 0001329: Problem in resolution of 0000793: "Regular Expressions: add REG_MINIMAL and a minimum repitition modifier" | ||||||
Description |
Issue 793 (which has status Applied) contains this text to apply: - Vol. 1: Base Definitions, Chapter 9, «Regular Expressions». 9.4.6 EREs Matching Multiple Characters, p. 190, line 6195: insert after 6. Each of the duplication symbols ('+', '*', '?', and intervals) may be suffixed by the minimal repitition modifier '?' <question-mark>, in which case matching behaviour is changed from the «leftmost longest possible match» to the «leftmost shortest possible match», including the null match (see [reference to A.9, p. 3500 ff.]). For example, the ERE ".*c" matches the last character ('c') in the string "abc abc", whereas the ERE ".*?c" matches the first character 'c', the third character in the string. There seems to be a mistake in this text: For example, the ERE ".*c" matches the last character ('c') in the string "abc abc", which should likely be For example, the ERE ".*c" matches all characters up to the last character ('c') in the string "abc abc", As an existing implementation, NetBSD's egrep matches the whole string: $ echo abc-abc | egrep --color ".*c" abc-abc with the whole of abc-abc coloured red. The text that follows seems incorrect for a similar but potentially more serious reason: whereas the ERE ".*?c" matches the first character 'c', the third character in the string. [[ "abc abc" ]] Possibly it should match "abc", because that is leftmost; just matching "c" would be shortest but I don't see it as being leftmost. Which of these two is meant? Furthermore, "repitition" (used several times) seems to contain a typo. |
||||||
Desired Action | Correct the examples, and/or indicate whether "leftmost" or "shortest" is more important in a match with a minimal repetition modifier. | ||||||
Tags | issue8 | ||||||
Attached Files | |||||||
|
Relationships | ||||||
|
Notes | |
(0004804) kre (reporter) 2020-03-30 02:57 |
The major issue here is whether in "leftmost shortest possible match" which takes priority, "leftmost" or "shortest possible" when the results would be different (as in the example given) - the text of the example suggests that the shortest match should be preferred over leftmost, but I suspect that's probably just a similar example error than the obvious earlier one. In case of "leftmost longest possible match" this issue cannot arise, as moving the match further to the left necessarily makes it longer, It is likely that the text from the definition of "matched" in 9.1: The search for a matching sequence starts at the beginning of a string and stops when the first sequence matching the expression is found, where ``first'' is defined to mean ``begins earliest in the string''. will cover this case, but that's only if this is what the implementations that currently implement minimal matches implement it that way, and if the example is corrected to make it clear that is what happens. Additional words to make it clear that when there is an apparent conflict, the leftmost possible match is chosen rather than a shorter one beginning further to the right might be useful. |
(0004805) kre (reporter) 2020-03-30 03:14 edited on: 2020-03-30 05:59 |
0000073 should probably also have amended the text in 9.1 that immediately follows the quote given in Note: 0004804 ... If the pattern permits a variable number of matching characters and thus there is more than one such sequence starting at that point, the longest such sequence is matched. Since that will no longer be universally true. Further, when dealing with sub-matches, we need to know how the "minimal match" operator applies to submatches, both when those sub-matches do, and do not, also contain the minimal match operator. That is, in each of (([abc]*)([abc]*)([abc]*))+?x (([abc]*?)([abc]*)([abc]*))+?x (([abc]*)([abc]*?)([abc]*))+?x (([abc]*?)([abc]*?)([abc]*))+?x when applied to the string abcaabbccaaabbbcccx what is \1 \2 \3 and \4 in each case ? That will partly turn upon whether the added text If the REG_MINIMAL flag, as defined in the <regex.h>[REF] header, is used when compiling an ERE via regcomp(3)[REF], the «leftmost shortest possible match» is the default, and the minimal repitition modifier ’?’ can be used to select the «leftmost longest possible match». (added in the new s.6 on p 190 of 7-TC1) is intended to apply to the ? operator as well as the REG_MINIMAL flag. That is, when evaluating an ERE for which the "minimal match" operator applies, do sub-expressions inherit the mimimal match property, or not? And if they do, does the minimal match operator do what it does when REG_MINIMAL is used, and reverse its effect, to become a longest match operator? I don't know the answers to any of this, but I do know it all needs to be made clear. There might be more, RE's in general tend to be things I simply use, I generally prefer not to try and bend my mind around their definitions. |
(0004806) geoffclare (manager) 2020-03-30 08:36 |
When I applied bug 0000793 I spotted that "up to" was missing and inserted it. I should have added a note to the bug that I had done this - sorry. The source currently has: For example, the ERE .sG ".*c" matches up to the last character (\c .cH c ) in the string .sG "abc abc" , whereas the ERE .sG ".*?c" matches up to the first character .cH c , the third character in the string. I also fixed the spelling of "repetition" (and would not have added a note for that - minor editorial corrections like that are often needed when applying bugs). Re: Note: 0004805 I agree that a change to the definition of "matched" in 9.1 is needed, and we should address that problem here. |
(0004808) kre (reporter) 2020-03-30 10:27 |
I'd also like to see a better definition of what all of this really means, particularly REG_MINIMAL and the (new, as opposed to old) '?' operator when that is in use, and how all of this applies in some of the more difficult cases. I know I couldn't implement this in a way I would expect to be compatible with other implementations based only on what has been added in 0000793 (even assuming that we fix the definition of "matched"). |
(0005111) geoffclare (manager) 2020-11-09 17:06 |
Draft 1.1 page 163 after line 5700 add:leftmostThe characters closest to the beginning of the string. Draft 1.1 page 172 lines 6090-6100 change: to:Each of the duplication symbols ('+', '*', '?', and intervals) can be suffixed by the minimal repetition modifier '?' (<question-mark>), in which case matching behavior shall be changed from the leftmost longest possible match to the leftmost shortest possible match, including the null match (see Section A.9, on page 3410). For example, the ERE <tt>".*c"</tt> matches up to the last character (<tt>'c'</tt>) in the string <tt>"abc abc"</tt>, whereas the ERE <tt>".*?c"</tt> matches up to the first character <tt>'c'</tt>, the third character in the string. Each of the duplication symbols ('+', '*', '?', and intervals) can be suffixed by the repetition modifier '?' (<question-mark>), in which case matching behavior for that repetition shall be changed from the leftmost longest possible match to the leftmost shortest possible match, including the null match (see Section A.9, on page 3410). For example, the ERE ".*c" matches up to and including the last character ('c') in the string "abc abc", whereas the ERE ".*?c" matches up to and including the first character 'c', the third character in the string. Draft 1.1 page 163 line 5718-5722 after: Consistent with the whole match being the longest of the leftmost matches, each subpattern, from left to right, shall match the longest possible string. For this purpose, a null string shall be considered to be longer than no match at all. For example, matching the BRE "\(.*\).*" against "abcdef", the subexpression "(\1)" is "abcdef", and matching the BRE "\(a*\)*" against "bc", the subexpression "(\1)" is the null string.add: However, matching the ERE "(.*?).*" against "abcdef", the subpattern "(.*?)" matches the empty string, since that is the longest possible match for the ERE ".*?". Draft 1.1 page 1719 line 56518, change: REG_MINIMALto:Change default matching behavior to leftmost shortest possible match. REG_MINIMALChange the matching behavior for duplication symbols to the leftmost shortest possible match, and invert the behavior of the repetition modifier '?' (<question-mark>) to match the longest possible match instead of the shortest. Draft 1.1 page 3411 line 116717-116719 change: EREs can optionally use a leftmost-shortest rule (enabled via the REG_MINIMAL flag or the '?' minimal repetition modifier), in which case the shortest possible matching prefix is instead identified as the matching sequence.to: EREs can optionally use a leftmost-shortest rule for repetitions (enabled via the REG_MINIMAL flag or the '?' repetition modifier), in which case the shortest possible matching prefix is instead identified as the matching sequence for the affected repetition(s). |
Issue History | |||
Date Modified | Username | Field | Change |
2020-03-29 20:54 | rhialto | New Issue | |
2020-03-29 20:54 | rhialto | Name | => Olaf 'Rhialto' Seibert |
2020-03-29 20:54 | rhialto | Section | => Vol 1. 9.4.6, Vol. 1. 13. regex.h, Vol 2. regcomp(), Vol. 4. A.9.2 |
2020-03-29 20:54 | rhialto | Page Number | => 190 |
2020-03-29 20:54 | rhialto | Line Number | => 6195 |
2020-03-29 23:03 | Don Cragun | Relationship added | child of 0000793 |
2020-03-30 02:57 | kre | Note Added: 0004804 | |
2020-03-30 03:14 | kre | Note Added: 0004805 | |
2020-03-30 03:23 | kre | Note Edited: 0004805 | |
2020-03-30 05:59 | kre | Note Edited: 0004805 | |
2020-03-30 08:36 | geoffclare | Note Added: 0004806 | |
2020-03-30 10:27 | kre | Note Added: 0004808 | |
2020-11-09 17:06 | geoffclare | Note Added: 0005111 | |
2020-11-09 17:08 | geoffclare | Interp Status | => --- |
2020-11-09 17:08 | geoffclare | Final Accepted Text | => Note: 0005111 |
2020-11-09 17:08 | geoffclare | Status | New => Resolved |
2020-11-09 17:08 | geoffclare | Resolution | Open => Accepted As Marked |
2020-11-09 17:08 | geoffclare | Tag Attached: issue8 | |
2020-12-09 10:01 | geoffclare | Status | Resolved => Applied |
2024-06-11 08:54 | agadmin | Status | Applied => Closed |
Mantis 1.1.6[^] Copyright © 2000 - 2008 Mantis Group |