View Issue Details
| ID | Project | Category | View Status | Date Submitted | Last Update |
|---|---|---|---|---|---|
| 0001077 | 1003.1(2013)/Issue7+TC1 | System Interfaces | public | 2016-09-11 17:47 | 2022-05-12 15:52 |
| Reporter | deadpixi | Assigned To | |||
| Priority | normal | Severity | Editorial | Type | Enhancement Request |
| Status | Closed | Resolution | Rejected | ||
| Name | Rob King | ||||
| Organization | |||||
| User Reference | |||||
| Section | regcomp | ||||
| Page Number | 1783-1789 | ||||
| Line Number | 57399-57703 | ||||
| Interp Status | --- | ||||
| Final Accepted Text | |||||
| Summary | 0001077: Recommend support for wide-character regcomp and regexec and/or specify multi-byte behavior | ||||
| Description | The existing mandated regular expression interfaces are specified to work solely on regular expressions and inputs that are specified using single-byte character strings. The standard is silent on what the regular expression functions should do if either the regular expression or the input contains multi-byte encoded characters in the current (or any) multi-byte encoding. This makes it impossible to rely on the regular expression interfaces in a portable manner, as their behavior is unspecified with multi-byte characters. For example, in UTF-8 encoding, a regular expression containing a single logical code point might encode that code point as four individual bytes. If this encoding were used in, e.g., a regular expression character class a naive implementation that expected each character to take up a single byte would treat each individual byte as a character to be matched in the character class, and not as a single character. | ||||
| Desired Action | The Standard should specify behavior of the regular expression interfaces when the expression to be compiled or the input to be matched contains multi-byte characters in the current character encoding. Alternatively (and perhaps preferably), the Standard should specify additional wide-character regular expression interfaces (perhaps named regwcomp and regwexec) to perform regular expression compilation and matching on expressions and inputs specified using wide characters; this would avoid any issue with multi-byte encoding. The specification of the additional wide-character interfaces would likely not be too burdensome: the GNU C library (when compiled with RE_ENABLE_I18N defined) already represents regular expressions and the input as wide characters internally. The Mac OS X standard library supports the "regwcomp" and "regwexec" interfaces. A free, portable, well-licensed, and popular implementation (libtre) exists and has been incorporated into several popular C libraries (musl, Darwin's libc, etc). | ||||
| Tags | No tags attached. | ||||
|
|
As an additional note, wide-character regular expressions are described in standard C++ as well, as of the C++11 language. |
|
|
I see nothing in the description of regular expression in the standard nor in the description of the regcomp() and regexec() functions that restricts their use to single-byte character strings. I believe the requirements are perfectly clear and that they apply to multi-byte character strings (such as UTF-8) just as much as they to do to single-byte character strings (such as ASCII, EBCDIC, and ISO 8859-*). |
|
|
I appreciate the rapid response, thank you. I agree now that the mutli-byte concerns are likely unfounded. I still think that there is some significant merit to specifying the wide-character interfaces. The only way to portably do many things on strings is to first convert them to wide strings. For example, it is impossible to move a character pointer backwards to point at the previous character of a string portably, since in shifted encodings one would have to scan first from the beginning of the string. I appreciate everyone's time in considering this. |
|
|
This issue failed to get a sponsor, and is therefore rejected. If complete wording can be specified and forwarded to a sponsor (IEEE, ISO/IEC JTC 1/SC 22, or The Open Group), it can be reconsidered in the future. |
| Date Modified | Username | Field | Change |
|---|---|---|---|
| 2016-09-11 17:47 | deadpixi | New Issue | |
| 2016-09-11 17:47 | deadpixi | Name | => Rob King |
| 2016-09-11 17:47 | deadpixi | Section | => regcomp |
| 2016-09-11 17:47 | deadpixi | Page Number | => - |
| 2016-09-11 17:47 | deadpixi | Line Number | => - |
| 2016-09-11 17:53 | deadpixi | Note Added: 0003376 | |
| 2016-09-11 21:03 | Don Cragun | Note Added: 0003377 | |
| 2016-09-11 21:08 | Don Cragun | Page Number | - => 1783-1789 |
| 2016-09-11 21:08 | Don Cragun | Line Number | - => 57399-57703 |
| 2016-09-11 21:08 | Don Cragun | Interp Status | => --- |
| 2016-09-11 23:54 | deadpixi | Note Added: 0003378 | |
| 2022-05-12 15:51 | nick | Note Added: 0005833 | |
| 2022-05-12 15:51 | nick | Note Edited: 0005833 | |
| 2022-05-12 15:52 | nick | Status | New => Closed |
| 2022-05-12 15:52 | nick | Resolution | Open => Rejected |