Anonymous | Login | 2024-05-04 15:20 UTC |
Main | My View | View Issues | Change Log | Docs |
Viewing Issue Simple Details [ Jump to Notes ] | [ Issue History ] [ Print ] | ||||||
ID | Category | Severity | Type | Date Submitted | Last Update | ||
0001077 | [1003.1(2013)/Issue7+TC1] System Interfaces | Editorial | Enhancement Request | 2016-09-11 17:47 | 2022-05-12 15:52 | ||
Reporter | deadpixi | View Status | public | ||||
Assigned To | |||||||
Priority | normal | Resolution | Rejected | ||||
Status | Closed | ||||||
Name | Rob King | ||||||
Organization | |||||||
User Reference | |||||||
Section | regcomp | ||||||
Page Number | 1783-1789 | ||||||
Line Number | 57399-57703 | ||||||
Interp Status | --- | ||||||
Final Accepted Text | |||||||
Summary | 0001077: Recommend support for wide-character regcomp and regexec and/or specify multi-byte behavior | ||||||
Description |
The existing mandated regular expression interfaces are specified to work solely on regular expressions and inputs that are specified using single-byte character strings. The standard is silent on what the regular expression functions should do if either the regular expression or the input contains multi-byte encoded characters in the current (or any) multi-byte encoding. This makes it impossible to rely on the regular expression interfaces in a portable manner, as their behavior is unspecified with multi-byte characters. For example, in UTF-8 encoding, a regular expression containing a single logical code point might encode that code point as four individual bytes. If this encoding were used in, e.g., a regular expression character class a naive implementation that expected each character to take up a single byte would treat each individual byte as a character to be matched in the character class, and not as a single character. |
||||||
Desired Action |
The Standard should specify behavior of the regular expression interfaces when the expression to be compiled or the input to be matched contains multi-byte characters in the current character encoding. Alternatively (and perhaps preferably), the Standard should specify additional wide-character regular expression interfaces (perhaps named regwcomp and regwexec) to perform regular expression compilation and matching on expressions and inputs specified using wide characters; this would avoid any issue with multi-byte encoding. The specification of the additional wide-character interfaces would likely not be too burdensome: the GNU C library (when compiled with RE_ENABLE_I18N defined) already represents regular expressions and the input as wide characters internally. The Mac OS X standard library supports the "regwcomp" and "regwexec" interfaces. A free, portable, well-licensed, and popular implementation (libtre) exists and has been incorporated into several popular C libraries (musl, Darwin's libc, etc). |
||||||
Tags | No tags attached. | ||||||
Attached Files | |||||||
|
Mantis 1.1.6[^] Copyright © 2000 - 2008 Mantis Group |