|Anonymous | Login||2022-01-20 07:57 UTC|
|Main | My View | View Issues | Change Log | Docs|
|Viewing Issue Simple Details|
|ID||Category||Severity||Type||Date Submitted||Last Update|
|0001077||[1003.1(2013)/Issue7+TC1] System Interfaces||Editorial||Enhancement Request||2016-09-11 17:47||2016-09-11 23:54|
|Final Accepted Text|
|Summary||0001077: Recommend support for wide-character regcomp and regexec and/or specify multi-byte behavior|
The existing mandated regular expression interfaces are specified to work solely on regular expressions and inputs that are specified using single-byte character strings. The standard is silent on what the regular expression functions should do if either the regular expression or the input contains multi-byte encoded characters in the current (or any) multi-byte encoding.
This makes it impossible to rely on the regular expression interfaces in a portable manner, as their behavior is unspecified with multi-byte characters.
For example, in UTF-8 encoding, a regular expression containing a single logical code point might encode that code point as four individual bytes. If this encoding were used in, e.g., a regular expression character class a naive implementation that expected each character to take up a single byte would treat each individual byte as a character to be matched in the character class, and not as a single character.
The Standard should specify behavior of the regular expression interfaces when the expression to be compiled or the input to be matched contains multi-byte characters in the current character encoding.
Alternatively (and perhaps preferably), the Standard should specify additional wide-character regular expression interfaces (perhaps named regwcomp and regwexec) to perform regular expression compilation and matching on expressions and inputs specified using wide characters; this would avoid any issue with multi-byte encoding.
The specification of the additional wide-character interfaces would likely not be too burdensome: the GNU C library (when compiled with RE_ENABLE_I18N defined) already represents regular expressions and the input as wide characters internally. The Mac OS X standard library supports the "regwcomp" and "regwexec" interfaces. A free, portable, well-licensed, and popular implementation (libtre) exists and has been incorporated into several popular C libraries (musl, Darwin's libc, etc).
|Tags||No tags attached.|
|As an additional note, wide-character regular expressions are described in standard C++ as well, as of the C++11 language.|
Don Cragun (manager)
|I see nothing in the description of regular expression in the standard nor in the description of the regcomp() and regexec() functions that restricts their use to single-byte character strings. I believe the requirements are perfectly clear and that they apply to multi-byte character strings (such as UTF-8) just as much as they to do to single-byte character strings (such as ASCII, EBCDIC, and ISO 8859-*).|
I appreciate the rapid response, thank you.
I agree now that the mutli-byte concerns are likely unfounded. I still think that there is some significant merit to specifying the wide-character interfaces. The only way to portably do many things on strings is to first convert them to wide strings. For example, it is impossible to move a character pointer backwards to point at the previous character of a string portably, since in shifted encodings one would have to scan first from the beginning of the string.
I appreciate everyone's time in considering this.
|2016-09-11 17:47||deadpixi||New Issue|
|2016-09-11 17:47||deadpixi||Name||=> Rob King|
|2016-09-11 17:47||deadpixi||Section||=> regcomp|
|2016-09-11 17:47||deadpixi||Page Number||=> -|
|2016-09-11 17:47||deadpixi||Line Number||=> -|
|2016-09-11 17:53||deadpixi||Note Added: 0003376|
|2016-09-11 17:53||deadpixi||Issue Monitored: deadpixi|
|2016-09-11 21:03||Don Cragun||Note Added: 0003377|
|2016-09-11 21:08||Don Cragun||Page Number||- => 1783-1789|
|2016-09-11 21:08||Don Cragun||Line Number||- => 57399-57703|
|2016-09-11 21:08||Don Cragun||Interp Status||=> ---|
|2016-09-11 23:54||deadpixi||Note Added: 0003378|
|Mantis 1.1.6[^] Copyright © 2000 - 2008 Mantis Group|