View Issue Details

IDProjectCategoryView StatusLast Update
00019591003.1(2024)/Issue8Shell and Utilitiespublic2025-12-15 06:55
Reportercollinfunk Assigned To 
PrioritynormalSeverityEditorialTypeClarification Requested
Status Interpretation RequiredResolutionAccepted As Marked 
Name
OrganizationGNU
User Reference
SectionXCU dd
Page Number2778
Line Number91990 - 91996
Interp StatusProposed
Final Accepted Text0001959:0007335
Summary0001959: dd conv=lcase and conv=ucase should only translate single byte locales
DescriptionThe `conv=lcase` and `conv=ucase` to `dd` require that conversion be done as specified by `LC_CTYPE`. Here is the description of `conv=lcase`, for reference:

> Map uppercase characters specified by the LC_CTYPE keyword tolower to the corresponding lowercase character. Characters for which no mapping is specified shall not be modified by this conversion.

This requirement means that if `LC_CTYPE` is a multibyte locale, one may have to read across block boundaries to have a full character for case conversion. For example using 512-byte blocks, a UTF-8 character like д, encoded as 0xd0 0xb4, might have 0xd0 at byte 512 of the first block and byte 0xb4 at byte 1 of the second block. This is not a problem if `dd` behaves on single bytes instead of characters.

However, introducing case conversion means we we must read entire multibyte characters, even if they extend across a block. Also complicating factor is that case conversion may change the length of the character in Unicode. Take the following example:

    $ python3 -c 'print(len("ß"))'
    1
    $ python3 -c 'print(len("ß".upper()))'
    2

If we have an input block containing all ASCII characters and `ß` as the last character, using `conv=ucase,sync bs=512` would result in a 512-byte output block followed a second block contains the second byte of uppercase `ß` and 511 NUL bytes. This is probably not what someone expects when using `dd`.
Desired ActionI see that there was discussion about this in the past during a 2011 Austin Group meeting. Here is the relevant text for reference [1]:

> We concluded that we want to change the wording to limit the lcase and ucase translations to single byte locales as they cannot work on multibyte locales. We will need to have an aardvark filed. We were not happy with limiting the conversion to ASCII. There will need to be some discussion about conv=ascii, ibm and ebcdic overriding the current locale for case conversions.

Before seeing this, Pádraig Brady and I discussed this and came to the conclusion that this was the correct behavior. For more complete locale aware case conversion something like `tr '[:lower:]' '[:upper:]' ` is more intuitive.

I could not find the original email, but it seems this was never acted upon since there was no bug report. How about making it explicit that `conv=lcase` and `conv=upcase` only operate on single-byte locales? I am not sure the best behavior regarding overriding the current locale.

[1] https://www.opengroup.org/austin/docs/austin_539.txt
Tagstc1-2024

Activities

geoffclare

2025-12-11 17:17

manager   bugnote:0007335

Interpretation response
------------------------
The standard states that conv=lcase and conv=ucase map characters as specified by LC_CTYPE, and conforming implementations must conform to this. However, concerns have been raised about this which are being referred to the sponsor.

Rationale:
-------------
Mapping multibyte characters would cause problems for block sizes and across block boundaries and is not current implementation practice.

Notes to the Editor (not part of this interpretation):
-------------------------------------------------------

On page 2778 line 91992,91996 section dd (lcase,ucase), after:
Characters for which no mapping is specified shall not be modified by this conversion.

add a sentence:
If a character to be mapped or a character resulting from the mapping is not a single-byte character, the behavior is unspecified.


After page 2778 line 92010 section dd (conv=value) add a paragraph after the list:
If the ucase or lcase conversion is used together with ascii, ebcdic, or ibm, the behavior is unspecified.



After page 2783 line 92100 section dd (APPLICATION USAGE) add:
Since the conversion from upper to lower or vice-versa is unspecified if the character to be mapped or a character resulting from the mapping is not a single-byte character, or if converting to ASCII or EBCDIC and attempting to change case, it is recommended to use tr to perform such conversions.


On page 2783 line 92110 section dd (EXAMPLES), change:
dd if=/dev/tape of=x ibs=800 cbs=80 conv=ascii,lcase

to:
dd if=/dev/tape ibs=800 cbs=80 conv=ascii | tr '[:upper:]' '[:lower:]' > x


The following command should not be used for this purpose as its behavior is unspecified:
dd if=/dev/tape of=x ibs=800 cbs=80 conv=ascii,lcase

ajosey

2025-12-15 06:55

manager   bugnote:0007338

Interpretation Proposed: 15 December 2025

Issue History

Date Modified Username Field Change
2025-11-13 23:13 collinfunk New Issue
2025-12-11 17:17 geoffclare Note Added: 0007335
2025-12-11 17:18 geoffclare Status New => Interpretation Required
2025-12-11 17:18 geoffclare Resolution Open => Accepted As Marked
2025-12-11 17:18 geoffclare Name Your Name Here =>
2025-12-11 17:18 geoffclare Interp Status => Pending
2025-12-11 17:18 geoffclare Final Accepted Text => 0001959:0007335
2025-12-11 17:19 geoffclare Tag Attached: tc1-2024
2025-12-15 06:55 ajosey Interp Status Pending => Proposed
2025-12-15 06:55 ajosey Note Added: 0007338