0001959: dd conv=lcase and conv=ucase should only translate single byte locales

ID	Project	Category	View Status	Date Submitted	Last Update

0001959	1003.1(2024)/Issue8	Shell and Utilities	public	2025-11-13 23:13	2026-01-23 12:09

Reporter	collinfunk	Assigned To
Priority	normal	Severity	Editorial	Type	Clarification Requested
Status	Interpretation Required	Resolution	Accepted As Marked

Name
Organization	GNU
User Reference
Section	XCU dd
Page Number	2778
Line Number	91990 - 91996
Interp Status	Proposed
Final Accepted Text	0001959:0007365


Summary	0001959: dd conv=lcase and conv=ucase should only translate single byte locales
Description	The `conv=lcase` and `conv=ucase` to `dd` require that conversion be done as specified by `LC_CTYPE`. Here is the description of `conv=lcase`, for reference: > Map uppercase characters specified by the LC_CTYPE keyword tolower to the corresponding lowercase character. Characters for which no mapping is specified shall not be modified by this conversion. This requirement means that if `LC_CTYPE` is a multibyte locale, one may have to read across block boundaries to have a full character for case conversion. For example using 512-byte blocks, a UTF-8 character like д, encoded as 0xd0 0xb4, might have 0xd0 at byte 512 of the first block and byte 0xb4 at byte 1 of the second block. This is not a problem if `dd` behaves on single bytes instead of characters. However, introducing case conversion means we we must read entire multibyte characters, even if they extend across a block. Also complicating factor is that case conversion may change the length of the character in Unicode. Take the following example: $ python3 -c 'print(len("ß"))' 1 $ python3 -c 'print(len("ß".upper()))' 2 If we have an input block containing all ASCII characters and `ß` as the last character, using `conv=ucase,sync bs=512` would result in a 512-byte output block followed a second block contains the second byte of uppercase `ß` and 511 NUL bytes. This is probably not what someone expects when using `dd`.
Desired Action	I see that there was discussion about this in the past during a 2011 Austin Group meeting. Here is the relevant text for reference [1]: > We concluded that we want to change the wording to limit the lcase and ucase translations to single byte locales as they cannot work on multibyte locales. We will need to have an aardvark filed. We were not happy with limiting the conversion to ASCII. There will need to be some discussion about conv=ascii, ibm and ebcdic overriding the current locale for case conversions. Before seeing this, Pádraig Brady and I discussed this and came to the conclusion that this was the correct behavior. For more complete locale aware case conversion something like `tr '[:lower:]' '[:upper:]' ` is more intuitive. I could not find the original email, but it seems this was never acted upon since there was no bug report. How about making it explicit that `conv=lcase` and `conv=upcase` only operate on single-byte locales? I am not sure the best behavior regarding overriding the current locale. [1] https://www.opengroup.org/austin/docs/austin_539.txt
Tags	tc1-2024

geoffclare 2025-12-11 17:17 manager bugnote:0007335 Last edited: 2026-01-22 16:54	OLD Interpretation response ------------------------ The standard states that conv=lcase and conv=ucase map characters as specified by LC_CTYPE, and conforming implementations must conform to this. However, concerns have been raised about this which are being referred to the sponsor. Rationale: ------------- Mapping multibyte characters would cause problems for block sizes and across block boundaries and is not current implementation practice. Notes to the Editor (not part of this interpretation): ------------------------------------------------------- On page 2778 line 91992,91996 section dd (lcase,ucase), after: Characters for which no mapping is specified shall not be modified by this conversion. add a sentence: If a character to be mapped or a character resulting from the mapping is not a single-byte character, the behavior is unspecified. After page 2778 line 92010 section dd (conv=value) add a paragraph after the list: If the ucase or lcase conversion is used together with ascii, ebcdic, or ibm, the behavior is unspecified. After page 2783 line 92100 section dd (APPLICATION USAGE) add: Since the conversion from upper to lower or vice-versa is unspecified if the character to be mapped or a character resulting from the mapping is not a single-byte character, or if converting to ASCII or EBCDIC and attempting to change case, it is recommended to use tr to perform such conversions. On page 2783 line 92110 section dd (EXAMPLES), change: dd if=/dev/tape of=x ibs=800 cbs=80 conv=ascii,lcase to: dd if=/dev/tape ibs=800 cbs=80 conv=ascii \| tr '[:upper:]' '[:lower:]' > x The following command should not be used for this purpose as its behavior is unspecified: dd if=/dev/tape of=x ibs=800 cbs=80 conv=ascii,lcase

ajosey 2025-12-15 06:55 manager bugnote:0007338	Interpretation Proposed: 15 December 2025

stephane 2025-12-16 07:16 reporter bugnote:0007340	I don't think: > dd if=/dev/tape ibs=800 cbs=80 conv=ascii \| tr '[:upper:]' '[:lower:]' Makes much sense. That conv=ascii is meant to be a "EBCDIC" to "ASCII", but with tables that cover values 0 to 255 even though ASCII only has characters for bytes 0 to 127. One may think it's about conversion from some superset of ASCII such as one of the ISO8859-x character sets to some national variant of EBCDIC, but it does not appear to be. Those conversion tables appear to be taken straight from the original dd implementation from Unix v5 in the early 70s (https://github.com/dspinellis/unix-history-repo/blob/Research-V5/usr/source/s1/dd.c#L31-L100), and PWB Unix for the "IBM" variant. Those tables reference OCR characters ⑀⑁⑂ (https://en.wikipedia.org/wiki/OCR-A) which are not found in any modern single-byte character set (let alone ASCII based or EBCDIC-based ones) and suggest those conversion tables (and the "print train" terminology) is about interaction with long-forgotten technology from the era (long before Unix localisation, long before ISO8859-x let alone Unicode and UTF-8). The dd text mentions "standard EBCDIC" as if there was one such standard. Same in iconv: > The iconv utility may support the conversion between ASCII and EBCDIC-based > encodings, but is not required to do so. In an XSI-compliant implementation, > the dd utility is the only method guaranteed to support conversion between > these two character sets In the rationale section of dd: > 2. EBCDIC 0137 (') translates to/from ASCII 0236 ('^'). In the standard > table, EBCDIC 0232 (no graphic) is used. ASCII ^ is 0136, not 0236 (ASCII only covers 0 to 0177). In the "EBCDIC" table, we indeed see 0232 for ASCII 0136, and for the "IBM" one, we see 0137 and a ´ glyph (not in ASCII). So the output of dd conv=ascii would be in a charset that is a superset of ASCII but is not in use today, and in particular not found as the charmap in any locale. dd if=/dev/tape ibs=800 cbs=80 conv=ascii \| LC_ALL=C tr '[:upper:]' '[:lower:]' May make sense (maybe to process a tape recovered from some museum?) on ASCII-based systems as would transliterate only A-Z to a-z, but then again, you'd rather use: LC_ALL=C dd if=/dev/tape ibs=800 cbs=80 conv=ascii,lcase then. On non-ASCII systems, the output of dd conv=ascii would not be text that can be processed by tr in any locale. I would remove any attempt to do case conversion on the output of dd conv=ascii/ebcdic/ibm, and (though that would likely be for a separate bug), clarify that those conv=ascii/ebcdic/ibm are legacy stuff of no relevance today and modernise the text so it can be understandable by someone who hasn't happened to have worked at AT&T in the 70s/80s. Or maybe drop those altogether.

stephane 2025-12-16 07:39 reporter bugnote:0007341	> However, introducing case conversion means we we must read entire multibyte characters, even if they extend across a block. Also complicating factor is that case conversion may change the length of the character in Unicode. Take the following example: > > $ python3 -c 'print(len("ß"))' > 1 > $ python3 -c 'print(len("ß".upper()))' > 2 > > If we have an input block containing all ASCII characters and `ß` as the last character, using `conv=ucase,sync bs=512` would result in a 512-byte output block followed a second block contains the second byte of uppercase `ß` and 511 NUL bytes. This is probably not what someone expects when using `dd`. POSIX case conversion is from character to character, it cannot translate "ß" to "SS" as per Unicode (or like perl/python do). It can however translate between characters with an encoding of different size, including ASCII ones such as "i" whose uppercase translation would be "İ" in some locales and that is encoded on 2 bytes in UTF-8.

stephane 2025-12-16 07:52 reporter bugnote:0007342	I would rather recommend awk '{print toupper($0)}' for case conversion. tr was really meant for byte-based transliteration, and several implementations still do.

geoffclare 2025-12-16 15:11 manager bugnote:0007344	> That conv=ascii is meant to be a "EBCDIC" to "ASCII", but with tables that cover values 0 to 255 even though ASCII only has characters for bytes 0 to 127. Good catch. We just assumed the output is ASCII so using tr (with current LC_CTYPE) would be fine. > dd if=/dev/tape ibs=800 cbs=80 conv=ascii \| LC_ALL=C tr '[:upper:]' '[:lower:]' I think this is the right solution. > you'd rather use: LC_ALL=C dd if=/dev/tape ibs=800 cbs=80 conv=ascii,lcase then. The standard doesn't specify the order in which those conversions are applied (perhaps it should), so to be portable you'd need: dd if=/dev/tape ibs=800 cbs=80 conv=ascii \| LC_ALL=C dd conv=lcase which has the disadvantage over the tr solution that it would produce two sets of block counts. > clarify that those conv=ascii/ebcdic/ibm are legacy stuff of no relevance today They are relevant today to users of z/OS (which is a certified UNIX system that uses EBCDIC).

stephane 2025-12-16 19:36 reporter bugnote:0007345	Re: 0001959:0007344 > They are relevant today to users of z/OS (which is a certified UNIX system that uses EBCDIC). Some form of EBCDIC may be relevant to z/OS, but I doubt the conv=ascii, conv=ebcdic, conv=ibm would be relevant to them as they have little to do with any EBCDIC charset that may still be in use today. I'd expect they'd use iconv if they needed to convert between any ASCII based charset and any of the EBCDIC national variants.

geoffclare 2026-01-22 16:53 manager bugnote:0007365	Interpretation response ------------------------ The standard states that conv=lcase and conv=ucase map characters as specified by LC_CTYPE, and conforming implementations must conform to this. However, concerns have been raised about this which are being referred to the sponsor. Rationale: ------------- Mapping multibyte characters would cause problems for block sizes and across block boundaries and is not current implementation practice. Notes to the Editor (not part of this interpretation): ------------------------------------------------------- On page 2778 line 91992,91996 section dd (lcase,ucase), after: Characters for which no mapping is specified shall not be modified by this conversion. add a sentence: If a character to be mapped or a character resulting from the mapping is not a single-byte character, the behavior is unspecified. After page 2778 line 92010 section dd (conv=value) add a paragraph after the list: If the ucase or lcase conversion is used together with ascii, ebcdic, or ibm, the behavior is unspecified. After page 2783 line 92100 section dd (APPLICATION USAGE) add: Since the conversion from upper to lower or vice-versa is unspecified if the character to be mapped or a character resulting from the mapping is not a single-byte character, or if converting to ASCII or EBCDIC and attempting to change case, it is recommended to use tr to perform such conversions. On page 2783 line 92110 section dd (EXAMPLES), change: dd if=/dev/tape of=x ibs=800 cbs=80 conv=ascii,lcase to: dd if=/dev/tape ibs=800 cbs=80 conv=ascii \| LC_ALL=C tr '[:upper:]' '[:lower:]' > x The following command should not be used for this purpose as its behavior is unspecified: dd if=/dev/tape of=x ibs=800 cbs=80 conv=ascii,lcase After Page 2783 line 92100 (Application Usage) add a new paragraph: Applications that need to convert between ASCII (ISO/IEC 646:1991 US) and some variant of EBCDIC should check to see if the implementation supports such conversion under iconv rather than using dd.

agadmin 2026-01-22 17:11 administrator bugnote:0007367	Interpretation Proposed: 22 Jan 2026

stephane 2026-01-23 06:51 reporter bugnote:0007368	Re: 0001959:0007365 > Applications that need to convert between ASCII (ISO/IEC 646:1991 US) and some variant of EBCDIC should check to see if the implementation supports such conversion under iconv rather than using dd. Thanks. The: > The iconv utility may support the conversion between ASCII and EBCDIC-based encodings, but is not required to do so. In an XSI-compliant implementation, the dd utility is the only method guaranteed to support conversion between these two character sets. part in the rationale section of the iconv utility specification would also need to be amended or removed. And maybe we should obsolete those conv={ascii,ibm,ebcdic} in a separate ticket or at least reword it so it be comprehensible to people not familiar with technology from the 70s and note that those conversion tables have no relevance today.

geoffclare 2026-01-23 10:02 manager bugnote:0007369	Re: 0001959:0007368 I don't see any reason to change that part of the iconv rationale; it is still a valid statement. The fact that ASCII<->EBCDIC conversion is not mandatory in iconv is the reason the new text we're adding to dd says "... should check to see if the implementation supports ...".

stephane 2026-01-23 12:09 reporter bugnote:0007370	Re: 0001959:0007369 My point was that the "the dd utility is the only method guaranteed to support conversion between these two character sets" is at best misleading as the dd conversion tables refer to variants of EBCDIC that have no relevance today (if ever) and should be obsoleted and are not /converting between "ASCII" and "EBCDIC"/ by any modern definition of those. Now, I see on Debian that dd's conv=ibm if limited to characters covered by ASCII (byte values 0 to 127) gives the same result as iconv -t CP1047 which https://en.wikipedia.org/wiki/EBCDIC#Code_pages_with_Latin-1_character_sets suggests is "Open Systems (MVS C compiler)" codeset whatever that is and iconv -t CP1137 which https://en.wikibooks.org/wiki/Character_Encodings/Code_Tables/EBCDIC/EBCDIC_1137 indicates "is an EBCDIC code page used on IBM mainframes to support Devanagari script". So maybe some Indian users of IBM mainframes can make some use of that. And there is no iconv translation for ASCII characters (0 to 127) that matches dd's conv=ebcdic.

Date Modified	Username	Field	Change
2025-11-13 23:13	collinfunk	New Issue
2025-12-11 17:17	geoffclare	Note Added: 0007335
2025-12-11 17:18	geoffclare	Status	New => Interpretation Required
2025-12-11 17:18	geoffclare	Resolution	Open => Accepted As Marked
2025-12-11 17:18	geoffclare	Name	Your Name Here =>
2025-12-11 17:18	geoffclare	Interp Status	=> Pending
2025-12-11 17:18	geoffclare	Final Accepted Text	=> 0001959:0007335
2025-12-11 17:19	geoffclare	Tag Attached: tc1-2024
2025-12-15 06:55	ajosey	Interp Status	Pending => Proposed
2025-12-15 06:55	ajosey	Note Added: 0007338
2025-12-16 07:16	stephane	Note Added: 0007340
2025-12-16 07:39	stephane	Note Added: 0007341
2025-12-16 07:52	stephane	Note Added: 0007342
2025-12-16 15:11	geoffclare	Note Added: 0007344
2025-12-16 19:36	stephane	Note Added: 0007345
2026-01-22 16:53	geoffclare	Note Added: 0007365
2026-01-22 16:54	geoffclare	Note Edited: 0007335
2026-01-22 16:55	geoffclare	Interp Status	Proposed => Pending
2026-01-22 16:55	geoffclare	Final Accepted Text	0001959:0007335 => 0001959:0007365
2026-01-22 17:11	agadmin	Interp Status	Pending => Proposed
2026-01-22 17:11	agadmin	Note Added: 0007367
2026-01-23 06:51	stephane	Note Added: 0007368
2026-01-23 10:02	geoffclare	Note Added: 0007369
2026-01-23 12:09	stephane	Note Added: 0007370

View Issue Details

Activities

Issue History