View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0001942 | 1003.1(2024)/Issue8 | Shell and Utilities | public | 2025-08-31 22:16 | 2025-09-12 14:45 |
Reporter | dwheeler | Assigned To | ajosey | ||
Priority | normal | Severity | Objection | Type | Enhancement Request |
Status | Under Review | Resolution | Open | ||
Name | David A. Wheeler | ||||
Organization | |||||
User Reference | diff | ||||
Section | diff | ||||
Page Number | 1 | ||||
Line Number | 1 | ||||
Interp Status | |||||
Final Accepted Text | |||||
Summary | 0001942: Add common options to diff | ||||
Description | # Proposal: Add options to diff The current POSIX diff specification <https://pubs.opengroup.org/onlinepubs/9799919799/utilities/diff.html> lacks many options that are widely supported across major implementations. Adding them would make portable use of `diff` far more capable. I did a cross-comparison of GNU, MacOS, FreeBSD, and BusyBox to see what was widely implemented in diff with the same semantics. I think that widespread implementation is a good argument that they're useful and should be standardized. (If I made a mistake, my apologies, please correct me): | Option | GNU | macOS | FreeBSD | BusyBox | Notes | |--------|-----|-------|---------|---------|--------| | `-a` | YES | YES | YES | YES | Consistent | | `-B` | YES | YES | YES | YES | Consistent | | `-d` | YES | YES | YES | YES* | *BusyBox: Partial implementation | | `-i` | YES | YES | YES | YES* | *BusyBox: ASCII-only | | `-I` | YES | YES | YES | NO | Not in BusyBox | | `-N` | YES | YES | YES | YES | Consistent | | `-q` | YES | YES | YES | YES | Consistent | | `-s` | YES | YES | YES | YES | Consistent | | `-T` | YES | YES | YES | YES | Consistent | | `-w` | YES | YES | YES | YES | Consistent | | `-x` | YES | YES | YES | NO | Not in BusyBox | | `-X` | YES | YES | YES | NO | Not in BusyBox | Obviously there are other implementations, but I thought this would be a helpful starting point. While `-I`, `-x`, and `-X` aren't implemented in BusyBox, I think they'd be easy to add since it already has a regex engine. The `-i` (ignore case) pays attention to locale as usual for every implementation, *except* BusyBox. BusyBox is usually compiled for only the C locale, while it currently only uses ASCII comparison, in general they probably don't notice. I think it should be standardized with locale support, for the locales it's compiled to support. BusyBox can note that it only fully complies when in C/POSIX locale, and has limited support beyond that. Full disclosure: I used Claude Code to do some of this analysis, and then hand-checked it. It's kind of an experiment in doing things this way. I think it's correct, but welcome corrections. Here are the details. ## `-a` (Treat as Text) **Purpose**: Force diff to treat all files as ASCII text, even binary files. **Behavior**: When specified, diff will attempt to compare files character-by-character as text even if they contain appear to contain binary data or null bytes. Without this option, most diffs try to detect binary files and report only "Binary files differ." **Use Cases**: Comparing files that contain some binary data but are primarily text, or forcing comparison of files with unusual content that diff might misidentify as binary. **Implementation**: Supported by GNU, FreeBSD, and BusyBox. ## `-B` (Ignore Blank Lines) **Purpose**: Ignore blank lines when comparing files. **Behavior**: Lines containing only whitespace characters (spaces, tabs) or completely empty lines are treated as if they don't exist for comparison purposes. This is particularly useful when comparing code or configuration files where blank line differences are not semantically meaningful. **Technical Detail**: A "blank line" is defined as a line containing only characters that match the POSIX `[[:space:]]` character class or an empty line (length zero after removing the trailing newline). **Implementation**: Supported by GNU, macOS, FreeBSD, and BusyBox. ## `-d` (Minimal) **Purpose**: Use a minimal difference algorithm to find the smallest set of changes. **Behavior**: Instructs diff to expend extra computational effort to find the minimal set of differences between files. While POSIX states that diff should "make every effort" to be minimal, the reality is that by default implementations wisely don't make *every* effort. This option explicitly requests the most thorough algorithm available, even if it may cause a crash or many hours of analysis. **Performance Impact**: May significantly increase processing time for large files, but produces more readable and smaller difference sets. **Implementation Variance**: - GNU diff: Uses Myers algorithm with additional optimizations - BSD variants: May use different algorithmic approaches but achieve similar minimality - BusyBox diff: Partial implementation affecting the "bound" calculation in stone algorithm ## `-i` (Ignore Case) **Purpose**: Perform case-insensitive comparison of file contents. **Behavior**: When specified, diff treats uppercase and lowercase letters as equivalent during comparison. The locale-specific case conversion follows the LC_CTYPE category, ensuring proper handling of international character sets. **Implementation Notes**: - GNU diff: Implements full Unicode case folding - macOS diff: Uses system locale for case conversion - FreeBSD diff: Follows POSIX locale conventions - BusyBox diff: Simple ASCII-only case conversion (does not use LC_CTYPE locale) ## `-I` (Ignore Lines Matching Pattern) **Purpose**: Ignore lines that match a specified regular expression pattern. **Syntax**: `-I regexp` where regexp is a regular expression in the style specified by POSIX Extended Regular Expressions. **Behavior**: Lines in either file that match the specified pattern are excluded from comparison. This is invaluable for ignoring timestamp lines, version comments, or other automatically-generated content that changes frequently but isn't semantically important. **Multiple Patterns**: Most implementations allow multiple `-I` options to specify several patterns. If only one is supported, branching like '(first|second|third)' would do the same thing. **Example**: `diff -I '^#.*Date:' file1 file2` would ignore any lines beginning with `#` and containing `Date:`. **Implementation Status**: Supported by GNU, macOS, and FreeBSD. **Not implemented in BusyBox**. ## `-N` (New File) **Purpose**: Treat absent files as empty files during directory comparison. **Behavior**: When comparing directories, if a file exists in one directory but not the other, treat the missing file as if it were an empty file. This generates output showing all lines of the existing file as additions or deletions. **Rationale**: Essential for meaningful directory comparisons where files may be added or removed. Without this option, diff typically reports "Only in..." messages that provide less useful information for automated processing. **Implementation**: All major implementations (GNU, macOS, FreeBSD, BusyBox) support this consistently. ## `-q` (Brief/Quiet Mode) **Purpose**: Report only whether files differ, without showing the actual differences. **Behavior**: Instead of displaying line-by-line differences, diff outputs only a brief message stating whether files differ. If files are identical, no output is produced. For different files, outputs "Files X and Y differ." **Use Cases**: Shell scripts that only need to know if files are different without processing the actual differences, or batch operations where detailed output would be excessive. **Implementation**: Supported by GNU, FreeBSD, and BusyBox. ## `-s` (Report Same Files) **Purpose**: Report when two files are identical. **Behavior**: When files are identical, output "Files X and Y are identical." Normally, diff produces no output for identical files. This option is particularly useful in combination with `-q` or when processing multiple file pairs. **Use Cases**: Verification scripts that need confirmation of file identity, or batch operations where positive confirmation is required. **Implementation**: Supported by GNU, FreeBSD, and BusyBox. ## `-T` (Tab Alignment) **Purpose**: Insert a tab character instead of a space before line text for consistent alignment. **Behavior**: In unified and context diff formats, lines are normally prefixed with a space, `+`, or `-` followed by another space. This option replaces that second space with a tab character, ensuring consistent alignment regardless of the display environment's tab stop settings. **Use Cases**: When diff output will be processed by tools expecting tab-delimited fields, or for consistent formatting in environments with varying tab stop configurations. **Implementation**: Supported by GNU, FreeBSD, and BusyBox. ## `-w` (Ignore All Whitespace) **Purpose**: Ignore all whitespace differences when comparing files. **Behavior**: All sequences of whitespace characters (spaces, tabs, newlines within lines) are treated as equivalent. This means "hello world" would match "hello\t\tworld" or "hello world". **Distinction from `-b`**: While POSIX `-b` only ignores trailing whitespace, `-w` ignores all whitespace differences throughout each line. **Use Cases**: Particularly valuable when comparing code that may have been reformatted or when different editors have inconsistent whitespace handling. **Implementation**: Supported by GNU, macOS, FreeBSD, and BusyBox. ## `-x` (Exclude Pattern) **Purpose**: Exclude files and directories matching a specified shell pattern during recursive comparison. **Syntax**: `-x pattern` where pattern is a shell-style glob pattern (e.g., `*.tmp`, `backup*`). **Behavior**: When performing recursive directory comparisons (with `-r`), skip files and directories whose names match the specified pattern. Multiple `-x` options can be used to specify multiple exclusion patterns. **Use Cases**: Comparing source trees while ignoring temporary files, backup files, or build artifacts. **Example**: `diff -r -x '*.o' -x '*.tmp' dir1 dir2` excludes object files and temporary files. **Implementation**: Supported by GNU and FreeBSD. **Not implemented in BusyBox**. ## `-X` (Exclude From File) **Purpose**: Exclude files and directories matching patterns listed in a specified file. **Syntax**: `-X file` where file contains one exclusion pattern per line. **Behavior**: Similar to `-x`, but reads exclusion patterns from a file instead of the command line. Each line in the file is treated as a separate exclusion pattern. **Use Cases**: Projects with standard exclusion lists that are maintained as configuration files, or when the number of exclusion patterns would make command-line specification unwieldy. **Example**: `diff -r -X .diffignore dir1 dir2` where `.diffignore` contains patterns like `*.log`, `temp/`, `*.bak`. **Implementation**: Supported by GNU and FreeBSD. **Not implemented in BusyBox**. | ||||
Desired Action | # Changes to POSIX diff specification ## Synopsis Section Change from: ``` diff [-c|-e|-f|-u|-C n|-U n] [-br] file1 file2 ``` To: ``` diff [-c|-e|-f|-u|-C n|-U n] [-abdiqsNrTw] [-I regexp] [-x pattern] [-X file] file1 file2 ``` ## Description Section Change from: ``` This list should be minimal. ``` To: ``` This list should be reasonably minimal. ``` ## Options Section Add the following options in alphabetical order after existing options: **-a** Treat all files as text. Files that would otherwise be identified as binary files shall be treated as text files. **-B** Ignore lines that are blank. A blank line is a line that contains only <space> characters, <tab> characters, or is empty. **-d** Use a more aggressive algorithm to minimize the number of changes in the output. This may require significantly more time and memory. **-i** Ignore differences in case when comparing lines. Case conversion shall be performed according to the LC_CTYPE category. **-I regexp** Ignore lines in both files that match the Extended Regular Expression regexp. Multiple -I options may be specified; lines matching any of the patterns shall be ignored. **-N** If file1 or file2 is a directory and the other is not, or if one file is missing during directory comparison, treat the missing file as an empty file. **-q** Output only whether the files differ. Write to standard output a single line for each pair of files that differ: "Files file1 and file2 differ". Write nothing if the files are identical. **-s** Report identical files. When files are identical, write to standard output: "Files file1 and file2 are identical". **-T** On output, replace the second <space> character with a <tab> character in the line prefix for context and unified format outputs. **-w** Ignore all <space> and <tab> character differences when comparing lines. Any sequence of one or more <space> or <tab> characters shall be equivalent to any other such sequence. **-x pattern** During recursive directory comparison, exclude files and directories whose basename matches the shell pattern specified by pattern. Multiple -x options may be specified. Pattern matching follows the rules specified in XBD Pattern Matching Notation. **-X file** During recursive directory comparison, exclude files and directories whose basenames match any pattern in file. Each line in file shall be treated as a shell pattern following the same matching rules as -x. | ||||
Tags | No tags attached. |
|
Note: historically the `-a` option is for "ASCII", but I suspect this can be renamed to "as text". In *practice* these systems do line-by-line comparisons. I did some brief testing on MacOS diff, and it handles UTF-8 just fine with `-a`. |
|
At the meeting on 2025-09-12 I was asked to make some improvements to this, which I plan to do. I'm recording this information here so the plan will be documented and so people can correct me if I misunderstood something. There appeared to be general agreement that these were widely implemented and worthy of including in POSIX. However, at the meeting some needed clarifications were identified:
Once I've completed my investigations, I intend to post an updated proposal that uses a restricted HTML as documented in https://www.austingroupbugs.net/proj_doc_page.php ; this limits HTML tags to p, li, ul, ol, br, pre, i, b, u, em, blockquote. If there's anything wrong/missing in this plan, please let me know. |
Date Modified | Username | Field | Change |
---|---|---|---|
2025-08-31 22:16 | dwheeler | New Issue | |
2025-08-31 22:16 | dwheeler | Status | New => Under Review |
2025-08-31 22:16 | dwheeler | Assigned To | => ajosey |
2025-09-01 15:08 | dwheeler | Note Added: 0007248 | |
2025-09-11 15:50 | geoffclare | Project | 1003.1(2008)/Issue 7 => 1003.1(2024)/Issue8 |
2025-09-12 14:43 | dwheeler | Note Added: 0007257 | |
2025-09-12 14:44 | dwheeler | Note Edited: 0007257 | |
2025-09-12 14:45 | dwheeler | Note Edited: 0007257 |