Austin Group Defect Tracker

Aardvark Mark IV


Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0001592 [Issue 8 drafts] Shell and Utilities Comment Enhancement Request 2022-07-15 08:36 2022-07-18 14:51
Reporter geoffclare View Status public  
Assigned To
Priority normal Resolution Open  
Status New   Product Version Draft 2.1
Name Geoff Clare
Organization The Open Group
User Reference
Section printf
Page Number 3085
Line Number 104307
Final Accepted Text
Summary 0001592: Add %n$ support to the printf utility
Description During the work on adding gettext, it was suggested that we should also add support for %n$ in format strings to the printf utility, to allow argument reordering by translators.

This is not as straightforward as it might seem, as per the email thread "Adding %n$ conversions to the printf utility" in Sept 2021.

One of the points raised in the thread was that it would be preferable for printf to treat a missing argument as an error when %n$ is used. The suggested changes recommend this (via the use of "should") but also allow the null/zero behaviour that happens for unnumbered argument conversions.
Desired Action On page 3085 line 104307 section printf, insert a new item 8:
Conversions can be applied to the nth argument operand rather than to the next argument operand. In this case, the conversion specifier character '%' is replaced by the sequence "%n$", where n is a decimal integer in the range [1,{NL_ARGMAX}], giving the argument operand number. This feature provides for the definition of format strings that select arguments in an order appropriate to specific languages.

The format can contain either numbered argument conversion specifications (that is, ones beginning with "%n$"), or unnumbered argument conversion specifications, but not both. The only exception to this is that "%%" can be mixed with the "%n$" form. The results of mixing numbered and unnumbered argument specifications that consume an argument are unspecified.

and renumber the later items.

On page 3085 line 104307 section printf, change:
For each conversion specification that consumes an argument, the next argument operand shall be evaluated and converted to the appropriate type for the conversion as specified below.
to:
For each conversion specification that consumes an argument, an argument operand shall be evaluated and converted to the appropriate type for the conversion as specified below. The operand to be evaluated shall be determined as follows:

  • If the conversion specification begins with a "%n$" sequence, the nth argument operand shall be evaluated.

  • Otherwise, the evaluated operand shall be the next argument operand after the one evaluated by the previous conversion specification that consumed an argument; if there is no such previous conversion specification the first argument operand shall be evaluated.

If the format operand contains no conversion specifications that consume an argument and there are argument operands present, the results are unspecified.

On page 3085 line 104310 section printf, change:
9.
The format operand shall be reused as often as necessary to satisfy the argument operands. Any extra b, c, or s conversion specifiers shall be evaluated as if a null string argument were supplied; other extra conversion specifications shall be evaluated as if a zero argument were supplied. If the format operand contains no conversion specifications and argument operands are present, the results are unspecified.
to:
10.
The format operand shall be reused as often as necessary to satisfy the argument operands. If conversion specifications beginning with a "%n$" sequence are used, on format reuse the value of n shall refer to the nth argument operand following the highest numbered argument operand consumed by the previous use of the format operand.
11.
If an argument operand to be consumed by a conversion specification does not exist:

  • If it is a numbered argument conversion specification, printf should write a diagnostic message to standard error and exit with non-zero status, but may behave as for an unnumbered argument conversion specification.

  • If it is an unnumbered argument conversion specification, any extra b, c, or s conversion specifiers shall be evaluated as if a null string argument were supplied and any other extra conversion specifiers shall be evaluated as if a zero argument were supplied.

and renumber the later items.

After page 3087 line 104372 section printf, add a new paragraph:
Unlike the printf() function, when numbered conversion specifications are used, specifying the Nth argument does not require that all the leading arguments, from the first to the (N-1)th, are specified in the format string. For example, "%3$s %1$d\n" is an acceptable format operand which evaluates the first and third argument operands but not the second.

On page 3089 line 104455 section printf, change:
The EXTENDED DESCRIPTION section almost exactly matches the printf() function in the ISO C standard, although it is described in terms of the file format notation in [xref to XBD Chapter 5].
to:
The format strings for the printf utility are handled differently than for the printf() function in several respects. Notable differences include:

  • Reuse of the format until all arguments have been consumed.

  • No provision for obtaining field width and precision from argument values.

  • No requirement to support floating-point conversion specifiers.

  • An additional b conversion specifier.

  • Special handling of leading single-quote or double-quote for integer conversion specifiers.

  • Hexadecimal character constants are not recognized in the format.

  • Formats that use numbered argument conversion specifications can have gaps in the argument numbers.

Although printf implementations have no difficulty handling formats with mixed numbered and unnumbered conversion specifications (unlike the printf() function where it is undefined behavior), existing implementations differ in behavior. Given the format operand "%2$d %d\n", with some implementations the "%d" evaluates the first argument and with others it evaluates the third. Consequently this standard leaves the behavior unspecified (as opposed to undefined).

On page 3089 line 104471 section printf, change FUTURE DIRECTIONS from:
None.
to:
A future version of this standard may require that a missing argument operand to be consumed by a numbered argument conversion specification is treated as an error.

Tags No tags attached.
Attached Files

- Relationships

-  Notes
(0005890)
kre (reporter)
2022-07-16 08:23
edited on: 2022-07-16 08:57

It is probably no big surprise, but I am opposed to this as it
stands currently.

I accept that having %n$ in printf is close to essential for
handling internationalised scripts, but it does not need to be
as described.

I also find it weird, that from the mailing list discussion, the one
issue considered worthy of adoption was the:
   "it would be preferable for printf to treat a missing argument as an error"
issue, which defeats the one and only potential solution to the real problem
with this proposal.

That is, while a script which uses this mechanism controls the arg list
which is presented to printf (so, after the format is removed, $1 is
something known to the invoking script, as are $2 ...) but what it does
not control is the format. That comes from the message translator, who
provides the text of the format string, and interpolates whatever of the
arguments are needed for that particular translation, in the order that
is appropriate for the culture and language being presented.

In the most extreme case, a script might call printf with several args
which it believes would be useful to present to the user, but the
translator simply provides a constant translation of the intent of the
message, which uses none of the args.

That would currently (both with, and without %n$ being added) run foul of the

    If the format operand contains no conversion specifications that
    consume an argument and there are argument operands present,
    the results are unspecified.

restriction. That can be defeated (in an extremely ugly fashion) by
adding a %999$s conversion somewhere in the format - knowing that there
won't be 999 args given, so if we can rely on an undefined arg presenting
an empty string to a %s conversion, that can be used anywhere, with no
effect on the output.

But if that's allowed to be an error, even that won't work.

I believe it is time to remove that "results are unspecified" and change
the "no conversions with args" to result in "print the format string once,
and exit, ignoring unused args" - this change isn't really required for
normal printf, where the situation normally shouldn't occur. But once
we lose control of the format string, all bets are off.

For the same reason, this:

     The format operand shall be reused as often as necessary to satisfy the
     argument operands. If conversion specifications beginning with a "%n$"
     sequence are used, on format reuse the value of n shall refer to the nth
     argument operand following the highest numbered argument operand consumed
     by the previous use of the format operand.

does not work either. Since we have no idea which %n$ values will be used
in a particular translation, we cannot supply args that are correct for
this to work as desired

Consider:
         printf trfanslated-format 10 36.3 string 5 19.4 word

where one translation of the format is

         "text %1$d %3$s # %2$5.2f\n"
another "total %2$%7.1g for %$1d %$3s\n"
and "That will be %2$.2f for %1$ objects\n"

(all in diverse languages of course). Consider what the proposed rule
does with the 3rd format.

Again, the %999$s trick could save things - essentially it causes the format
string to be used exactly once, which for cases where we don't control what
is in the format, is really what we need.

The best solution is simply to not reuse the format string when any %n$
conversions are present. That avoids all kinds of issues. Further,
it is hard to imagine a sane case for the need for that.

Reusing the format is not used all that often, when it is, it tends to be
for simple formats like
      printf " %s" "$@"
where internationalization is not an issue. or for tabular data, like

      printf '%5d\t%-12s\t%c\t%5d\n' ......

again, where there is not really anything in the format to translate
(on the other hand, the args, particularly strings, might want to be
translated, but that's a side issue).

So, I'd suggest changing things so the format is not reused if there has
been any %$n arg used, forbid it to be an error to refer to an arg that
doesn't exist (for at least for one conversion, even if a new made up one,
which converts nothing -- %Z could be guaranteed to print an empty string,
but we could still use %20$Z to make this a format using %n$, and thus
assuming the other change is accepted, not make it undefined to not
reference any args.

Without changes like these, the proposal is simply unworkable - which is
why I have resisted implementing %n$ in NetBSD printf. It would be truly
easy to simply do what the other implementations have done, and do the
easy thing (parsing and implementing it is trivial) but if the result is
as hopeless as all of those are, I am not at all interested.

(0005891)
geoffclare (manager)
2022-07-18 11:17

Re Note: 0005890

The reason the behaviour is unspecified when there are arguments but no conversions is because some implementations write a warning message about the unused arguments. If it stops being unspecified they will have to change.

A better work-around than %999$s is %1$.0s i.e. consume the argument(s) without them producing any contribution to the output. A similar trick also solves the "repeat the format" situation where a translator only wants to use two of three arguments - they can insert %3$.0s anywhere so that the third argument in each triple is consumed.

You say "it is hard to imagine a sane case for the need for" format reuse with numbered arguments, but in the mailing list discussion someone (Stephane I think) said they often use it for reordering columns and would object to this feature being disallowed.

The restrictions for the printf utility as proposed are already less stringent than the printf() function, where if you consume the Nth argument you must also consume all arguments from 1 to (N-1). Translators seem to have been able to cope with the printf() restrictions for many years, so they should be able to do so equally well for the printf utility. The only "extra" problem if printf is used with translated formats in a similar fashion to printf() is the arguments-but-no-conversions issue and as I said above there is a simple work-around for that (%1$.0s)
(0005892)
bhaible (reporter)
2022-07-18 14:14

The proposed specification looks perfectly OK to me.
It marks a couple of cases as unspecified or error, that
would not be needed in practical uses, and thus avoids
backward-incompatible changes in existing implementations.

There is no handling for a width or precision that comes from an
argument, unlike in C, because POSIX printf(1) already does not
support this. Example:
LC_ALL=C printf '%.*f %s\n' 2 3.1415926 pi
So this is OK as well.

Replying to Robert Elz's comment:

> having %n$ in printf is close to essential for handling
> internationalised scripts

No. There can be different ways to internationize a shell script.
For example, these four statements do all the same thing:

1) pid=$$; eval_gettext "Running as process number \$pid."; echo
2) printf_gettext "Running as process number %d." $$; echo
3) printf "`gettext 'Running as process number %d.'`" $$; echo
4) printf $(gettext 'Running as process number %d.') $$; echo

GNU gettext currently implements only the first one. When POSIX
printf(1) comes with %n$ support, it will be possible for GNU
gettext to support the three others as well. Namely, when
preparing a template PO file from the given script, xgettext
will add a marker '#, sh-printf-format' to the extracted message:

#, sh-printf-format
msgid "Running as process number %d."
msgstr ""

When translators then submit localizations, the '-c' option
of 'msgfmt -c' will ensure that the msgstr must consume the
same arguments, with the same formats. For example,

#, sh-printf-format
msgid "Running as process number %d."
msgstr "Laufe als Prozess Nummer %1$d."

will pass, whereas

#, sh-printf-format
msgid "Running as process number %d."
msgstr "Laufe als Prozess Nummer %1$s."

will be rejected (conversion specifier does not match), and

#, sh-printf-format
msgid "Running as process number %d."
msgstr "Laufe als Prozess Nummer."

will be rejected as well (number of consumed arguments does not match).

This checking of format strings is essential for C, because badly
localized printf(3) format strings would make the program crash.
The same mechanism, applied to printf(1) format strings, will prevent
the script from running into the unspecified/error cases of the
specification.

In other words, for the purpose of internationalization, it does not
really matter what the specification says about these corner cases,
because 'msgfmt -c' will ensure that these corner cases are not
exercised.

The important point is that POSIX defines the syntax with %n$ at all;
this will be the foundation for the format string checking of
'#, sh-printf-format' in msgfmt.

> In the most extreme case, a script might call printf with several args
> which it believes would be useful to present to the user, but the
> translator simply provides a constant translation of the intent of the
> message, which uses none of the args.

'msgfmt -c' will prevent this.

> what it does
> not control is the format. That comes from the message translator

This is intentional. When printing an amount of money, for example,

#, sh-printf-format
msgid "Price: %.2f"
msgstr "Preis: %.3f"

is OK, since the translator knows better than the program whether
amounts of money should better be printed with 2 or with 3 decimals.

> and "That will be %2$.2f for %1$ objects\n"
> Consider what the proposed rule does with the 3rd format.

'msgfmt -c' will reject that translation (number of consumed
arguments does not match).
(0005893)
bhaible (reporter)
2022-07-18 14:51

Replying to Robert Elz:
> the one issue considered worthy of adoption was the:
> "it would be preferable for printf to treat a missing argument as an error"
> issue

Format strings in C also have the same restrictions that:
  (A) Numbered and unnumbered arguments cannot be used in the same format string.
  (B) Missing arguments in a format string with numbered arguments are an error.

Restriction (A) has the purpose to ensure a clear specification, and also to catch some kinds of unintentional programmer/translator mistakes.

Restriction (B) exists in C, because sizeof(argument) can only be deduced from the conversion specifier. In languages where each argument is merely a pointer, such as Awk, Boost, JavaScript, Ruby, Tcl, this restriction is unnecessary. For shell programming, where each argument is a plain string, it is unnecessary as well.

- Issue History
Date Modified Username Field Change
2022-07-15 08:36 geoffclare New Issue
2022-07-15 08:36 geoffclare Name => Geoff Clare
2022-07-15 08:36 geoffclare Organization => The Open Group
2022-07-15 08:36 geoffclare Section => printf
2022-07-15 08:36 geoffclare Page Number => 3085
2022-07-15 08:36 geoffclare Line Number => 104307
2022-07-15 08:39 geoffclare Desired Action Updated
2022-07-16 08:23 kre Note Added: 0005890
2022-07-16 08:47 kre Note Edited: 0005890
2022-07-16 08:53 kre Note Edited: 0005890
2022-07-16 08:57 kre Note Edited: 0005890
2022-07-18 11:17 geoffclare Note Added: 0005891
2022-07-18 14:14 bhaible Note Added: 0005892
2022-07-18 14:51 bhaible Note Added: 0005893


Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker