Austin Group Defect Tracker

Aardvark Mark III


Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0001122 [1003.1(2016)/Issue7+TC2] System Interfaces Editorial Enhancement Request 2017-02-28 16:51 2017-04-05 05:49
Reporter joerg View Status public  
Assigned To
Priority normal Resolution Open  
Status New  
Name Jörg Schilling
Organization
User Reference
Section 3 + 4
Page Number 1102...and others
Line Number somewhere in section 3 and 4
Interp Status ---
Final Accepted Text
Summary 0001122: POSIX should include gettext() and friends
Description POSIX defines catgets() but most software rather uses gettext()
Desired Action Add the interfaces needed for gettext().

This should include at least gettext(), dgettext(), textdomain(), bindtextdomain(), dcgettext()

and the programs msgfmt, gettext, xgettext

This change seems to be a good candidate for issue 8.
Tags No tags attached.
Attached Files

- Relationships

-  Notes
(0003575)
steffen (reporter)
2017-03-01 16:05

I posted on the ML (which does no longer include a sequence number, the resent-message-id was "dhdJXC.A.-oG.z8rtYB"@Phoebe.vpn.opengroup.org), a superset of what follows:

  [..]it definitely must be possible to implement this as a macro,
  that is to say. And possibly GETTEXT() should at least be
  reserved, just in case ISO C does not go that route, so that
  GETTEXT(target-store, msgid) can be a #define with do{}while(0).[..]

This is because implementations of gettext() (-alike interfaces) have been seen, or may still exist, which avoid successive, and therefore practically always unnecessary database lookups by injecting code into compilation units (e.g., a modification-counter reference and a character pointer, both "static", to be initialized first, and only comparing the reference against the database-specific counter).
(0003576)
shware_systems (reporter)
2017-03-01 16:54

While this looks obvious, certainly, catgets() and gettext() both are already obsolete, in terms of thoroughness, since before C99 came out. The gettext package's utilities rely on GNU extensions that make them non-conforming anyways, so isn't suitable to replace cat*() to begin with. Gettext is usable for a variety of languages, but it's still a subset of the languages currently in use. Both models, as compiler and implementation extensions, leave as application-defined aspects of language translation Unicode addresses in a more systematic fashion. Unicode also has known omissions, noted on their to do lists, but not as many. I believe this was touched upon in rejecting Bug 861, which also mentions gettext.

Imo, Issue 8 should include something, and I have a tentative design in mind (that includes Bug 861), but addressing this completely is part of Phase 3 or later, to account for how POSIX adds the Unicode requirements, so this can be part of a POSIX Option Group. I consider forward usability, as the final design implemented will be invention, as more important than backwards compatibility. Only a bulk conversion utility for msg and po files, and Win mc sources as a cross platform consideration, is planned for upgrading current applications. Source code and cat/mo file conversions I consider compiler and platform dependent, as does the standard, and overly impractical to automate. An application using gettext or catgets now won't have to upgrade, but those that choose to refactor their source files will stand a much better chance of being certified as Conforming with Extensions.
(0003577)
joerg (reporter)
2017-03-01 17:10
edited on: 2017-03-01 17:10

Why do you believe that gettext() is obsolete?

Why do you believe that gettext() relies in non-conforming GNU extensions()

gettext() was designed by Sun Microsystems in the 1980s and I see no problems with it.

There are a few GNU enhancements that are worth to be included but these extensions still are not non-conforming.

(0003578)
steffen (reporter)
2017-03-01 17:11

No, you may feel the desire to turn gettext() into a Waldorf Astoria, but i am afraid you will neither be able to close its doors for renovation, nor turn it into a set of nice Unicode luxury apartments: because gettext() is the actual workhorse of internationalisation in Unix C programs, and charset agnostic. Compare this with catgets(), which is simply not used in any software that i know, except old and practically unmaintained codebases that still use it, but are asleep development-wise, except maybe for adjusting slightly to avoid the bitrot that modern compilers, well, not reveal, but rather generate. Ah, and the ISO C standard committee is not to be forrotten.
(0003579)
steffen (reporter)
2017-03-01 17:13

Sorry for the tune, Jörg Schilling mentioned the actual facts (to me of interest mostly that the C extensions that are in use are not needed for functioning, but only for a performance improvement).
(0003580)
joerg (reporter)
2017-03-01 17:23
edited on: 2017-03-01 18:09

The original Sun design for gettext() indeed was charset agnostic and
for this reason, every locale needed an own version of a .po and .mo file.

Fortunately, there is a GNU extension that allows to code the character
set used in the .po file into the .po file and this results in a first
empty from string "translated into" the name of the character encoding.

Even the Solaris implementation supports this and in turn uses iconv()
to convert to the actual character set in use. As a result, the .mo
files for locales with the same language but different encoding may be
linked together and make localization much easier.

BTW: the main problem with catgets() is that it needs a globally
maintained message number.

the main problem with gettext() is the ambiguous english language
in case that you use English as the base language for the texts.
This problem may however be circumvented by avoiding english texts
that may have more than one meaning, by either using better wording
or by adding more text to differentiate and by translating the long
english text into the short and ambiguous english text.

Another way was to use German or French as the base language.

(0003581)
steffen (reporter)
2017-03-01 22:37

Hm, with charset agnostic i referred to the fact that the "char const *MSGID", the message itself, is used as the key of the message, so this is just a NUL terminated byte stream; which is then either mapped to a different byte stream that has been loaded from a LC_MESSAGES dependent database file, or can simply be returned "as-is".

Too much time has passed, i cannot say anything to the actual plain text input format of .po files, without rereading all the available information that is.

 |TW: the main problem with catgets() is that it needs a globally maintained message number.

I agree. It is in practice unmaintainable unless you have software which generates the MESSAGE-NUMBER = STRING mappings in an automated way. POSIX does not include such software, and it is, also in practice, not as easy as it seems to write such a software, let alone portably, since strings can come from preprocessor expansions, spread across several lines, .. whatever.

 |This problem may however be circumvented by avoiding english texts
 | that may have more than one meaning, by either using better wording
 | or by adding more text to differentiate and by translating the long
 | english text into the short and ambiguous english text.

The usual way is normally to put a specially crafted comment on the line before the string is defined, something like

  colour.c: /* I18N: ..of colour mapping */
  colour.c- n_err(_("`colour': %s: invalid precondition: %s\n"),

or

  send.c: /* I18N: Filename input prompt with file type indication */
  send.c- str_concat_csvl(&prompt, _("Enter filename for part "),

Most others use "L10N:" instead, however, for "localization".

And iconv(3) on the fly so that a german LATIN1 message catalog can be used in a german UTF-8 locale thus seems to be common. I think conversion errors should be ignored in, hopefully uncommon, translator-error cases.
(0003582)
shware_systems (reporter)
2017-03-02 06:39

Re: 3577
For something on top of c89 to support locales that only need SBCS charmaps that use the subset of control codes C limits applications to gettext is a decent design. I agree, for that limited scope, it can be considered feature complete. When c99 came out the design was not updated to handle the potential use of L"" string form for wchar constants, as in no interfaces returning or using wchar * types, or systems still using arrays of short to represent UCS-2 strings. It may do this for other computer languages, but it's not in the C interfaces. Linguistically, there is no support of mixing R2L or T2B scripts in a L2R string, which has been part of Unicode and ISO-6429 for years now, as well, as just 2 examples of "non-thorough". The gettext manual's advice, in Section 4.3, on using the BiDi control codes is "Don't do that!", because a representation of them a translator could understand easily isn't defined. It is left up to application developers to split or post-process strings needing features like this so the applications accommodate gettext, not gettext accommodating what the standards allow.

The most glaring extension, also visible in the manual, is requiring -- switches without corresponding short forms on the utility command lines for various options, and so the code base is dependent on GNU's getopts, not POSIX's. Admittedly, I haven't looked at the autoconf sources to see how many other package dependencies there are, but that alone is disqualifying. Even if all the options had short forms, limiting POSIX versions to only using those would break any script using the long forms.

Re: 3578
I have no desire to do a Waldorf Astoria, or even a Motel 6, on something like this. I'd prefer packages like gettext kept getting updated, and did not use extensions that went against the utility guidelines. Supporting Unicode means little niceties like BiDi support can no longer be considered luxuries, however, as any Unicode data stream, including u8", u", and U" strings in C11 source files, may request features like this. Ensuring there are no potential conflicts with implementation defined aspects catgets and gettext allow means something entirely new is the least pita means of going about it that I see, that's all, so string translation source sets have the symbolic tagging that preserves that content. If enough isn't included, any country needing missing features to accurately represent their historical scripting or encoding conventions will be justified, imho, in voting against ratifying Issue 8. If enough is, an elegant enough design provides a framework that various to do lists stand a chance at getting completed. This I would like to see happen. I suspect the people that wrote the lists to begin with would too.
(0003583)
joerg (reporter)
2017-03-02 09:09

Re Note: 0003581
SCO used catgets() in their Svr4 product and managed the numbers for catgets()
manually in include files. I know this because I wrote the ISO-9660:1999 and
Joliet drivers for SCO and needed to enhance the mount program.

I am not sure whether a user of catgets wrote an automated mapping program.

The problem with the ambiguous english language cannot be circumvented
by comments in the source code as long as there are two or more identical
strings intended to mean something different.

Once you manages to have different strings for different meaning, you
could add such special commands and GNU xgettext has an option to tell
it to transfer such a comment into thr .po file.

If you provide a English -> English .po file for your program, you may
handle the problem with non-7-Bit-ASCII texts in case you use the
encoding tags in the .po file.

Here is a sample .po file with the typical mime based hints that
are encoded into the .mo file:

# SOME DESCRIPTIVE TITLE.
# Copyright (C) YEAR THE PACKAGE'S COPYRIGHT HOLDER
# This file is distributed under the same license as the PACKAGE package.
# FIRST AUTHOR <EMAIL@ADDRESS>, YEAR.
#
#, afuzzy
msgid ""
msgstr "Project-Id-Version: BLA 9.99\n"
"Report-Msgid-Bugs-To: me@\n"
"POT-Creation-Date: 2010-12-19 22:23+0100\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language-Team: LANGUAGE <LL@li.org>\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=iso-8859-1\n"
"Content-Transfer-Encoding: 8bit\n"

#: cdrecord.c:319
#, c-format
msgid "This is an evaluation version of cdrecord,\n"
msgstr "Dies ist eine Beurteilungsversion von cdrecord\n"
...

The fact that mime is used here allows to add more enhancements
in the future.
(0003584)
keld (reporter)
2017-03-02 09:40

I agree that gettext and friends should be added. It is a much more used set of APIs than catgets(), indeed most of the GUIs in Linux, like KDE and Gnome are built with gettext(), and also POSIX shell utilities.


I would then like to see the APIs specified in a thread-safe version.

Is there text which we can use as a base for normative wording in POSIX?
(0003585)
keld (reporter)
2017-03-02 09:56

In gettext, there are easy ways to differentiate between different meanings of the same text string, and there are practices established for that. An easy way is to mark the msgid with a tag and explain this tag in an accompanying comment, and then translate the string into the target language (which may be English, even if the msgid is in English too).

There is Unicode support in gettext, indeed most corpora is in Unicode. It is mostly UTF-8. I believe you can also use UTF-16 and others, you just decalre this in the header of the .po-file.

I am puzzled to hear that gettext should not support r2l languages sufficiently. Myriads of programs are translated in adequate ways for Arabic and Hebrew languages.
(0003586)
steffen (reporter)
2017-03-02 14:41

Re 0003582. Oh my god; i am talking Daihatsu and you roll over it with a F150. You pass in a byte buffer, and you get back a byte buffer. That is that interface, is it? I seem to know that some vendors have chosen to use UTF-16, or even ebcdic.utf-16. This should be ok for the returned data (?) as long as the key is a NUL terminated byte buffer. Note that for POSIX wide I/O streams and iconv() anything is pretty transparent.

All this can't be helped, you need a different sort of interface to handle all languages of the world. Your need to look at entire strings rather than a wchar_t at a time, for example, since many languages have very complicated rules for case conversion, let alone collation.

I don't see BiDi in gettext()? And i don't understand what this has to do with gettext().

I know there is the problem with identical strings. You have to find a solution for that, but then again with the xgettext(1) tool you will get your .pot template file that contains all strings which are marked for translation, and if you used that special comments (--add-comments=I18N, for example), then these will be attached to the string to translate.

You then have the option to decide what to do, and can interact with translators, which will also see the comments; e.g., if a French translator would give you feedback that the two use cases of "No?" cannot be translated to French without introducing ambiguities, it is very easy to apply changes, and xgettext will merge them in.
This is a massive improvement to the catalogues.

I also don't think that gettext() is very scalable, but then again it is the de-facto tool and was this already seventeen years ago, at least. E.g., singular and plural will not work out in a single string, as in "%s part%s" for "one part" and "two parts". Note this has nothing to do with gettext(), it is an incarnation of a restricted western world brain to think it can be done like that. I guess it is ok for an airplane message, since pilots all talk english. It won't work out in a global multi-language software like this. This is not the fault of gettext().

Note that there are languages which use special names for some numbers, and simply printing 1000 instead of giving the symbolic name is _possibly_ understood as being very impolite. For this to work out, you for one have to have the knowledge of the culture, and then you need a completely different way of programming based upon this understanding.
It seems to me this will as such never go into POSIX, completely let aside ISO C.
(0003587)
keld (reporter)
2017-03-02 17:01
edited on: 2017-03-05 00:17

The different texts for singular and plural (and other cases) are dealt with by gettext, you declare which scheme to be used in the header of the po file and then you just list the cases for the translated texts like msgstr[0] and
msgstr[1]

(0003588)
shware_systems (reporter)
2017-03-03 04:42

Re: 3585
I see it supports using UTF8 text when the locale specifies UTF8 as the charset, for C programs. I do not see it supports u8"", or u"" strings, where it's supposed to be UTF8 or wide chars, whatever the locale. Without a separate way to identify these in the po and mo files, gettext() will treat them as being of the .mo file's locale on read. I don't see any handling of \u or \U escapes in regular strings either, as a separate issue, for conversion to UTF8 or the locale's charset. The examples assume any UTF or UCS values will be specified as octal escapes instead. For wide strings in an .po file, if one is generated manually, reading it as chars using the equivalent of getline() or fscanf("%s") will terminate on the first code with a high byte of 0, or is a multiple of 256. The gettext utilities will not put wide chars in a .mo file anyways (see 10.3), even though the format uses counted strings.

For atomic R2L strings the .po format is fine, where all the chars have R2L or neutral directionality, or T2B or neutral for Asian languages. It's when you have strings that need the BiDi codes to override a charmap's assumed directionality that it has problems, say in a text in one language describing another language. Ecma's TR-053 covers the issues involved from the ISO-6429 perspective; Unicode discusses it related to the Bidi_Class property (see UAX9-35, #3.1.2, for examples) for each code point. EBCDIC has the hooks to support BiDi also, for that matter, but how much it does is still undocumented publicly.
(0003596)
joerg (reporter)
2017-03-03 16:44

Could we please change the way to discuss the topic?

Please do not claim "gettext() is bad - I know that", but rather
explain a specific real issue. It is then possible to reply to
such a description and either explain why this is no problem,
explain how to solve the problem or mark it as an issue that
really has not yet been solved.

Thank you!
(0003603)
EdSchouten (reporter)
2017-03-03 21:21

What I think is pretty bad about gettext(): modern code tends to be written in such a way that it doesn't keep track of global state. This makes APIs by definition thread-safe, easier to reuse, potentially easier to test, etc. gettext() doesn't offer anything in this regard. For example, gettext() doesn't allow you to write a networked service that returns different responses, based on a per-user language preference, as the interface depends on the process-wide locale.

Lots of effort has been spent within this working group to add APIs that are thread-safe. The POSIX 2008 locale_t work is a good example of that. Though I have no objection against standardising tools like msgfmt, etc. I think that adding the C APIs (in their current shape) doesn't make any sense at all.
(0003608)
keld (reporter)
2017-03-05 00:12
edited on: 2017-03-05 00:14

Re: 3588
Case conversion and collation is handled by the locale, which also chooses which message catalogue to use. Quite elaborate schemes for case conversions and also collation can be specified with current glibc (ISO 30112) functionality, and that is indeed in widespread use - building on ISO 14651 for the collation.

Where is the need for u8"" or u"" support? Gettext handles localization of a program, where the strings to be translated are just given to the API as native strings, which are character encoding independent, and you can port the program to another environment with another internal encoding.

I believe \U and \u escapes are available as normal native string processing.

For UTF-16 or other 16-bit handling, I believe the normal libc APIS treat 16 bits as one byte, but I am not sure. I believe IOS is using UTF-16 so there must be a lot of experience of porting normal standard C/C++/POSIX utilities to that environment. Does anybody know how this is done?

(0003609)
joerg (reporter)
2017-03-06 10:28
edited on: 2017-03-06 10:28

Re Note: 0003603

Shouldn't gettext() return text for the thread's current locale that
has been set up by uselocale()?

I know that Solaris (even Solaris 11) does not yet implement the new
locale_t, but other OS may do...

(0003610)
joerg (reporter)
2017-03-06 10:33
edited on: 2017-03-06 10:33

Re Note: 0003608

I believe that gettext(3), xgettext(1) and friends support all encodings
that are based on 8-bit character streams.

I am not shure whether I did ever see a real life program that included
wide character string constants that are subject for translation. If so,
we may need to add a wgettext() entry ;-)

The translation files (*.po and *.mo) could still be in 8-bit format.

(0003659)
shware_systems (reporter)
2017-04-05 05:49

Re: 3596
It was asked, so I put a few reasons I thought were obvious. Ed S. brought up another of them. I'm not trying to solve those issues here, simply pointing them out. It doesn't matter whether an individual or application does or doesn't consider the concerns raised here are issues because the package works well enough for their purposes, enough variants of the package or gencat already exist because people have considered them issues and done something about it. Some variants are wrappers of gettext, some use other logical models; the Mono runtime does both. Just as we look at how different shells do things, there's reason to expect those variants will be considered in due time also so the standard can be said to at least acknowledge existing practice, even if it doesn't get included.

Most shell maintainers have already made formal statements concerning a willingness to adapt to how we modify the shell's requirements, so discussion here is relevant; gettext and those variants haven't. Their maintainers, not us, have to decide whether they want to produce and propose an Issue 8 conforming package that includes their extensions and maybe addresses those concerns. The GNU people at FSF are aware of what's involved for packages being conforming interfaces and utilities, i.e. the general implications of POSIXLY_CORRECT, and further discussion should be on their lists, not here. The evidence is more they don't want to with gettext or it could have been part of the standard already for Issue 7. Similar discussion on the Solaris lists for what Sun did can also be started, if they want that to be considered part of the standard.

- Issue History
Date Modified Username Field Change
2017-02-28 16:51 joerg New Issue
2017-02-28 16:51 joerg Name => Jörg Schilling
2017-02-28 16:51 joerg Section => 3 + 4
2017-02-28 16:51 joerg Page Number => 1102...and others
2017-02-28 16:51 joerg Line Number => somewhere in section 3 and 4
2017-03-01 16:05 steffen Note Added: 0003575
2017-03-01 16:54 shware_systems Note Added: 0003576
2017-03-01 17:10 joerg Note Added: 0003577
2017-03-01 17:10 joerg Note Edited: 0003577
2017-03-01 17:11 steffen Note Added: 0003578
2017-03-01 17:13 steffen Note Added: 0003579
2017-03-01 17:23 joerg Note Added: 0003580
2017-03-01 17:27 joerg Note Edited: 0003580
2017-03-01 18:09 joerg Note Edited: 0003580
2017-03-01 22:37 steffen Note Added: 0003581
2017-03-02 06:39 shware_systems Note Added: 0003582
2017-03-02 09:09 joerg Note Added: 0003583
2017-03-02 09:40 keld Note Added: 0003584
2017-03-02 09:56 keld Note Added: 0003585
2017-03-02 14:41 steffen Note Added: 0003586
2017-03-02 17:01 keld Note Added: 0003587
2017-03-03 04:42 shware_systems Note Added: 0003588
2017-03-03 16:44 joerg Note Added: 0003596
2017-03-03 21:21 EdSchouten Note Added: 0003603
2017-03-05 00:12 keld Note Added: 0003608
2017-03-05 00:13 keld Note Edited: 0003608
2017-03-05 00:14 keld Note Edited: 0003608
2017-03-05 00:17 keld Note Edited: 0003587
2017-03-06 10:28 joerg Note Added: 0003609
2017-03-06 10:28 joerg Note Edited: 0003609
2017-03-06 10:33 joerg Note Added: 0003610
2017-03-06 10:33 joerg Note Edited: 0003610
2017-04-05 05:49 shware_systems Note Added: 0003659


Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker