Austin Group Defect Tracker

Aardvark Mark IV


Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0000983 [1003.1(2013)/Issue7+TC1] Shell and Utilities Editorial Enhancement Request 2015-08-31 21:41 2019-10-21 09:25
Reporter stephane View Status public  
Assigned To
Priority normal Resolution Accepted As Marked  
Status Applied  
Name Stephane Chazelas
Organization
User Reference
Section (section number or name, can be interface name)
Page Number 2467
Line Number 79046-79048
Interp Status Approved
Final Accepted Text See Note: 0003371
Summary 0000983: awk's rand()/srand() spec not useful
Description For awk's rand()/srand() builtin functions, the spec currently
says:

> rand()
> Return a random number n, such that 0<=n<1.
> srand([expr])
> Set the seed value for rand to expr or use the time
> of day if expr is omitted. The previous seed value
> shall be returned.

It mentions a "seed" but doesn't actually mention how that
"seed" is being used. One has to suppose it's a seed in a
pseudo-random number generator algorithm. I'm not sure we can derive
that srand(4); print rand() always returns the same value (even
by two invocations of the same awk utility on the same system).
In practice it seems it does with all implementations I've
tested.

It doesn't say what type the seed should be (integer, any
floating point number, string...). Is srand("test") meant to be
the same as srand(0)? (it's not in mawk).

The reference to "time of day" is very confusing. Is that meant
to say that if one calls srand() at the same time every day, the
PRNG would be seeded in the same way?

I suppose it comes from the fact that historically, awk
implementations have used the return value of time() to seed the
PRNG but that specification is not telling us that. It's not
telling us either that two calls of srand() done in the same
second/minute... would seed the PRNG in the same way (as is the case in
many awk implementations) as it doesn't tell us how the "time of
day" is being used.

I suppose, the intention behind using time() in early awk
implementations was to give a "random" number, that is a number
that is not always the same. Like a poor man's, very limited
source of external entropy. I'm not sure it would make sense to
restrict awk implementations to using that.

Also, the spec doesn't say what should happen if srand() is
never called (should rand() always return the same thing), or
what the first srand() may return, or what srand() may return
after a srand() without argument (a number, a string, which
ones?).
Desired Action Here is the gist of what I think should be mentioned in the specification, as a request for comment:

 - In srand(seed), the seed should be numerical and used to seed a PRNG
 - After a call of srand(seed) with a given value of seed, the sequence of numbers returned by rand() shall always be the same with a given awk implementation on one system but need not be the same between implementations. That is, awk '{srand(3); print rand(), rand()}' shall always give the same output for each record.
 - If the seed passed to srand() is not integer, it's unspecified if srand(seed) should have the same effect as srand(int(seed)). Applications are encouraged to use integers for seeds (in practive, for many awk implementations srand(1.1) is the same as srand(1). In particular, srand(rand()) is the same as srand(0)).
 - if the seed is a string, even a numerical string, the behaviour is unspecified (srand("123") is not the same as srand(123) in mawk for instance, awk 'srand("test"); print srand()' doesn't print "test" in many awk implementations).
 - The return value of the first srand() is undefined but is guaranteed to be numerical (though not necesarily integer (it's not in OpenBSD for instance)). One cannot expect it to be the same between two invocations of awk.
 - if srand() is not called, it's unspecified whether a fixed seed is used or not. Applications wishing to have a random initial seed should either call srand() without argument (but see limitation below) or call srand() with some externally obtained random number. On the other hands, applications wishing to have a constant serie of rand() values should call srand() with a fixed seed.
 - If srand() is called without argument, the PRNG should be seeded with an external source of entropy. The exact value of the seed is undefined. Historical implementations have been using the epoch time in seconds at the time of invocation of srand(). As a result, two calls of srand() within the same second may seed the PRNG in the same manner with some implementations. Not sure how to convey that in the spec, but it's an important note/limitation.
 - the value returned by srand() after a call of srand() without arguments shall be numerical but otherwise undefined. However, using that value as an argument to srand() shall yield the same serie of values returned by rand() as those returned after that previous call of srand() without argument.
 That is awk 'BEGIN{srand(); r1=rand(); srand(srand()); r2=rand(); if (r1 == r2) print "yes"}' should output yes. It's not the case in OpenBSD, but the behaviour looks bogus over there.
Tags tc3-2008
Attached Files

- Relationships

-  Notes
(0003265)
joerg (reporter)
2016-06-16 16:07

People use this awk program as a portable way to get time_t from a shell script:

awk 'BEGIN { srand(); t = srand(); print t } '
(0003271)
Don Cragun (manager)
2016-06-23 16:27
edited on: 2016-09-01 16:26

OLD Interpretation response (superseded by Note: 0003371)
------------------------

The standard does not speak to this issue, and as such no conformance distinction can be made between alternative implementations based on this. This is being referred to the sponsor.

Rationale:
-------------
The behavior in this area varies between various implementation of awk.

Notes to the Editor (not part of this interpretation):
-------------------------------------------------------
Replace the description of rand() in the awk EXTENDED DESCRIPTION on P2467 L79046 with:
Return a floating point pseudo-random number n, such that 0<=n<1.

    
Replace the description of srand() in the awk EXTENDED DESCRIPTION on P2467, L79047-79078 with:
Set the seed value for rand to expr or use the seconds since the Epoch if expr is omitted. The previous seed value shall be returned. The behavior is unspecified if expr is not an integer expression. The initial seed value is unspecified if rand is called without calling srand first. The srand function uses the argument as a seed for a new sequence of pseudo-random numbers to be returned by subsequent calls to rand. If srand is then called with the
same seed value, the sequence of pseudo-random numbers shall be repeated.


(0003272)
stephane (reporter)
2016-06-23 19:59

For the record, that issue was discussed last year on the mailing list at
http://thread.gmane.org/gmane.comp.standards.posix.austin.general/11344/focus=11346 [^]

Again, I don't like the idea of forcing implementations to use a non-random seed (time()) upon srand(). That breaks OpenBSD's implementation at least. Though we need to warn applications that some implementation's srand()s may use the same seed if called within the same second.
(0003273)
shware_systems (reporter)
2016-06-23 22:15

??? Possibly add the following to Application Usage section as consecutive paragraphs, to address concerns discussed but left implicit in the normative changes:

For the srand() function, the "seconds since the epoch" value used may have integer resolution granularity. It is up to the application to provide a delay mechanism of one second duration between srand() calls to ensure a different start value is used for subsequent rand() call series. A system("sleep 1"); srand(); sequence should work for most applications.

A platform may provide a non-clock based source of entropy for the initial seed to rand() if srand() is not called first. On these platforms it is suggested an implementation have the initial call to srand() return the value derived from accessing that source at program startup, as the "previous seed value", and this may be a negative or positive numeric value that should not be construed as a "seconds before/since the Epoch" quantity, but may be used to reinitialize the sequence. It is nominally implementation-defined whether a means to get additional values from the source is provided within awk as an extension, suitable for use as the expr argument to srand(), for different sequences. Lack of existing implementations prevents anything more specific being considered normative; it is known some platforms have such sources, but no awk implementation examined makes use of them.
(0003274)
ajosey (manager)
2016-06-27 09:37

Interpretation Proposed: 27 June 2016
(0003275)
stephane (reporter)
2016-06-27 11:27

Re: 0003273

From what I understand, though it's not clear, the text in 0003271 implies that the seed set by srand() is the epoch seconds as an integer in order to be able to use the value returned by the next call to srand() to be used as an argument to srand() as in:

    awk 'BEGIN{srand(); a=rand(); srand(srand()); print a == rand()}'

should print "1". If srand() is allowed to seed with a non-integer and if srand(non-integer) is unspecified, that means one cannot use that code above. Now, maybe that's fine. After all, when you call srand() without a seed, you mean to have random values, you don't care for the reproducibility.

Also, at the moment, one has to use srand() or srand(externallyprovidedentropysource) to get random numbers. Many implementations, including the traditional ones always return the same sequence when srand() is not called.

So allowing (and not requiring) implementations to do the reasonable thing when srand() is not called is not going to help with writing portable scripts.

I suppose early awk implementations used time() as the best portable source of entropy they had at the time. IMO, 30 years later, we shouldn't have to be constrained by those limitations from the 80s.

I think here we should allow implementations to use time() as that's what historical implementations have done. But also allow implementations to use a more modern source of entropy when srand() is called without a seed (and also when srand() is not called). In order to allow that, in addition to making the returned value of srand() after a call of srand() without see unspecified, we would need to also make it unspecified whether that value can be used as an argument to srand() again to reproduce the same sequence.

Or in other words, if you use srand() without seed, the sequence may not be reproducible (it could be truly random, a different algorithm may be used from the one used when a seed is provided). If you want a reproducible sequence, call srand() with a number. If you want that sequence to be random and reproducible, you can use srand(); seed = int(rand() * 2^32); srand(seed).
(0003279)
Don Cragun (manager)
2016-07-01 00:53
edited on: 2016-08-25 15:16

In response to Note: 0003275: We discussed this again during the 2016-06-30 Austin Group Conference Call.
Just because one implementation of awk seeds the random number generator with a random value, does not mean that that is what applications expect. On all other implementations of awk, the command:

    awk 'BEGIN{print srand(srand())}'

provides the only way to portably get the current number of seconds since the Epoch in a shell script (using any version of awk except yours), and many shell scripts depend on this behavior. We believe that it was a bug in the standard to have not required this behavior.

Furthermore, the command:

    awk 'BEGIN{srand(seed_value); a=rand(); srand(srand()); print (a == rand())}'

with the added set of parentheses (since the precedence between the print command and the == comparison operator are not defined by the standard) is still required to return 1 no matter what seed_value is specified by the user in the first call to srand(). The standard allows srand() to ignore non-integer parts of the seed_value when setting the random numbers seed; but it does not loosen the requirement that the next call to srand(withORwithoutARG) return the value used to set the seed on the previous call to srand() (if there was a previous call) or the default seed used (if srand() had not been previously called). If the implementation doesn't truncate an input value to an integer, it can't truncate the value a subsequent call to srand() returns either.

(0003281)
stephane (reporter)
2016-07-01 15:29
edited on: 2016-09-05 10:53

Re: Note: 0003279

Let's be serious. srand() is the function to seed the pseudo-random number generator, not the one to retrieve the epoch time. And srand() without arguments is meant to seed it in a random fashion, so that two invocations of awk with srand() would give different numbers.

The fact that it's the only POSIX way to get the epoch time is POSIX's failure of specifying "date +%s" or $EPOCHSECONDS. We can't reasonably specify that the way to get the epoch time is awk 'BEGIN{printf "%d\n", srand(srand())}'.

What about:

echo "$((0$(touch /tmp/
  pax -d -x ustar -w /tmp/ |
    dd bs=1 skip=136 count=11 2> /dev/null))
)"

While we're at convoluted ones.

BTW, I believe awk 'BEGIN{srand();print srand()}' is my invention (from 2004): https://groups.google.com/forum/#!msg/comp.unix.shell/e6M0JtaKLaw/_3hOn1Dw7GgJ [^]
or at least I contributed to popularise it (I can't find earlier references in comp.unix.shell or comp.lang.awk, but it's possible others would have found it before).

A few weeks later, it was discussed for inclusion in the comp.unix.shell FAQ. In that thread, you'll find (https://groups.google.com/forum/message/raw?msg=comp.unix.shell/eFEv05hed-4/u9ycHtmkWwsJ) [^] Geoff mentioning that it should be written awk 'BEGIN{srand();printf "%d\n", srand()}' as some implementations may produce things like 1.08307e+09 instead.

And that's what the comp.unix.shell FAQ has now: http://cfajohnson.com/shell/cus-faq.html#Q6 [^]

(note that OpenBSD is not /my/ implementation. I'm not even a (regular) user of OpenBSD. I'm not an awk developer, only user. It's just a comment on how to improve the awk command. The fact that srand() called within the same second gives the same sequence is a bug IMO).

(0003284)
stephane (reporter)
2016-07-01 20:46
edited on: 2016-07-01 20:49

There's also the question of the range of the seed accepted by srand().

If we're going to settle with forcing srand() to use time() (and even if we only allow it), then that means one can't use srand() if they may be called several times in a second.

Things like:

rand() {
  awk -v range="$1" 'BEGIN{srand();print int(rand()*range)}'
}

for file in *.foo; do
  mv -- "$file" "dir$(rand 5)"
done

Except on OpenBSD would likely put all the files in the same directory, as all those "awk"s (and thus all the srand()) would likely be called within the same second.

So one would have to seed by hand (or run only one awk that feeds a sequence of numbers via a pipe for instance). And you'd want to maximize the initial entropy you give to awk's pseudo-random generator.

My: seed = int(rand() * 2^32); from Note: 0003275 is actually wrong with mawk at least.

In mawk, srand(anything-greater-than-0x7fffffff) gives the same rand() numbers as srand(0x7fffffff). So with seed = int(rand() * 2^32), half of the times on average, I'd get the same sequence.

With Solaris 11 nawk or bwk's awk or busybox awk, srand(anything-greater-than-0x100000000) gives the same sequence as with srand(0).

With gawk, srand(x) gives the same result as srand(x & 0xffffffff).

So it would seem, that portably, we can only give at most 31 bits of entropy to srand() (and mawk has a year 2038 bug). My shell rand() function from above would have to be something like:

seed=
rand() {
  eval "$(
    awk -v range="$1" '
      BEGIN {
        srand('"$seed"')
        print "seed=" int(rand() * 2^31)
        print "echo " int(rand() * range)
      }')"
}

(0003362)
rhansen (manager)
2016-08-25 17:33
edited on: 2016-08-25 17:36

This was discussed during the 2016-08-25 teleconference. We ran out of time before finishing the conversation, so we'll pick it up again next week. For our notes, see the 2016-08-25 etherpad page.

Regarding the default value srand() uses if one isn't provided: We cannot change the default value for Issue 7 because that would be a requirements change that would break existing applications. Changing it for Issue 8 might be appropriate, but that would have to be addressed in a separate bug report because this bug report is tagged tc3-2008. (We are considering adding a "future directions" note for this bug report to let implementation and application authors know that we might be changing the requirements in this area.)

Regarding the range issue: For this bug report we can say that a value between 0 and 2^31-1 (inclusive) shall be supported and any other value has unspecified behavior. For Issue 8 we could expand the range or say something more specific, but that would have to be addressed in a separate bug report.

(0003371)
rhansen (manager)
2016-09-01 16:20
edited on: 2016-09-01 16:27

Interpretation response
------------------------
The standard does not speak to this issue, and as such no conformance distinction can be made between alternative implementations based on this. This is being referred to the sponsor.

Rationale:
-------------
The behavior in this area varies between various implementation of awk.

Notes to the Editor (not part of this interpretation):
-------------------------------------------------------
Replace the description of rand() in the awk EXTENDED DESCRIPTION on P2467 L79046 with:
Return a floating point pseudo-random number n, such that 0<=n<1.

Replace the description of srand() in the awk EXTENDED DESCRIPTION on P2467, L79047-79078 with:
Set the seed value for rand to expr or use the seconds since the Epoch if expr is omitted. The previous seed value shall be returned. The behavior is unspecified if expr is not an integer expression or if the value of expr is not within the range 0 through 2^31-1 (2147483647), inclusive. The initial seed value is unspecified if rand is called without calling srand first. The srand function uses the argument as a seed for a new sequence of pseudo-random numbers to be returned by subsequent calls to rand. If srand is then called with the same seed value, the sequence of pseudo-random numbers shall be repeated.

On page 2486 line 79914, replace the awk FUTURE DIRECTIONS with:
A future version of this standard may require srand to accept any numeric value and calculate the seed by taking the provided value, converting it to an integer, and calculating the integer value modulo 2^n where n is an implementation-defined value greater than or equal to 32.

A future version of this standard may require the initial seed for the rand function (the seed value used if srand is not called) to be an integer between 0 and 2^n-1 inclusive where n is an implementation-defined value greater than or equal to 32. Additionally, the initial seed value may be required to be a (pseudo-)random value such that two invocations of awk are unlikely to emit the same sequence of random values (unless the seed is explicitly set to the same value via srand).

A future version of this standard may define a new posix_srand function that enables application authors to set the seed to a (pseudo-)random value generated by the system. Alternatively, the specification of the srand function may be altered to provide some means to set the default seed value to a (pseudo-)random value.


(0003372)
ajosey (manager)
2016-09-02 11:24

Interpretation proposed: 2 Sep 2016
(0003394)
ajosey (manager)
2016-10-03 08:48

Interpretation approved: 3 October 2016

- Issue History
Date Modified Username Field Change
2015-08-31 21:41 stephane New Issue
2015-08-31 21:41 stephane Name => Stephane Chazelas
2015-08-31 21:41 stephane Section => (section number or name, can be interface name)
2015-08-31 21:41 stephane Page Number => 2467
2015-08-31 21:41 stephane Line Number => 79046-79048
2016-06-16 16:07 joerg Note Added: 0003265
2016-06-23 16:27 Don Cragun Note Added: 0003271
2016-06-23 16:29 Don Cragun Note Edited: 0003271
2016-06-23 16:32 Don Cragun Note Edited: 0003271
2016-06-23 16:33 Don Cragun Interp Status => Pending
2016-06-23 16:33 Don Cragun Final Accepted Text => See Note: 0003271.
2016-06-23 16:33 Don Cragun Status New => Interpretation Required
2016-06-23 16:33 Don Cragun Resolution Open => Accepted As Marked
2016-06-23 16:40 Don Cragun Tag Attached: tc3-2008
2016-06-23 19:59 stephane Note Added: 0003272
2016-06-23 22:15 shware_systems Note Added: 0003273
2016-06-27 09:37 ajosey Interp Status Pending => Proposed
2016-06-27 09:37 ajosey Note Added: 0003274
2016-06-27 11:27 stephane Note Added: 0003275
2016-07-01 00:53 Don Cragun Note Added: 0003279
2016-07-01 00:55 Don Cragun Note Edited: 0003279
2016-07-01 00:56 Don Cragun Note Edited: 0003279
2016-07-01 15:29 stephane Note Added: 0003281
2016-07-01 15:33 stephane Note Edited: 0003281
2016-07-01 15:34 stephane Note Edited: 0003281
2016-07-01 15:37 stephane Note Edited: 0003281
2016-07-01 15:40 stephane Note Edited: 0003281
2016-07-01 20:46 stephane Note Added: 0003284
2016-07-01 20:47 stephane Note Edited: 0003284
2016-07-01 20:48 stephane Note Edited: 0003281
2016-07-01 20:49 stephane Note Edited: 0003284
2016-08-25 15:16 Don Cragun Note Edited: 0003279
2016-08-25 17:33 rhansen Note Added: 0003362
2016-08-25 17:36 rhansen Note Edited: 0003362
2016-08-25 17:36 rhansen Note Edited: 0003362
2016-09-01 16:20 rhansen Note Added: 0003371
2016-09-01 16:21 rhansen Note Edited: 0003371
2016-09-01 16:22 rhansen Interp Status Proposed => Pending
2016-09-01 16:22 rhansen Final Accepted Text See Note: 0003271. => See Note: 0003371
2016-09-01 16:26 geoffclare Note Edited: 0003271
2016-09-01 16:27 rhansen Note Edited: 0003371
2016-09-02 11:24 ajosey Interp Status Pending => Proposed
2016-09-02 11:24 ajosey Note Added: 0003372
2016-09-05 10:53 stephane Note Edited: 0003281
2016-10-03 08:48 ajosey Interp Status Proposed => Approved
2016-10-03 08:48 ajosey Note Added: 0003394
2019-10-21 09:25 geoffclare Status Interpretation Required => Applied


Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker