From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham
	autolearn_force=no version=3.4.4
X-Google-Language: ENGLISH,ASCII-7-bit
X-Google-Thread: 103376,2bfbf78d65c8859c
X-Google-Attributes: gid103376,public
From: Mok-kong Shen <mok-kong.shen@stud.uni-muenchen.de>
Subject: Re: random number generation
Date: 1998/01/02
Message-ID: <34AD053B.4F14233C@stud.uni-muenchen.de>
X-Deja-AN: 312128776
Content-Transfer-Encoding: 7bit
References: <349AC365.45277E06@stud.uni-muenchen.de>
Mime-Version: 1.0
Distribution: world
Content-Type: text/plain; charset=us-ascii
Organization: Studenten; LRZ; Universitaet Muenchen
Newsgroups: comp.lang.ada
Date: 1998-01-02T00:00:00+00:00
List-Id: <comp.lang.ada>


Since I received comments and critiques from several readers of my
article, I think it is appropriate to respond to them all in one
follow-up rather than in seperate pieces. Formally it is a response
to Herman Rubin.


Herman Rubin <hrubin@stat.purdue.edu> wrote on 23 Dec 1997:

>In article <349AC478.47988774@stud.uni-muenchen.de> you write:
>>Hello,
>>
>>If you are interested in random number sequences, I like to invite you
>>to look at a combined pseudo-random number generator designed by me.
>>The underlying algorithm, which to my knowledge has not appeared
>>previously in the literature, is in fact extremely simple:
>>
>>   j:=0; do output(G(j)); j:=trunc(n*G(j)) od;
>>
>>where G(j) (0<=j<n) are n pseudo-random number generators (with output
>>in [0,1)) that can be arbitrarily given by the user, i.e. no particular
>>requirements of statistical qualities are imposed on them. These n
>>constituent generators can be of any kind; their parameters can be,
>>in particular, randomly chosen with the help of one additional pseudo-
>>random number generator.

>>The basic idea is the following: Each of the n constituent generators
>>produces an output stream that is, depending on its quality, more or
>>less random. If one randomly interleaves these streams, then the
>>result should be more random. In other words, the auto-correlations
>>that exist in the n individual output streams get lost through the
>>mixing process since elements that are at distance d (d>=1) apart
>>from one another in any particular one of the n output streams are no
>>longer at any fixed distance, say d', apart on the combined output.
>>Besides, half of the output of each generator is consumed internally
>>and does not appear as the output of the combined generator. From
>>the algorithm it can be seen that auto-correlations of elements at
>>distance d=1 apart of the n individual output streams are totally
>>lost on the combined output stream.


>It is not totally clear what you are doing.  Something like this
>is discussed by Knuth, and the results are not that good.


Like the method of MacLaren and Marsaglia, my algorithm effects a
shuffling of the outputs of PRNGs. MacLaren and Marsaglia shuffle the
output of one single PRNG with the help of another PRNG, while I
shuffle the outputs of n PRNGs with the help of these n PRNGs
themselves. The outputs of the n PRNGs are interleaved (mixed up) in a
(pseudo-)random way, because the order of activations of the PRNGs is
not fixed (like e.g. round-robin) but is determined at runtime of the
algorithm by the current outputs of the PRNGs. This can be easily
seen from the algorithm.  (If not, a paper and pencil exercise with,
say, 3 PRNGs should make this interleaving clear.)

Knuth may have found that some existing methods of combinations of
PRNGs or of shuffling are poor (I haven't checked through his book),
but I am quite sure that he didn't give categorical statements to the
effect that such approaches were doomed to be a failure from the
very beginning, for that would have been a non-scientific way of
argumentation. It is well-known that shuffling appears to be not
amenable to rigorous formal mathematical analysis and hence has the
disadvantage of having to be treated on a more or less heuristic
basis. Thus J. Dagpunar writes in his Principles of Random Variate
Generation, Clarendon, Oxford, 1988, on p.35 the following:

     Given the inevitable dependencies that will exist in a 'random'
     sequence, it seems natural that one should try to shuffle the
     output of a sequence, in the hope that this will make it more
     random. Examples of such approaches are MacLaren  and
     Marsaglia (1965) and Bays and Durham (1976).

without giving pointers to any text where rigorous formal mathematical
analysis of the merits of shuffling could be found. The consequence
of this is that such methods need more extensive tests before they
can be accepted for application as compared to those methods that
are easily amenable to mathematical analysis. On the other hand, this
clearly does in no way rule out that certain shuffling methods could
be very successful nevertheless. This can be seen from the following
which is taken from the User's Manual of IMSL STAT/Library, vol. 3,
p.947:

     The user can also select a shuffled version of these generators
     using RNOPT. The shuffled generators use a scheme due to
     Learmonth and Lewis (1973a). ....... This scheme is similar to
     that of Bays and Durham (1976), and their analysis would be
     applicable to this scheme as well.

Since several readers mentioned Knuth, it is interesting to know what
Knuth writes in his book (1969, p.30) about the method of MacLaren
and Marsaglia which is one of the best known existing shuffling methods
(hence not amenable to rigorous mathematical analysis todate):

     A much better method has been suggested by MacLaren and
     Marsaglia, and is eminently suited to computer programming.  ....
     ....... This method can be highly recommended.

Let me now expand my arguments about correlations. Consider the output
of a particular one of the n PRNGs, say, G(k) for a fixed k. Let its
output be (t_i) (i=1,2,.....). Depending on the parameters of G(k),
this sequence has more or less strong auto-correlations. There is
dependency between t_i and t_(i+1) for i=1,2,....., which makes for the
auto-correlation at distance d=1. There is dependency between t_i and
t_(i+2) for i=1,2,....., which makes for the auto-correlation at
distance d=2. Analog for auto-correlations at other values of d. My
algorithm is designed such that t_i and t_(i+1) never both appear as
output of the combined generator. Therefore the auto-correlation at
distance d=1 originally present in the output sequence of G(k) is
completely lost on the output stream of the combined generator. Let
us now investigate what happens to the auto-correlation at distance
d=2 of the sequence output by G(k). Depending on the value of i,
t_i and t_(i+2) either both appear as output of the combined generator
or both do not appear as the output of the combined generator. Of
those pairs t_i and t_(i+2) that do appear as the output of the
combined generator the following happens: While t_i and t_(i+2) are
at a constant distance d=2 apart within the original output sequence
of G(k), they are in general at another distance d' apart (on the
output sequence of the combined generator) which, in consequence of
the (pseudo-)random interleaving of the n output streams of the PRNGS,
is not a constant for all i, in other words, d' takes on a spectrum of
different values. Consequently, the auto-correlation at d=2 originally
present in the output of G(k) becomes 'distributed' into auto-
correlations at a number of distances d' on the output stream of the
combined generator. Analogous arguments apply to auto-correlations at
d=3,4,..... that are present in the output sequence of G(k). Lastly
one should consider possible dependency between any two elements in
the output sequence of the combined generator that are originated
from different constituent PRNGs. This question turns out to be very
simple. The n PRNGs are assumed to have been chosen independent from
one another. Hence there is no dependency between a pair of elements
of the said kind.

There are two aspects of my algorithm particularly worthy of notice.
The first aspect is that n, the number of constituent PRNGs, can be
chosen to be arbitrary large (practically without limit) without
causing penalty in computing cost, which means that one can freely
tune the quality of the output of the combined generator through
the value of n. (As I found through experiments, the results of the
serial test asmyptotically improves with increasing n.) This is not
true, as far as my knowledge goes, for other existing shuffling
methods employing more than one PRNG. As a consequence, they are
limited, for the practical reason of economy, to using very small
values of n.  The second aspect is that only one half of the total
outputs of the n constituent generators appear as the output of the
combined generator, while the other half is consumed internally, thus
rendering the inference of the parameters of the n constituent PRNGs
(n itself has to be inferred) much more difficult. This is of
essential significance to information security applications, though
perhaps of no particular relevance to normal statistical computations.

It may be mentioned that my design originally started from a simpler
one:

      j:=0; do t:=G(j); output(t); j:=trunc(n*t) od;

This has the advantage of having a computing cost factor of 1 instead
of 2 of my final design. However, I prefer the scheme posted in my
article because of (1) the complete elimination of the original
auto-correlation at d=1, which is normally the more significant of
the spectrum of auto-correlations and (2) the much enhanced complexity
of inferring the parameters of the constituent PRNGs as a result of
the suppression of one half of their outputs.

Some readers have pointed to the following two issues of interest:

(a) Consider the PRNG G(0). If it continues to generate numbers close
to 0.0 for a long time, then G(0) will always be selected during that
period by the algorithm and the remaining n-1 PNRGs will not
contribute to the output of the combined generator. This is certainly
undesirable. Analogous argument applies to G(k) for any k. However,
the pathological case for G(0) could only arise if its parameters
are extremely bad. This is rendered impossible through a bit careful
programming of the algorithm. See in this regard the implementation
on my Web page, where the linear congruential relations have 
sufficiently large moduli. Those who still feel uncomfortable may
consider the following variant of the algorithm:

     j:=0; do output(G(j)); j1:=trunc(n*G(j));
              if j=j1 then j:=j+1 mod n else j:=j1 fi od;

(b) The implementation provided on my Web page uses only two input
values, namely n and a seed for the standard PRNG that is used to
generate the paramenters of the n PRNGs. Consequently an inferencer
has only these two numbers to deal with. However, my implementation
is programmed to show simply how my algorithm may be essentially
realized. If one choose n=10000, one could let the parameters of the
10000 constituent PRNGs be input by the user, if one likes. A
practically more useful set of input values can contain m seeds,
with each seed being used for the construction of a subset of the
n PRNGs. One could also use several standard PRNGs instead of a
single one that is employed in my implementation. Finally it is to
be noted that the algorithm does not prescribe the type of the
n constituent PRNGs. These can be, in particular, non-linear PRNGs,
instead of being based on linear congruential relations as in my
implementation.


>One should test for the randomness of bits, rather than numbers.
>Congruential generaters are know to be atrocious for tail bits.
>Also, it is easy to fix up congruential generators to give low
>serial correlations without really giving independence.  Even
>the non-random quasi-random streams have 0 correlation, but
>essentially complete dependence (except for rounding).
>-- 
>This address is for information only.  I do not claim that these views
>are those of the Statistics Department or of Purdue University.
>Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
>hrubin@stat.purdue.edu         Phone: (765)494-6054   FAX: (765)494-0558


I must admit that I possess little knowledge of the details of the
'non-randomness' of the tail bits of numbers produced by
congruential generators. If I don't err, this concerns directly the
integer-valued numbers that are computed. My algorithm requires
that the constituent PRNGs output real-valued numbers in [0,1).
Now the integer-valued numbers obtained from the congruential
relations have to be divided by the corresponding moduli in order
to produce the real-valued numbers in [0,1). I have a strong feeling
that any 'non-randomness' of the tail bits of the integer-valued
numbers will be much reduced, if not rendered entirely negligible,
when the division process is invoked and when the output streams
of the n constituent PRNGs are (pseudo-)randomly interleaved, in
particular, when n is chosen to be sufficiently large. In any case,
I should appreciate being able to access any concrete materials
on the 'non-randomness' of the tail bits of either the integer-
valued numbers or the mantissa of the real-valued numbers
originated from congruential relations.


To conclude, I think I have provided sufficient (informal) explanation
of why my algorithm should produce output of low auto-correlations
and why the inference of the parameters of the PRNGs is likely to
be very difficult. On the other hand, I have done yet very little
tests. While the serial test is considered in the literature to be
fairly reliable for detecting dependencies, it is only one of a
currently existing huge battery of tests that are recommended. My
personal problem is that I do not have the hardware resources to carry
out even a small portion of these tests, not to mention the more grave
issue that the running and the evaluation of the results of the
software packages involved could demand expertise that is well
beyond my humble knowledge. Thus I hope that my posted article could
attract the attentions of some experts who, being knowledgeable and
possessing sufficient hardware resources, would be kind enough to
critically examine my algorithm by subjecting it to a wide range
of tests which I could not perform, thus refuting or confirming
my subjective opinion on its merits. Lastly I like to mention
that essentials of the above arguments plus some other points may
be found on my Web page and that a tiny reward of US$300 is offered
to the expert who is kind enough to help me.

M. K. Shen

______________________________________________________
------------------------------------------------------
Apology:
On 29-30th Dec 1997 someone, cracking my login password and using 
my e-mail address, has sent a number of faked mails of schizophrenic
(fortunately non-criminal) content. These probably all have at the
bottom the name of a certain Molana, as could be deduced from some
mails that were bounced back because the target addresses happened
to be invalid. I sincerely regret the inconvenience that these faked
mails have caused to their receivers.

------------------------------------------------------
M. K. Shen, Postfach 340238, D-80099 Muenchen, Germany
+49 (89) 831939   (6:00 GMT)
mok-kong.shen@stud.uni-muenchen.de
http://www.stud.uni-muenchen.de/~mok-kong.shen/
(containing 2 mathematical problems with rewards totalling US$500)

----------------------------------------------
Sunshine is the best disinfectant.
(citation of a citation in B. Schneier and D. Banisar, The Electronic
Privacy Papers. John-Wiley, New York, 1997.)

----------------------------------------------
The words of a man's mouth are as deep waters, and the wellspring 
of wisdom as a flowing brook.   (Proverbs 18:4)

A little that a righteous man hath is better than the riches of many
wicked.   (Psalms 37:16)