From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham autolearn_force=no version=3.4.4 X-Google-Language: ENGLISH,ASCII-7-bit X-Google-Thread: 103376,2bfbf78d65c8859c X-Google-Attributes: gid103376,public From: Mok-kong Shen Subject: Re: random number generation Date: 1998/01/02 Message-ID: <34AD053B.4F14233C@stud.uni-muenchen.de> X-Deja-AN: 312128776 Content-Transfer-Encoding: 7bit References: <349AC365.45277E06@stud.uni-muenchen.de> Mime-Version: 1.0 Distribution: world Content-Type: text/plain; charset=us-ascii Organization: Studenten; LRZ; Universitaet Muenchen Newsgroups: comp.lang.ada Date: 1998-01-02T00:00:00+00:00 List-Id: Since I received comments and critiques from several readers of my article, I think it is appropriate to respond to them all in one follow-up rather than in seperate pieces. Formally it is a response to Herman Rubin. Herman Rubin wrote on 23 Dec 1997: >In article <349AC478.47988774@stud.uni-muenchen.de> you write: >>Hello, >> >>If you are interested in random number sequences, I like to invite you >>to look at a combined pseudo-random number generator designed by me. >>The underlying algorithm, which to my knowledge has not appeared >>previously in the literature, is in fact extremely simple: >> >> j:=0; do output(G(j)); j:=trunc(n*G(j)) od; >> >>where G(j) (0<=j>in [0,1)) that can be arbitrarily given by the user, i.e. no particular >>requirements of statistical qualities are imposed on them. These n >>constituent generators can be of any kind; their parameters can be, >>in particular, randomly chosen with the help of one additional pseudo- >>random number generator. >>The basic idea is the following: Each of the n constituent generators >>produces an output stream that is, depending on its quality, more or >>less random. If one randomly interleaves these streams, then the >>result should be more random. In other words, the auto-correlations >>that exist in the n individual output streams get lost through the >>mixing process since elements that are at distance d (d>=1) apart >>from one another in any particular one of the n output streams are no >>longer at any fixed distance, say d', apart on the combined output. >>Besides, half of the output of each generator is consumed internally >>and does not appear as the output of the combined generator. From >>the algorithm it can be seen that auto-correlations of elements at >>distance d=1 apart of the n individual output streams are totally >>lost on the combined output stream. >It is not totally clear what you are doing. Something like this >is discussed by Knuth, and the results are not that good. Like the method of MacLaren and Marsaglia, my algorithm effects a shuffling of the outputs of PRNGs. MacLaren and Marsaglia shuffle the output of one single PRNG with the help of another PRNG, while I shuffle the outputs of n PRNGs with the help of these n PRNGs themselves. The outputs of the n PRNGs are interleaved (mixed up) in a (pseudo-)random way, because the order of activations of the PRNGs is not fixed (like e.g. round-robin) but is determined at runtime of the algorithm by the current outputs of the PRNGs. This can be easily seen from the algorithm. (If not, a paper and pencil exercise with, say, 3 PRNGs should make this interleaving clear.) Knuth may have found that some existing methods of combinations of PRNGs or of shuffling are poor (I haven't checked through his book), but I am quite sure that he didn't give categorical statements to the effect that such approaches were doomed to be a failure from the very beginning, for that would have been a non-scientific way of argumentation. It is well-known that shuffling appears to be not amenable to rigorous formal mathematical analysis and hence has the disadvantage of having to be treated on a more or less heuristic basis. Thus J. Dagpunar writes in his Principles of Random Variate Generation, Clarendon, Oxford, 1988, on p.35 the following: Given the inevitable dependencies that will exist in a 'random' sequence, it seems natural that one should try to shuffle the output of a sequence, in the hope that this will make it more random. Examples of such approaches are MacLaren and Marsaglia (1965) and Bays and Durham (1976). without giving pointers to any text where rigorous formal mathematical analysis of the merits of shuffling could be found. The consequence of this is that such methods need more extensive tests before they can be accepted for application as compared to those methods that are easily amenable to mathematical analysis. On the other hand, this clearly does in no way rule out that certain shuffling methods could be very successful nevertheless. This can be seen from the following which is taken from the User's Manual of IMSL STAT/Library, vol. 3, p.947: The user can also select a shuffled version of these generators using RNOPT. The shuffled generators use a scheme due to Learmonth and Lewis (1973a). ....... This scheme is similar to that of Bays and Durham (1976), and their analysis would be applicable to this scheme as well. Since several readers mentioned Knuth, it is interesting to know what Knuth writes in his book (1969, p.30) about the method of MacLaren and Marsaglia which is one of the best known existing shuffling methods (hence not amenable to rigorous mathematical analysis todate): A much better method has been suggested by MacLaren and Marsaglia, and is eminently suited to computer programming. .... ....... This method can be highly recommended. Let me now expand my arguments about correlations. Consider the output of a particular one of the n PRNGs, say, G(k) for a fixed k. Let its output be (t_i) (i=1,2,.....). Depending on the parameters of G(k), this sequence has more or less strong auto-correlations. There is dependency between t_i and t_(i+1) for i=1,2,....., which makes for the auto-correlation at distance d=1. There is dependency between t_i and t_(i+2) for i=1,2,....., which makes for the auto-correlation at distance d=2. Analog for auto-correlations at other values of d. My algorithm is designed such that t_i and t_(i+1) never both appear as output of the combined generator. Therefore the auto-correlation at distance d=1 originally present in the output sequence of G(k) is completely lost on the output stream of the combined generator. Let us now investigate what happens to the auto-correlation at distance d=2 of the sequence output by G(k). Depending on the value of i, t_i and t_(i+2) either both appear as output of the combined generator or both do not appear as the output of the combined generator. Of those pairs t_i and t_(i+2) that do appear as the output of the combined generator the following happens: While t_i and t_(i+2) are at a constant distance d=2 apart within the original output sequence of G(k), they are in general at another distance d' apart (on the output sequence of the combined generator) which, in consequence of the (pseudo-)random interleaving of the n output streams of the PRNGS, is not a constant for all i, in other words, d' takes on a spectrum of different values. Consequently, the auto-correlation at d=2 originally present in the output of G(k) becomes 'distributed' into auto- correlations at a number of distances d' on the output stream of the combined generator. Analogous arguments apply to auto-correlations at d=3,4,..... that are present in the output sequence of G(k). Lastly one should consider possible dependency between any two elements in the output sequence of the combined generator that are originated from different constituent PRNGs. This question turns out to be very simple. The n PRNGs are assumed to have been chosen independent from one another. Hence there is no dependency between a pair of elements of the said kind. There are two aspects of my algorithm particularly worthy of notice. The first aspect is that n, the number of constituent PRNGs, can be chosen to be arbitrary large (practically without limit) without causing penalty in computing cost, which means that one can freely tune the quality of the output of the combined generator through the value of n. (As I found through experiments, the results of the serial test asmyptotically improves with increasing n.) This is not true, as far as my knowledge goes, for other existing shuffling methods employing more than one PRNG. As a consequence, they are limited, for the practical reason of economy, to using very small values of n. The second aspect is that only one half of the total outputs of the n constituent generators appear as the output of the combined generator, while the other half is consumed internally, thus rendering the inference of the parameters of the n constituent PRNGs (n itself has to be inferred) much more difficult. This is of essential significance to information security applications, though perhaps of no particular relevance to normal statistical computations. It may be mentioned that my design originally started from a simpler one: j:=0; do t:=G(j); output(t); j:=trunc(n*t) od; This has the advantage of having a computing cost factor of 1 instead of 2 of my final design. However, I prefer the scheme posted in my article because of (1) the complete elimination of the original auto-correlation at d=1, which is normally the more significant of the spectrum of auto-correlations and (2) the much enhanced complexity of inferring the parameters of the constituent PRNGs as a result of the suppression of one half of their outputs. Some readers have pointed to the following two issues of interest: (a) Consider the PRNG G(0). If it continues to generate numbers close to 0.0 for a long time, then G(0) will always be selected during that period by the algorithm and the remaining n-1 PNRGs will not contribute to the output of the combined generator. This is certainly undesirable. Analogous argument applies to G(k) for any k. However, the pathological case for G(0) could only arise if its parameters are extremely bad. This is rendered impossible through a bit careful programming of the algorithm. See in this regard the implementation on my Web page, where the linear congruential relations have sufficiently large moduli. Those who still feel uncomfortable may consider the following variant of the algorithm: j:=0; do output(G(j)); j1:=trunc(n*G(j)); if j=j1 then j:=j+1 mod n else j:=j1 fi od; (b) The implementation provided on my Web page uses only two input values, namely n and a seed for the standard PRNG that is used to generate the paramenters of the n PRNGs. Consequently an inferencer has only these two numbers to deal with. However, my implementation is programmed to show simply how my algorithm may be essentially realized. If one choose n=10000, one could let the parameters of the 10000 constituent PRNGs be input by the user, if one likes. A practically more useful set of input values can contain m seeds, with each seed being used for the construction of a subset of the n PRNGs. One could also use several standard PRNGs instead of a single one that is employed in my implementation. Finally it is to be noted that the algorithm does not prescribe the type of the n constituent PRNGs. These can be, in particular, non-linear PRNGs, instead of being based on linear congruential relations as in my implementation. >One should test for the randomness of bits, rather than numbers. >Congruential generaters are know to be atrocious for tail bits. >Also, it is easy to fix up congruential generators to give low >serial correlations without really giving independence. Even >the non-random quasi-random streams have 0 correlation, but >essentially complete dependence (except for rounding). >-- >This address is for information only. I do not claim that these views >are those of the Statistics Department or of Purdue University. >Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 >hrubin@stat.purdue.edu Phone: (765)494-6054 FAX: (765)494-0558 I must admit that I possess little knowledge of the details of the 'non-randomness' of the tail bits of numbers produced by congruential generators. If I don't err, this concerns directly the integer-valued numbers that are computed. My algorithm requires that the constituent PRNGs output real-valued numbers in [0,1). Now the integer-valued numbers obtained from the congruential relations have to be divided by the corresponding moduli in order to produce the real-valued numbers in [0,1). I have a strong feeling that any 'non-randomness' of the tail bits of the integer-valued numbers will be much reduced, if not rendered entirely negligible, when the division process is invoked and when the output streams of the n constituent PRNGs are (pseudo-)randomly interleaved, in particular, when n is chosen to be sufficiently large. In any case, I should appreciate being able to access any concrete materials on the 'non-randomness' of the tail bits of either the integer- valued numbers or the mantissa of the real-valued numbers originated from congruential relations. To conclude, I think I have provided sufficient (informal) explanation of why my algorithm should produce output of low auto-correlations and why the inference of the parameters of the PRNGs is likely to be very difficult. On the other hand, I have done yet very little tests. While the serial test is considered in the literature to be fairly reliable for detecting dependencies, it is only one of a currently existing huge battery of tests that are recommended. My personal problem is that I do not have the hardware resources to carry out even a small portion of these tests, not to mention the more grave issue that the running and the evaluation of the results of the software packages involved could demand expertise that is well beyond my humble knowledge. Thus I hope that my posted article could attract the attentions of some experts who, being knowledgeable and possessing sufficient hardware resources, would be kind enough to critically examine my algorithm by subjecting it to a wide range of tests which I could not perform, thus refuting or confirming my subjective opinion on its merits. Lastly I like to mention that essentials of the above arguments plus some other points may be found on my Web page and that a tiny reward of US$300 is offered to the expert who is kind enough to help me. M. K. Shen ______________________________________________________ ------------------------------------------------------ Apology: On 29-30th Dec 1997 someone, cracking my login password and using my e-mail address, has sent a number of faked mails of schizophrenic (fortunately non-criminal) content. These probably all have at the bottom the name of a certain Molana, as could be deduced from some mails that were bounced back because the target addresses happened to be invalid. I sincerely regret the inconvenience that these faked mails have caused to their receivers. ------------------------------------------------------ M. K. Shen, Postfach 340238, D-80099 Muenchen, Germany +49 (89) 831939 (6:00 GMT) mok-kong.shen@stud.uni-muenchen.de http://www.stud.uni-muenchen.de/~mok-kong.shen/ (containing 2 mathematical problems with rewards totalling US$500) ---------------------------------------------- Sunshine is the best disinfectant. (citation of a citation in B. Schneier and D. Banisar, The Electronic Privacy Papers. John-Wiley, New York, 1997.) ---------------------------------------------- The words of a man's mouth are as deep waters, and the wellspring of wisdom as a flowing brook. (Proverbs 18:4) A little that a righteous man hath is better than the riches of many wicked. (Psalms 37:16)