From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham autolearn_force=no version=3.4.4 X-Google-Language: ENGLISH,ASCII-7-bit X-Google-Thread: 10f6aa,e1e578817780dac2 X-Google-Attributes: gid10f6aa,public X-Google-Thread: fc89c,97188312486d4578 X-Google-Attributes: gidfc89c,public X-Google-Thread: 1014db,6154de2e240de72a X-Google-Attributes: gid1014db,public X-Google-Thread: 109fba,baaf5f793d03d420 X-Google-Attributes: gid109fba,public X-Google-Thread: 103376,97188312486d4578 X-Google-Attributes: gid103376,public From: miker3@ix.netcom.com (Mike Rubenstein) Subject: Re: Should I learn C or Pascal? Date: 1996/08/22 Message-ID: <321bb94c.6596465@nntp.ix.netcom.com> X-Deja-AN: 175731684 references: <4u7hi6$s2b@nntp.seflin.lib.fl.us> <4uo74j$95p@ns.broadvision.com> <4v1r2a$gh6@falcon.le.ac.uk> <840224235snz@genesis.demon.co.uk> <32159698.39D9@xenon.chromatic.com> <3215cc56.142847293@nntp.ix.netcom.com> <4v9a47$gm3@goanna.cs.rmit.edu.au> <32199bba.19972158@nntp.ix.netcom.com> <4vg8v8$9i2@goanna.cs.rmit.edu.au> organization: Netcom x-netcom-date: Wed Aug 21 9:02:01 PM CDT 1996 newsgroups: comp.lang.c,comp.lang.c++,comp.unix.programmer,comp.lang.ada,comp.os.msdos.programmer Date: 1996-08-21T21:02:01-05:00 List-Id: ok@goanna.cs.rmit.edu.au (Richard A. O'Keefe) wrote: > miker3@ix.netcom.com (Mike Rubenstein) writes: > >No. The assumption that O(N^2) performance is unlikely assumes that > >quicksort is properly implemented. The distribution of the data has > >nothing to do with it. > > The point I was trying to make is that the allegedly "unlikely" > behaviour is already *known* to have occurred much more often than > the theorems you see in data structures books would have you > believe. Given any non-randomising version of quicksort, I can > construct a pessimal input for that version. If the distribution > of the input I supply is such that pessimal problems occur with > probability 1, I will get so-called "unlikely" behaviour with > probability 1. So *in the absence of built-in randomisation* > the distribution has a lot to do with it. > > Again, some of the proofs of the unlikelihood of bad behaviour > tacitly assume that all elements are distinct. Even randomisation > won't help unless you use a fat pivot or some other technique that > handles repeated elements well. > > So if Mike Rubenstein means by "properly implemented" a quicksort > using randomisation and a fat pivot, then I am forced to agree with > him. On the other hand, while the quicksort I use (the "Engineered" > one by Bentley & McIlroy) uses a fat pivot, it doesn't use randomisation, > so for a recently version of quicksort designed and tested by experts > and published in a refereed journal, which gets impressive performance > in practice (it is the first quicksort I've ever come across which can > routinely compete with a well implemented merge sort), it is possible > to find an input distribution for which "Engineered" qsort will behave > badly with probability 1. So either Mike Rubenstein is wrong, or > Bentley & McIlroy's qsort is not "properly implemented". I certainly mean that a properly implemented quicksort (or any other algorithm) is one that is also properly applied and is implemented with due consideration to the problem at hand. For a general purpose quicksort, this certainly means randomization and a fat pivot (or some other technique for handling multiple elements). > >> The likelihood of quicksort doing badly on *your* data depends on the > >> actual probability distribution of *your* data. > > >Again, not if quicksort is implemented properly. > > Which either means "using randomisation and a fat pivot" or is false. > > >There is no need to permute the array. Just follow Hoare's suggestion > >for avoiding O(N^2) performance. > > >Hoare suggested choosing the pivot randomly. This has the same effect > >on overall performance as permuting the data. > > Permuting the whole array is simpler and faster. There is a phase where > the random number generator is in the cache and the sorting code is not, > then there is a phase when the sorting code is in the cache and the > random number generator is not. What's more, depending on the random > number generator, it may be possible to vectorise the random number > generation and fuse that loop with the permutation loop. The separated > approach *looks* as though it ought to be easier to make fast. (For > example, if using AS183, the three integers it maintains can be kept in > registers during the permutation phase instead of being repeatedly > stored back into memory and reloaded.) What nonsense. If you mean the instruction cache, it's unlikely that the code for quicksort will fit in the cache. If you mean high speed memory (e.g., the typical memory cache on an 80x86 or pentium), both will almost certainly fit. Using a random pivot requires no movement of the data. Rearranging the data requires moving the data or constructing an array of pointers of indices that will slow down the comparison. > There is also the practical point that if you want to use someone else's > quicksort and don't fully trust it, you can permute the array yourself > before calling qsort, without being able to modify the qsort code. If you are using untrusted code when timing is critical, you deserve everything that happens to you. If timing is really critical, functions like C's qsort() are unsuitable anyways since they rely on inefficient comparison and, possibly, swapping code. I've never seen a general sort program that is always suitable. > >Choosing the median of the first, middle, and last elements for the > >pivot also works well in practice. > > Ahem. This is the approach that has yielded bad behaviour IN PRACTICE. > > >In particular, it will prevent > >O(N^2) performance if the data is nearly in order or reverse order. > > Only if (a) repeated elements are rare or (b) you use a fat pivot. Actually, repeated elements don't have to be rare. What matters is if there are a lot of a few elements. For example, if every element appears twice, repeated elements are common, but performance is still O(n log n). If there is likely to be items that are repeated many times, a fat pivot should certainly be used. My experience is that this is not a very common situation, but your experience may differ. > The only ways to make bad runtime cost unlikely for any algorithm are > (a) use an algorithm for which bad cost can *never* happen > (b) randomise so that you *know* what the distribution is. More nonesense. In practice one often knows a great deal about the input. Randomising may be the worst thing to do. It's much better to think than blindly follow rules. Elsewhere, I've mentioned a problem I ran into of doing a lot of sorts on almost in order data. Insertion sort, with a known worst and average case terrible performance, works quite well in such a situation. Randomizing and using someone elses quicksort would almost certainly have been disasterous. Michael M Rubenstein