From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.3 required=5.0 tests=BAYES_00,INVALID_MSGID
	autolearn=no autolearn_force=no version=3.4.4
X-Google-Language: ENGLISH,ASCII-7-bit
X-Google-Thread: 103376,50880f040eb869b4
X-Google-Attributes: gid103376,public
From: mfb@mbunix.mitre.org (Michael F Brenner)
Subject: Re: Anyone help develop an algorythm?
Date: 1997/04/22
Message-ID: <5jidul$mru@top.mitre.org>#1/1
X-Deja-AN: 236593776
References: <5jddg7$uf0@newssvr01-int.news.prodigy.com>
 <335AF137.54D7@bix.com>
Organization: The MITRE Corporation, Bedford Mass.
Newsgroups: comp.lang.ada
Summary: Use a finite state machine
Date: 1997-04-22T00:00:00+00:00
List-Id: <comp.lang.ada>


Several of the fastest ways to sort your array of strings include not
sorting them at all. In a prior post I gave code for partially balanced
tree searches which do a partial sort in less time than true sorts, 
balancing (pardon the pun) the requirements for small sort time versus
the time for small lookup time. Three faster methods include an address
finite state machine, an calculation sort, and a hash code sort. Hash
codes are just imperfect address calculation sorts, and will not be
discussed here. Address calculations are becoming more and more possible
as machine addresses get larger, and virtual hardware with translation
lookaside buffers become the rule rather than the exception on modern
risc CPUs. However, I will assume that you have a tiny PC such as
a Pentium like I have, with a pitiful few dozen megs of RAM to fit
the arrays and the workspace. Therefore, we come to the FSA possibility.
FSAs at first glance seem to take K**(L+1) words of memory, where
K is the size of the alphabet used for the lookup strings, and L is
the number of lookup strings. However, this matrix is extremely
sparse. First, we notice that K can be reduces from 256 down to 
approximately 27 by collapsing all special characters and making
everything lower case, if the words are in English. Second, we notice
that most of the potential states are unused, so we only need to store
the matrix for m states, where m is the sum of the lengths of the 
lookup strings. So we now have a reduced matrix of 27m words length.
This can be implemented directly in code instead of in matrix form, using
such constructs as CASE statements or GOTO statements. One last 
possibility to mention is the dream of perfect hash codes. If you
happen to come across a fast-to-compute formula that happens to
compute the lookup function of your list of words, then by all means
use it. It is not always possible to derive such a magic formula
for all possible inputs, however.