From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: * X-Spam-Status: No, score=1.3 required=5.0 tests=BAYES_00,INVALID_MSGID, MSGID_RANDY autolearn=no autolearn_force=no version=3.4.4 X-Google-Language: ENGLISH,ASCII-7-bit X-Google-Thread: 103376,a1868a27625c21e7 X-Google-Attributes: gid103376,public From: Robert Dewar Subject: Re: Looking for keyed file package Date: 1999/09/26 Message-ID: <7sju61$pdv$1@nnrp1.deja.com>#1/1 X-Deja-AN: 529578867 References: <37E817C6.80ED41E0@easystreet.com> <7saii8$5bl$1@nnrp1.deja.com> <7scrba$bdu$1@clnews.edf.fr> <7semeb$69u$1@nnrp1.deja.com> <7seobv$7ib$1@nnrp1.deja.com> <1999Sep24.130829.1@eisner> <7sijv6$tc8$1@nnrp1.deja.com> <1999Sep25.114101.1@eisner> X-Http-Proxy: 1.0 x40.deja.com:80 (Squid/1.1.22) for client 166.72.81.93 Organization: Deja.com - Before you buy. X-Article-Creation-Date: Sun Sep 26 01:51:00 1999 GMT X-MyDeja-Info: XMYDJUIDrobert_dewar Newsgroups: comp.lang.ada X-Http-User-Agent: Mozilla/4.04 [en] (OS/2; I) Date: 1999-09-26T00:00:00+00:00 List-Id: In article <1999Sep25.114101.1@eisner>, Kilgallen@eisner.decus.org.nospam wrote: > Since you say "in main memory as possible (and ...in cache", > I gather you are talking about processor memory cache, whereas > what I meant was having the index in the caches created by the > indexed file system in main memory. Yes, exactly > Presuming the actual data will likely require a disk read, > I think the difference between index data being in processor > cache and being in the file system cache in main memory is > not of consequence. That's not necessarily the case, there can be substantial processing involved in accessing the index, and processor caching is a factor. > The data I have been dealing with this month have a maximum > of 16 bytes of key data for a record length of 600 bytes, > so the standard recommendation to set the cache size large > enough to hold the entire index tree in memory seems > reasonable. Sure if you have a small file where the indexes can fit in main memory even if not compressed, that is a simple case to deal with. > Memory is cheap enough, however, that even not having the > keys compressed would be acceptable in this setting. That > would be quite different with 584 bytes of key data for > a record length of 600 bytes. Of course particular key > data can be more or less susceptible to compression. Actually that's wrong, with modern key compression techniques you can get down to less than 16 bits per key on average quite regardless of the distribution of key data. This is a rather specialized area, and the techniques involved are rather tricky. To get an intuitive feeling for why the above is true, consider the following extremes 1. Keys are very very different, in this case you only need a little bit of the key to differentiate them (this is trailing key compression in action, discarding trailing data in the key that is not significant). 2. Keys are very very similar, in this case leading key compression becomes effective. The combination is effective in all situations. As I mentioned earlier, the technique used in the Realia COBOL compiler gets down to about 16 bits per key, including control information, pretty much regardless of key distribution. Yes, of course for a small file, there is no issue, but the point is that key compression greatly extends the size of files that can be handled efficiently. Memory is cheap, but files are in practice getting bigger at least as fast as main memory size is increasing. Consider as an example, the ebay files, do some simple calculations here (remembering that you can do full searches on titles and expect instant response). You will see that files like this get pretty big, pretty quickly :-) Sent via Deja.com http://www.deja.com/ Before you buy.