From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM autolearn=unavailable autolearn_force=no version=3.4.4 X-Received: by 10.66.152.164 with SMTP id uz4mr9221295pab.24.1378175517085; Mon, 02 Sep 2013 19:31:57 -0700 (PDT) X-Received: by 10.50.23.8 with SMTP id i8mr745992igf.8.1378175516750; Mon, 02 Sep 2013 19:31:56 -0700 (PDT) Path: border1.nntp.dca3.giganews.com!border3.nntp.dca.giganews.com!border1.nntp.dca.giganews.com!nntp.giganews.com!newsfeed.news.ucla.edu!usenet.stanford.edu!n2no17375082pbg.1!news-out.google.com!z6ni30971pbu.0!nntp.google.com!j7no108610qai.0!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail Newsgroups: comp.lang.ada Date: Mon, 2 Sep 2013 19:31:56 -0700 (PDT) In-Reply-To: Complaints-To: groups-abuse@google.com Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=105.236.92.252; posting-account=p-xPhAkAAADjHQWEO7sFME2XBdF1P_2H NNTP-Posting-Host: 105.236.92.252 References: <1679ec49-424b-43bd-8f35-a5f69e658112@googlegroups.com> <7aa26916-cde1-46f8-9f49-d9ebcc2dee93@googlegroups.com> <782ef090-7299-4164-b4e5-14a06d1c1a44@googlegroups.com> User-Agent: G2/1.0 MIME-Version: 1.0 Message-ID: <8268e85c-e372-4883-8449-ef5253e2c77e@googlegroups.com> Subject: Re: Hash Type Size From: Peter Brooks Injection-Date: Tue, 03 Sep 2013 02:31:56 +0000 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Original-Bytes: 3810 Xref: number.nntp.dca.giganews.com comp.lang.ada:183263 Date: 2013-09-02T19:31:56-07:00 List-Id: On Tuesday, 3 September 2013 03:47:52 UTC+2, Randy Brukardt wrote: >=20 > So my answer to "what's a good way to hash a private type/access=20 > value/system.address?" is that no such way exists. The best you can do is= =20 > convert the data to a stream-element array and hash that somehow, but tha= t=20 > most likely would be a dubious hash (there is a lot of unused bits in a= =20 > typical record, and it would be really easy for those bits to end up bein= g=20 > most of the result). The whole point is that a hash function needs to be= =20 > tailored to the data that it is hashing. >=20 This looks like a good topic for somebody's Ph.D! I'd imagine it possible to have some broad applicability for certain algori= thms. In the case of triplestores, the ones used by Wikipaedia contain, mai= nly, English words and phrases, with an increasing number of entries in oth= er languages. It's be useful to know if a hash function that works for Wiki= paedia data in English is more or less effective when used on German or Kor= ean. The large triplestores of genetic data are clearly a different matter and I= can see tailoring the algorithm there would be important - what, though, d= o you actually need to know about the data to design a hash algorithm? With RDF, which all triplestores are made of, the question that's occupying= me is how much benefit there is to encoding the different parts of the URI= . The triples are of the form: subject - predicate - object Clearly the '' don't need to be part of the hash, I'll need= a mapping both ways, from subject -> predicate -> object and from object -= > predicate -> subject, that's what'll be in the binary trees. The predicates will be a much smaller set, so an associative array will pro= bably be best - probably a three-way associative array, one with the source= of the predicate 'www.w3.org/1999/02/22-rdf-syntax-ns' another with the pa= rticular predicate 'type' and the third with the whole URI - but the same h= ash-key for each, as they're the same relation. This is a bit of a simplification as the triples are actually quads (subjec= t,predicate,object, graph), but that doesn't affect the hashing question.