From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=0.4 required=5.0 tests=BAYES_00,FORGED_MUA_MOZILLA autolearn=no autolearn_force=no version=3.4.4 X-Google-Thread: 103376,a65bb7bde679ed1d X-Google-NewGroupId: yes X-Google-Attributes: gida07f3367d7,domainid0,public,usenet X-Google-Language: ENGLISH,ASCII-7-bit Received: by 10.68.31.165 with SMTP id b5mr3713062pbi.1.1322818297090; Fri, 02 Dec 2011 01:31:37 -0800 (PST) Path: lh20ni57320pbb.0!nntp.google.com!news1.google.com!goblin2!goblin.stu.neva.ru!news.internetdienste.de!news.tu-darmstadt.de!news.belwue.de!newsfeed.arcor.de!newsspool1.arcor-online.net!news.arcor.de.POSTED!not-for-mail Date: Fri, 02 Dec 2011 10:30:11 +0100 From: Georg Bauhaus User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:8.0) Gecko/20111105 Thunderbird/8.0 MIME-Version: 1.0 Newsgroups: comp.lang.ada Subject: Re: Ann: Natools.Chunked_Strings, beta 1 References: <4ed4fc37$0$2537$ba4acef3@reader.news.orange.fr> <7nz692j39hkt$.146ba4w7yczck$.dlg@40tude.net> In-Reply-To: <7nz692j39hkt$.146ba4w7yczck$.dlg@40tude.net> Message-ID: <4ed89aa8$0$7616$9b4e6d93@newsspool1.arcor-online.net> Organization: Arcor NNTP-Posting-Date: 02 Dec 2011 10:30:16 CET NNTP-Posting-Host: 5d718ee9.newsspool1.arcor-online.net X-Trace: DXC=T[IVcfDbaJ8Tia]Ho99G50ic==]BZ:af>4Fo<]lROoR1<`=YMgDjhg2lY\DnWb[:E2PCY\c7>ejV8EnoN?K8a4O4kE4FWJRl531 X-Complaints-To: usenet-abuse@arcor.de Xref: news1.google.com comp.lang.ada:19304 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Date: 2011-12-02T10:30:16+01:00 List-Id: On 02.12.11 09:27, Dmitry A. Kazakov wrote: > On Fri, 02 Dec 2011 00:26:29 +0100, Vinzent Hoefler wrote: > >> Dmitry A. Kazakov wrote: >> >>> This is very likely. But my concern was not performance, rather the idea of >>> having long strings. Since long text strings do not exist in "nature" >>> (:-)), nobody should like to have them. >> >> Hmm. What's the average length of a DNA-string? ;) > > Yes, this is what I had in mind when specifically indicated strings as > *text* ones. > > DNA chain is not a text string. Furthermore it would likely have some > specific operations and a representation tailored substring search. I have tried this once. Sequence information is given and is using just a handful of characters. I mapped those to some 4bit type, even tried less. Added lots of purportedly smart unchecked conversions, some shifting, made my head spin by thinking about what combinations of "characters" might suggest there could be clever additions, not shifts and the like for obtaining info about substrings or single "characters", noted that addition is faster than shifting or logical operations on the processor, etc. Tried specializing searching. If there is a solution, it seems tricky. Perhaps to be found by someone with more than ordinary combinatorial skills. In my case this effort has produced only minuscule advantages, sometimes the opposite, but the cost was a large number of specialized subprograms. Then I stopped and went back to a comparatively stupid subtype of String. There might still be an algorithm that uses the standard String type and some fast String search, but on rearranged data: such that not just one DNA_Character'(x) maps to same Character'(x) of the subject string, but, if there are no more than 16 different DNA characters, such that a pair of DNA characters from enumeration type ('A', 'C', 'G', 'T', ...) maps to a single ordinary Character, where 'Pos gives the usual sequence of 0, 1, 2, ... So that, for example, String'('G', 'C') becomes Character'Val (2 * 2**4 + 1) = '!'. Or store a few DNA characters in exact floats to be processed in parallel with SIMD instructions of SSE on Intelian processors, maybe. Isn't this an interesting challenge in some circles? If someone achieves any of this using plain Ada, the solution should create some interest in higher order languages.