From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham
	autolearn_force=no version=3.4.4
X-Google-Language: ENGLISH,ASCII-7-bit
X-Google-Thread: 103376,4f316de357ae35e9
X-Google-Attributes: gid103376,public
X-Google-ArrivalTime: 2002-08-05 04:39:58 PST
Path: 
 archiver1.google.com!news1.google.com!newsfeed.stanford.edu!news.tele.dk!small.news.tele.dk!130.133.1.3!fu-berlin.de!uni-berlin.de!tar-alcarin.cbb-automation.DE!not-for-mail
From: Dmitry A. Kazakov <mailbox@dmitry-kazakov.de>
Newsgroups: comp.lang.ada
Subject: Re: FAQ and string functions
Date: Mon, 05 Aug 2002 13:50:38 +0200
Message-ID: <b0osku0tktsihgp0hoih183250hq3pjhq5@4ax.com>
References: <20020730093206.A8550@videoproject.kiev.ua>
 <20020731182308.K1083@videoproject.kiev.ua>
 <aib0a6$139lkn$1@ID-77047.news.dfncis.de>
 <20020801161052.M1080@videoproject.kiev.ua>
 <aidq39$13rmja$1@ID-77047.news.dfncis.de>
 <20020802193535.N1101@videoproject.kiev.ua>
NNTP-Posting-Host: tar-alcarin.cbb-automation.de (212.79.194.111)
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Trace: fu-berlin.de 1028547596 39644926 212.79.194.111 (16 [77047])
X-Newsreader: Forte Agent 1.8/32.548
Xref: archiver1.google.com comp.lang.ada:27692
Date: 2002-08-05T13:50:38+02:00
List-Id: <comp.lang.ada>

On Fri, 2 Aug 2002 19:35:35 +0300, Oleg Goodyckov
<og@videoproject.kiev.ua> wrote:

>On Sat, Aug 03, 2002 at 01:29:23AM +0200, Dmitry A.Kazakov wrote:
>> 
>> My implementation (for parsing unit expressions) is about 0.5K lines long. 
>> Is that much?
>
>500 bytes?

How big is the run-time library then?

>It is not right (as for me) to process EVERY error in input data. As for
>me it is more effectively to process only correct data (which are reliably
>recognized) and any other simply to drop nuffig.

Ah, that practice, which makes HTML a disaster because browsers
silently ignore what they do not understand. The results are known.

>> > Difference is like difference between RANDOM and SEQUENTIAL acceses to
>> > data.
>> 
>> This is a good point. There is also a technical term for that. There are 
>> global and local methods of processing texts, images etc. Global methods 
>> (split is one) are working good for only small anount of data.
>
>What here global and local methodes are for? For making conclusion "global
>methods are working good almost never", so they are nuffig need not?

The problem of all global methods is that the parameters they need
cannot be optimal in a  large context. Split is an example. It
requires a separator and a notion of a token which may vary from point
to point, making the approach useless.

>Config files of applications - are they small amount of data? Yes. But it
>exists in every application. And to parse it splitting of string to
>several independent fields is much more effective and convinient way than
>make some sequential syntactical analyzing.

I remember a project with a config file of ~2MBytes big. (it was a
Windows registry folder). I wonder how much time it would take to
parse it using split technique.

>> that as the complexity of syntax increases it becomes almost impossible at 
>> some point to write a correct pattern and prove that it is correct.
>
>Which nuffig "complexity of syntax"? Syntax is - no more simplest: fields
>with separators (of one type) between of them.

It is not a real syntax.

>Take record, split it by separators and enjoy.

Well, how long a record is allowed to be?

>No! Give me a syntax...

An argument in a call of a subroutine in C++.

>> First, the example is not realistic but illustrative. A real-life example 
>> would take into accout different spellings, typo errors, proper nouns, 
>> multi-word tokens etc. It would probably work with a data base, it would 
>> surely avoid unbounded strings (heap allocation) and so on and so far. I 
>> doubt that a Perl implementation of all that would be simplier or shorter 
>> than in Ada.
>
>Really? Empty words. Try and show me. In skipped example I've seen one
>attempt. Show me another - better.
>Task solved in skipped example has name - building hystorgram of words
>implementation. Why you name this task not realistic?

Because histogram is also a global method (used for I suppose sort of
clustering) which also has great limitations and is by no means an end
product of the program.

>> Second, the 80% of the example code is dealing with s/w components like 
>> containers etc. This has nothing to do with text processing. What is really 
>> dedicated to parsing is quite short and transparent.
>
>So, if that 80% of code throw out, then program will work? Or they are
>necessary though?

Not for text processing. I supposed that it does something more than
only that.

Generally, if you have a problem to solve you must first decompose it
into subproblems. You should do it properly. Surely one could use
eigenvalues and vectors to invert a matrix but this would be a *bad*
idea. To decompose some text analysing problem into a bunch of split
operations as also a *bad* idea. This is my point.

>> You might argue that Ada should have standard components standard (:-)), it 
>> is questionable, but as you see (Ada Standard Component Library) there is a 
>> work going in the direction of having that components, though maybe not as 
>> a part of the standard.
>
>So, my words have sence? Why then you argue?

Because I doubt that split should be a part of any standard library.
As I said, I count it for useless.

---
Regards,
Dmitry Kazakov
www.dmitry-kazakov.de