Announce: OpenToken 2.0 released

comp.lang.ada
 help / color / mirror / Atom feed

* Announce: OpenToken 2.0 released
@ 2000-01-27  0:00 Ted Dennison
  2000-01-28  0:00 ` Jürgen Pfeifer
  2000-01-31  0:00 ` Hyman Rosen
  0 siblings, 2 replies; 33+ messages in thread
From: Ted Dennison @ 2000-01-27  0:00 UTC (permalink / raw)


Release 2.0 of OpenToken has now been placed on the website (
http://www.telepath.com/dennison/Ted/OpenToken/OpenToken.html ).
Highlights of this new version include an LALR(1) parser and an html
analyzer submitted by the ever-helpful Christoph Green. This is the
first version to include parsing capablity. The existing packages
underwent a major reorganization to accomodate the new
functionality. As some of the restructuring that was done is
incompatable with old code, the major revision has been bumped up to 2.
A partial list of changes is below:


   * Renamed the top level of the hierarchy from Token to OpenToken.
   * Moved the analyzer underneath the new OpenToken.Token hierarchy.
   * Renamed the Token recognizers from Token.* to
     OpenToken.Recognizer.*
   * Changed the text feeder procedure pointer into a text feeder
     object. This will allow full re-entrancy in analyzers that was
     thwarted by those global text feeders previously.
   * Updated the SLOC counter to read a list of files to process from a
     file. It also handles files with errors in them a bit better.
   * Added lalr(1) parsing capability and numerous packages to support
     it. A structure is in place to build other parsers as well.
   * Created a package hierarchy to support parse tokens. The word
     "Token" in OpenToken now refers to objects of this type, rather
     than to token recognizers.
   * An HTML lexer has been added to the language lexers
   * .Recognizer.Bracketed_Comment now works properly with
     single-character terminators.
   * Rewrote the text feeer and analyzer to minimize data copying.

With this release OpenToken now gains status as a viable replacement for
lex/yacc. In many ways it is more powerful, and there are plans to add
even more power to it. For those of you not already familiar with
OpenToken, I encourage you to vist the website and look around. But in
the meantime, here's a blurb about it from the readme file:

     The OpenToken package is a facility for performing token
     analysis and parsing within the Ada language. It is designed
     to provide all the functionality of a traditional lexical
     analyzer/parser generator, such as lex/yacc. But due to the
     magic of inheritance and runtime polymorphism it is
     implemented entirely in Ada as withed-in code. No
     precompilation step is required, and no messy tool-generated
     source code is created.

     Additionally, the technique of using classes of recognizers
     promises to make most token specifications as simple as making
     an easy to read procedure call. The most error prone part of
     generating analyzers, the token pattern matching, has been
     taken from the typical user's hands and placed into reusable
     classes. Over time I hope to see the addition of enough
     reusable recognizer classes that very few users will ever need
     to write a custom one. Parse tokens themselves also use this
     technique, so they ought to be just as reusable in principle,
     athough there currently aren't a lot of predefined parse
     tokens included in OpenToken.

     Ada's type safety features should also make misbehaving
     analyzers and parsers easier to debug. All this will hopefully
     add up to token analyzers and parsers that are much simpler
     and faster to create, easier to get working properly, and
     easier to understand.

--
T.E.D.

Home - mailto:dennison@telepath.com  Work - mailto:dennison@ssd.fsi.com
WWW  - http://www.telepath.com/dennison/Ted/TED.html  ICQ  - 10545591






^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Announce: OpenToken 2.0 released
  2000-01-27  0:00 Announce: OpenToken 2.0 released Ted Dennison
@ 2000-01-28  0:00 ` Jürgen Pfeifer
  2000-01-28  0:00   ` Ted Dennison
  2000-01-31  0:00 ` Hyman Rosen
  1 sibling, 1 reply; 33+ messages in thread
From: Jürgen Pfeifer @ 2000-01-28  0:00 UTC (permalink / raw)


The RPMs for Red Hat and SuSE GNU/Linux are available at http://www.gnuada.org

For glibc-2.1 based systems (RH 6.x, SuSE 6.{2,3}):
http://www.gnuada.org/rpms312p.html#OPENTOKEN

For glibc-2.0 based systems (RH 5.x, SuSE 6.{0,1}):
http://www.gnuada.org/rpms312p_0.html#OPENTOKEN

Cheers
Jï¿½rgen




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Announce: OpenToken 2.0 released
  2000-01-28  0:00 ` Jürgen Pfeifer
@ 2000-01-28  0:00   ` Ted Dennison
  0 siblings, 0 replies; 33+ messages in thread
From: Ted Dennison @ 2000-01-28  0:00 UTC (permalink / raw)


Jï¿½rgen Pfeifer wrote:

> The RPMs for Red Hat and SuSE GNU/Linux are available at http://www.gnuada.org
>

Cool. I'll make note of that on the website. Thanks Jï¿½rgen.

--
T.E.D.

Home - mailto:dennison@telepath.com  Work - mailto:dennison@ssd.fsi.com
WWW  - http://www.telepath.com/dennison/Ted/TED.html  ICQ  - 10545591






^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Announce: OpenToken 2.0 released
  2000-01-27  0:00 Announce: OpenToken 2.0 released Ted Dennison
  2000-01-28  0:00 ` Jürgen Pfeifer
@ 2000-01-31  0:00 ` Hyman Rosen
  2000-02-01  0:00   ` Ted Dennison
  1 sibling, 1 reply; 33+ messages in thread
From: Hyman Rosen @ 2000-01-31  0:00 UTC (permalink / raw)


Ted Dennison <dennison@telepath.com> writes:
> Release 2.0 of OpenToken has now been placed on the website

From a quick look at opentoken.ads, I see a declaration for an
EOF_Character, set to Ada.Characters.Latin_1.EOT. Does this mean
that OpenToken cannot parse binary files that happen to contain
this character? It's a rather odd choice in any case, given that
no system that I know of uses EOT as an end-of-file marker.




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Announce: OpenToken 2.0 released
  2000-02-01  0:00           ` Hyman Rosen
@ 2000-02-01  0:00             ` Brian Rogoff
  2000-02-02  0:00               ` Hyman Rosen
  2000-02-02  0:00             ` Vladimir Olensky
  2000-02-02  0:00             ` Jeff Carter
  2 siblings, 1 reply; 33+ messages in thread
From: Brian Rogoff @ 2000-02-01  0:00 UTC (permalink / raw)

On 1 Feb 2000, Hyman Rosen wrote:

> Brian Rogoff <bpr@shell5.ba.best.com> writes:
> > (1) Exceptions: raise a Not_Found when input is exhausted. Some people 
> >     hate this because "Exceptions are only for error handling, not
> >     control flow!". OCaml (and SML too I think) use exceptions for this, 
> >     and Ada sometimes does (try reading a file stream without using
> >     File_Type...)
> 
> I don't like this much.
> Exceptions are for error handling, not control flow :-)

De gustibus non est disputandum. I've gotten used to this technique, and
lived to talk about it.

> > (2) Provide a query on the sequence, like in Java, so you have code like 
> >     while Has_More_Elements(Seq) loop 
> >         Char := Get_Next_Element(Seq);
> >         ...
> >     end loop;
> >     I find this very readable.
> 
> Unfortunately, when it comes to input, it is impossible on most systems
> to divorce a test for end of input from the attempt to read the input.

Exactly true, and as you say unfortunate too. 

> > (3) Provide an option type like in (OCa|S)ML which wraps returned elements 
> >     and forces the reader to unwrap them, like this
> 
> This is the integer/character thing dressed up in high-falutin' clothes.

True, though the high falutin thing is more general and much less prone to 
error since it expresses the intent clearly. Its also easily expressible
in C++ (your favorite language?) and other languages which have some form
of parametric polymorphism and variant types. I suppose you can do it in
Eiffel too but faking variants (tagged unions) with classes is an extra
level of ugliness IMO. 

I prefer using exceptions here, but I bet most Ada programmers agree with 
you and would like a procedure with a two out params.

-- Brian 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Announce: OpenToken 2.0 released
  2000-01-31  0:00 ` Hyman Rosen
@ 2000-02-01  0:00   ` Ted Dennison
  2000-02-01  0:00     ` Hyman Rosen
  0 siblings, 1 reply; 33+ messages in thread
From: Ted Dennison @ 2000-02-01  0:00 UTC (permalink / raw)

In article <t7n1plq56s.fsf@calumny.jyacc.com>,
  Hyman Rosen <hymie@prolifics.com> wrote:
> Ted Dennison <dennison@telepath.com> writes:
> > Release 2.0 of OpenToken has now been placed on the website
>
> From a quick look at opentoken.ads, I see a declaration for an
> EOF_Character, set to Ada.Characters.Latin_1.EOT. Does this mean
> that OpenToken cannot parse binary files that happen to contain
> this character? It's a rather odd choice in any case, given that
> no system that I know of uses EOT as an end-of-file marker.

That's the marker that the OpenToken text feeders agree put on text to
indicate that there is no more text to read. If you have to parse text
which contains an EOT, its a simple matter to change EOF_Character to
something else.

As for parsing binaries; to my knowledge OT has not been used that way
before. However, I see only one real inpediment. EOF_Character is used
in OpenToken:
   o  In the line comment recognizer (line comments make no sense in
binaries anyway)
   o  In the Text_IO-based text feeder. Using this feeder also makes no
sense in binaries. You'd want to write one based on Sequential_IO or
something.
   o  In the End_Of_File token recognizer. This also makes no sense for
binaries, as a sentinel character which can be tokenized clearly won't
do the job.
   o  By you the user to make sure you don't attempt to read past the
end of the file after a token analysis or parse returns. In this case,
no problem for binaries exists. You just use a different method to
prevent reading past the end of the file.
   o  In the analyzer to prevent reading past the end of file when
matching a token. This *would* be a problem for you, unless none of your
"binary" tokens span an EOT. My suggestions for working around this
problem are follows:
Modify EOF_Character to be a variable so that it can be set by your
custom text feeder. Set it to some good terminating value normally. This
would be a byte value that cannot be anywhere in a token except at the
end. But when you read the last character from the file, you set it to
that value instead.

A better option with a bit more work would be the following:
Modify the root text_feeder package to have a primitive operation for
returning whether we are at the end of the input. Implement that routine
in your custom text feeder (as well as any others that you may use).
Modify the one line in the Analyzer that checks EOF_Character to intead
call that routine on its text feeder.

Proper binary support is not in OT because it has just never come up
before. But as you can see, it could be modified fairly easily to
support parsing binaries. But using a sentinel character for the end of
file has always seemed like a nice simplification. So what are the uses
of parsing binaries? I kinda thought that binaries are, by their very
nature, already parsed.

--
T.E.D.

http://www.telepath.com/~dennison/Ted/TED.html

Sent via Deja.com http://www.deja.com/
Before you buy.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Announce: OpenToken 2.0 released
  2000-02-01  0:00     ` Hyman Rosen
@ 2000-02-01  0:00       ` David Starner
  2000-02-01  0:00         ` Brian Rogoff
  2000-02-02  0:00       ` Ted Dennison
  2000-02-04  0:00       ` Florian Weimer
  2 siblings, 1 reply; 33+ messages in thread
From: David Starner @ 2000-02-01  0:00 UTC (permalink / raw)


On 01 Feb 2000 13:16:14 -0500, Hyman Rosen <hymie@prolifics.com> wrote:
>By the way, the normal C/C++ style for handling EOF is to have the
>return type of the character reader be such that it can hold any
>value of the character set, plus an out-of-band value representing
>EOF. The usual is '#define EOF -1' and 'int getchar()'.

Which means that c: Character; c := getchar(); is illegal (at least
in Ada; in C it would get silently truncated.) It's a major pain
in C, and a well known source of bugs. How about making it like
procedure GetChar (EOF: in out boolean; Char: in out character);?

-- 
David Starner - dstarner98@aasaa.ofe.org
If you wish to strive for peace of soul then believe; 
if you wish to be a devotee of truth, then inquire.
   -- Friedrich Nietzsche




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Announce: OpenToken 2.0 released
  2000-02-01  0:00       ` David Starner
@ 2000-02-01  0:00         ` Brian Rogoff
  2000-02-01  0:00           ` Hyman Rosen
  0 siblings, 1 reply; 33+ messages in thread
From: Brian Rogoff @ 2000-02-01  0:00 UTC (permalink / raw)


On 1 Feb 2000, David Starner wrote:
> On 01 Feb 2000 13:16:14 -0500, Hyman Rosen <hymie@prolifics.com> wrote:
> >By the way, the normal C/C++ style for handling EOF is to have the
> >return type of the character reader be such that it can hold any
> >value of the character set, plus an out-of-band value representing
> >EOF. The usual is '#define EOF -1' and 'int getchar()'.
> 
> Which means that c: Character; c := getchar(); is illegal (at least
> in Ada; in C it would get silently truncated.) It's a major pain
> in C, and a well known source of bugs. How about making it like
> procedure GetChar (EOF: in out boolean; Char: in out character);?

There are a few other approaches I can think of to this issue

(1) Exceptions: raise a Not_Found when input is exhausted. Some people 
    hate this because "Exceptions are only for error handling, not
    control flow!". OCaml (and SML too I think) use exceptions for this, 
    and Ada sometimes does (try reading a file stream without using
    File_Type...)

(2) Provide a query on the sequence, like in Java, so you have code like 

    while Has_More_Elements(Seq) loop 
        Char := Get_Next_Element(Seq);
        ...
    end loop;

    I find this very readable.

(3) Provide an option type like in (OCa|S)ML which wraps returned elements 
    and forces the reader to unwrap them, like this 

    type Option_T is (Some, None);

    generic
        type Element_T is private;
    package Options is
      type Optional_T(Option : Option_T) is record
          case Option is
             when Some =>
                 Data : Element_T;
             when None  =>
                 null;
          end case;
      end record;
    end Options;

   loop 
        Elem := Get_Next_Element(Seq);
        case Elem.Option is 
            when Some => ...
            when None => exit;
        end case;
  end loop;

This is too inefficient for reading chars and is very verbose in Ada; much
less so in ML.

-- Brian






^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Announce: OpenToken 2.0 released
  2000-02-01  0:00   ` Ted Dennison
@ 2000-02-01  0:00     ` Hyman Rosen
  2000-02-01  0:00       ` David Starner
                         ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Hyman Rosen @ 2000-02-01  0:00 UTC (permalink / raw)

Ted Dennison <dennison@telepath.com> writes:
> Proper binary support is not in OT because it has just never come up
> before. But as you can see, it could be modified fairly easily to
> support parsing binaries. But using a sentinel character for the end of
> file has always seemed like a nice simplification. So what are the uses
> of parsing binaries? I kinda thought that binaries are, by their very
> nature, already parsed.

Well, at one point I was writing code to parse Adobe PDF files.
They have a binary format, where arbitrary 8-bit bytes can appear,
and a structure which I think lends itself well to syntax-oriented
parsing.

In general, I like to avoid arbitrary restrictions in tools. Before
GNU, most classic UNIX utilities had arbitrary limits, especially
on line size. This led to unexpected and sometimes silent breakage
when the tools were fed files with lines which were too large. And
the tool reporting the problem isn't of much help, when I still have
that file I need to process and the tool won't work.

By the way, the normal C/C++ style for handling EOF is to have the
return type of the character reader be such that it can hold any
value of the character set, plus an out-of-band value representing
EOF. The usual is '#define EOF -1' and 'int getchar()'.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Announce: OpenToken 2.0 released
  2000-02-01  0:00         ` Brian Rogoff
@ 2000-02-01  0:00           ` Hyman Rosen
  2000-02-01  0:00             ` Brian Rogoff
                               ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Hyman Rosen @ 2000-02-01  0:00 UTC (permalink / raw)

Brian Rogoff <bpr@shell5.ba.best.com> writes:
> (1) Exceptions: raise a Not_Found when input is exhausted. Some people 
>     hate this because "Exceptions are only for error handling, not
>     control flow!". OCaml (and SML too I think) use exceptions for this, 
>     and Ada sometimes does (try reading a file stream without using
>     File_Type...)

I don't like this much.
Exceptions are for error handling, not control flow :-)

> (2) Provide a query on the sequence, like in Java, so you have code like 
>     while Has_More_Elements(Seq) loop 
>         Char := Get_Next_Element(Seq);
>         ...
>     end loop;
>     I find this very readable.

Unfortunately, when it comes to input, it is impossible on most systems
to divorce a test for end of input from the attempt to read the input.
This is the classic Pascal file input problem that made I/O in that
language so despised.

> (3) Provide an option type like in (OCa|S)ML which wraps returned elements 
>     and forces the reader to unwrap them, like this

This is the integer/character thing dressed up in high-falutin' clothes.
It probably adds more overhead than people would want. But it's a fine
technique.

I think the approach best suited for Ada is a procedure with two out
parameters, a boolean for end-of-file, and a character for the data.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Announce: OpenToken 2.0 released
  2000-02-02  0:00             ` Vladimir Olensky
@ 2000-02-01  0:00               ` Hyman Rosen
  0 siblings, 0 replies; 33+ messages in thread
From: Hyman Rosen @ 2000-02-01  0:00 UTC (permalink / raw)

"Vladimir Olensky" <vladimir_olensky@yahoo.com> writes:
> If language has well defined and constructed   exceptions
> mechanism  without much overhead then there is nothing
> wrong in using exceptions as condition signals or events.

Because of the way exceptions work, unwinding the stack and calling
finalizers on controlled objects along the way, it's rarely the case
that they are implemented "without much overhead". I understand that
some Ada compilers will watch for exceptions which are thrown and
caught locally, and turn them into efficient code, but that's not
something to rely on for portability. The general implementation
strategy for exceptions is to make their use as cheap as possible
as long as they are not actually thrown, but to allow considerable
overhead once they are thrown.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Announce: OpenToken 2.0 released
  2000-02-01  0:00           ` Hyman Rosen
  2000-02-01  0:00             ` Brian Rogoff
@ 2000-02-02  0:00             ` Vladimir Olensky
  2000-02-01  0:00               ` Hyman Rosen
  2000-02-02  0:00             ` Jeff Carter
  2 siblings, 1 reply; 33+ messages in thread
From: Vladimir Olensky @ 2000-02-02  0:00 UTC (permalink / raw)

Hyman Rosen wrote in message ...
>Brian Rogoff <bpr@shell5.ba.best.com> writes:
>> (1) Exceptions: raise a Not_Found when input is exhausted. Some people
>>     hate this because "Exceptions are only for error handling, not
>>     control flow!". OCaml (and SML too I think) use exceptions for this,
>>     and Ada sometimes does (try reading a file stream without using
>>     File_Type...)
>
>I don't like this much.
>Exceptions are for error handling, not control flow :-)

It seems to me that this is somewhat narrow view on that.

   More generally exceptions could  be viewed as mechanism
that gives  user a tool to signal  outside that some condition is
true and ability to handle this signal in the place that is not
known in advance.
  Or it may be viewed as some program event.  In the event driven
system we  have ability to choose the level/scope where this event
will be handled.

If language has well defined and constructed   exceptions
mechanism  without much overhead then there is nothing
wrong in using exceptions as condition signals or events.

Regards,
Vladimir Olensky

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Announce: OpenToken 2.0 released
  2000-02-01  0:00           ` Hyman Rosen
  2000-02-01  0:00             ` Brian Rogoff
  2000-02-02  0:00             ` Vladimir Olensky
@ 2000-02-02  0:00             ` Jeff Carter
  2 siblings, 0 replies; 33+ messages in thread
From: Jeff Carter @ 2000-02-02  0:00 UTC (permalink / raw)


Hyman Rosen wrote:
> 
> Brian Rogoff <bpr@shell5.ba.best.com> writes:
> > (1) Exceptions: raise a Not_Found when input is exhausted. Some people
> >     hate this because "Exceptions are only for error handling, not
> >     control flow!". OCaml (and SML too I think) use exceptions for this,
> >     and Ada sometimes does (try reading a file stream without using
> >     File_Type...)
> 
> I don't like this much.
> Exceptions are for error handling, not control flow :-)

The Ada design team expressed a preference for shorter names when
possible ("task" rather than "process"), so why didn't they use "error"
instead of "exception"? The answer is that exceptions are for handling
exceptional situations; not all exceptional situations are errors.

-- 
Jeff Carter
"We call your door-opening request a silly thing."
Monty Python & the Holy Grail




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Announce: OpenToken 2.0 released
  2000-02-01  0:00     ` Hyman Rosen
  2000-02-01  0:00       ` David Starner
@ 2000-02-02  0:00       ` Ted Dennison
  2000-02-04  0:00         ` Ted Dennison
  2000-02-04  0:00       ` Florian Weimer
  2 siblings, 1 reply; 33+ messages in thread
From: Ted Dennison @ 2000-02-02  0:00 UTC (permalink / raw)


In article <t790146b69.fsf@calumny.jyacc.com>,
  Hyman Rosen <hymie@prolifics.com> wrote:
> Well, at one point I was writing code to parse Adobe PDF files.
> They have a binary format, where arbitrary 8-bit bytes can appear,
> and a structure which I think lends itself well to syntax-oriented
> parsing.

You along with some emailers have convinced me. I'll make the change to
the analyzer I mentioned in the previous message. That should be
sufficient to allow binaries to be parsed.

--
T.E.D.

http://www.telepath.com/~dennison/Ted/TED.html


Sent via Deja.com http://www.deja.com/
Before you buy.




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Announce: OpenToken 2.0 released
  2000-02-01  0:00             ` Brian Rogoff
@ 2000-02-02  0:00               ` Hyman Rosen
  0 siblings, 0 replies; 33+ messages in thread
From: Hyman Rosen @ 2000-02-02  0:00 UTC (permalink / raw)


Brian Rogoff <bpr@shell5.ba.best.com> writes:
> True, though the high falutin thing is more general and much less prone to 
> error since it expresses the intent clearly. Its also easily expressible
> in C++ (your favorite language?) and other languages which have some form
> of parametric polymorphism and variant types. I suppose you can do it in
> Eiffel too but faking variants (tagged unions) with classes is an extra
> level of ugliness IMO. 

Yup (C++). I've seen it referred to as "Fallible<T>". I've also seen
an amusing variant which forces you to test error return codes from
functions. The function returns a "MustRead<T>" object, which will
throw an exception if it is destructed before the value it holds is
extracted.




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Announce: OpenToken 2.0 released
  2000-02-02  0:00       ` Ted Dennison
@ 2000-02-04  0:00         ` Ted Dennison
  2000-02-05  0:00           ` Ehud Lamm
  0 siblings, 1 reply; 33+ messages in thread
From: Ted Dennison @ 2000-02-04  0:00 UTC (permalink / raw)

In article <879fc4$eod$1@nnrp1.deja.com>,
  Ted Dennison <dennison@telepath.com> wrote:
> In article <t790146b69.fsf@calumny.jyacc.com>,
>   Hyman Rosen <hymie@prolifics.com> wrote:
> > Well, at one point I was writing code to parse Adobe PDF files.
> > They have a binary format, where arbitrary 8-bit bytes can appear,
> > and a structure which I think lends itself well to syntax-oriented
> > parsing.
>
> You along with some emailers have convinced me. I'll make the change
> to the analyzer I mentioned in the previous message. That should be
> sufficient to allow binaries to be parsed.

All references to EOF_Character have now been removed from the analyzer.
This has been integrated and tested. Thus the next version of OpenToken
should be suitable for use in parsing binaries.

As for the protracted discussion here on the best method for determining
when the end of the text has been reached, I sort of cheated. I left
that problem to the implementors of the text feeders.

--
T.E.D.

http://www.telepath.com/~dennison/Ted/TED.html

Sent via Deja.com http://www.deja.com/
Before you buy.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Announce: OpenToken 2.0 released
  2000-02-01  0:00     ` Hyman Rosen
  2000-02-01  0:00       ` David Starner
  2000-02-02  0:00       ` Ted Dennison
@ 2000-02-04  0:00       ` Florian Weimer
  2000-02-07  0:00         ` Hyman Rosen
  2 siblings, 1 reply; 33+ messages in thread
From: Florian Weimer @ 2000-02-04  0:00 UTC (permalink / raw)


Hyman Rosen <hymie@prolifics.com> writes:

> By the way, the normal C/C++ style for handling EOF is to have the
> return type of the character reader be such that it can hold any
> value of the character set, plus an out-of-band value representing
> EOF. The usual is '#define EOF -1' and 'int getchar()'.

Unfortunately, this becomes inband signalling on some platforms (where
sizeof(char) == sizeof(int)).  Broken as designed.




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Announce: OpenToken 2.0 released
  2000-02-04  0:00         ` Ted Dennison
@ 2000-02-05  0:00           ` Ehud Lamm
  0 siblings, 0 replies; 33+ messages in thread
From: Ehud Lamm @ 2000-02-05  0:00 UTC (permalink / raw)


On Fri, 4 Feb 2000, Ted Dennison wrote:

|As for the protracted discussion here on the best method for determining
|when the end of the text has been reached, I sort of cheated. I left
|that problem to the implementors of the text feeders.
|

It is the best kind of "cheating." Good software always factors out those
things that may need to change.

Ehud Lamm mslamm@mscc.huji.ac.il
http://purl.oclc.org/NET/ehudlamm <== My home on the web 
Check it out and subscribe to the E-List- for interesting essays and more!







^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Announce: OpenToken 2.0 released
  2000-02-04  0:00       ` Florian Weimer
@ 2000-02-07  0:00         ` Hyman Rosen
  2000-02-07  0:00           ` Florian Weimer
  2000-02-09  0:00           ` Robert A Duff
  0 siblings, 2 replies; 33+ messages in thread
From: Hyman Rosen @ 2000-02-07  0:00 UTC (permalink / raw)

Florian Weimer <someone@deneb.cygnus.argh.org> writes:
> Unfortunately, this becomes inband signalling on some platforms (where
> sizeof(char) == sizeof(int)).  Broken as designed.

No. It's only in-band if on those platforms, input characters do in fact
range over the full set of representable values of a char. Just because
the compiler internally represents char with the same range as int does
not mean that stream input is reading objects of that range from its
sources. Input could be octets, or 16-bit unicode chars, or 32-bit other
representation of input. It is usually the case that the wider character
formats have error values which explicitly represent no character within
the set, so that an out-of-band value may be supplied that way.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Announce: OpenToken 2.0 released
  2000-02-07  0:00         ` Hyman Rosen
@ 2000-02-07  0:00           ` Florian Weimer
  2000-02-07  0:00             ` Hyman Rosen
  2000-02-09  0:00           ` Robert A Duff
  1 sibling, 1 reply; 33+ messages in thread
From: Florian Weimer @ 2000-02-07  0:00 UTC (permalink / raw)


Hyman Rosen <hymie@prolifics.com> writes:

> No. It's only in-band if on those platforms, input characters do in fact
> range over the full set of representable values of a char. Just because
> the compiler internally represents char with the same range as int does
> not mean that stream input is reading objects of that range from its
> sources. Input could be octets, or 16-bit unicode chars, or 32-bit other
> representation of input.

You are right for text files.  For binary files, there has to be
a bijection between the internal and external representation of
characters.  This means that the value EOF can occur in a binary stream
if sizeof(char) equals sizeof(int).




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Announce: OpenToken 2.0 released
  2000-02-07  0:00           ` Florian Weimer
@ 2000-02-07  0:00             ` Hyman Rosen
  0 siblings, 0 replies; 33+ messages in thread
From: Hyman Rosen @ 2000-02-07  0:00 UTC (permalink / raw)

Florian Weimer <someone@deneb.cygnus.argh.org> writes:
> You are right for text files.  For binary files, there has to be
> a bijection between the internal and external representation of
> characters.  This means that the value EOF can occur in a binary stream
> if sizeof(char) equals sizeof(int).

Agreed. When you are coding for such a case in C or C++, you must then use
one of the several input methods which separate content and status. In C,
you can use fread, or test with feof. C++ additionally allows the use of
istream.get(char &), which returns a status value.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Announce: OpenToken 2.0 released
  2000-02-07  0:00         ` Hyman Rosen
  2000-02-07  0:00           ` Florian Weimer
@ 2000-02-09  0:00           ` Robert A Duff
  2000-02-09  0:00             ` Hyman Rosen
  1 sibling, 1 reply; 33+ messages in thread
From: Robert A Duff @ 2000-02-09  0:00 UTC (permalink / raw)


Hyman Rosen <hymie@prolifics.com> writes:

> Florian Weimer <someone@deneb.cygnus.argh.org> writes:
> > Unfortunately, this becomes inband signalling on some platforms (where
> > sizeof(char) == sizeof(int)).  Broken as designed.
> 
> No. It's only in-band if on those platforms, input characters do in fact
> range over the full set of representable values of a char.

But "char" in C is just the smallest integer type.  There's no reason to
believe that reading a "char" has anything to do with characters/text,
despite its misleading name.

- Bob




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Announce: OpenToken 2.0 released
  2000-02-09  0:00             ` Hyman Rosen
@ 2000-02-09  0:00               ` Larry Kilgallen
  2000-02-17  0:00               ` Robert A Duff
  1 sibling, 0 replies; 33+ messages in thread
From: Larry Kilgallen @ 2000-02-09  0:00 UTC (permalink / raw)


In article <t7bt5qrsgj.fsf@calumny.jyacc.com>, Hyman Rosen <hymie@prolifics.com> writes:
> Robert A Duff <bobduff@world.std.com> writes:
>> But "char" in C is just the smallest integer type.  There's no reason to
>> believe that reading a "char" has anything to do with characters/text,
>> despite its misleading name.
> 
> The C/C++ stream interface concerns itself with reading characters
> from input streams. The char type is able to hold any character
> read from a stream. This discussion is about how input functions
> can provide end-of-file notification. One way is to have an
> out-of-band value returned by the single-character reader function.
> The C way of doing this is to have the return type of that function
> be int. It is then the case that if the implementation has int and
> char the same size, and the input format is such that every possible
> bit pattern for char may be present, then this out-of-band method
> will not work, and one of the available alternate methods must be
> used, such as fread or feof. This situation is rare, and may even be
> non-existent.
> 
> Do you feel that your comment has contributed anything to this
> discussion? To me it seems to be a snide and pointless attack on C.

I have no particular use for C, but I viewed the discussion more as
a useful reminder about the dangers of in-band signalling.  Phone
companies learned that years ago, and today they use the SS7 protocol.




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Announce: OpenToken 2.0 released
  2000-02-09  0:00           ` Robert A Duff
@ 2000-02-09  0:00             ` Hyman Rosen
  2000-02-09  0:00               ` Larry Kilgallen
  2000-02-17  0:00               ` Robert A Duff
  0 siblings, 2 replies; 33+ messages in thread
From: Hyman Rosen @ 2000-02-09  0:00 UTC (permalink / raw)

Robert A Duff <bobduff@world.std.com> writes:
> But "char" in C is just the smallest integer type.  There's no reason to
> believe that reading a "char" has anything to do with characters/text,
> despite its misleading name.

The C/C++ stream interface concerns itself with reading characters
from input streams. The char type is able to hold any character
read from a stream. This discussion is about how input functions
can provide end-of-file notification. One way is to have an
out-of-band value returned by the single-character reader function.
The C way of doing this is to have the return type of that function
be int. It is then the case that if the implementation has int and
char the same size, and the input format is such that every possible
bit pattern for char may be present, then this out-of-band method
will not work, and one of the available alternate methods must be
used, such as fread or feof. This situation is rare, and may even be
non-existent.

Do you feel that your comment has contributed anything to this
discussion? To me it seems to be a snide and pointless attack on C.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Announce: OpenToken 2.0 released
  2000-02-09  0:00             ` Hyman Rosen
  2000-02-09  0:00               ` Larry Kilgallen
@ 2000-02-17  0:00               ` Robert A Duff
  2000-02-17  0:00                 ` Hyman Rosen
  1 sibling, 1 reply; 33+ messages in thread
From: Robert A Duff @ 2000-02-17  0:00 UTC (permalink / raw)

Hyman Rosen <hymie@prolifics.com> writes:

> The C/C++ stream interface concerns itself with reading characters
> from input streams. The char type is able to hold any character
> read from a stream. This discussion is about how input functions
> can provide end-of-file notification. One way is to have an
> out-of-band value returned by the single-character reader function.
> The C way of doing this is to have the return type of that function
> be int. It is then the case that if the implementation has int and
> char the same size, and the input format is such that every possible
> bit pattern for char may be present, then this out-of-band method
> will not work, and one of the available alternate methods must be
> used, such as fread or feof. This situation is rare, and may even be
> non-existent.
> 
> Do you feel that your comment has contributed anything to this
> discussion? To me it seems to be a snide and pointless attack on C.

I'm sorry to offend you.

My point was simply that the C programming language has a design flaw
(namely, confusion between characters and integers) that contributes to
the poor design we were talking about (namely, assuming that int can
represent every char value, plus at least one more value, which is not
always the case).  In my opinion, of course.  Again, I apologize if my
criticism of these particular aspects of C has offended you.

- Bob

P.S. If it makes you feel better, I have a habit of criticizing certain
aspects of every programming language I've ever learned, including the
ones I've helped design.  ;-) ;-)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Announce: OpenToken 2.0 released
  2000-02-17  0:00                 ` Hyman Rosen
@ 2000-02-17  0:00                   ` Robert A Duff
  2000-02-17  0:00                   ` Hyman Rosen
       [not found]                   ` <88iuk2$s6d3@ftp.kvaerner.com>
  2 siblings, 0 replies; 33+ messages in thread
From: Robert A Duff @ 2000-02-17  0:00 UTC (permalink / raw)

Hyman Rosen <hymie@prolifics.com> writes:

> If you look at the subject of this thread, you will be reminded that
> it started because the author of OpenToken used exactly this approach
> in his Ada code, in a way even worse than C's approach - making a
> potentially legal character the end-of-file sentinel.

Good point.

The sentinel approach really is a good one in many cases, but C and Ada
both get in the way of doing that right (ie making sure the sentinel
doesn't conflict with anything).  Except for pointers, where null is
safe to use.

>... It's not at all
> unnatural to want to use this kind of approach.

Agreed.  You just have to make sure you do it right.

> C does not "confuse" characters and integers. It allows arithmetic on
> chars, ...

To me, that is a confusion.  For characters in real life (the ones you
see on your screen or on paper) there is no natural meaningful addition.
There is comparison (we've all memorized the alphabet in order).  But no
addition.  Or multiplication!

- Bob

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Announce: OpenToken 2.0 released
  2000-02-17  0:00                   ` Hyman Rosen
@ 2000-02-17  0:00                     ` Robert A Duff
  2000-02-17  0:00                       ` Hyman Rosen
  0 siblings, 1 reply; 33+ messages in thread
From: Robert A Duff @ 2000-02-17  0:00 UTC (permalink / raw)

Hyman Rosen <hymie@prolifics.com> writes:

> Apropos of this, I recall more than one thread here on c.l.a where
> Robert Dewar says that one can legitimately make "ordinary platform"
> assumptions when programming in Ada. That is, one does not read the
> A.R.M. or treat the compiler as if they are the enemy and thwarting
> you at every turn.

True, but it's not always completely clear what assumptions are
reasonable.  For example, you might reasonably decide that your program
will be portable only to 8-bit-byte-addressable machines, whereas
standard libraries perhaps ought to be more portable than that.

If the *hardware* insists that the smallest integer type is 32 bits,
then is it unreasonable for the C implementation to read 32-bit
quantities?  Surely, if char and int are the same size (say, 32 bits), I
should be allowed to write out the number -1, or 2**31-1, or 1_000_000!
And surely if I can write it out, I should be able to read it back in.

>... Conceivably a compiler for any language can be
> implemented to stick to the letter of its Standard but still be as
> useless as possible, but then you have the choice of not using it.

- Bob

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Announce: OpenToken 2.0 released
  2000-02-17  0:00               ` Robert A Duff
@ 2000-02-17  0:00                 ` Hyman Rosen
  2000-02-17  0:00                   ` Robert A Duff
                                     ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Hyman Rosen @ 2000-02-17  0:00 UTC (permalink / raw)

Robert A Duff <bobduff@world.std.com> writes:
> My point was simply that the C programming language has a design flaw
> (namely, confusion between characters and integers) that contributes to
> the poor design we were talking about (namely, assuming that int can
> represent every char value, plus at least one more value, which is not
> always the case).

If you look at the subject of this thread, you will be reminded that
it started because the author of OpenToken used exactly this approach
in his Ada code, in a way even worse than C's approach - making a
potentially legal character the end-of-file sentinel. It's not at all
unnatural to want to use this kind of approach. The only "design flaw"
in the C approach is that it is possible to implement integers and
chars to have the same size, and to have streams that then allow the
full range of at least 32-bit values to appear on their input. I would
guess that the number of such adversarial platforms is small, and
would not be surprised at all if that number were zero. (I know that
the number of platforms where sizeof(int) == sizeof(char) is non-zero,
but do those platforms have 32-bit input from external sources?)

C does not "confuse" characters and integers. It allows arithmetic on
chars, and automatic conversions among arithmetic types. It is certainly
the case that this can lead to truncation errors and similar surprises
that would not occur in Ada. Many modern C compilers attempt to
compensate for this by generating warnings when truncation would occur,
but that's something of a band-aid.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Announce: OpenToken 2.0 released
  2000-02-17  0:00                 ` Hyman Rosen
  2000-02-17  0:00                   ` Robert A Duff
@ 2000-02-17  0:00                   ` Hyman Rosen
  2000-02-17  0:00                     ` Robert A Duff
       [not found]                   ` <88iuk2$s6d3@ftp.kvaerner.com>
  2 siblings, 1 reply; 33+ messages in thread
From: Hyman Rosen @ 2000-02-17  0:00 UTC (permalink / raw)

Hyman Rosen <hymie@prolifics.com> writes:
> I would guess that the number of such adversarial platforms is
> small, and would not be surprised at all if that number were zero.

Apropos of this, I recall more than one thread here on c.l.a where
Robert Dewar says that one can legitimately make "ordinary platform"
assumptions when programming in Ada. That is, one does not read the
A.R.M. or treat the compiler as if they are the enemy and thwarting
you at every turn. Conceivably a compiler for any language can be
implemented to stick to the letter of its Standard but still be as
useless as possible, but then you have the choice of not using it.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Announce: OpenToken 2.0 released
  2000-02-17  0:00                     ` Robert A Duff
@ 2000-02-17  0:00                       ` Hyman Rosen
  0 siblings, 0 replies; 33+ messages in thread
From: Hyman Rosen @ 2000-02-17  0:00 UTC (permalink / raw)

Robert A Duff <bobduff@world.std.com> writes:
> If the *hardware* insists that the smallest integer type is 32 bits,
> then is it unreasonable for the C implementation to read 32-bit
> quantities?  Surely, if char and int are the same size (say, 32 bits), I
> should be allowed to write out the number -1, or 2**31-1, or 1_000_000!
> And surely if I can write it out, I should be able to read it back in.

But for now, and for the forseeable future, it's extremely unlikely
that file contents will have a granularity no smaller than 32 bits.
That means that stream input is still going to be reading octets,
whether or not the processor likes to deal with quantities 32, or 64,
bits at a time. I'm not saying that the situation you are describing
could never, in principle, occur. I'm just saying that I don't think
it will. If it does, than on that platform, you can not write your
input code as 'int c; while ((c = getchar()) != EOF) { ... }', but C
already has other mechanisms to deal with this case. But I don't
think that it's necessary for the ordinary C programmer to worry
about portability of everyday code to such an oddball platform.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [OT] C and in-band signalling (was: Re: Announce: OpenToken 2.0 released)
       [not found]                   ` <88iuk2$s6d3@ftp.kvaerner.com>
@ 2000-03-05  0:00                     ` Florian Weimer
  2000-03-06  0:00                       ` Tarjei T. Jensen
  0 siblings, 1 reply; 33+ messages in thread
From: Florian Weimer @ 2000-03-05  0:00 UTC (permalink / raw)


"Tarjei T. Jensen" <tarjei.jensen@kvaerner.com> writes:

> Hyman Rosen wrote
> >(I know that
> >the number of platforms where sizeof(int) == sizeof(char) is non-zero,
> >but do those platforms have 32-bit input from external sources?)
> 
> Is that a problem on any plattform? EOF is not a fixed value. 

Yes, it is a preprocessor macro "which expands to an integer constant
expression, with type int and a negative value".

> It is traditionally -1. If, on a platform where sizeof(int) ==
> sizeof(char) there is a convention that characters are positive
> (assuming the sizeof(int) > 8) then a convention which use negative
> numbers for signalling would still work.

No, for binary files, a conforming implementation has to be able to store
and retrieve all possible char type values, including the negative ones.




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [OT] C and in-band signalling (was: Re: Announce: OpenToken 2.0 released)
  2000-03-06  0:00                       ` Tarjei T. Jensen
@ 2000-03-06  0:00                         ` Keith Thompson
  0 siblings, 0 replies; 33+ messages in thread
From: Keith Thompson @ 2000-03-06  0:00 UTC (permalink / raw)


"Tarjei T. Jensen" <tarjei.jensen@kvaerner.com> writes:
[...]
> You would have the same problem on a platform with a signed char type. As far
> as I know EOF should not be used with binary data. Those kind of data is best
> handled with read() and write() or a functional equivalent.

The C input routines are defined in terms of getc(), which returns the
next byte as an *unsigned* char converted to int.  As long as int is
big enough to hold all possible unsigned char values, plus EOF (which
is typically -1), you can safely use EOF with binary data.

Yes, this is *way* off-topic.

-- 
Keith Thompson (The_Other_Keith) kst@cts.com  <http://www.ghoti.net/~kst>
San Diego Supercomputer Center           <*>  <http://www.sdsc.edu/~kst>
Welcome to the last year of the 20th century.




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [OT] C and in-band signalling (was: Re: Announce: OpenToken 2.0 released)
  2000-03-05  0:00                     ` [OT] C and in-band signalling (was: Re: Announce: OpenToken 2.0 released) Florian Weimer
@ 2000-03-06  0:00                       ` Tarjei T. Jensen
  2000-03-06  0:00                         ` Keith Thompson
  0 siblings, 1 reply; 33+ messages in thread
From: Tarjei T. Jensen @ 2000-03-06  0:00 UTC (permalink / raw)



Florian Weimer wrote
>"Tarjei T. Jensen" writes:
>> It is traditionally -1. If, on a platform where sizeof(int) ==
>> sizeof(char) there is a convention that characters are positive
>> (assuming the sizeof(int) > 8) then a convention which use negative
>> numbers for signalling would still work.
>
>No, for binary files, a conforming implementation has to be able to store
>and retrieve all possible char type values, including the negative ones.

You would have the same problem on a platform with a signed char type. As far
as I know EOF should not be used with binary data. Those kind of data is best
handled with read() and write() or a functional equivalent.

Greetings,







^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2000-03-06  0:00 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2000-01-27  0:00 Announce: OpenToken 2.0 released Ted Dennison
2000-01-28  0:00 ` Jürgen Pfeifer
2000-01-28  0:00   ` Ted Dennison
2000-01-31  0:00 ` Hyman Rosen
2000-02-01  0:00   ` Ted Dennison
2000-02-01  0:00     ` Hyman Rosen
2000-02-01  0:00       ` David Starner
2000-02-01  0:00         ` Brian Rogoff
2000-02-01  0:00           ` Hyman Rosen
2000-02-01  0:00             ` Brian Rogoff
2000-02-02  0:00               ` Hyman Rosen
2000-02-02  0:00             ` Vladimir Olensky
2000-02-01  0:00               ` Hyman Rosen
2000-02-02  0:00             ` Jeff Carter
2000-02-02  0:00       ` Ted Dennison
2000-02-04  0:00         ` Ted Dennison
2000-02-05  0:00           ` Ehud Lamm
2000-02-04  0:00       ` Florian Weimer
2000-02-07  0:00         ` Hyman Rosen
2000-02-07  0:00           ` Florian Weimer
2000-02-07  0:00             ` Hyman Rosen
2000-02-09  0:00           ` Robert A Duff
2000-02-09  0:00             ` Hyman Rosen
2000-02-09  0:00               ` Larry Kilgallen
2000-02-17  0:00               ` Robert A Duff
2000-02-17  0:00                 ` Hyman Rosen
2000-02-17  0:00                   ` Robert A Duff
2000-02-17  0:00                   ` Hyman Rosen
2000-02-17  0:00                     ` Robert A Duff
2000-02-17  0:00                       ` Hyman Rosen
     [not found]                   ` <88iuk2$s6d3@ftp.kvaerner.com>
2000-03-05  0:00                     ` [OT] C and in-band signalling (was: Re: Announce: OpenToken 2.0 released) Florian Weimer
2000-03-06  0:00                       ` Tarjei T. Jensen
2000-03-06  0:00                         ` Keith Thompson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox