Reading "normal" text files with Wide_Text

comp.lang.ada
 help / color / mirror / Atom feed

* Reading "normal" text files with Wide_Text_IO in GNAT
@ 2006-11-30 19:54 Adam Beneschan
  2006-12-03  1:22 ` Björn Persson
  0 siblings, 1 reply; 8+ messages in thread
From: Adam Beneschan @ 2006-11-30 19:54 UTC (permalink / raw)


I was looking in the GNAT reference manual at the description of the
WCEM Form parameter to Wide_Text_IO.Open.  It describes the different
ways that wide characters can be represented in text files that
Wide_Text_IO can interpret.

However, at first glance, I didn't see a way to get Wide_Text_IO to
read a UCS-1 text file.  This is the encoding where each byte in the
range  16#00#..16#FF# represents a character in the range
Wide_Character'Val(16#0000#) .. Wide_Character'Val(16#00FF#), and there
is no way to represent wide characters from 16#0100# to 16#FFFF#.  In
other words, a boring old-fashioned 8-bit text file, maybe with Latin-1
characters or control characters in the 80..9F range.  Yes, I know that
a file like this could be read using Text_IO, but let's say that we
don't know what format the file is in until runtime.

Does GNAT's Wide_Text_IO have a way to read a file like this?

                        -- thanks, Adam




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Reading "normal" text files with Wide_Text_IO in GNAT
  2006-11-30 19:54 Reading "normal" text files with Wide_Text_IO in GNAT Adam Beneschan
@ 2006-12-03  1:22 ` Björn Persson
  2006-12-04 18:17   ` Adam Beneschan
  0 siblings, 1 reply; 8+ messages in thread
From: Björn Persson @ 2006-12-03  1:22 UTC (permalink / raw)


Adam Beneschan wrote:

> However, at first glance, I didn't see a way to get Wide_Text_IO to
> read a UCS-1 text file.

Hmm, I've never heard of UCS-1. Is such an encoding really defined?

> This is the encoding where each byte in the 
> range  16#00#..16#FF# represents a character in the range
> Wide_Character'Val(16#0000#) .. Wide_Character'Val(16#00FF#), and there
> is no way to represent wide characters from 16#0100# to 16#FFFF#.

OK, so it's identical to ISO 8859-1.

> Does GNAT's Wide_Text_IO have a way to read a file like this?

It does indeed look like it can't. Gnat's approach to character encodings is
amazingly faulty.

Does EAstrings fill your needs? If not, would you like to join me in
finishing the implementation so we can get rid of these problems?

http://adacl.sourceforge.net/AdaBrowse/adacl-eastrings.html

-- 
Bjï¿½rn Persson                              PGP key A88682FD
                   omb jor ers @sv ge.
                   r o.b n.p son eri nu



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Reading "normal" text files with Wide_Text_IO in GNAT
  2006-12-03  1:22 ` Björn Persson
@ 2006-12-04 18:17   ` Adam Beneschan
  2006-12-04 23:35     ` Manuel Collado
  0 siblings, 1 reply; 8+ messages in thread
From: Adam Beneschan @ 2006-12-04 18:17 UTC (permalink / raw)


Björn Persson wrote:
> Adam Beneschan wrote:
>
> > However, at first glance, I didn't see a way to get Wide_Text_IO to
> > read a UCS-1 text file.
>
> Hmm, I've never heard of UCS-1. Is such an encoding really defined?

I don't know if that's the correct name.  I have seen it referenced in
a few places.

> > This is the encoding where each byte in the
> > range  16#00#..16#FF# represents a character in the range
> > Wide_Character'Val(16#0000#) .. Wide_Character'Val(16#00FF#), and there
> > is no way to represent wide characters from 16#0100# to 16#FFFF#.
>
> OK, so it's identical to ISO 8859-1.

Technically, I thought ISO-8859-1 was a mapping from a range of
integers to a set of characters, rather than a specification of how
characters are represented in bits in an actual file.  I could be
wrong.  The distinction gets blurry at times.


> > Does GNAT's Wide_Text_IO have a way to read a file like this?
>
> It does indeed look like it can't. Gnat's approach to character encodings is
> amazingly faulty.
>
> Does EAstrings fill your needs? If not, would you like to join me in
> finishing the implementation so we can get rid of these problems?
>
> http://adacl.sourceforge.net/AdaBrowse/adacl-eastrings.html

My question was more theoretical than anything---I was looking at that
section of the manual for other reasons, and happened to notice what
seemed like an omission.  But thanks for the pointer.  I'll take a look
at it.

                                 -- Adam




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Reading "normal" text files with Wide_Text_IO in GNAT
  2006-12-04 18:17   ` Adam Beneschan
@ 2006-12-04 23:35     ` Manuel Collado
  2006-12-06 23:46       ` Björn Persson
  0 siblings, 1 reply; 8+ messages in thread
From: Manuel Collado @ 2006-12-04 23:35 UTC (permalink / raw)


Adam Beneschan escribiï¿½:
> Bjï¿½rn Persson wrote:
>> Adam Beneschan wrote:
>>
>>> However, at first glance, I didn't see a way to get Wide_Text_IO to
>>> read a UCS-1 text file.
>> Hmm, I've never heard of UCS-1. Is such an encoding really defined?
> 
> I don't know if that's the correct name.  I have seen it referenced in
> a few places.

To clarify things:
- Character set - mapping of characters to integers (the so called 
'codepoints')
- Character encoding - mapping of a sequence of codepoints to a sequence of 
bytes

UCS-1 means encoding each character (codepoint) as a single byte whose 
numerical value is just the codepoint. Can be used only for codepoints in 
the range (0..255). UCS-1 is the natural, implicit encoding of all 8-bit 
(and 7-bits) character sets.

> 
>>> This is the encoding where each byte in the
>>> range  16#00#..16#FF# represents a character in the range
>>> Wide_Character'Val(16#0000#) .. Wide_Character'Val(16#00FF#), and there
>>> is no way to represent wide characters from 16#0100# to 16#FFFF#.

Yes, this is UCS-1.

>> OK, so it's identical to ISO 8859-1.
> 
> Technically, I thought ISO-8859-1 was a mapping from a range of
> integers to a set of characters, rather than a specification of how
> characters are represented in bits in an actual file.  I could be
> wrong.  The distinction gets blurry at times.

Quite true. Technically, ISO-8859-1 is a character set (not a character 
encoding). Usually encoded as UCS-1 (as well as a lot of other character sets).

Regretably, the terms 'character set' and 'character encoding' are used as 
synonyms in a lot of places.

Regards.
-- 
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Reading "normal" text files with Wide_Text_IO in GNAT
  2006-12-04 23:35     ` Manuel Collado
@ 2006-12-06 23:46       ` Björn Persson
  2006-12-07  2:02         ` Adam Beneschan
  0 siblings, 1 reply; 8+ messages in thread
From: Björn Persson @ 2006-12-06 23:46 UTC (permalink / raw)


Manuel Collado wrote:
> UCS-1 means encoding each character (codepoint) as a single byte whose
> numerical value is just the codepoint. Can be used only for codepoints in
> the range (0..255). UCS-1 is the natural, implicit encoding of all 8-bit
> (and 7-bits) character sets.

I'd still like to know where UCS-1 is defined, and by whom.
http://www.iana.org/assignments/character-sets lists ISO-10646-UCS-2,
ISO-10646-UCS-4 and ISO-10646-UCS-Basic, but no UCS-1.
http://www.unicode.org/glossary/#U also has entries for UCS-2 and UCS-4,
but no UCS-1.

-- 
Bjï¿½rn Persson                              PGP key A88682FD
                   omb jor ers @sv ge.
                   r o.b n.p son eri nu



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Reading "normal" text files with Wide_Text_IO in GNAT
  2006-12-06 23:46       ` Björn Persson
@ 2006-12-07  2:02         ` Adam Beneschan
  2006-12-09 20:43           ` Björn Persson
  2006-12-11 19:49           ` Manuel Collado
  0 siblings, 2 replies; 8+ messages in thread
From: Adam Beneschan @ 2006-12-07  2:02 UTC (permalink / raw)

Björn Persson wrote:
> Manuel Collado wrote:
> > UCS-1 means encoding each character (codepoint) as a single byte whose
> > numerical value is just the codepoint. Can be used only for codepoints in
> > the range (0..255). UCS-1 is the natural, implicit encoding of all 8-bit
> > (and 7-bits) character sets.
>
> I'd still like to know where UCS-1 is defined, and by whom.
> http://www.iana.org/assignments/character-sets lists ISO-10646-UCS-2,
> ISO-10646-UCS-4 and ISO-10646-UCS-Basic, but no UCS-1.
> http://www.unicode.org/glossary/#U also has entries for UCS-2 and UCS-4,
> but no UCS-1.

UCS-Basic may be the "official" name for what I'm talking about.
Unfortunately, I'm having trouble figuring it out.  The IANA website
you referred me to is titled "Character Sets", but some of the things
listed underneath are encoding standards (UTF-8, etc.) rather than
character sets; UCS-Basic is listed as a "subset of Unicode", however,
and Unicode is a character set (not an encoding; there are multiple
ways to encode Unicode characters, including UTF-8, UTF-16, UCS-2).  So
this page just exemplifies the sort of confusion Manuel referred to.  A
quick Google search hasn't provided any further enlightenment on
exactly what UCS-Basic is.  Specifically, I can't tell whether it's a
character set or an encoding.

UCS-2 and UCS-4 are representations in which if an integer N maps to a
character, then that character is represented simply by a 2- or 4-byte
binary representation of N (byte ordering is an issue, though).  So it
would seem logical that UCS-1 would simply refer to a 1-byte binary
representation of a number.  That's how it seemed to me, and I did find
other references to this term, so I figured it was the correct term.
But maybe it isn't official.

Sigh....

                               -- Adam

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Reading "normal" text files with Wide_Text_IO in GNAT
  2006-12-07  2:02         ` Adam Beneschan
@ 2006-12-09 20:43           ` Björn Persson
  2006-12-11 19:49           ` Manuel Collado
  1 sibling, 0 replies; 8+ messages in thread
From: Björn Persson @ 2006-12-09 20:43 UTC (permalink / raw)

Adam Beneschan wrote:

> Bjï¿½rn Persson wrote:
>> Manuel Collado wrote:
>> > UCS-1 means encoding each character (codepoint) as a single byte whose
>> > numerical value is just the codepoint. Can be used only for codepoints
>> > in the range (0..255). UCS-1 is the natural, implicit encoding of all
>> > 8-bit (and 7-bits) character sets.
>>
>> I'd still like to know where UCS-1 is defined, and by whom.
>> http://www.iana.org/assignments/character-sets lists ISO-10646-UCS-2,
>> ISO-10646-UCS-4 and ISO-10646-UCS-Basic, but no UCS-1.
>> http://www.unicode.org/glossary/#U also has entries for UCS-2 and UCS-4,
>> but no UCS-1.
> 
> UCS-Basic may be the "official" name for what I'm talking about.

No, it can't be. It's described as "ASCII subset of Unicode.  Basic Latin =
collection 1", so it only encodes 128 characters.

> Unfortunately, I'm having trouble figuring it out.  The IANA website
> you referred me to is titled "Character Sets", but some of the things
> listed underneath are encoding standards (UTF-8, etc.) rather than
> character sets; UCS-Basic is listed as a "subset of Unicode", however,
> and Unicode is a character set (not an encoding; there are multiple
> ways to encode Unicode characters, including UTF-8, UTF-16, UCS-2).  So
> this page just exemplifies the sort of confusion Manuel referred to.

The names listed there are used in Internet protocols and data formats, such
as MIME, HTTP and XML. They're also widely used in Unix-like systems to
keep track of how text is encoded. They are for example used by Iconv, the
transcoding library. In these applications there's never a need to specify
both an encoding and a character set separately. One parameter is enough to
specify how to translate an octet stream to a sequence of characters.
Programs may do the translation in several steps, but that's each program's
own business. The IETF isn't concerned with how software represents
characters internally.

From a pragmatic viewpoint, there's no need to deal with different character
sets now that we have Unicode. The way I think of it, there is only one
character set ? the universal character set ? and a plethora of character
encodings. Some encodings can encode any character in the UCS, but most
only deal with some subset. Thus, the ISO 8859 series, the IBM codepages,
the Windows character sets and all the other old eight-bit character sets
can now be considered character encodings that each define both a subset of
the UCS and a way to encode those characters as eight-bit numbers.

-- 
Bjï¿½rn Persson                              PGP key A88682FD
                   omb jor ers @sv ge.
                   r o.b n.p son eri nu

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Reading "normal" text files with Wide_Text_IO in GNAT
  2006-12-07  2:02         ` Adam Beneschan
  2006-12-09 20:43           ` Björn Persson
@ 2006-12-11 19:49           ` Manuel Collado
  1 sibling, 0 replies; 8+ messages in thread
From: Manuel Collado @ 2006-12-11 19:49 UTC (permalink / raw)

Adam Beneschan escribiï¿½:
> Bjï¿½rn Persson wrote:
>> ...
>> I'd still like to know where UCS-1 is defined, and by whom.
>> http://www.iana.org/assignments/character-sets lists ISO-10646-UCS-2,
>> ISO-10646-UCS-4 and ISO-10646-UCS-Basic, but no UCS-1.
>> http://www.unicode.org/glossary/#U also has entries for UCS-2 and UCS-4,
>> but no UCS-1.
> ...
> UCS-2 and UCS-4 are representations in which if an integer N maps to a
> character, then that character is represented simply by a 2- or 4-byte
> binary representation of N (byte ordering is an issue, though).  So it
> would seem logical that UCS-1 would simply refer to a 1-byte binary
> representation of a number.  That's how it seemed to me, and I did find
> other references to this term, so I figured it was the correct term.
> But maybe it isn't official.

Well, it seems that there are no official names for simple, direct 
encodings (no tied to a given character set). In fact UCS-2 and UCS-4 are 
specific names for Unicode stuff (UCS means Universal Character Set).

Character encoding concepts are precisely defined in:

     http://en.wikipedia.org/wiki/Character_encoding

As you can see, the encoding issue is composed of two separated ideas: the 
CEF (character encodng form) and the CES (character encoding scheme). Some 
of the latest ones have explicit names. But the direct CEFs are so simple 
that they don't need explicit names (just the size of the code value).

If we take UCS-2 and UCS-4 out of the Unicode world and use them as general 
names for direct CEFs with 16-bit and 32-bit code values, then UCS-1 
becomes the natural name for the direct CEF with 8-bit code values. Let it 
be official or not.

Regards.
-- 
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2006-12-11 19:49 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-11-30 19:54 Reading "normal" text files with Wide_Text_IO in GNAT Adam Beneschan
2006-12-03  1:22 ` Björn Persson
2006-12-04 18:17   ` Adam Beneschan
2006-12-04 23:35     ` Manuel Collado
2006-12-06 23:46       ` Björn Persson
2006-12-07  2:02         ` Adam Beneschan
2006-12-09 20:43           ` Björn Persson
2006-12-11 19:49           ` Manuel Collado

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox