comp.lang.ada
 help / color / mirror / Atom feed
From: "Björn Persson" <spam-away@nowhere.nil>
Subject: Re: Reading "normal" text files with Wide_Text_IO in GNAT
Date: Sat, 09 Dec 2006 20:43:12 GMT
Date: 2006-12-09T20:43:12+00:00	[thread overview]
Message-ID: <AdFeh.25956$E02.10562@newsb.telia.net> (raw)
In-Reply-To: 1165456975.595248.177740@l12g2000cwl.googlegroups.com

Adam Beneschan wrote:

> Bj�rn Persson wrote:
>> Manuel Collado wrote:
>> > UCS-1 means encoding each character (codepoint) as a single byte whose
>> > numerical value is just the codepoint. Can be used only for codepoints
>> > in the range (0..255). UCS-1 is the natural, implicit encoding of all
>> > 8-bit (and 7-bits) character sets.
>>
>> I'd still like to know where UCS-1 is defined, and by whom.
>> http://www.iana.org/assignments/character-sets lists ISO-10646-UCS-2,
>> ISO-10646-UCS-4 and ISO-10646-UCS-Basic, but no UCS-1.
>> http://www.unicode.org/glossary/#U also has entries for UCS-2 and UCS-4,
>> but no UCS-1.
> 
> UCS-Basic may be the "official" name for what I'm talking about.

No, it can't be. It's described as "ASCII subset of Unicode.  Basic Latin =
collection 1", so it only encodes 128 characters.

> Unfortunately, I'm having trouble figuring it out.  The IANA website
> you referred me to is titled "Character Sets", but some of the things
> listed underneath are encoding standards (UTF-8, etc.) rather than
> character sets; UCS-Basic is listed as a "subset of Unicode", however,
> and Unicode is a character set (not an encoding; there are multiple
> ways to encode Unicode characters, including UTF-8, UTF-16, UCS-2).  So
> this page just exemplifies the sort of confusion Manuel referred to.

The names listed there are used in Internet protocols and data formats, such
as MIME, HTTP and XML. They're also widely used in Unix-like systems to
keep track of how text is encoded. They are for example used by Iconv, the
transcoding library. In these applications there's never a need to specify
both an encoding and a character set separately. One parameter is enough to
specify how to translate an octet stream to a sequence of characters.
Programs may do the translation in several steps, but that's each program's
own business. The IETF isn't concerned with how software represents
characters internally.

From a pragmatic viewpoint, there's no need to deal with different character
sets now that we have Unicode. The way I think of it, there is only one
character set ? the universal character set ? and a plethora of character
encodings. Some encodings can encode any character in the UCS, but most
only deal with some subset. Thus, the ISO 8859 series, the IBM codepages,
the Windows character sets and all the other old eight-bit character sets
can now be considered character encodings that each define both a subset of
the UCS and a way to encode those characters as eight-bit numbers.

-- 
Bj�rn Persson                              PGP key A88682FD
                   omb jor ers @sv ge.
                   r o.b n.p son eri nu



  reply	other threads:[~2006-12-09 20:43 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-11-30 19:54 Reading "normal" text files with Wide_Text_IO in GNAT Adam Beneschan
2006-12-03  1:22 ` Björn Persson
2006-12-04 18:17   ` Adam Beneschan
2006-12-04 23:35     ` Manuel Collado
2006-12-06 23:46       ` Björn Persson
2006-12-07  2:02         ` Adam Beneschan
2006-12-09 20:43           ` Björn Persson [this message]
2006-12-11 19:49           ` Manuel Collado
replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox