From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham
	autolearn_force=no version=3.4.4
X-Google-Thread: 103376,5d4095813b818c7d
X-Google-Attributes: gid103376,public
X-Google-Language: ENGLISH,ASCII
Path: 
 g2news2.google.com!news3.google.com!border1.nntp.dca.giganews.com!nntp.giganews.com!nx01.iad01.newshosting.com!newshosting.com!newsfeed.icl.net!newsfeed.fjserv.net!news.tele.dk!news.tele.dk!small.news.tele.dk!newspeer2.se.telia.net!se.telia.net!masternews.telia.net.!newsb.telia.net.POSTED!not-for-mail
From: =?ISO-8859-1?Q?Bj=F6rn?= Persson <spam-away@nowhere.nil>
Subject: Re: Reading "normal" text files with Wide_Text_IO in GNAT
Newsgroups: comp.lang.ada
References: <1164916470.648544.256710@n67g2000cwd.googlegroups.com>
 <kFpch.25227$E02.10276@newsb.telia.net>
 <1165256255.486012.132810@l12g2000cwl.googlegroups.com>
 <4574b0c2@news.upm.es> <lDIdh.25626$E02.10478@newsb.telia.net>
 <1165456975.595248.177740@l12g2000cwl.googlegroups.com>
User-Agent: KNode/0.10.4
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8Bit
Message-ID: <AdFeh.25956$E02.10562@newsb.telia.net>
Date: Sat, 09 Dec 2006 20:43:12 GMT
NNTP-Posting-Host: 83.250.96.174
X-Complaints-To: abuse@telia.com
X-Trace: newsb.telia.net 1165696992 83.250.96.174 (Sat,
 09 Dec 2006 21:43:12 CET)
NNTP-Posting-Date: Sat, 09 Dec 2006 21:43:12 CET
Organization: Telia Internet
Xref: g2news2.google.com comp.lang.ada:7870
Date: 2006-12-09T20:43:12+00:00
List-Id: <comp.lang.ada>

Adam Beneschan wrote:

> Bj�rn Persson wrote:
>> Manuel Collado wrote:
>> > UCS-1 means encoding each character (codepoint) as a single byte whose
>> > numerical value is just the codepoint. Can be used only for codepoints
>> > in the range (0..255). UCS-1 is the natural, implicit encoding of all
>> > 8-bit (and 7-bits) character sets.
>>
>> I'd still like to know where UCS-1 is defined, and by whom.
>> http://www.iana.org/assignments/character-sets lists ISO-10646-UCS-2,
>> ISO-10646-UCS-4 and ISO-10646-UCS-Basic, but no UCS-1.
>> http://www.unicode.org/glossary/#U also has entries for UCS-2 and UCS-4,
>> but no UCS-1.
> 
> UCS-Basic may be the "official" name for what I'm talking about.

No, it can't be. It's described as "ASCII subset of Unicode.  Basic Latin =
collection 1", so it only encodes 128 characters.

> Unfortunately, I'm having trouble figuring it out.  The IANA website
> you referred me to is titled "Character Sets", but some of the things
> listed underneath are encoding standards (UTF-8, etc.) rather than
> character sets; UCS-Basic is listed as a "subset of Unicode", however,
> and Unicode is a character set (not an encoding; there are multiple
> ways to encode Unicode characters, including UTF-8, UTF-16, UCS-2).  So
> this page just exemplifies the sort of confusion Manuel referred to.

The names listed there are used in Internet protocols and data formats, such
as MIME, HTTP and XML. They're also widely used in Unix-like systems to
keep track of how text is encoded. They are for example used by Iconv, the
transcoding library. In these applications there's never a need to specify
both an encoding and a character set separately. One parameter is enough to
specify how to translate an octet stream to a sequence of characters.
Programs may do the translation in several steps, but that's each program's
own business. The IETF isn't concerned with how software represents
characters internally.

>From a pragmatic viewpoint, there's no need to deal with different character
sets now that we have Unicode. The way I think of it, there is only one
character set ? the universal character set ? and a plethora of character
encodings. Some encodings can encode any character in the UCS, but most
only deal with some subset. Thus, the ISO 8859 series, the IBM codepages,
the Windows character sets and all the other old eight-bit character sets
can now be considered character encodings that each define both a subset of
the UCS and a way to encode those characters as eight-bit numbers.

-- 
Bj�rn Persson                              PGP key A88682FD
                   omb jor ers @sv ge.
                   r o.b n.p son eri nu