From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham autolearn_force=no version=3.4.4 X-Google-Thread: 103376,5d4095813b818c7d X-Google-Attributes: gid103376,public X-Google-Language: ENGLISH,ASCII Path: g2news2.google.com!news3.google.com!border1.nntp.dca.giganews.com!nntp.giganews.com!nx01.iad01.newshosting.com!newshosting.com!newsfeed.icl.net!newsfeed.fjserv.net!news.tele.dk!news.tele.dk!small.news.tele.dk!newspeer2.se.telia.net!se.telia.net!masternews.telia.net.!newsb.telia.net.POSTED!not-for-mail From: =?ISO-8859-1?Q?Bj=F6rn?= Persson Subject: Re: Reading "normal" text files with Wide_Text_IO in GNAT Newsgroups: comp.lang.ada References: <1164916470.648544.256710@n67g2000cwd.googlegroups.com> <1165256255.486012.132810@l12g2000cwl.googlegroups.com> <4574b0c2@news.upm.es> <1165456975.595248.177740@l12g2000cwl.googlegroups.com> User-Agent: KNode/0.10.4 MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8Bit Message-ID: Date: Sat, 09 Dec 2006 20:43:12 GMT NNTP-Posting-Host: 83.250.96.174 X-Complaints-To: abuse@telia.com X-Trace: newsb.telia.net 1165696992 83.250.96.174 (Sat, 09 Dec 2006 21:43:12 CET) NNTP-Posting-Date: Sat, 09 Dec 2006 21:43:12 CET Organization: Telia Internet Xref: g2news2.google.com comp.lang.ada:7870 Date: 2006-12-09T20:43:12+00:00 List-Id: Adam Beneschan wrote: > Bj�rn Persson wrote: >> Manuel Collado wrote: >> > UCS-1 means encoding each character (codepoint) as a single byte whose >> > numerical value is just the codepoint. Can be used only for codepoints >> > in the range (0..255). UCS-1 is the natural, implicit encoding of all >> > 8-bit (and 7-bits) character sets. >> >> I'd still like to know where UCS-1 is defined, and by whom. >> http://www.iana.org/assignments/character-sets lists ISO-10646-UCS-2, >> ISO-10646-UCS-4 and ISO-10646-UCS-Basic, but no UCS-1. >> http://www.unicode.org/glossary/#U also has entries for UCS-2 and UCS-4, >> but no UCS-1. > > UCS-Basic may be the "official" name for what I'm talking about. No, it can't be. It's described as "ASCII subset of Unicode. Basic Latin = collection 1", so it only encodes 128 characters. > Unfortunately, I'm having trouble figuring it out. The IANA website > you referred me to is titled "Character Sets", but some of the things > listed underneath are encoding standards (UTF-8, etc.) rather than > character sets; UCS-Basic is listed as a "subset of Unicode", however, > and Unicode is a character set (not an encoding; there are multiple > ways to encode Unicode characters, including UTF-8, UTF-16, UCS-2). So > this page just exemplifies the sort of confusion Manuel referred to. The names listed there are used in Internet protocols and data formats, such as MIME, HTTP and XML. They're also widely used in Unix-like systems to keep track of how text is encoded. They are for example used by Iconv, the transcoding library. In these applications there's never a need to specify both an encoding and a character set separately. One parameter is enough to specify how to translate an octet stream to a sequence of characters. Programs may do the translation in several steps, but that's each program's own business. The IETF isn't concerned with how software represents characters internally. >From a pragmatic viewpoint, there's no need to deal with different character sets now that we have Unicode. The way I think of it, there is only one character set ? the universal character set ? and a plethora of character encodings. Some encodings can encode any character in the UCS, but most only deal with some subset. Thus, the ISO 8859 series, the IBM codepages, the Windows character sets and all the other old eight-bit character sets can now be considered character encodings that each define both a subset of the UCS and a way to encode those characters as eight-bit numbers. -- Bj�rn Persson PGP key A88682FD omb jor ers @sv ge. r o.b n.p son eri nu