From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=0.4 required=5.0 tests=BAYES_00,FORGED_MUA_MOZILLA autolearn=no autolearn_force=no version=3.4.4 X-Google-Thread: 103376,5bcc293dc5642650 X-Google-NewGroupId: yes X-Google-Attributes: gida07f3367d7,domainid0,public,usenet X-Google-Language: ENGLISH,ASCII Received: by 10.68.38.134 with SMTP id g6mr3576721pbk.6.1318958857644; Tue, 18 Oct 2011 10:27:37 -0700 (PDT) Path: d5ni28049pbc.0!nntp.google.com!news1.google.com!news3.google.com!feeder.news-service.com!eternal-september.org!feeder.eternal-september.org!.POSTED!not-for-mail From: "J-P. Rosen" Newsgroups: comp.lang.ada Subject: Re: Why no Ada.Wide_Directories? Date: Tue, 18 Oct 2011 19:27:37 +0200 Organization: A noiseless patient Spider Message-ID: References: <9937871.172.1318575525468.JavaMail.geo-discussion-forums@prib32> <418b8140-fafb-442f-b91c-e22cc47f8adb@y22g2000pri.googlegroups.com> <7156122c-b63f-487e-ad1b-0edcc6694a7a@u10g2000prl.googlegroups.com> <1tggwi1yicf5z.1q3xra9r00oyb$.dlg@40tude.net> Mime-Version: 1.0 Injection-Date: Tue, 18 Oct 2011 17:27:36 +0000 (UTC) Injection-Info: mx04.eternal-september.org; posting-host="kNGZANgPhxsSLx11YhGgCw"; logging-data="16784"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18KUH1+GiKDk9u/7Cy7hntf" User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20110929 Thunderbird/7.0.1 In-Reply-To: Cancel-Lock: sha1:5p2jlVyYB8XqqjHEMjI8/5q0WWM= Xref: news1.google.com comp.lang.ada:18571 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Date: 2011-10-18T19:27:37+02:00 List-Id: Le 18/10/2011 17:34, Adam Beneschan a �crit : > On Oct 18, 12:55 am, "Dmitry A. Kazakov" > wrote: >> On Mon, 17 Oct 2011 18:10:35 -0700 (PDT), Adam Beneschan wrote: >>> I have a feeling you're fundamentally confused about what UTF-8 is, as >>> compared to "Latin-1". Latin-1 is a character mapping. It defines, >>> for all integers in the range 0..255, what character that integer >>> represents (e.g. 77 represents 'M', etc.). Unicode is a character >>> mapping that defines characters for a much larger integer range. >> >> No, Unicode is a standard describes character mappings. Both UTF-8 and >> Latin-1 are encodings. Latin-1 as an encoding has a property that there is >> 1-1 octet to code point correspondence, at the cost that some (most) of >> code points cannot be represented by the encoding. UTF-8 lacks this >> property, but is capable to represent all code points. > > Sigh... I guess you're right about the term "Latin-1". It appears to > be *both* a character mapping *and* an encoding, based on a bit of > Wikipedia research. The problem for me is this: what does that make > Latin-2, Latin-3, KOI8-R, etc.? Those seem to describe the same > encoding mechanism as Latin-1 (each code represented as one 8-bit > byte), but with different meanings for the codes in the 16#A0#..16#FF# > range. So the same encoding scheme seems to have multiple different > names. That's very confusing to me. > Not 100% sure, but I think here is the picture. 1) Code points are always 31 bits (or maybe 30). 2) Below is the lower left corner of BMP (use fixed fonts!): | |____________________ | | | | Latin 1 | Latin 2 | |_________|_________|_______ The lower halves of Latin-1 and Latin-2 are identical, i.e. the same characters have two different code-points, differing by 256. When you use Latin-1 with 8 bit bytes, you can view this as an encoding with the 24 upper bits being 16#00_00_00#. When you use Latin-2 with 8 bit bytes, you can view this as an encoding with the 24 upper bits being 16#00_00_01#. So in a sense, Latin-1 and Latin-2 are both character sets, and when represented on only 8 bits, an encoding. Does this make sense? -- --------------------------------------------------------- J-P. Rosen (rosen@adalog.fr) Adalog a d�m�nag� / Adalog has moved: 2 rue du Docteur Lombard, 92441 Issy-les-Moulineaux CEDEX Tel: +33 1 45 29 21 52, Fax: +33 1 45 29 25 00