From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-0.9 required=5.0 tests=BAYES_00,FORGED_GMAIL_RCVD, FREEMAIL_FROM autolearn=no autolearn_force=no version=3.4.4 X-Google-Thread: 103376,5bcc293dc5642650 X-Google-NewGroupId: yes X-Google-Attributes: gida07f3367d7,domainid0,public,usenet X-Google-Language: ENGLISH,ASCII Received: by 10.68.27.230 with SMTP id w6mr903451pbg.3.1318906569807; Mon, 17 Oct 2011 19:56:09 -0700 (PDT) Path: d5ni24814pbc.0!nntp.google.com!news2.google.com!postnews.google.com!p27g2000prp.googlegroups.com!not-for-mail From: ytomino Newsgroups: comp.lang.ada Subject: Re: Why no Ada.Wide_Directories? Date: Mon, 17 Oct 2011 19:32:04 -0700 (PDT) Organization: http://groups.google.com Message-ID: <409c81ab-bd54-493b-beb4-a0cca99ec306@p27g2000prp.googlegroups.com> References: <9937871.172.1318575525468.JavaMail.geo-discussion-forums@prib32> <418b8140-fafb-442f-b91c-e22cc47f8adb@y22g2000pri.googlegroups.com> <7156122c-b63f-487e-ad1b-0edcc6694a7a@u10g2000prl.googlegroups.com> NNTP-Posting-Host: 118.6.135.155 Mime-Version: 1.0 X-Trace: posting.google.com 1318906569 19292 127.0.0.1 (18 Oct 2011 02:56:09 GMT) X-Complaints-To: groups-abuse@google.com NNTP-Posting-Date: Tue, 18 Oct 2011 02:56:09 +0000 (UTC) Complaints-To: groups-abuse@google.com Injection-Info: p27g2000prp.googlegroups.com; posting-host=118.6.135.155; posting-account=Mi71UQoAAACnFhXo1NVxPlurinchtkIj User-Agent: G2/1.0 X-Google-Web-Client: true X-Google-Header-Order: HNKUARELSC X-HTTP-UserAgent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_5_8) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.202 Safari/535.1,gzip(gfe) Xref: news2.google.com comp.lang.ada:14018 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Date: 2011-10-17T19:32:04-07:00 List-Id: On Oct 18, 10:10=A0am, Adam Beneschan wrote: > On Oct 17, 4:47=A0pm, ytomino wrote: > > > > > > > > > > > On Oct 18, 6:33=A0am, "Randy Brukardt" wrote: > > > > Say what? > > > > Ada.Strings.Encoding (new in Ada 2012) uses a subtype of String to st= ore > > > UTF-8 encoded strings. As such, I'd find it pretty surprising if doin= g so > > > was "a violation of the standard". > > > > The intent has always been that Open, Ada.Directories, etc. take UTF-= 8 > > > strings as an option. Presumably the implementation would use a Form = to > > > specify that the file names in UTF-8 form rather than Latin-1. (I was= n't > > > able to find a reference for this in a quick search, but I know it ha= s been > > > talked about on several occasions.) > > > > One of the primary reasons that Ada.Strings.Encoding uses a subtype o= f > > > String rather than a separate type is so that it can be passed to Ope= n and > > > the like. > > > > It's probably true that we should standardize on the Form needed to u= se > > > UTF-8 strings in these contexts, or at least come up with Implementat= ion > > > Advice on that point. > > > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0Randy. > > > Good news. Thanks for letting know. > > My worry is decreased a little. > > > However, even if that is right, Form parameters are missing for many > > subprograms. > > Probably, All subprograms in Ada.Directories, > > Ada.Directories.Hierarchical_File_Names, Ada.Command_Line, > > Ada.Environment_Variables and other subprograms having Name parameter > > or returning a file name should have Form parameter. > > (For example, I do Open (X, Form =3D> "UTF-8"). Which does Name (X) > > returns UTF-8 or Latin-1?) > > > Moreover, in the future, we will always use I/O subprograms as UTF-8 > > mode if what you say is realized. > > But other libraries in the standard are explicitly defined as Latin-1. > > It's certain that Ada.Character.Handling.To_Upper breaks UTF-8. > > I have a feeling you're fundamentally confused about what UTF-8 is, as > compared to "Latin-1". =A0Latin-1 is a character mapping. =A0It defines, > for all integers in the range 0..255, what character that integer > represents (e.g. 77 represents 'M', etc.). =A0Unicode is a character > mapping that defines characters for a much larger integer range. =A0For > integers in the range 0..255, the character represented in Unicode is > the same as that in Latin-1; higher integers represent characters in > other alphabets, other symbols, etc. =A0Those mappings just tell you > what symbols go with what numbers, and they don't say anything about > how the numbers are supposed to be stored. > > UTF-8 is an encoding (representation). =A0It defines, for each non- > negative integer up to a certain point, what bits are used to > represent that integer. =A0The number of bits is not fixed. =A0So even if > you're working with characters all in the 0..255 range, some of those > characters will be represented in 8 bits (one byte) and some will take > 16 bits (two bytes). > > Because of this, it is not feasible to work with strings or characters > in UTF-8 encoding. =A0Suppose you declare a string > > =A0 =A0S : String (1 .. 100); > > but you want it to be a UTF-8 string. =A0How would that work? =A0If you > want to look at S(50), the computer would have to start at the > beginning of the string and figure out whether each character is > represented as 1 or 2 bytes. =A0Nobody wants that. > > The only sane way to work with strings in memory is to use a format > where every character is the same size (String if all your characters > are in the 0..255 range, Wide_String for 0..65535, Wide_Wide_String > for 0..2**32-1). =A0Then, if you have a string of bytes in UTF-8 format, > you convert it to a regular (Wide_)(Wide_)String with routines in > Ada.Strings.UTF_Encoding; and it also has routines for converting > regular strings to UTF-8 format. =A0But you don't want to *keep* strings > in memory and work with them in UTF-8 format. =A0That's why it doesn't > make sense to have string routines (like > Ada.Strings.Equal_Case_Insensitive or Ada.Character_Handling.To_Upper) > that work with UTF-8. > > Hope this solves your problem. > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0-- Adam I'm not confused. Your misreading. Of course, if applications always hold file names as Wide_Wide_String, and encode to UTF-8 only/every calling I/O subprograms as what you say, so it's very simple and it is perhaps intended method. I understand it. But, where do these file names come from? These are usually told by command-line or configuration file (written by user). It is probably encoded UTF-8 if the locale setting of OS is UTF-8. So Form parameters of subprograms in Ada.Command_Line are necessary and it's natural keeping UTF-8. (Some file systems like Linux accept broken code as correct file name. Applications must not (can not?) decode/encode file names in this case. Broken file name may be right file name if user sets LANG variable. Same thing is in NTFS/NFS+. These file systems can accept broken UTF-16. Strictly speaking, always, an application should not encode/ decode file names. But, Ada decides file names are stored into String (as long as Randy says). So we have to give up about UTF-16 file systems.) And, it's popular that text processing functions keep encoded strings in many other libraries or languages. I do not necessarily want to deny the way of Ada, but I feel your opinion is prejudiced. It is not so difficult as you say in fact.