From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM autolearn=ham autolearn_force=no version=3.4.4 X-Google-Language: ENGLISH,ASCII-7-bit X-Google-Thread: 103376,88ed72d98e6b3457 X-Google-Attributes: gid103376,public X-Google-ArrivalTime: 2003-10-05 16:57:38 PST Path: archiver1.google.com!news2.google.com!newsfeed.stanford.edu!headwall.stanford.edu!newshub.sdsu.edu!elnk-nf2-pas!newsfeed.earthlink.net!wn14feed!worldnet.att.net!204.127.198.203!attbi_feed3!attbi_feed4!attbi.com!sccrnsc02.POSTED!not-for-mail Message-ID: <3F80AFE4.4010901@comcast.net> From: "Robert I. Eachus" User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.0.2) Gecko/20021120 Netscape/7.01 X-Accept-Language: en-us, en MIME-Version: 1.0 Newsgroups: comp.lang.ada Subject: Re: Standard Library Interest? References: <3F7F760E.2020901@comcast.net> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit NNTP-Posting-Host: 24.34.139.183 X-Complaints-To: abuse@comcast.net X-Trace: sccrnsc02 1065398257 24.34.139.183 (Sun, 05 Oct 2003 23:57:37 GMT) NNTP-Posting-Date: Sun, 05 Oct 2003 23:57:37 GMT Organization: Comcast Online Date: Sun, 05 Oct 2003 23:57:37 GMT Xref: archiver1.google.com comp.lang.ada:284 Date: 2003-10-05T23:57:37+00:00 List-Id: Georg Bauhaus wrote: > How can this be when Unicode has more than 65536 code positions? > (Assuming I wanted to use full Unicode, I guess I will have to rely > on Implementation Permissions to provide me with a corresponding > character type?) If you are that familiar with Unicode... Ada Wide_Character corresponds to the ISO 10646 BMP, and to Unicode. ISO 10646 defines a 32-bit mapping for code points, broken into octets, and further into 16-bit (two octet) planes. It also defines three encoding mechanisms, UTF-8, UTF-16, and UTF-32. Unicode corresponds to UTF-16, where most 16-bit encodings map to single code points, and encodings in the surrogates area are used to encode code points from other planes. These encodings consist of a high surrogate from the range 16#DC00# to 16#DFFF# followed by a low surrogate from the range 16#D800# to 16#DBFF#. Technically Ada encodes the BMP and will not damage any embedded surrogates, but surrogate pairs will not be counted as a single code point. If anyone wants to use "full" Unicode in Ada, the more appropriate approach would be to add support for UTF-32 as Wide_Wide_Character. But in practice, there would be no difference between Ada's treatment of Wide_Character as the BMP or an encoding using UTF-16, because of the way Unicode has defined the surrogate characters. Most of the 'missing' Unicode support has to do with display rules that apply to printers not to strings. If you want to write a subprogram to determine the length of a Wide_Character string in characters, you can't do it without adopting specific language rules on what is or is not a character. For example Hangul (a form of Korean) combines up to three code points into a single Hangul character, which represents a syllable. Or Vietnamese, which can have several accent marks on a single (Latin) character. It is certainly possible to have a (written language dependent) set of categorization routines that correctly sorts Wide_Character representations into appropriate categories for that language. (Character, symbol, numeric digit, etc.) But I would hesitate to even try to come up with a language independent mapping. For example Pi is a mathematical symbol in English, but a capital letter in Greek. -- Robert I. Eachus "Quality is the Buddha. Quality is scientific reality. Quality is the goal of Art. It remains to work these concepts into a practical, down-to-earth context, and for this there is nothing more practical or down-to-earth than what I have been talking about all along...the repair of an old motorcycle." -- from Zen and the Art of Motorcycle Maintenance by Robert Pirsig