Ada 2012 and Unicode package (UTF-nn encodings handling)

comp.lang.ada
 help / color / mirror / Atom feed

* Ada 2012 and Unicode package (UTF-nn encodings handling)
@ 2010-08-20 21:38 Yannick Duchêne (Hibou57)
  2010-08-20 21:41 ` Yannick Duchêne (Hibou57)
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Yannick Duchêne (Hibou57) @ 2010-08-20 21:38 UTC (permalink / raw)


Extract from the thread “S-expression I/O in Ada”. Subtopic moved in a  
separate thread for clarity.

Le Wed, 18 Aug 2010 15:16:50 +0200, J-P. Rosen <rosen@adalog.fr> a écrit:
> Slightly OT, but you (and others) might be interested to know that Ada
> 2012 will include string encoding packages to the various UTF-X
> encodings. These will be (are?) provided very soon by GNAT.
>
> See AI05-137-2
> (http://www.ada-auth.org/cgi-bin/cvsweb.cgi/ai05s/ai05-0137-2.txt?rev=1.2)

Time for my stupid question of the day :)

I've noticed this introduction in the last amendment, because Unicode has  
always been an issue/matter for me (actually use my own).

I could not avoid two questions: why no UTF-32 ? (this would not be an  
implementation nightmare) and why BOM handled for each string while BOM is  
to be used at stream/file level ? (see XML or HTML files for example). Or  
are these strings supposed to hold the whole content of a file/stream ?

Quote:
http://www.unicode.org/faq/utf_bom.html
> A: A byte order mark (BOM) consists of the character code U+FEFF at the  
> beginning of a data stream

This is a FAQ at Unicode.org; but all references (Unicode PDF files, XML  
reference, HTTML reference) all says the same.

This matter, because the code point U+FEFF can stands for two different  
things: Zero Width No Break Space or encoding Byte Order Mark. The only  
way to distinguish both usage, is where-it-appears.

If it appears as the first code point of a stream, this is a BOM  
(heuristics may be applied to automatically switch encoding with an  
analysis of the first byte of a stream, this is what I do) ; if this  
appears any where else in a stream, this is a character code point.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Ada 2012 and Unicode package (UTF-nn encodings handling)
  2010-08-20 21:38 Ada 2012 and Unicode package (UTF-nn encodings handling) Yannick Duchêne (Hibou57)
@ 2010-08-20 21:41 ` Yannick Duchêne (Hibou57)
  2010-08-21  6:21 ` Dmitry A. Kazakov
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Yannick Duchêne (Hibou57) @ 2010-08-20 21:41 UTC (permalink / raw)


Le Fri, 20 Aug 2010 23:38:20 +0200, Yannick Duchêne (Hibou57)  
<yannick_duchene@yahoo.fr> a écrit:
> (heuristics may be applied to automatically switch encoding with an  
> analysis of the first byte of a stream

Mistake: read “analysis of the first byteS” (plural)



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Ada 2012 and Unicode package (UTF-nn encodings handling)
  2010-08-20 21:38 Ada 2012 and Unicode package (UTF-nn encodings handling) Yannick Duchêne (Hibou57)
  2010-08-20 21:41 ` Yannick Duchêne (Hibou57)
@ 2010-08-21  6:21 ` Dmitry A. Kazakov
  2010-08-21  7:01 ` J-P. Rosen
  2010-08-23 22:28 ` Randy Brukardt
  3 siblings, 0 replies; 10+ messages in thread
From: Dmitry A. Kazakov @ 2010-08-21  6:21 UTC (permalink / raw)


On Fri, 20 Aug 2010 23:38:20 +0200, Yannick Duchï¿½ne (Hibou57) wrote:

> I could not avoid two questions: why no UTF-32 ?

Is there anybody who would ever use it?

> Quote:
> http://www.unicode.org/faq/utf_bom.html
>> A: A byte order mark (BOM) consists of the character code U+FEFF at the  
>> beginning of a data stream
> 
> This is a FAQ at Unicode.org; but all references (Unicode PDF files, XML  
> reference, HTTML reference) all says the same.

That is all OS's business, how does it handle the content. So if it belongs
anywhere then to stream I/O + directories. 

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Ada 2012 and Unicode package (UTF-nn encodings handling)
  2010-08-20 21:38 Ada 2012 and Unicode package (UTF-nn encodings handling) Yannick Duchêne (Hibou57)
  2010-08-20 21:41 ` Yannick Duchêne (Hibou57)
  2010-08-21  6:21 ` Dmitry A. Kazakov
@ 2010-08-21  7:01 ` J-P. Rosen
  2010-08-21  8:12   ` Yannick Duchêne (Hibou57)
  2010-08-23 22:28 ` Randy Brukardt
  3 siblings, 1 reply; 10+ messages in thread
From: J-P. Rosen @ 2010-08-21  7:01 UTC (permalink / raw)


Le 20/08/2010 23:38, Yannick Duchêne (Hibou57) a écrit :
> Time for my stupid question of the day :)
A question is never stupid. Answers sometimes...

> I could not avoid two questions: why no UTF-32 ? (this would not be an
> implementation nightmare)
I still fail to see the benefit of encoding 31 bits values into 32 bits
values...
And even if implementation is not a nightmare, it always has a cost.
Implementers are reluctant to spend money for features that nobody will
use. (Wide_Wide_Character was forced on us by ISO).

> and why BOM handled for each string while BOM
> is to be used at stream/file level ? (see XML or HTML files for
> example).
A package provides functionnalities. It should not presume how it is
used. Since this package is clearly in the "string handling" class, it
makes sense to handle this with strings.

For files, the usage is to have a BOM on the first line of the file. The
way the functions are defined makes it easy to not process the first
line specially; see the use case in the AI.


-- 
---------------------------------------------------------
           J-P. Rosen (rosen@adalog.fr)
Visit Adalog's web site at http://www.adalog.fr



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Ada 2012 and Unicode package (UTF-nn encodings handling)
  2010-08-21  7:01 ` J-P. Rosen
@ 2010-08-21  8:12   ` Yannick Duchêne (Hibou57)
  2010-08-22 18:51     ` J-P. Rosen
  0 siblings, 1 reply; 10+ messages in thread
From: Yannick Duchêne (Hibou57) @ 2010-08-21  8:12 UTC (permalink / raw)


> I still fail to see the benefit of encoding 31 bits values into 32 bits
> values...
UTF-32 is not formally an encoding format, it would better be referred to  
as a matter of Byte order. But this byte order is not system dependent, it  
is cross-platform data dependent.

> And even if implementation is not a nightmare, it always has a cost.
> Implementers are reluctant to spend money for features that nobody will
> use. (Wide_Wide_Character was forced on us by ISO).
I suppose the ISO forced the introduction of Wide_Wide_Character because  
it is part of the Unicode standard, and as you know, conformance requires  
full-conformance. There is no part-of with this, because as soon and it is  
defined, this may really have occurrences.

Imagine a web crawler: it would have to be designed with this option in  
mind. Designers could not say “We do not feel UTF-32 is useful, our  
crawler will then not be offered the capabilities of handling such  
documents”.

I just though this was a little pity, if one want to rely on the standard  
packages capabilities, then this one will only be able to do it partially.  
This would be a bit like Two way linked list without the one way (or the  
opposite). A matter of completeness.

> A package provides functionnalities. It should not presume how it is
> used. Since this package is clearly in the "string handling" class, it
> makes sense to handle this with strings.
Right, this is defined in *String*_Encoding.

> For files, the usage is to have a BOM on the first line of the file. The
> way the functions are defined makes it easy to not process the first
> line specially; see the use case in the AI.
I just had a look back at
http://www.ada-auth.org/standards/12aarm/html/AA-A-4-11.html
Only Encode has this capability (via Output_BOM : Boolean). Decode/Convert  
has nothing similar and will always skip any 16#FEFF# which will be  
interpreted as a BOM instead of as a character (there is nothing like an  
Interpret_BOM : Boolean).

But may be I am missing something. Will have a deeper look at it and at  
the AI which come with it (I saw UTF-32 was at least “pronounced” during  
the talk).



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Ada 2012 and Unicode package (UTF-nn encodings handling)
  2010-08-21  8:12   ` Yannick Duchêne (Hibou57)
@ 2010-08-22 18:51     ` J-P. Rosen
  2010-08-22 19:48       ` Georg Bauhaus
  0 siblings, 1 reply; 10+ messages in thread
From: J-P. Rosen @ 2010-08-22 18:51 UTC (permalink / raw)


Le 21/08/2010 10:12, Yannick Duchêne (Hibou57) a écrit :
> I just had a look back at
> http://www.ada-auth.org/standards/12aarm/html/AA-A-4-11.html
> Only Encode has this capability (via Output_BOM : Boolean).
> Decode/Convert has nothing similar and will always skip any 16#FEFF#
> which will be interpreted as a BOM instead of as a character (there is
> nothing like an Interpret_BOM : Boolean).
> 
> But may be I am missing something. Will have a deeper look at it and at
> the AI which come with it (I saw UTF-32 was at least “pronounced” during
> the talk).
I think you missed the "Encoding" function. The intended usage
(extracted from the !discussion section) is:
1) Read the first line. Call function Encoding on that line with an
   appropriate default to use if the line does not start with a
   BOM. Initialize the encoding scheme to the value returned by the
   function.

2) Decode all lines (including the first one) with the chosen encoding
   scheme. Since the BOM is ignored by Decode functions, it is not
   necessary to slice the first line specially.



-- 
---------------------------------------------------------
           J-P. Rosen (rosen@adalog.fr)
Visit Adalog's web site at http://www.adalog.fr



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Ada 2012 and Unicode package (UTF-nn encodings handling)
  2010-08-22 18:51     ` J-P. Rosen
@ 2010-08-22 19:48       ` Georg Bauhaus
  2010-08-22 20:40         ` J-P. Rosen
  0 siblings, 1 reply; 10+ messages in thread
From: Georg Bauhaus @ 2010-08-22 19:48 UTC (permalink / raw)


On 8/22/10 8:51 PM, J-P. Rosen wrote:

> I think you missed the "Encoding" function. The intended usage
> (extracted from the !discussion section) is:
> 1) Read the first line. Call function Encoding on that line with an
>     appropriate default to use if the line does not start with a
>     BOM. Initialize the encoding scheme to the value returned by the
>     function.

Since Ada is an ISO language, is the name BOM for the non-UTF-8
thing used by Microsoft actually ISO? (I.e., has it become part of ISO 10646)?



Georg



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Ada 2012 and Unicode package (UTF-nn encodings handling)
  2010-08-22 19:48       ` Georg Bauhaus
@ 2010-08-22 20:40         ` J-P. Rosen
  2010-08-23 10:32           ` Georg Bauhaus
  0 siblings, 1 reply; 10+ messages in thread
From: J-P. Rosen @ 2010-08-22 20:40 UTC (permalink / raw)


Le 22/08/2010 21:48, Georg Bauhaus a écrit :
> On 8/22/10 8:51 PM, J-P. Rosen wrote:
> 
>> I think you missed the "Encoding" function. The intended usage
>> (extracted from the !discussion section) is:
>> 1) Read the first line. Call function Encoding on that line with an
>>     appropriate default to use if the line does not start with a
>>     BOM. Initialize the encoding scheme to the value returned by the
>>     function.
> 
> Since Ada is an ISO language, is the name BOM for the non-UTF-8
> thing used by Microsoft actually ISO? (I.e., has it become part of ISO
> 10646)?
> 
It's from Unicode. ISO 10646 defines only character encodings
(code-points). Unicode uses the same encodings, and in addition defines
UTF-8 and siblings.
-- 
---------------------------------------------------------
           J-P. Rosen (rosen@adalog.fr)
Visit Adalog's web site at http://www.adalog.fr



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Ada 2012 and Unicode package (UTF-nn encodings handling)
  2010-08-22 20:40         ` J-P. Rosen
@ 2010-08-23 10:32           ` Georg Bauhaus
  0 siblings, 0 replies; 10+ messages in thread
From: Georg Bauhaus @ 2010-08-23 10:32 UTC (permalink / raw)

On 22.08.10 22:40, J-P. Rosen wrote:
> Le 22/08/2010 21:48, Georg Bauhaus a écrit :
>> On 8/22/10 8:51 PM, J-P. Rosen wrote:
>>
>>> I think you missed the "Encoding" function. The intended usage
>>> (extracted from the !discussion section) is:
>>> 1) Read the first line. Call function Encoding on that line with an
>>>     appropriate default to use if the line does not start with a
>>>     BOM. Initialize the encoding scheme to the value returned by the
>>>     function.
>>
>> Since Ada is an ISO language, is the name BOM for the non-UTF-8
>> thing used by Microsoft actually ISO? (I.e., has it become part of ISO
>> 10646)?
>>
> It's from Unicode. ISO 10646 defines only character encodings
> (code-points).

Uhm, minor nitpicking ; ISO/IEC 10646:2003

"* specifies a multiple byte (one to four) byte transformation
   UTF-8 for use with ISO 646 (ASCII) byte-oriented environments;

"* specifies a two 16-bit form and associated transformation
   UTF-16 for supplementary characters;"

(and LRM A.4.11 seems too mention, IINM.)

Markus Kuhn explains why in POSIX environments UTF-8 files---that
never have a byte order issue---should *not* have a BOM "signature".
It is, therefore, a good thing that Convert/Encode turn off outputting a
"BOM used as signature" byte sequence, since that sequence works on recent
Windows(TM) platforms but creates problems on the ISO standards compliant
platforms.

http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucsutf

"It has also been suggested to use the UTF-8 encoded BOM (0xEF 0xBB 0xBF)
 as a signature to mark the beginning of a UTF-8 file. This practice
 should definitely not be used on POSIX systems for several reasons:

 ..."

Indeed, program source files that use "incorrect" Microsoft UTF-8
signatures do create problems with Eclipse when they are used
with both Windows and GNU/Linux editions of Eclipse.

Georg

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Ada 2012 and Unicode package (UTF-nn encodings handling)
  2010-08-20 21:38 Ada 2012 and Unicode package (UTF-nn encodings handling) Yannick Duchêne (Hibou57)
                   ` (2 preceding siblings ...)
  2010-08-21  7:01 ` J-P. Rosen
@ 2010-08-23 22:28 ` Randy Brukardt
  3 siblings, 0 replies; 10+ messages in thread
From: Randy Brukardt @ 2010-08-23 22:28 UTC (permalink / raw)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 888 bytes --]


"Yannick Duch�ne (Hibou57)" <yannick_duchene@yahoo.fr> wrote in message 
news:op.vhrad6mjule2fv@garhos...
...
>> See AI05-137-2
>> (http://www.ada-auth.org/cgi-bin/cvsweb.cgi/ai05s/ai05-0137-2.txt?rev=1.2)
>
>Time for my stupid question of the day :)
>
>I could not avoid two questions: why no UTF-32 ? (this would not be an 
>implementation nightmare) and why BOM handled for each string while BOM is 
>to be used at stream/file level ? (see XML or HTML files for example). Or 
>are these strings supposed to hold the whole content of a file/stream ?

Did you read the AI? There is a reason that I put links to the AIs into 
these messages and links to the AIs from the AARM online. Each AI includes a 
!discussion section which typically includes some discussion of the design 
decisions. In this case, both of these questions are answered in the AI.

                    Randy.





^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2010-08-23 22:28 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-20 21:38 Ada 2012 and Unicode package (UTF-nn encodings handling) Yannick Duchêne (Hibou57)
2010-08-20 21:41 ` Yannick Duchêne (Hibou57)
2010-08-21  6:21 ` Dmitry A. Kazakov
2010-08-21  7:01 ` J-P. Rosen
2010-08-21  8:12   ` Yannick Duchêne (Hibou57)
2010-08-22 18:51     ` J-P. Rosen
2010-08-22 19:48       ` Georg Bauhaus
2010-08-22 20:40         ` J-P. Rosen
2010-08-23 10:32           ` Georg Bauhaus
2010-08-23 22:28 ` Randy Brukardt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox