* Reading "normal" text files with Wide_Text_IO in GNAT @ 2006-11-30 19:54 Adam Beneschan 2006-12-03 1:22 ` Björn Persson 0 siblings, 1 reply; 8+ messages in thread From: Adam Beneschan @ 2006-11-30 19:54 UTC (permalink / raw) I was looking in the GNAT reference manual at the description of the WCEM Form parameter to Wide_Text_IO.Open. It describes the different ways that wide characters can be represented in text files that Wide_Text_IO can interpret. However, at first glance, I didn't see a way to get Wide_Text_IO to read a UCS-1 text file. This is the encoding where each byte in the range 16#00#..16#FF# represents a character in the range Wide_Character'Val(16#0000#) .. Wide_Character'Val(16#00FF#), and there is no way to represent wide characters from 16#0100# to 16#FFFF#. In other words, a boring old-fashioned 8-bit text file, maybe with Latin-1 characters or control characters in the 80..9F range. Yes, I know that a file like this could be read using Text_IO, but let's say that we don't know what format the file is in until runtime. Does GNAT's Wide_Text_IO have a way to read a file like this? -- thanks, Adam ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Reading "normal" text files with Wide_Text_IO in GNAT 2006-11-30 19:54 Reading "normal" text files with Wide_Text_IO in GNAT Adam Beneschan @ 2006-12-03 1:22 ` Björn Persson 2006-12-04 18:17 ` Adam Beneschan 0 siblings, 1 reply; 8+ messages in thread From: Björn Persson @ 2006-12-03 1:22 UTC (permalink / raw) Adam Beneschan wrote: > However, at first glance, I didn't see a way to get Wide_Text_IO to > read a UCS-1 text file. Hmm, I've never heard of UCS-1. Is such an encoding really defined? > This is the encoding where each byte in the > range 16#00#..16#FF# represents a character in the range > Wide_Character'Val(16#0000#) .. Wide_Character'Val(16#00FF#), and there > is no way to represent wide characters from 16#0100# to 16#FFFF#. OK, so it's identical to ISO 8859-1. > Does GNAT's Wide_Text_IO have a way to read a file like this? It does indeed look like it can't. Gnat's approach to character encodings is amazingly faulty. Does EAstrings fill your needs? If not, would you like to join me in finishing the implementation so we can get rid of these problems? http://adacl.sourceforge.net/AdaBrowse/adacl-eastrings.html -- Bj�rn Persson PGP key A88682FD omb jor ers @sv ge. r o.b n.p son eri nu ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Reading "normal" text files with Wide_Text_IO in GNAT 2006-12-03 1:22 ` Björn Persson @ 2006-12-04 18:17 ` Adam Beneschan 2006-12-04 23:35 ` Manuel Collado 0 siblings, 1 reply; 8+ messages in thread From: Adam Beneschan @ 2006-12-04 18:17 UTC (permalink / raw) Björn Persson wrote: > Adam Beneschan wrote: > > > However, at first glance, I didn't see a way to get Wide_Text_IO to > > read a UCS-1 text file. > > Hmm, I've never heard of UCS-1. Is such an encoding really defined? I don't know if that's the correct name. I have seen it referenced in a few places. > > This is the encoding where each byte in the > > range 16#00#..16#FF# represents a character in the range > > Wide_Character'Val(16#0000#) .. Wide_Character'Val(16#00FF#), and there > > is no way to represent wide characters from 16#0100# to 16#FFFF#. > > OK, so it's identical to ISO 8859-1. Technically, I thought ISO-8859-1 was a mapping from a range of integers to a set of characters, rather than a specification of how characters are represented in bits in an actual file. I could be wrong. The distinction gets blurry at times. > > Does GNAT's Wide_Text_IO have a way to read a file like this? > > It does indeed look like it can't. Gnat's approach to character encodings is > amazingly faulty. > > Does EAstrings fill your needs? If not, would you like to join me in > finishing the implementation so we can get rid of these problems? > > http://adacl.sourceforge.net/AdaBrowse/adacl-eastrings.html My question was more theoretical than anything---I was looking at that section of the manual for other reasons, and happened to notice what seemed like an omission. But thanks for the pointer. I'll take a look at it. -- Adam ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Reading "normal" text files with Wide_Text_IO in GNAT 2006-12-04 18:17 ` Adam Beneschan @ 2006-12-04 23:35 ` Manuel Collado 2006-12-06 23:46 ` Björn Persson 0 siblings, 1 reply; 8+ messages in thread From: Manuel Collado @ 2006-12-04 23:35 UTC (permalink / raw) Adam Beneschan escribi�: > Bj�rn Persson wrote: >> Adam Beneschan wrote: >> >>> However, at first glance, I didn't see a way to get Wide_Text_IO to >>> read a UCS-1 text file. >> Hmm, I've never heard of UCS-1. Is such an encoding really defined? > > I don't know if that's the correct name. I have seen it referenced in > a few places. To clarify things: - Character set - mapping of characters to integers (the so called 'codepoints') - Character encoding - mapping of a sequence of codepoints to a sequence of bytes UCS-1 means encoding each character (codepoint) as a single byte whose numerical value is just the codepoint. Can be used only for codepoints in the range (0..255). UCS-1 is the natural, implicit encoding of all 8-bit (and 7-bits) character sets. > >>> This is the encoding where each byte in the >>> range 16#00#..16#FF# represents a character in the range >>> Wide_Character'Val(16#0000#) .. Wide_Character'Val(16#00FF#), and there >>> is no way to represent wide characters from 16#0100# to 16#FFFF#. Yes, this is UCS-1. >> OK, so it's identical to ISO 8859-1. > > Technically, I thought ISO-8859-1 was a mapping from a range of > integers to a set of characters, rather than a specification of how > characters are represented in bits in an actual file. I could be > wrong. The distinction gets blurry at times. Quite true. Technically, ISO-8859-1 is a character set (not a character encoding). Usually encoded as UCS-1 (as well as a lot of other character sets). Regretably, the terms 'character set' and 'character encoding' are used as synonyms in a lot of places. Regards. -- Manuel Collado - http://lml.ls.fi.upm.es/~mcollado ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Reading "normal" text files with Wide_Text_IO in GNAT 2006-12-04 23:35 ` Manuel Collado @ 2006-12-06 23:46 ` Björn Persson 2006-12-07 2:02 ` Adam Beneschan 0 siblings, 1 reply; 8+ messages in thread From: Björn Persson @ 2006-12-06 23:46 UTC (permalink / raw) Manuel Collado wrote: > UCS-1 means encoding each character (codepoint) as a single byte whose > numerical value is just the codepoint. Can be used only for codepoints in > the range (0..255). UCS-1 is the natural, implicit encoding of all 8-bit > (and 7-bits) character sets. I'd still like to know where UCS-1 is defined, and by whom. http://www.iana.org/assignments/character-sets lists ISO-10646-UCS-2, ISO-10646-UCS-4 and ISO-10646-UCS-Basic, but no UCS-1. http://www.unicode.org/glossary/#U also has entries for UCS-2 and UCS-4, but no UCS-1. -- Bj�rn Persson PGP key A88682FD omb jor ers @sv ge. r o.b n.p son eri nu ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Reading "normal" text files with Wide_Text_IO in GNAT 2006-12-06 23:46 ` Björn Persson @ 2006-12-07 2:02 ` Adam Beneschan 2006-12-09 20:43 ` Björn Persson 2006-12-11 19:49 ` Manuel Collado 0 siblings, 2 replies; 8+ messages in thread From: Adam Beneschan @ 2006-12-07 2:02 UTC (permalink / raw) Björn Persson wrote: > Manuel Collado wrote: > > UCS-1 means encoding each character (codepoint) as a single byte whose > > numerical value is just the codepoint. Can be used only for codepoints in > > the range (0..255). UCS-1 is the natural, implicit encoding of all 8-bit > > (and 7-bits) character sets. > > I'd still like to know where UCS-1 is defined, and by whom. > http://www.iana.org/assignments/character-sets lists ISO-10646-UCS-2, > ISO-10646-UCS-4 and ISO-10646-UCS-Basic, but no UCS-1. > http://www.unicode.org/glossary/#U also has entries for UCS-2 and UCS-4, > but no UCS-1. UCS-Basic may be the "official" name for what I'm talking about. Unfortunately, I'm having trouble figuring it out. The IANA website you referred me to is titled "Character Sets", but some of the things listed underneath are encoding standards (UTF-8, etc.) rather than character sets; UCS-Basic is listed as a "subset of Unicode", however, and Unicode is a character set (not an encoding; there are multiple ways to encode Unicode characters, including UTF-8, UTF-16, UCS-2). So this page just exemplifies the sort of confusion Manuel referred to. A quick Google search hasn't provided any further enlightenment on exactly what UCS-Basic is. Specifically, I can't tell whether it's a character set or an encoding. UCS-2 and UCS-4 are representations in which if an integer N maps to a character, then that character is represented simply by a 2- or 4-byte binary representation of N (byte ordering is an issue, though). So it would seem logical that UCS-1 would simply refer to a 1-byte binary representation of a number. That's how it seemed to me, and I did find other references to this term, so I figured it was the correct term. But maybe it isn't official. Sigh.... -- Adam ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Reading "normal" text files with Wide_Text_IO in GNAT 2006-12-07 2:02 ` Adam Beneschan @ 2006-12-09 20:43 ` Björn Persson 2006-12-11 19:49 ` Manuel Collado 1 sibling, 0 replies; 8+ messages in thread From: Björn Persson @ 2006-12-09 20:43 UTC (permalink / raw) Adam Beneschan wrote: > Bj�rn Persson wrote: >> Manuel Collado wrote: >> > UCS-1 means encoding each character (codepoint) as a single byte whose >> > numerical value is just the codepoint. Can be used only for codepoints >> > in the range (0..255). UCS-1 is the natural, implicit encoding of all >> > 8-bit (and 7-bits) character sets. >> >> I'd still like to know where UCS-1 is defined, and by whom. >> http://www.iana.org/assignments/character-sets lists ISO-10646-UCS-2, >> ISO-10646-UCS-4 and ISO-10646-UCS-Basic, but no UCS-1. >> http://www.unicode.org/glossary/#U also has entries for UCS-2 and UCS-4, >> but no UCS-1. > > UCS-Basic may be the "official" name for what I'm talking about. No, it can't be. It's described as "ASCII subset of Unicode. Basic Latin = collection 1", so it only encodes 128 characters. > Unfortunately, I'm having trouble figuring it out. The IANA website > you referred me to is titled "Character Sets", but some of the things > listed underneath are encoding standards (UTF-8, etc.) rather than > character sets; UCS-Basic is listed as a "subset of Unicode", however, > and Unicode is a character set (not an encoding; there are multiple > ways to encode Unicode characters, including UTF-8, UTF-16, UCS-2). So > this page just exemplifies the sort of confusion Manuel referred to. The names listed there are used in Internet protocols and data formats, such as MIME, HTTP and XML. They're also widely used in Unix-like systems to keep track of how text is encoded. They are for example used by Iconv, the transcoding library. In these applications there's never a need to specify both an encoding and a character set separately. One parameter is enough to specify how to translate an octet stream to a sequence of characters. Programs may do the translation in several steps, but that's each program's own business. The IETF isn't concerned with how software represents characters internally. From a pragmatic viewpoint, there's no need to deal with different character sets now that we have Unicode. The way I think of it, there is only one character set ? the universal character set ? and a plethora of character encodings. Some encodings can encode any character in the UCS, but most only deal with some subset. Thus, the ISO 8859 series, the IBM codepages, the Windows character sets and all the other old eight-bit character sets can now be considered character encodings that each define both a subset of the UCS and a way to encode those characters as eight-bit numbers. -- Bj�rn Persson PGP key A88682FD omb jor ers @sv ge. r o.b n.p son eri nu ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Reading "normal" text files with Wide_Text_IO in GNAT 2006-12-07 2:02 ` Adam Beneschan 2006-12-09 20:43 ` Björn Persson @ 2006-12-11 19:49 ` Manuel Collado 1 sibling, 0 replies; 8+ messages in thread From: Manuel Collado @ 2006-12-11 19:49 UTC (permalink / raw) Adam Beneschan escribi�: > Bj�rn Persson wrote: >> ... >> I'd still like to know where UCS-1 is defined, and by whom. >> http://www.iana.org/assignments/character-sets lists ISO-10646-UCS-2, >> ISO-10646-UCS-4 and ISO-10646-UCS-Basic, but no UCS-1. >> http://www.unicode.org/glossary/#U also has entries for UCS-2 and UCS-4, >> but no UCS-1. > ... > UCS-2 and UCS-4 are representations in which if an integer N maps to a > character, then that character is represented simply by a 2- or 4-byte > binary representation of N (byte ordering is an issue, though). So it > would seem logical that UCS-1 would simply refer to a 1-byte binary > representation of a number. That's how it seemed to me, and I did find > other references to this term, so I figured it was the correct term. > But maybe it isn't official. Well, it seems that there are no official names for simple, direct encodings (no tied to a given character set). In fact UCS-2 and UCS-4 are specific names for Unicode stuff (UCS means Universal Character Set). Character encoding concepts are precisely defined in: http://en.wikipedia.org/wiki/Character_encoding As you can see, the encoding issue is composed of two separated ideas: the CEF (character encodng form) and the CES (character encoding scheme). Some of the latest ones have explicit names. But the direct CEFs are so simple that they don't need explicit names (just the size of the code value). If we take UCS-2 and UCS-4 out of the Unicode world and use them as general names for direct CEFs with 16-bit and 32-bit code values, then UCS-1 becomes the natural name for the direct CEF with 8-bit code values. Let it be official or not. Regards. -- Manuel Collado - http://lml.ls.fi.upm.es/~mcollado ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2006-12-11 19:49 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2006-11-30 19:54 Reading "normal" text files with Wide_Text_IO in GNAT Adam Beneschan 2006-12-03 1:22 ` Björn Persson 2006-12-04 18:17 ` Adam Beneschan 2006-12-04 23:35 ` Manuel Collado 2006-12-06 23:46 ` Björn Persson 2006-12-07 2:02 ` Adam Beneschan 2006-12-09 20:43 ` Björn Persson 2006-12-11 19:49 ` Manuel Collado
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox