A few questions on parsing, sockets, UTF-8 strings

comp.lang.ada
 help / color / mirror / Atom feed

* A few questions on parsing, sockets, UTF-8 strings
@ 2016-08-11 14:39 john
  2016-08-11 16:23 ` Dmitry A. Kazakov
  0 siblings, 1 reply; 7+ messages in thread
From: john @ 2016-08-11 14:39 UTC (permalink / raw)


Hi! For some non-standard interprocess communication, I need to:

1. Listen with a TCP socket to the local loopback interface and obtain the PORT suggested by the OS, i.e. like with bind() to port 0 and getsockname() in C. Is this possible with GNAT.sockets?

2. Connect to the incoming host and parse input and send output delimited by LF line-by-line. What about buffering, can it be switched off or is there a line buffer mode already? (It needs to be compatible with LF instead of CR+LF as delimiter, though.) 

3. Convert back and forth between Base64-encoded UTF-8 strings and UTF-8 strings (which may also be represented in the source text). How do I do this?

Especially the last point is a bit mysterious to me. I cannot use UCS-2, it needs to be full UTF-8. Should I use a library for UTF-8 strings? Which one? 

Speed is not very important, but I'd like to avoid unnecessary conversions.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A few questions on parsing, sockets, UTF-8 strings
  2016-08-11 14:39 A few questions on parsing, sockets, UTF-8 strings john
@ 2016-08-11 16:23 ` Dmitry A. Kazakov
  2016-08-11 17:40   ` john
  0 siblings, 1 reply; 7+ messages in thread
From: Dmitry A. Kazakov @ 2016-08-11 16:23 UTC (permalink / raw)


On 2016-08-11 16:39, john@peppermind.com wrote:
> Hi! For some non-standard interprocess communication, I need to:
>
> 1. Listen with a TCP socket to the local loopback interface and
> obtain  the PORT suggested by the OS, i.e. like with bind() to port 0 and
> getsockname() in C. Is this possible with GNAT.sockets?

Why not?

> 2. Connect to the incoming host and parse input and send output
> delimited by LF line-by-line. What about buffering, can it be switched
> off or is there a line buffer mode already? (It needs to be compatible
> with LF instead of CR+LF as delimiter, though.)

The user input buffer is the buffer you provide. TCP/IP stream ignores 
whatever line terminators. They are just bytes as anything else. You 
read until a complete line is in the buffer then do whatever you have to 
with the line.

> 3. Convert back and forth between Base64-encoded UTF-8 strings and
> UTF-8 strings (which may also be represented in the source text). How do
> I do this?
>
> Especially the last point is a bit mysterious to me. I cannot use
> UCS-2, it needs to be full UTF-8. Should I use a library for UTF-8
> strings? Which one?

I have one:

http://dmitry-kazakov.de/ada/strings_edit.htm#12

There must be others surely.

P.S. There is no point using Base64 encoding over TCP/IP unless a 
specific protocol requires it.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A few questions on parsing, sockets, UTF-8 strings
  2016-08-11 16:23 ` Dmitry A. Kazakov
@ 2016-08-11 17:40   ` john
  2016-08-11 17:49     ` Dmitry A. Kazakov
  0 siblings, 1 reply; 7+ messages in thread
From: john @ 2016-08-11 17:40 UTC (permalink / raw)

On Thursday, August 11, 2016 at 5:24:06 PM UTC+1, Dmitry A. Kazakov wrote:

> 
> The user input buffer is the buffer you provide. TCP/IP stream ignores 
> whatever line terminators. They are just bytes as anything else. You 
> read until a complete line is in the buffer then do whatever you have to 
> with the line.

Thanks, that's what I thought. Just wanted to make sure there is no showstopper before I start. (I need to be able to get the port from the listening socket *before* a connection is made.)

> I have one:
> 
> http://dmitry-kazakov.de/ada/strings_edit.htm#12

Great, I'll take a look at it.

I'm still confused, though. Strings are really kind of a mess in Ada, IMHO. Do I really need your package? Or could I use String instead? Suppose the GPS source code option is set to UTF-8, so a string literal in the source code should be UTF-8 data (inside a fixed String type). Right? So if I Base64 encode this directly, do I have to care about UTF-8?

> P.S. There is no point using Base64 encoding over TCP/IP unless a 
> specific protocol requires it.

I agree, and it's my own protocol, but in this case I believe Base64 makes sense despite the inefficiency. The protocol should be as easy as possible to implement in basically any language that has strings and TCP sockets, and many languages have built-in Base64 encoders and decoders.

Base64 without LFs and CRs is a brute-force way of string escaping in this case. Originally I used string escapes like in Scheme and Lisp with an appropriate reader. That's nice for debugging, but way too complicated for parsing. Since Base64 does not contain spaces either, I can split strings on whitespace when a line is read, and many languages have built in commands for this, too.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A few questions on parsing, sockets, UTF-8 strings
  2016-08-11 17:40   ` john
@ 2016-08-11 17:49     ` Dmitry A. Kazakov
  2016-08-11 18:22       ` john
  0 siblings, 1 reply; 7+ messages in thread
From: Dmitry A. Kazakov @ 2016-08-11 17:49 UTC (permalink / raw)


On 2016-08-11 19:40, john@peppermind.com wrote:

> I'm still confused, though. Strings are really kind of a mess in
> Ada,  IMHO. Do I really need your package? Or could I use String instead?

Instead of what?

> Suppose the GPS source code option is set to UTF-8, so a string literal
> in the source code should be UTF-8 data (inside a fixed String type).
> Right?

ASCII string is an UTF-8 string. The reverse if false.

> So if I Base64 encode this directly, do I have to care about UTF-8?

No, if it is strictly ASCII. Yes, if you are going to use other Unicode 
code points.

>> P.S. There is no point using Base64 encoding over TCP/IP unless a
>> specific protocol requires it.
>
> I agree, and it's my own protocol, but in this case I believe Base64
> makes sense despite the inefficiency. The protocol should be as easy as
> possible to implement in basically any language that has strings and TCP
> sockets, and many languages have built-in Base64 encoders and decoders.

Then, what's the problem? Use no Base64 and no line terminators. Pass 
the packet length. Then, the packet contents. No encoding, no recording, 
always known how many bytes to read next.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A few questions on parsing, sockets, UTF-8 strings
  2016-08-11 17:49     ` Dmitry A. Kazakov
@ 2016-08-11 18:22       ` john
  2016-08-11 19:09         ` gautier_niouzes
  2016-08-11 21:10         ` Dmitry A. Kazakov
  0 siblings, 2 replies; 7+ messages in thread
From: john @ 2016-08-11 18:22 UTC (permalink / raw)

On Thursday, August 11, 2016 at 6:49:33 PM UTC+1, Dmitry A. Kazakov wrote:

> ASCII string is an UTF-8 string. The reverse if false.

You're right, Ascii uses only 0...127 as code points. But I thought that Ada fixed strings hold one byte per character, meaning that I can store UTF-8 in it? Am I mistaken about that?

> > So if I Base64 encode this directly, do I have to care about UTF-8?
> 
> No, if it is strictly ASCII. Yes, if you are going to use other Unicode 
> code points.

Sorry for being such a noob, but I still don't get it. If GNAT GPS is set to UTF-8 (-gnatW8 for gnatmake and source encoding in GPS preferences), doesn't that mean that if I enter a Unicode character into a fixed string literal (just String, not Wide_String or Wide_Wide_String) that the string will contain this character in the form of as many bytes as the Unicode code point requires? So if it's a two-byte UTF-8 code point, then the string will contain two bytes?

In that case, as long as I don't need to access single characters ever, could I stick with fixed strings?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A few questions on parsing, sockets, UTF-8 strings
  2016-08-11 18:22       ` john
@ 2016-08-11 19:09         ` gautier_niouzes
  2016-08-11 21:10         ` Dmitry A. Kazakov
  1 sibling, 0 replies; 7+ messages in thread
From: gautier_niouzes @ 2016-08-11 19:09 UTC (permalink / raw)


> In that case, as long as I don't need to access single characters ever, could I stick with fixed strings?

Exactly. String is just an array of (8-bit) Character, so you can have UTF-8 strings stored there (or ASCII, or other things...), but a single "Unicode character" will take one *or more* Character's in a String.
As a reminder, you can define "subtype UTF_8_String is String;", just to be aware that taking a single Character's in your String can be meaningless.
But wait, the package Ada.Strings.UTF_Encoding does it for you, plus provides conversions functions.
_________________________ 
Gautier's Ada programming 
http://sf.net/users/gdemont/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A few questions on parsing, sockets, UTF-8 strings
  2016-08-11 18:22       ` john
  2016-08-11 19:09         ` gautier_niouzes
@ 2016-08-11 21:10         ` Dmitry A. Kazakov
  1 sibling, 0 replies; 7+ messages in thread
From: Dmitry A. Kazakov @ 2016-08-11 21:10 UTC (permalink / raw)

On 2016-08-11 20:22, john@peppermind.com wrote:
> On Thursday, August 11, 2016 at 6:49:33 PM UTC+1, Dmitry A. Kazakov wrote:
>
> You're right, Ascii uses only 0...127 as code points. But I thought
> that Ada fixed strings hold one byte per character, meaning that I can
> store UTF-8 in it?

You can. Formally you should not, because RM 3.5.2 defines Character as 
Latin-1, but in practice nobody cares.

> Sorry for being such a noob, but I still don't get it. If GNAT GPS
> is  set to UTF-8 (-gnatW8 for gnatmake and source encoding in GPS
> preferences), doesn't that mean that if I enter a Unicode character into
> a fixed string literal (just String, not Wide_String or
> Wide_Wide_String) that the string will contain this character in the
> form of as many bytes as the Unicode code point requires?

Source encoding is not the encoding of program strings. You better not 
use non-ASCII literals if you want to have sources portable.

If you need some Unicode code points, use explicit conversions. With this:

    http://dmitry-kazakov.de/ada/strings_edit.htm#7.1

    Left_Arrow : constant String := Strings_Edit.UTF8.Image (16#2190#);

> So if it's a
> two-byte UTF-8 code point, then the string will contain two bytes?

Code point numeric representation is longer than 2 bytes. The range is 
0..16#10FFFF#. The length of the UTF-8 representation depends on the 
code point value. It can be longer than 2 bytes. The maximal length of 
UCS-2 character is 3 bytes in UTF-8.

> In that case, as long as I don't need to access single characters
> ever, could I stick with fixed strings?

Yes. But there is no problem accessing single code points either. UTF-8 
was designed for easy forward and backward navigation. In the package 
above Get takes code points moving forward and Get_Backwards does it 
backwards.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-08-11 21:10 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-11 14:39 A few questions on parsing, sockets, UTF-8 strings john
2016-08-11 16:23 ` Dmitry A. Kazakov
2016-08-11 17:40   ` john
2016-08-11 17:49     ` Dmitry A. Kazakov
2016-08-11 18:22       ` john
2016-08-11 19:09         ` gautier_niouzes
2016-08-11 21:10         ` Dmitry A. Kazakov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox