Forum: War Ensemble BBS

utf-8 support in gnu cobol (3.0, under linux)

From Mayer Goldberg@mayer.goldberg@gmail.com to comp.lang.cobol on Thu Sep 27 09:03:09 2018

From Newsgroup: comp.lang.cobol

Hello:
I don't quite understand the issue of utf-8 support in gnu cobol:
- When I use unicode characters in literal strings, all seems to work:
DISPLAY 'שלום לכולם! ﷽'.
Prints out some unicode characters to the screen to everyone's delight...
But when I try to see what integer value these characters have, e.g., using REDEFINES, I get strings that have the numerical value
WORKING-STORAGE SECTION.
01 unicode-char.
02 integer-value pic 9(4).
02 utf8-value redefines integer-value pic x(4).
And now when I do:
move 1488 to integer-value.
display utf8-value.
I get 1488 printed back at me. So I try further: I add
77 ws-utf8-value pic x(4).
and when I try to do a move to it:
move 1488 to integer-value.
move utf8-value to ws-utf8-value.
display ws-utf8-value.
I still get 1488... as a string.
So can someone please tell me how can I get at the integer value of a utf-8 character within gnu cobol? What am I missing?
Thanks,
Mayer
--- Synchronet 3.20a-Linux NewsLink 1.114

From Richard@riplin@azonic.co.nz to comp.lang.cobol on Thu Sep 27 12:12:32 2018

From Newsgroup: comp.lang.cobol

On Friday, September 28, 2018 at 4:03:10 AM UTC+12, Mayer Goldberg wrote:

Hello:

I don't quite understand the issue of utf-8 support in gnu cobol:

- When I use unicode characters in literal strings, all seems to work:

DISPLAY 'שלום לכולם! ﷽'.

Prints out some unicode characters to the screen to everyone's delight...

But when I try to see what integer value these characters have, e.g., using REDEFINES, I get strings that have the numerical value

WORKING-STORAGE SECTION.
01 unicode-char.
02 integer-value pic 9(4).
02 utf8-value redefines integer-value pic x(4).

And now when I do:

move 1488 to integer-value.
display utf8-value.

I get 1488 printed back at me. So I try further: I add

77 ws-utf8-value pic x(4).

and when I try to do a move to it:

move 1488 to integer-value.
move utf8-value to ws-utf8-value.
display ws-utf8-value.

I still get 1488... as a string.

UTF8 is a coding structure for 8bit characters. The basic characters are the same as ASCII and are held in one byte. Extended characters use two bytes or more, the first being a value outside the the normal character set of 00 to 7F.
Your 'integer-value' is PIC 9(4) which is display numeric. Each digit is represented by an ASCII numeric character. So '1488' is a 4 byte string.
In order to see the binary code value of a byte you need to redefine each character as a bit value, such as BINARY-CHAR UNSIGNED.

01 unicode-char.
02 utf8-value pic x(4).
02 utf8-codes redefines utf8-value.
03 utf8-code usage binary-char unsigned occurs 4.
Move each utf8 character to utf8-value and get the value of each byte in utf8-code[n]. Characters outside the range a-z, A-Z, 0-9 and some special characters may use more than one character position.

So can someone please tell me how can I get at the integer value of a utf-8 character within gnu cobol? What am I missing?

Thanks,

Mayer

--- Synchronet 3.20a-Linux NewsLink 1.114

From Mayer Goldberg@mayer.goldberg@gmail.com to comp.lang.cobol on Thu Sep 27 14:58:14 2018

From Newsgroup: comp.lang.cobol

Is this really that easy?? I entered the letter alef as a literal, which is 1488 in utf-8. Out came \215\144 in octal (why octal??). This is weird, because these are not the bytes that correspond to 1488, but anyway: When I tried entering the respective literals 141 and 100 into utf8-code(1), utf8-code(2), and printing utf8-value I didn't get alef. I thought the whole point of REDEFINES is that things are supposed to go in both directions (!). I can't have
symbolic characters alef 1488.
because the number needs to be between 0..255. So:
(1) Can I go both ways between utf8 and its integer representation?
(2) Can I use binary-short instead of binary-char to represent a single two-byte value?
(3) Can I define symbolic character names for utf8 characters?
Thanks,
Mayer
On Thursday, September 27, 2018 at 10:12:33 PM UTC+3, Richard wrote:

On Friday, September 28, 2018 at 4:03:10 AM UTC+12, Mayer Goldberg wrote:

Hello:

I don't quite understand the issue of utf-8 support in gnu cobol:

- When I use unicode characters in literal strings, all seems to work:

DISPLAY 'שלום לכולם! ﷽'.

Prints out some unicode characters to the screen to everyone's delight...

But when I try to see what integer value these characters have, e.g., using REDEFINES, I get strings that have the numerical value

WORKING-STORAGE SECTION.
01 unicode-char.
02 integer-value pic 9(4).
02 utf8-value redefines integer-value pic x(4).

And now when I do:

move 1488 to integer-value.
display utf8-value.

I get 1488 printed back at me. So I try further: I add

77 ws-utf8-value pic x(4).

and when I try to do a move to it:

move 1488 to integer-value.
move utf8-value to ws-utf8-value.
display ws-utf8-value.

I still get 1488... as a string.

UTF8 is a coding structure for 8bit characters. The basic characters are the same as ASCII and are held in one byte. Extended characters use two bytes or more, the first being a value outside the the normal character set of 00 to 7F.

Your 'integer-value' is PIC 9(4) which is display numeric. Each digit is represented by an ASCII numeric character. So '1488' is a 4 byte string.

In order to see the binary code value of a byte you need to redefine each character as a bit value, such as BINARY-CHAR UNSIGNED.

01 unicode-char.
02 utf8-value pic x(4).
02 utf8-codes redefines utf8-value.
03 utf8-code usage binary-char unsigned occurs 4.

Move each utf8 character to utf8-value and get the value of each byte in utf8-code[n]. Characters outside the range a-z, A-Z, 0-9 and some special characters may use more than one character position.

So can someone please tell me how can I get at the integer value of a utf-8 character within gnu cobol? What am I missing?

Thanks,

Mayer

--- Synchronet 3.20a-Linux NewsLink 1.114

From Robert Wessel@robertwessel2@yahoo.com to comp.lang.cobol on Thu Sep 27 18:24:39 2018

From Newsgroup: comp.lang.cobol

\215\144 would appear to be a decimal representation of the character.
U+5D0 (ALEF), would encode in UTF-8 as 0xd7, 0x90. Which in decimal
is 215 and 144. With C syntax, the string would be octal, but
"\327\220".

On Thu, 27 Sep 2018 14:58:14 -0700 (PDT), Mayer Goldberg <mayer.goldberg@gmail.com> wrote:

Is this really that easy?? I entered the letter alef as a literal, which is 1488 in utf-8. Out came \215\144 in octal (why octal??). This is weird, because these are not the bytes that correspond to 1488, but anyway: When I tried entering the respective literals 141 and 100 into utf8-code(1), utf8-code(2), and printing utf8-value I didn't get alef. I thought the whole point of REDEFINES is that things are supposed to go in both directions (!). I can't have

symbolic characters alef 1488.

because the number needs to be between 0..255. So:

(1) Can I go both ways between utf8 and its integer representation?
(2) Can I use binary-short instead of binary-char to represent a single two-byte value?
(3) Can I define symbolic character names for utf8 characters?

Thanks,

Mayer

On Thursday, September 27, 2018 at 10:12:33 PM UTC+3, Richard wrote:

On Friday, September 28, 2018 at 4:03:10 AM UTC+12, Mayer Goldberg wrote:

Hello:

I don't quite understand the issue of utf-8 support in gnu cobol:

- When I use unicode characters in literal strings, all seems to work:

DISPLAY '???? ?????! ?'.

Prints out some unicode characters to the screen to everyone's delight... >> >
But when I try to see what integer value these characters have, e.g., using REDEFINES, I get strings that have the numerical value

WORKING-STORAGE SECTION.
01 unicode-char.
02 integer-value pic 9(4).
02 utf8-value redefines integer-value pic x(4).

And now when I do:

move 1488 to integer-value.
display utf8-value.

I get 1488 printed back at me. So I try further: I add

77 ws-utf8-value pic x(4).

and when I try to do a move to it:

move 1488 to integer-value.
move utf8-value to ws-utf8-value.
display ws-utf8-value.

I still get 1488... as a string.

UTF8 is a coding structure for 8bit characters. The basic characters are the same as ASCII and are held in one byte. Extended characters use two bytes or more, the first being a value outside the the normal character set of 00 to 7F.

Your 'integer-value' is PIC 9(4) which is display numeric. Each digit is represented by an ASCII numeric character. So '1488' is a 4 byte string.

In order to see the binary code value of a byte you need to redefine each character as a bit value, such as BINARY-CHAR UNSIGNED.

01 unicode-char.
02 utf8-value pic x(4).
02 utf8-codes redefines utf8-value.
03 utf8-code usage binary-char unsigned occurs 4.

Move each utf8 character to utf8-value and get the value of each byte in utf8-code[n]. Characters outside the range a-z, A-Z, 0-9 and some special characters may use more than one character position.

So can someone please tell me how can I get at the integer value of a utf-8 character within gnu cobol? What am I missing?

Thanks,

Mayer

--- Synchronet 3.20a-Linux NewsLink 1.114

From Mayer Goldberg@mayer.goldberg@gmail.com to comp.lang.cobol on Thu Sep 27 16:31:44 2018

From Newsgroup: comp.lang.cobol

You're right. Sorry. I forgot how characters are encoded in utf8. Thanks!
On Friday, September 28, 2018 at 2:24:37 AM UTC+3, robert...@yahoo.com wrote:

\215\144 would appear to be a decimal representation of the character.
U+5D0 (ALEF), would encode in UTF-8 as 0xd7, 0x90. Which in decimal
is 215 and 144. With C syntax, the string would be octal, but
"\327\220".

On Thu, 27 Sep 2018 14:58:14 -0700 (PDT), Mayer Goldberg <mayer.goldberg@gmail.com> wrote:

Is this really that easy?? I entered the letter alef as a literal, which is 1488 in utf-8. Out came \215\144 in octal (why octal??). This is weird, because these are not the bytes that correspond to 1488, but anyway: When I tried entering the respective literals 141 and 100 into utf8-code(1), utf8-code(2), and printing utf8-value I didn't get alef. I thought the whole point of REDEFINES is that things are supposed to go in both directions (!). I can't have

symbolic characters alef 1488.

because the number needs to be between 0..255. So:

(1) Can I go both ways between utf8 and its integer representation?
(2) Can I use binary-short instead of binary-char to represent a single two-byte value?
(3) Can I define symbolic character names for utf8 characters?

Thanks,

Mayer

On Thursday, September 27, 2018 at 10:12:33 PM UTC+3, Richard wrote:

On Friday, September 28, 2018 at 4:03:10 AM UTC+12, Mayer Goldberg wrote: >> > Hello:

I don't quite understand the issue of utf-8 support in gnu cobol:

- When I use unicode characters in literal strings, all seems to work: >> >
DISPLAY '???? ?????! ?'.

Prints out some unicode characters to the screen to everyone's delight...

But when I try to see what integer value these characters have, e.g., using REDEFINES, I get strings that have the numerical value

WORKING-STORAGE SECTION.
01 unicode-char.
02 integer-value pic 9(4).
02 utf8-value redefines integer-value pic x(4).

And now when I do:

move 1488 to integer-value.
display utf8-value.

I get 1488 printed back at me. So I try further: I add

77 ws-utf8-value pic x(4).

and when I try to do a move to it:

move 1488 to integer-value.
move utf8-value to ws-utf8-value.
display ws-utf8-value.

I still get 1488... as a string.

UTF8 is a coding structure for 8bit characters. The basic characters are the same as ASCII and are held in one byte. Extended characters use two bytes or more, the first being a value outside the the normal character set of 00 to 7F.

Your 'integer-value' is PIC 9(4) which is display numeric. Each digit is represented by an ASCII numeric character. So '1488' is a 4 byte string.

In order to see the binary code value of a byte you need to redefine each character as a bit value, such as BINARY-CHAR UNSIGNED.

01 unicode-char.
02 utf8-value pic x(4).
02 utf8-codes redefines utf8-value.
03 utf8-code usage binary-char unsigned occurs 4.

Move each utf8 character to utf8-value and get the value of each byte in utf8-code[n]. Characters outside the range a-z, A-Z, 0-9 and some special characters may use more than one character position.

So can someone please tell me how can I get at the integer value of a utf-8 character within gnu cobol? What am I missing?

Thanks,

Mayer

--- Synchronet 3.20a-Linux NewsLink 1.114

Who's Online
Recent Visitors
- Microbot
  Mon Apr 21 01:36:56 2025
  from Moore, Ok via Telnet
- Noozle
  Sun Apr 20 15:14:28 2025
  from Noozle City via Telnet
- Microbot
  Sun Apr 20 03:00:36 2025
  from Moore, Ok via Telnet
- Noozle
  Sat Apr 19 14:10:30 2025
  from Noozle City via Telnet

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,030
Nodes:	10 (0 / 10)
Uptime:	60:32:00
Calls:	13,349
Calls today:	1
Files:	186,574
D/L today:	1,129 files (307M bytes)
Messages:	3,358,548

utf-8 support in gnu cobol (3.0, under linux)

Who's Online

Recent Visitors

System Info