Hello:UTF8 is a coding structure for 8bit characters. The basic characters are the same as ASCII and are held in one byte. Extended characters use two bytes or more, the first being a value outside the the normal character set of 00 to 7F.
I don't quite understand the issue of utf-8 support in gnu cobol:
- When I use unicode characters in literal strings, all seems to work:
DISPLAY 'שלום לכולם! ﷽'.
Prints out some unicode characters to the screen to everyone's delight...
But when I try to see what integer value these characters have, e.g., using REDEFINES, I get strings that have the numerical value
WORKING-STORAGE SECTION.
01 unicode-char.
02 integer-value pic 9(4).
02 utf8-value redefines integer-value pic x(4).
And now when I do:
move 1488 to integer-value.
display utf8-value.
I get 1488 printed back at me. So I try further: I add
77 ws-utf8-value pic x(4).
and when I try to do a move to it:
move 1488 to integer-value.
move utf8-value to ws-utf8-value.
display ws-utf8-value.
I still get 1488... as a string.
So can someone please tell me how can I get at the integer value of a utf-8 character within gnu cobol? What am I missing?--- Synchronet 3.20a-Linux NewsLink 1.114
Thanks,
Mayer
On Friday, September 28, 2018 at 4:03:10 AM UTC+12, Mayer Goldberg wrote:
Hello:
I don't quite understand the issue of utf-8 support in gnu cobol:
- When I use unicode characters in literal strings, all seems to work:
DISPLAY 'שלום לכולם! ﷽'.
Prints out some unicode characters to the screen to everyone's delight...
But when I try to see what integer value these characters have, e.g., using REDEFINES, I get strings that have the numerical value
WORKING-STORAGE SECTION.
01 unicode-char.
02 integer-value pic 9(4).
02 utf8-value redefines integer-value pic x(4).
And now when I do:
move 1488 to integer-value.
display utf8-value.
I get 1488 printed back at me. So I try further: I add
77 ws-utf8-value pic x(4).
and when I try to do a move to it:
move 1488 to integer-value.
move utf8-value to ws-utf8-value.
display ws-utf8-value.
I still get 1488... as a string.
UTF8 is a coding structure for 8bit characters. The basic characters are the same as ASCII and are held in one byte. Extended characters use two bytes or more, the first being a value outside the the normal character set of 00 to 7F.
Your 'integer-value' is PIC 9(4) which is display numeric. Each digit is represented by an ASCII numeric character. So '1488' is a 4 byte string.
In order to see the binary code value of a byte you need to redefine each character as a bit value, such as BINARY-CHAR UNSIGNED.
01 unicode-char.
02 utf8-value pic x(4).
02 utf8-codes redefines utf8-value.
03 utf8-code usage binary-char unsigned occurs 4.
Move each utf8 character to utf8-value and get the value of each byte in utf8-code[n]. Characters outside the range a-z, A-Z, 0-9 and some special characters may use more than one character position.
--- Synchronet 3.20a-Linux NewsLink 1.114So can someone please tell me how can I get at the integer value of a utf-8 character within gnu cobol? What am I missing?
Thanks,
Mayer
Is this really that easy?? I entered the letter alef as a literal, which is 1488 in utf-8. Out came \215\144 in octal (why octal??). This is weird, because these are not the bytes that correspond to 1488, but anyway: When I tried entering the respective literals 141 and 100 into utf8-code(1), utf8-code(2), and printing utf8-value I didn't get alef. I thought the whole point of REDEFINES is that things are supposed to go in both directions (!). I can't have--- Synchronet 3.20a-Linux NewsLink 1.114
symbolic characters alef 1488.
because the number needs to be between 0..255. So:
(1) Can I go both ways between utf8 and its integer representation?
(2) Can I use binary-short instead of binary-char to represent a single two-byte value?
(3) Can I define symbolic character names for utf8 characters?
Thanks,
Mayer
On Thursday, September 27, 2018 at 10:12:33 PM UTC+3, Richard wrote:
On Friday, September 28, 2018 at 4:03:10 AM UTC+12, Mayer Goldberg wrote:
Hello:
I don't quite understand the issue of utf-8 support in gnu cobol:
- When I use unicode characters in literal strings, all seems to work:
DISPLAY '???? ?????! ?'.
Prints out some unicode characters to the screen to everyone's delight... >> >
But when I try to see what integer value these characters have, e.g., using REDEFINES, I get strings that have the numerical value
WORKING-STORAGE SECTION.
01 unicode-char.
02 integer-value pic 9(4).
02 utf8-value redefines integer-value pic x(4).
And now when I do:
move 1488 to integer-value.
display utf8-value.
I get 1488 printed back at me. So I try further: I add
77 ws-utf8-value pic x(4).
and when I try to do a move to it:
move 1488 to integer-value.
move utf8-value to ws-utf8-value.
display ws-utf8-value.
I still get 1488... as a string.
UTF8 is a coding structure for 8bit characters. The basic characters are the same as ASCII and are held in one byte. Extended characters use two bytes or more, the first being a value outside the the normal character set of 00 to 7F.
Your 'integer-value' is PIC 9(4) which is display numeric. Each digit is represented by an ASCII numeric character. So '1488' is a 4 byte string.
In order to see the binary code value of a byte you need to redefine each character as a bit value, such as BINARY-CHAR UNSIGNED.
01 unicode-char.
02 utf8-value pic x(4).
02 utf8-codes redefines utf8-value.
03 utf8-code usage binary-char unsigned occurs 4.
Move each utf8 character to utf8-value and get the value of each byte in utf8-code[n]. Characters outside the range a-z, A-Z, 0-9 and some special characters may use more than one character position.
So can someone please tell me how can I get at the integer value of a utf-8 character within gnu cobol? What am I missing?
Thanks,
Mayer
\215\144 would appear to be a decimal representation of the character.
U+5D0 (ALEF), would encode in UTF-8 as 0xd7, 0x90. Which in decimal
is 215 and 144. With C syntax, the string would be octal, but
"\327\220".
On Thu, 27 Sep 2018 14:58:14 -0700 (PDT), Mayer Goldberg <mayer.goldberg@gmail.com> wrote:--- Synchronet 3.20a-Linux NewsLink 1.114
Is this really that easy?? I entered the letter alef as a literal, which is 1488 in utf-8. Out came \215\144 in octal (why octal??). This is weird, because these are not the bytes that correspond to 1488, but anyway: When I tried entering the respective literals 141 and 100 into utf8-code(1), utf8-code(2), and printing utf8-value I didn't get alef. I thought the whole point of REDEFINES is that things are supposed to go in both directions (!). I can't have
symbolic characters alef 1488.
because the number needs to be between 0..255. So:
(1) Can I go both ways between utf8 and its integer representation?
(2) Can I use binary-short instead of binary-char to represent a single two-byte value?
(3) Can I define symbolic character names for utf8 characters?
Thanks,
Mayer
On Thursday, September 27, 2018 at 10:12:33 PM UTC+3, Richard wrote:
On Friday, September 28, 2018 at 4:03:10 AM UTC+12, Mayer Goldberg wrote: >> > Hello:
I don't quite understand the issue of utf-8 support in gnu cobol:
- When I use unicode characters in literal strings, all seems to work: >> >
DISPLAY '???? ?????! ?'.
Prints out some unicode characters to the screen to everyone's delight...
But when I try to see what integer value these characters have, e.g., using REDEFINES, I get strings that have the numerical value
WORKING-STORAGE SECTION.
01 unicode-char.
02 integer-value pic 9(4).
02 utf8-value redefines integer-value pic x(4).
And now when I do:
move 1488 to integer-value.
display utf8-value.
I get 1488 printed back at me. So I try further: I add
77 ws-utf8-value pic x(4).
and when I try to do a move to it:
move 1488 to integer-value.
move utf8-value to ws-utf8-value.
display ws-utf8-value.
I still get 1488... as a string.
UTF8 is a coding structure for 8bit characters. The basic characters are the same as ASCII and are held in one byte. Extended characters use two bytes or more, the first being a value outside the the normal character set of 00 to 7F.
Your 'integer-value' is PIC 9(4) which is display numeric. Each digit is represented by an ASCII numeric character. So '1488' is a 4 byte string.
In order to see the binary code value of a byte you need to redefine each character as a bit value, such as BINARY-CHAR UNSIGNED.
01 unicode-char.
02 utf8-value pic x(4).
02 utf8-codes redefines utf8-value.
03 utf8-code usage binary-char unsigned occurs 4.
Move each utf8 character to utf8-value and get the value of each byte in utf8-code[n]. Characters outside the range a-z, A-Z, 0-9 and some special characters may use more than one character position.
So can someone please tell me how can I get at the integer value of a utf-8 character within gnu cobol? What am I missing?
Thanks,
Mayer
Sysop: | DaiTengu |
---|---|
Location: | Appleton, WI |
Users: | 1,030 |
Nodes: | 10 (0 / 10) |
Uptime: | 60:32:00 |
Calls: | 13,349 |
Calls today: | 1 |
Files: | 186,574 |
D/L today: |
1,129 files (307M bytes) |
Messages: | 3,358,548 |