• utf-8 support in gnu cobol (3.0, under linux)

    From Mayer Goldberg@mayer.goldberg@gmail.com to comp.lang.cobol on Thu Sep 27 09:03:09 2018
    From Newsgroup: comp.lang.cobol

    Hello:
    I don't quite understand the issue of utf-8 support in gnu cobol:
    - When I use unicode characters in literal strings, all seems to work:
    DISPLAY 'שלום לכולם! ﷽'.
    Prints out some unicode characters to the screen to everyone's delight...
    But when I try to see what integer value these characters have, e.g., using REDEFINES, I get strings that have the numerical value
    WORKING-STORAGE SECTION.
    01 unicode-char.
    02 integer-value pic 9(4).
    02 utf8-value redefines integer-value pic x(4).
    And now when I do:
    move 1488 to integer-value.
    display utf8-value.
    I get 1488 printed back at me. So I try further: I add
    77 ws-utf8-value pic x(4).
    and when I try to do a move to it:
    move 1488 to integer-value.
    move utf8-value to ws-utf8-value.
    display ws-utf8-value.
    I still get 1488... as a string.
    So can someone please tell me how can I get at the integer value of a utf-8 character within gnu cobol? What am I missing?
    Thanks,
    Mayer
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Richard@riplin@azonic.co.nz to comp.lang.cobol on Thu Sep 27 12:12:32 2018
    From Newsgroup: comp.lang.cobol

    On Friday, September 28, 2018 at 4:03:10 AM UTC+12, Mayer Goldberg wrote:
    Hello:

    I don't quite understand the issue of utf-8 support in gnu cobol:

    - When I use unicode characters in literal strings, all seems to work:

    DISPLAY 'שלום לכולם! ﷽'.

    Prints out some unicode characters to the screen to everyone's delight...

    But when I try to see what integer value these characters have, e.g., using REDEFINES, I get strings that have the numerical value

    WORKING-STORAGE SECTION.
    01 unicode-char.
    02 integer-value pic 9(4).
    02 utf8-value redefines integer-value pic x(4).

    And now when I do:

    move 1488 to integer-value.
    display utf8-value.

    I get 1488 printed back at me. So I try further: I add

    77 ws-utf8-value pic x(4).

    and when I try to do a move to it:

    move 1488 to integer-value.
    move utf8-value to ws-utf8-value.
    display ws-utf8-value.

    I still get 1488... as a string.
    UTF8 is a coding structure for 8bit characters. The basic characters are the same as ASCII and are held in one byte. Extended characters use two bytes or more, the first being a value outside the the normal character set of 00 to 7F.
    Your 'integer-value' is PIC 9(4) which is display numeric. Each digit is represented by an ASCII numeric character. So '1488' is a 4 byte string.
    In order to see the binary code value of a byte you need to redefine each character as a bit value, such as BINARY-CHAR UNSIGNED.

    01 unicode-char.
    02 utf8-value pic x(4).
    02 utf8-codes redefines utf8-value.
    03 utf8-code usage binary-char unsigned occurs 4.
    Move each utf8 character to utf8-value and get the value of each byte in utf8-code[n]. Characters outside the range a-z, A-Z, 0-9 and some special characters may use more than one character position.
    So can someone please tell me how can I get at the integer value of a utf-8 character within gnu cobol? What am I missing?

    Thanks,

    Mayer
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Mayer Goldberg@mayer.goldberg@gmail.com to comp.lang.cobol on Thu Sep 27 14:58:14 2018
    From Newsgroup: comp.lang.cobol

    Is this really that easy?? I entered the letter alef as a literal, which is 1488 in utf-8. Out came \215\144 in octal (why octal??). This is weird, because these are not the bytes that correspond to 1488, but anyway: When I tried entering the respective literals 141 and 100 into utf8-code(1), utf8-code(2), and printing utf8-value I didn't get alef. I thought the whole point of REDEFINES is that things are supposed to go in both directions (!). I can't have
    symbolic characters alef 1488.
    because the number needs to be between 0..255. So:
    (1) Can I go both ways between utf8 and its integer representation?
    (2) Can I use binary-short instead of binary-char to represent a single two-byte value?
    (3) Can I define symbolic character names for utf8 characters?
    Thanks,
    Mayer
    On Thursday, September 27, 2018 at 10:12:33 PM UTC+3, Richard wrote:
    On Friday, September 28, 2018 at 4:03:10 AM UTC+12, Mayer Goldberg wrote:
    Hello:

    I don't quite understand the issue of utf-8 support in gnu cobol:

    - When I use unicode characters in literal strings, all seems to work:

    DISPLAY 'שלום לכולם! ﷽'.

    Prints out some unicode characters to the screen to everyone's delight...

    But when I try to see what integer value these characters have, e.g., using REDEFINES, I get strings that have the numerical value

    WORKING-STORAGE SECTION.
    01 unicode-char.
    02 integer-value pic 9(4).
    02 utf8-value redefines integer-value pic x(4).

    And now when I do:

    move 1488 to integer-value.
    display utf8-value.

    I get 1488 printed back at me. So I try further: I add

    77 ws-utf8-value pic x(4).

    and when I try to do a move to it:

    move 1488 to integer-value.
    move utf8-value to ws-utf8-value.
    display ws-utf8-value.

    I still get 1488... as a string.

    UTF8 is a coding structure for 8bit characters. The basic characters are the same as ASCII and are held in one byte. Extended characters use two bytes or more, the first being a value outside the the normal character set of 00 to 7F.

    Your 'integer-value' is PIC 9(4) which is display numeric. Each digit is represented by an ASCII numeric character. So '1488' is a 4 byte string.

    In order to see the binary code value of a byte you need to redefine each character as a bit value, such as BINARY-CHAR UNSIGNED.


    01 unicode-char.
    02 utf8-value pic x(4).
    02 utf8-codes redefines utf8-value.
    03 utf8-code usage binary-char unsigned occurs 4.

    Move each utf8 character to utf8-value and get the value of each byte in utf8-code[n]. Characters outside the range a-z, A-Z, 0-9 and some special characters may use more than one character position.




    So can someone please tell me how can I get at the integer value of a utf-8 character within gnu cobol? What am I missing?

    Thanks,

    Mayer
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Robert Wessel@robertwessel2@yahoo.com to comp.lang.cobol on Thu Sep 27 18:24:39 2018
    From Newsgroup: comp.lang.cobol

    \215\144 would appear to be a decimal representation of the character.
    U+5D0 (ALEF), would encode in UTF-8 as 0xd7, 0x90. Which in decimal
    is 215 and 144. With C syntax, the string would be octal, but
    "\327\220".


    On Thu, 27 Sep 2018 14:58:14 -0700 (PDT), Mayer Goldberg <mayer.goldberg@gmail.com> wrote:

    Is this really that easy?? I entered the letter alef as a literal, which is 1488 in utf-8. Out came \215\144 in octal (why octal??). This is weird, because these are not the bytes that correspond to 1488, but anyway: When I tried entering the respective literals 141 and 100 into utf8-code(1), utf8-code(2), and printing utf8-value I didn't get alef. I thought the whole point of REDEFINES is that things are supposed to go in both directions (!). I can't have

    symbolic characters alef 1488.

    because the number needs to be between 0..255. So:

    (1) Can I go both ways between utf8 and its integer representation?
    (2) Can I use binary-short instead of binary-char to represent a single two-byte value?
    (3) Can I define symbolic character names for utf8 characters?

    Thanks,

    Mayer

    On Thursday, September 27, 2018 at 10:12:33 PM UTC+3, Richard wrote:
    On Friday, September 28, 2018 at 4:03:10 AM UTC+12, Mayer Goldberg wrote:
    Hello:

    I don't quite understand the issue of utf-8 support in gnu cobol:

    - When I use unicode characters in literal strings, all seems to work:

    DISPLAY '???? ?????! ?'.

    Prints out some unicode characters to the screen to everyone's delight... >> >
    But when I try to see what integer value these characters have, e.g., using REDEFINES, I get strings that have the numerical value

    WORKING-STORAGE SECTION.
    01 unicode-char.
    02 integer-value pic 9(4).
    02 utf8-value redefines integer-value pic x(4).

    And now when I do:

    move 1488 to integer-value.
    display utf8-value.

    I get 1488 printed back at me. So I try further: I add

    77 ws-utf8-value pic x(4).

    and when I try to do a move to it:

    move 1488 to integer-value.
    move utf8-value to ws-utf8-value.
    display ws-utf8-value.

    I still get 1488... as a string.

    UTF8 is a coding structure for 8bit characters. The basic characters are the same as ASCII and are held in one byte. Extended characters use two bytes or more, the first being a value outside the the normal character set of 00 to 7F.

    Your 'integer-value' is PIC 9(4) which is display numeric. Each digit is represented by an ASCII numeric character. So '1488' is a 4 byte string.

    In order to see the binary code value of a byte you need to redefine each character as a bit value, such as BINARY-CHAR UNSIGNED.


    01 unicode-char.
    02 utf8-value pic x(4).
    02 utf8-codes redefines utf8-value.
    03 utf8-code usage binary-char unsigned occurs 4.

    Move each utf8 character to utf8-value and get the value of each byte in utf8-code[n]. Characters outside the range a-z, A-Z, 0-9 and some special characters may use more than one character position.




    So can someone please tell me how can I get at the integer value of a utf-8 character within gnu cobol? What am I missing?

    Thanks,

    Mayer
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Mayer Goldberg@mayer.goldberg@gmail.com to comp.lang.cobol on Thu Sep 27 16:31:44 2018
    From Newsgroup: comp.lang.cobol

    You're right. Sorry. I forgot how characters are encoded in utf8. Thanks!
    On Friday, September 28, 2018 at 2:24:37 AM UTC+3, robert...@yahoo.com wrote:
    \215\144 would appear to be a decimal representation of the character.
    U+5D0 (ALEF), would encode in UTF-8 as 0xd7, 0x90. Which in decimal
    is 215 and 144. With C syntax, the string would be octal, but
    "\327\220".


    On Thu, 27 Sep 2018 14:58:14 -0700 (PDT), Mayer Goldberg <mayer.goldberg@gmail.com> wrote:

    Is this really that easy?? I entered the letter alef as a literal, which is 1488 in utf-8. Out came \215\144 in octal (why octal??). This is weird, because these are not the bytes that correspond to 1488, but anyway: When I tried entering the respective literals 141 and 100 into utf8-code(1), utf8-code(2), and printing utf8-value I didn't get alef. I thought the whole point of REDEFINES is that things are supposed to go in both directions (!). I can't have

    symbolic characters alef 1488.

    because the number needs to be between 0..255. So:

    (1) Can I go both ways between utf8 and its integer representation?
    (2) Can I use binary-short instead of binary-char to represent a single two-byte value?
    (3) Can I define symbolic character names for utf8 characters?

    Thanks,

    Mayer

    On Thursday, September 27, 2018 at 10:12:33 PM UTC+3, Richard wrote:
    On Friday, September 28, 2018 at 4:03:10 AM UTC+12, Mayer Goldberg wrote: >> > Hello:

    I don't quite understand the issue of utf-8 support in gnu cobol:

    - When I use unicode characters in literal strings, all seems to work: >> >
    DISPLAY '???? ?????! ?'.

    Prints out some unicode characters to the screen to everyone's delight...

    But when I try to see what integer value these characters have, e.g., using REDEFINES, I get strings that have the numerical value

    WORKING-STORAGE SECTION.
    01 unicode-char.
    02 integer-value pic 9(4).
    02 utf8-value redefines integer-value pic x(4).

    And now when I do:

    move 1488 to integer-value.
    display utf8-value.

    I get 1488 printed back at me. So I try further: I add

    77 ws-utf8-value pic x(4).

    and when I try to do a move to it:

    move 1488 to integer-value.
    move utf8-value to ws-utf8-value.
    display ws-utf8-value.

    I still get 1488... as a string.

    UTF8 is a coding structure for 8bit characters. The basic characters are the same as ASCII and are held in one byte. Extended characters use two bytes or more, the first being a value outside the the normal character set of 00 to 7F.

    Your 'integer-value' is PIC 9(4) which is display numeric. Each digit is represented by an ASCII numeric character. So '1488' is a 4 byte string.

    In order to see the binary code value of a byte you need to redefine each character as a bit value, such as BINARY-CHAR UNSIGNED.


    01 unicode-char.
    02 utf8-value pic x(4).
    02 utf8-codes redefines utf8-value.
    03 utf8-code usage binary-char unsigned occurs 4.

    Move each utf8 character to utf8-value and get the value of each byte in utf8-code[n]. Characters outside the range a-z, A-Z, 0-9 and some special characters may use more than one character position.




    So can someone please tell me how can I get at the integer value of a utf-8 character within gnu cobol? What am I missing?

    Thanks,

    Mayer
    --- Synchronet 3.20a-Linux NewsLink 1.114