• Tcl9: source files are interpreted as utf-8 by default

    From Uwe Schmitz@schmitzu@mail.de to comp.lang.tcl on Fri Dec 13 16:02:51 2024
    From Newsgroup: comp.lang.tcl

    Folks,

    is it possible in Tcl9 to get the old (8.x) behaviour back,
    that tcl files are read with system encoding instead of
    utf-8?
    Is there e.g. an environment variable or a configure
    switch to change this?

    I found:

    --with-encoding encoding for configuration values (default: utf-8)

    but, what is meant with "configuration values"?

    I've the problem that almost all of my sources contain some
    umlauts and are, for legacy reasons, in iso8859-1. If it's
    not possible to get the old behavior back, I have to
    edit every pkgIndex.tcl and every tclsh <script> call
    before I'm able to migrate.

    Thanks in advance
    Uwe

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Harald Oehlmann@wortkarg3@yahoo.com to comp.lang.tcl on Fri Dec 13 16:17:48 2024
    From Newsgroup: comp.lang.tcl

    Am 13.12.2024 um 16:02 schrieb Uwe Schmitz:
    Folks,

    is it possible in Tcl9 to get the old (8.x) behaviour back,
    that tcl files are read with system encoding instead of
    utf-8?
    Is there e.g. an environment variable or a configure
    switch to change this?

    I found:

    --with-encoding         encoding for configuration values (default: utf-8)

    but, what is meant with "configuration values"?

    I've the problem that almost all of my sources contain some
    umlauts and are, for legacy reasons, in iso8859-1. If it's
    not possible to get the old behavior back, I have to
    edit every pkgIndex.tcl and every tclsh <script> call
    before I'm able to migrate.

    Thanks in advance
    Uwe


    Hi Uwe,
    source -encoding iso8859-1 $file

    I put those in the package index files which source the package files.
    Remark that msgcat message files are always in utf-8.

    The great TCL9 migration helper by Ashok tries to find the related files.

    Take care,
    Harald
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Uwe Schmitz@schmitzu@mail.de to comp.lang.tcl on Fri Dec 13 16:41:45 2024
    From Newsgroup: comp.lang.tcl

    Hi Harald,

    source -encoding iso8859-1 $file

    I put those in the package index files which source the package files.
    Remark that msgcat message files are always in utf-8.

    I was worried that I would have to add this to all
    my packages. However, that seems manageable.

    I'm more worried that I also have to add the encoding
    to all my ksh/bash scripts that call Tcl via
    tclsh <scriptfile>.

    I've already made a note of Ashok's migration helper.
    I wanted to use it in the next step. I didn't expect
    to run into migration problems during installation
    (where some simple Tcl scripts are used, e.g. for
    documentation purposes).

    Thanks for your answer!
    Uwe

    Am 13.12.2024 um 16:17 schrieb Harald Oehlmann:
    Am 13.12.2024 um 16:02 schrieb Uwe Schmitz:
    Folks,

    is it possible in Tcl9 to get the old (8.x) behaviour back,
    that tcl files are read with system encoding instead of
    utf-8?
    Is there e.g. an environment variable or a configure
    switch to change this?

    I found:

    --with-encoding         encoding for configuration values (default: utf-8)

    but, what is meant with "configuration values"?

    I've the problem that almost all of my sources contain some
    umlauts and are, for legacy reasons, in iso8859-1. If it's
    not possible to get the old behavior back, I have to
    edit every pkgIndex.tcl and every tclsh <script> call
    before I'm able to migrate.

    Thanks in advance
    Uwe


    Hi Uwe,
    source -encoding iso8859-1 $file

    I put those in the package index files which source the package files.
    Remark that msgcat message files are always in utf-8.

    The great TCL9 migration helper by Ashok tries to find the related files.

    Take care,
    Harald

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Uwe Schmitz@schmitzu@mail.de to comp.lang.tcl on Tue Jan 7 18:00:15 2025
    From Newsgroup: comp.lang.tcl

    Sorry that I have to come back to this issue.

    As stated before, we use iso8859-1 as system encoding.
    With Tcl9 we now got errors reading source files with e.g. umlauts,
    because Tcl9 interprets all sources as utf-8 by
    default. That means we have to add "-encoding iso8859-1"
    to ALL source and ALL tclsh calls in ALL scripts.
    So far, so good(or bad?).

    What initially seems quite doable, looks more and more scary
    to me. First, if we ever may switch encoding to utf-8 we
    have to alter all those lines again. Either we switch them to utf-8 or
    we remove the -encoding and went back
    to the state before Tcl9.

    Another point: we have MANY scripts only for development needs.
    Coded quickNdirty for code generation, documentation, packaging, etc.
    Most of them called by "tclsh helperScript.tcl ..." (they have no shebang or whatever). They now have to be called by
    "tclsh -encoding iso8859-1 helperScript.tcl ..."

    Thats a lot more typing.

    Some of them have a usage message like:
    usage: tclsh helperScript.tcl arg1 arg2
    ...

    Do we now have to change it to:
    usage: tclsh -encoding iso8859-1 helperScript.tcl arg1 arg2
    ...
    ?

    Side note: The open command, which opens a file with the
    system encoding by default, has thankfully not changed
    in the same manner as source and tclsh :-).

    Now my suggestion:
    Wouldn't it be convenient for Tcl9 to have a global switch
    (e.g. Environment variable) to get back the Tcl8 encoding
    behaviour?
    Or, isn't it best to keep the old encoding mimik in Tcl9.
    I see no advantage in the new behavior. Even if you have
    all sources in utf-8, you might also have choosen utf-8 as
    system encoding.

    Do we have (or can we have) a magic comment or something else
    with which we can choose the encoding of a source file in
    the file itself?
    Python e.g. has https://peps.python.org/pep-0263/.

    Maybe I’m missing something crucial...

    Thanks in advance
    Uwe


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Harald Oehlmann@wortkarg3@yahoo.com to comp.lang.tcl on Tue Jan 7 18:20:08 2025
    From Newsgroup: comp.lang.tcl

    Am 07.01.2025 um 18:00 schrieb Uwe Schmitz:
    Sorry that I have to come back to this issue.

    As stated before, we use iso8859-1 as system encoding.
    With Tcl9 we now got errors reading source files with e.g. umlauts,
    because Tcl9 interprets all sources as utf-8 by
    default. That means we have to add "-encoding iso8859-1"
    to ALL source and ALL tclsh calls in ALL scripts.
    So far, so good(or bad?).

    What initially seems quite doable, looks more and more scary
    to me. First, if we ever may switch encoding to utf-8 we
    have to alter all those lines again. Either we switch them to utf-8 or
    we remove the -encoding and went back
    to the state before Tcl9.

    Another point: we have MANY scripts only for development needs.
    Coded quickNdirty for code generation, documentation, packaging, etc.
    Most of them called by "tclsh helperScript.tcl ..." (they have no
    shebang or
    whatever). They now have to be called by
    "tclsh -encoding iso8859-1 helperScript.tcl ..."

    Thats a lot more typing.

    Some of them have a usage message like:
    usage: tclsh helperScript.tcl arg1 arg2
    ...

    Do we now have to change it to:
    usage: tclsh -encoding iso8859-1 helperScript.tcl arg1 arg2
    ...
    ?

    Side note: The open command, which opens a file with the
    system encoding by default, has thankfully not changed
    in the same manner as source and tclsh :-).

    Now my suggestion:
    Wouldn't it be convenient for Tcl9 to have a global switch
    (e.g. Environment variable) to get back the Tcl8 encoding
    behaviour?
    Or, isn't it best to keep the old encoding mimik in Tcl9.
    I see no advantage in the new behavior. Even if you have
    all sources in utf-8, you might also have choosen utf-8 as
    system encoding.

    Do we have (or can we have) a magic comment or something else
    with which we can choose the encoding of a source file in
    the file itself?
    Python e.g. has https://peps.python.org/pep-0263/.

    Maybe I’m missing something crucial...

    Thanks in advance
    Uwe



    THanks, Uwe.
    Sorry, for the inconvenience.

    For me, this is a big step forward.

    With tcl 8.6, I always have to type:
    source -encoding utf-8 script.tcl
    as I don't know the system encoding.
    It is not setable for me.
    So, this change is a big advantage, as now, I can type:
    source script.tcl.


    Sorry,
    Harald
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Luc@luc@sep.invalid to comp.lang.tcl on Tue Jan 7 16:08:10 2025
    From Newsgroup: comp.lang.tcl

    On Tue, 7 Jan 2025 18:00:15 +0100, Uwe Schmitz wrote:

    They now have to be called by
    "tclsh -encoding iso8859-1 helperScript.tcl ..."

    Thats a lot more typing.

    **************************

    The lot more typing problem can be solved with a shell alias.

    In Tcl, using 'source,' you can create an alias too.

    Not exactly what you wanted, but readily available here and now.
    --
    Luc


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Uwe Schmitz@schmitzu@mail.de to comp.lang.tcl on Wed Jan 8 11:35:10 2025
    From Newsgroup: comp.lang.tcl

    Harald,

    THanks, Uwe.
    Sorry, for the inconvenience.

    For me, this is a big step forward.

    With tcl 8.6, I always have to type:
    source -encoding utf-8 script.tcl
    as I don't know the system encoding.
    It is not setable for me.
    So, this change is a big advantage, as now, I can type:
    source script.tcl.

    on all the systems I've worked on so far, I've been able to
    to set the system encoding, even as a normal user.
    Users of our in-house software stack (similar to BAWT, but only
    for Linux) are advised to set iso8859-1 encoding, before running
    any programs.

    Anyhow, what we should have at least is a magic comment as described
    in my other post. This would give you the option of placing
    the encoding where it really belongs. And this would avoid
    having to include the encoding with every source/tclsh call.
    If you ever change the encoding, you have to find all this places
    and correct them. Good luck to find them all...

    To summarize, I am more and more getting to the opinion,
    that Tcl9 forces developers to encode their source codes in utf-8.
    Otherwise you end up in an encoding nightmare.

    Best regards,
    Uwe

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Uwe Schmitz@schmitzu@mail.de to comp.lang.tcl on Wed Jan 8 11:54:34 2025
    From Newsgroup: comp.lang.tcl


    Thanks for your suggestions!

    I know that there are ways to work around the typing effort.

    Aliases are a good start, but you have to deploy them to
    other users/developers. And it's getting more cpomlicated if
    you have heterogenous operating systems (thinking of Windows).

    Redefining the source command is another one. In the past, I was always
    very careful when it came to overwriting BuiltIns. In small
    applications, this is usually manageable. But when the
    applications get bigger, it's easy to lose track and
    shoot yourself in the foot.

    However, before I consider any of the above options,
    I try to solve problems as close to the actual cause as possible.

    Best regards,
    Uwe


    Am 07.01.2025 um 20:08 schrieb Luc:
    On Tue, 7 Jan 2025 18:00:15 +0100, Uwe Schmitz wrote:

    They now have to be called by
    "tclsh -encoding iso8859-1 helperScript.tcl ..."

    Thats a lot more typing.

    **************************

    The lot more typing problem can be solved with a shell alias.

    In Tcl, using 'source,' you can create an alias too.

    Not exactly what you wanted, but readily available here and now.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Harald Oehlmann@wortkarg3@yahoo.com to comp.lang.tcl on Wed Jan 8 11:58:08 2025
    From Newsgroup: comp.lang.tcl

    Am 08.01.2025 um 11:35 schrieb Uwe Schmitz:
    Harald,

    THanks, Uwe.
    Sorry, for the inconvenience.

    For me, this is a big step forward.

    With tcl 8.6, I always have to type:
    source -encoding utf-8 script.tcl
    as I don't know the system encoding.
    It is not setable for me.
    So, this change is a big advantage, as now, I can type:
    source script.tcl.

    on all the systems I've worked on so far, I've been able to
    to set the system encoding, even as a normal user.
    Users of our in-house software stack (similar to BAWT, but only
    for Linux) are advised to set iso8859-1 encoding, before running
    any programs.

    Anyhow, what we should have at least is a magic comment as described
    in my other post. This would give you the option of placing
    the encoding where it really belongs. And this would avoid
    having to include the encoding with every source/tclsh call.
    If you ever change the encoding, you have to find all this places
    and correct them. Good luck to find them all...

    To summarize, I am more and more getting to the opinion,
    that Tcl9 forces developers to encode their source codes in utf-8.
    Otherwise you end up in an encoding nightmare.

    Best regards,
    Uwe

    Sorry, I am MS-Windows only.
    I can only set the system encoding system wide (well, the answer is more complicated - it depends on the application manifest and the system wide system encoding).
    As I distribute my software worldwide, I am not in control of the system encoding.

    Sorry, different use-case, different answer.

    If you want this feature, please file a bug report at the bug tracker.

    Take care,
    Harald

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Harald Oehlmann@wortkarg3@yahoo.com to comp.lang.tcl on Wed Jan 8 13:01:54 2025
    From Newsgroup: comp.lang.tcl

    Am 08.01.2025 um 11:58 schrieb Harald Oehlmann:
    Am 08.01.2025 um 11:35 schrieb Uwe Schmitz:
    Harald,

    THanks, Uwe.
    Sorry, for the inconvenience.

    For me, this is a big step forward.

    With tcl 8.6, I always have to type:
    source -encoding utf-8 script.tcl
    as I don't know the system encoding.
    It is not setable for me.
    So, this change is a big advantage, as now, I can type:
    source script.tcl.

    on all the systems I've worked on so far, I've been able to
    to set the system encoding, even as a normal user.
    Users of our in-house software stack (similar to BAWT, but only
    for Linux) are advised to set iso8859-1 encoding, before running
    any programs.

    Anyhow, what we should have at least is a magic comment as described
    in my other post. This would give you the option of placing
    the encoding where it really belongs. And this would avoid
    having to include the encoding with every source/tclsh call.
    If you ever change the encoding, you have to find all this places
    and correct them. Good luck to find them all...

    To summarize, I am more and more getting to the opinion,
    that Tcl9 forces developers to encode their source codes in utf-8.
    Otherwise you end up in an encoding nightmare.

    Best regards,
    Uwe

    Sorry, I am MS-Windows only.
    I can only set the system encoding system wide (well, the answer is more complicated - it depends on the application manifest and the system wide system encoding).
    As I distribute my software worldwide, I am not in control of the system encoding.

    Sorry, different use-case, different answer.

    If you want this feature, please file a bug report at the bug tracker.

    Take care,
    Harald


    And you may change the behaviour in the automatically sourced startup
    file in your home folder or wherever, Linux folks may help.

    rename source source2
    proc source args {
    source2 -encoding iso8859-1 {*}$args
    }

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Luc@luc@sep.invalid to comp.lang.tcl on Wed Jan 8 11:35:19 2025
    From Newsgroup: comp.lang.tcl

    On Wed, 8 Jan 2025 11:54:34 +0100, Uwe Schmitz wrote:

    However, before I consider any of the above options,
    I try to solve problems as close to the actual cause as possible. **************************

    I can get my encoding in the ::env array on Linux. Can you on Windows?
    If so, maybe you can use that to set the encoding in the beginning
    of all scripts and never have to change it because it will adjust automatically.
    --
    Luc


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Luc@luc@sep.invalid to comp.lang.tcl on Wed Jan 8 11:42:54 2025
    From Newsgroup: comp.lang.tcl

    On Wed, 8 Jan 2025 11:35:19 -0300, Luc wrote:

    I can get my encoding in the ::env array on Linux. Can you on Windows?
    If so, maybe you can use that to set the encoding in the beginning
    of all scripts and never have to change it because it will adjust >automatically.
    **************************

    Another idea: force all scripts to source a set_encoding.tcl file
    stored somewhere. If you ever have to change, you change the one file
    and move on. You could even make it blank if convenient or necessary.
    --
    Luc


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Uwe Schmitz@schmitzu@mail.de to comp.lang.tcl on Wed Jan 8 15:56:15 2025
    From Newsgroup: comp.lang.tcl


    And you may change the behaviour in the automatically sourced startup file in your home folder or wherever, Linux folks may help.

    rename source source2
    proc source args {
    source2 -encoding iso8859-1 {*}$args
    }

    Yes, but this only works for interactive tclsh's.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Uwe Schmitz@schmitzu@mail.de to comp.lang.tcl on Wed Jan 8 16:04:26 2025
    From Newsgroup: comp.lang.tcl

    Am 08.01.2025 um 15:42 schrieb Luc:
    On Wed, 8 Jan 2025 11:35:19 -0300, Luc wrote:

    I can get my encoding in the ::env array on Linux. Can you on Windows?
    If so, maybe you can use that to set the encoding in the beginning
    of all scripts and never have to change it because it will adjust
    automatically.
    **************************

    Another idea: force all scripts to source a set_encoding.tcl file
    stored somewhere. If you ever have to change, you change the one file
    and move on. You could even make it blank if convenient or necessary.


    Nice try, but I don't think it's possible to set the encoding within the file. And that for one simple reason: the file has already been read.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Luc@luc@sep.invalid to comp.lang.tcl on Wed Jan 8 12:40:55 2025
    From Newsgroup: comp.lang.tcl

    On Wed, 8 Jan 2025 16:04:26 +0100, Uwe Schmitz wrote:

    Another idea: force all scripts to source a set_encoding.tcl file
    stored somewhere. If you ever have to change, you change the one file
    and move on. You could even make it blank if convenient or necessary.


    Nice try, but I don't think it's possible to set the encoding within the >file. And that for one simple reason: the file has already been read. **************************

    That doesn't sound quite true to me. Why is there an 'encoding' command
    then? Is it useless because whenever you use it it's too late because
    the file has already been read? Unlikely.

    Source the set_encoding.tcl file before anything else, before you even
    try to read anything. If you can set the encoding on the command line,
    you can set it on the first line of the script that command line is
    supposed to run.
    --
    Luc


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Rich@rich@example.invalid to comp.lang.tcl on Wed Jan 8 17:04:26 2025
    From Newsgroup: comp.lang.tcl

    Luc <luc@sep.invalid> wrote:
    On Wed, 8 Jan 2025 16:04:26 +0100, Uwe Schmitz wrote:

    Another idea: force all scripts to source a set_encoding.tcl file
    stored somewhere. If you ever have to change, you change the one file
    and move on. You could even make it blank if convenient or necessary.


    Nice try, but I don't think it's possible to set the encoding within the >>file. And that for one simple reason: the file has already been read.
    **************************

    That doesn't sound quite true to me. Why is there an 'encoding' command
    then? Is it useless because whenever you use it it's too late because
    the file has already been read? Unlikely.

    Source the set_encoding.tcl file before anything else, before you even
    try to read anything. If you can set the encoding on the command line,
    you can set it on the first line of the script that command line is
    supposed to run.

    Uwe's issue is two part:

    1) encoding for scripts his 'main' script itself sources. Your
    suggestion for renaming 'source' early would avoid having to change
    every [source] invocation from that point forward in the main script
    or in any script it sources.

    2) encoding for the 'main' script itself (the very first one loaded
    when his application is started). This one is "sourced" by the main
    Tcl interpreter, and is read in and parsed using the default
    character encoding the interpreter is using, before any commands in
    that script are run. So this situation creates a chicken-or-the-egg
    situation. If the script is iso-8859 encoded, but Tcl's default
    parsing reads it as UTF-8, then all of the iso-8859 characters
    inside are already corrupted *before* even the first command in the
    script is executed. So there's no way to "source" a
    "set_encoding.tcl" /in the main script itself/, that would adjust
    the encoding before the main script is parsed using the wrong
    encoding.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Luc@luc@sep.invalid to comp.lang.tcl on Wed Jan 8 16:23:39 2025
    From Newsgroup: comp.lang.tcl

    On Wed, 8 Jan 2025 17:04:26 -0000 (UTC), Rich wrote:

    situation. If the script is iso-8859 encoded, but Tcl's default
    parsing reads it as UTF-8, then all of the iso-8859 characters
    inside are already corrupted *before* even the first command in the
    script is executed. So there's no way to "source" a
    "set_encoding.tcl" /in the main script itself/, that would adjust
    the encoding before the main script is parsed using the wrong
    encoding.

    **************************

    I see.

    How about trading places?

    Instead of main.tcl sourcing set_encoding.tcl, starter.tcl runs some
    'encoding' command then sources main.tcl. Basically, a wrapper.

    Do we have a cigar?

    Another option is to run 'iconv' recursively on all those source files.

    I did something like that some 15 years ago. But my case involved a
    migration. I had a ton of legacy iso-8859 files on a system-wide
    utf-8 Linux system. That caused me problems too, but iconv fixed it.
    --
    Luc


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Rich@rich@example.invalid to comp.lang.tcl on Wed Jan 8 19:32:24 2025
    From Newsgroup: comp.lang.tcl

    Luc <luc@sep.invalid> wrote:
    On Wed, 8 Jan 2025 17:04:26 -0000 (UTC), Rich wrote:

    situation. If the script is iso-8859 encoded, but Tcl's default
    parsing reads it as UTF-8, then all of the iso-8859 characters
    inside are already corrupted *before* even the first command in the
    script is executed. So there's no way to "source" a
    "set_encoding.tcl" /in the main script itself/, that would adjust
    the encoding before the main script is parsed using the wrong
    encoding.

    **************************

    I see.

    How about trading places?

    Instead of main.tcl sourcing set_encoding.tcl, starter.tcl runs some 'encoding' command then sources main.tcl. Basically, a wrapper.

    Yes, that works. But then Uwe has to go and "wrapperize" all the
    various scripts, on all the various client systems. So he's back in
    the same boat of "major modifications need be made now" as changing all
    the launching instances to launch with "-encoding iso-8859".

    Another option is to run 'iconv' recursively on all those source files.

    I've resisted pointing this one out, but long term, yes, updating all
    the scripts to be utf-8 encoded is the right, long term, answer. But
    that belies all the current, short term effort, involved in doing so.

    I did something like that some 15 years ago. But my case involved a migration. I had a ton of legacy iso-8859 files on a system-wide
    utf-8 Linux system. That caused me problems too, but iconv fixed it.

    In my case, I used the \uxxxx escapes for anything that was not plain
    ASCII, so all my scripts are both "basic 8859" and "utf-8" at the same
    time, and having Tcl 9 source them as utf-8 won't cause an issue. But
    it sounds like Uwe directly entered the extended 8859 characters into
    the scripts. Which very well may have made perfect sense if he had
    more than one or two of them per script.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Luc@luc@sep.invalid to comp.lang.tcl on Wed Jan 8 17:23:12 2025
    From Newsgroup: comp.lang.tcl

    On Wed, 8 Jan 2025 19:32:24 -0000 (UTC), Rich wrote:

    Instead of main.tcl sourcing set_encoding.tcl, starter.tcl runs some
    'encoding' command then sources main.tcl. Basically, a wrapper.

    Yes, that works. But then Uwe has to go and "wrapperize" all the
    various scripts, on all the various client systems. So he's back in
    the same boat of "major modifications need be made now" as changing all
    the launching instances to launch with "-encoding iso-8859".

    True, but he has considered that kind of effort. His words:


    "That means we have to add "-encoding iso8859-1"
    to ALL source and ALL tclsh calls in ALL scripts.
    So far, so good(or bad?)."

    "What initially seems quite doable, looks more and more scary
    to me. First, if we ever may switch encoding to utf-8 we
    have to alter all those lines again."


    So in my mind, the "customer" accepts (though grudgingly) making
    large scale changes, but is concerned with possible new changes
    in the future. A wrapper can handle the future quite gracefully.


    I've resisted pointing this one out, but long term, yes, updating all
    the scripts to be utf-8 encoded is the right, long term, answer. But
    that belies all the current, short term effort, involved in doing so.

    Actually, when I mentioned my migration case, I was also thinking that
    I could afford to do it because I was migrating to Linux and utf-8 was
    not even the future anymore, it was pretty much the present. But maybe
    running iconv wouldn't be acceptable because Uwe is (I assume) on
    Windows. Does a Windows user want to convert his files to utf-8?
    Won't that cause problems if the system is iso-8859-1? Windows still
    uses iso-8859-1, right?

    So yes, I guess Tcl9 causes trouble to 8859-1 users. Yes, sounds like
    it needs some fixing.

    More suggestions: how about not using Tcl9 just yet? I'm stil on 8.6
    and the water is fine. Early adopters tend to pay a price. In my case,
    absent packages.

    I have my own special case, I use Debian 9 which only ships 8.6.6 so
    I had to build 8.6.15 from source because I really need Unicode.
    But for some time I used Freewrap as a single-file batteries included
    Tcl/Tk interpreter. So maybe Uwe should just use a different interpreter, likely just a slightly older version of Tcl/Tk and embrace Tcl9 later.

    I wonder if one can hack the encoding issue on the Tcl9 source and
    rebuild it.
    --
    Luc


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From ted@loft.tnolan.com (Ted Nolan@tednolan to comp.lang.tcl on Wed Jan 8 22:06:24 2025
    From Newsgroup: comp.lang.tcl

    In article <20250108172312.253b829c@lud1.home>, Luc <luc@sep.invalid> wrote: >On Wed, 8 Jan 2025 19:32:24 -0000 (UTC), Rich wrote:

    Instead of main.tcl sourcing set_encoding.tcl, starter.tcl runs some
    'encoding' command then sources main.tcl. Basically, a wrapper.

    Yes, that works. But then Uwe has to go and "wrapperize" all the
    various scripts, on all the various client systems. So he's back in
    the same boat of "major modifications need be made now" as changing all >>the launching instances to launch with "-encoding iso-8859".

    True, but he has considered that kind of effort. His words:


    "That means we have to add "-encoding iso8859-1"
    to ALL source and ALL tclsh calls in ALL scripts.
    So far, so good(or bad?)."

    "What initially seems quite doable, looks more and more scary
    to me. First, if we ever may switch encoding to utf-8 we
    have to alter all those lines again."


    So in my mind, the "customer" accepts (though grudgingly) making
    large scale changes, but is concerned with possible new changes
    in the future. A wrapper can handle the future quite gracefully.


    I've resisted pointing this one out, but long term, yes, updating all
    the scripts to be utf-8 encoded is the right, long term, answer. But
    that belies all the current, short term effort, involved in doing so.

    Actually, when I mentioned my migration case, I was also thinking that
    I could afford to do it because I was migrating to Linux and utf-8 was
    not even the future anymore, it was pretty much the present. But maybe >running iconv wouldn't be acceptable because Uwe is (I assume) on
    Windows. Does a Windows user want to convert his files to utf-8?
    Won't that cause problems if the system is iso-8859-1? Windows still
    uses iso-8859-1, right?

    So yes, I guess Tcl9 causes trouble to 8859-1 users. Yes, sounds like
    it needs some fixing.

    More suggestions: how about not using Tcl9 just yet? I'm stil on 8.6
    and the water is fine. Early adopters tend to pay a price. In my case, >absent packages.

    I have my own special case, I use Debian 9 which only ships 8.6.6 so
    I had to build 8.6.15 from source because I really need Unicode.
    But for some time I used Freewrap as a single-file batteries included
    Tcl/Tk interpreter. So maybe Uwe should just use a different interpreter, >likely just a slightly older version of Tcl/Tk and embrace Tcl9 later.

    I wonder if one can hack the encoding issue on the Tcl9 source and
    rebuild it.


    --
    Luc



    FWIW, could check if a source file is utf-8 easily enough. I wrote
    a command to do that based on some code from the web a while ago and
    it seemed to work OK for what I needed it for.

    So read your suspect file in binary mode, call "string_is_utf" on it and
    if it is, you're good to source it.

    (If it isn't you can probably apply some more heuristics on the
    string to guess what it actually is).

    ==
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>

    #include <tcl.h>

    #ifdef WIN32
    #include <io.h>
    #define TCL_API __declspec(dllexport)
    #else
    #include <unistd.h>
    #define TCL_API
    #endif

    #ifdef WIN32
    #define dup _dup
    #define fileno _fileno
    #define fdopen _fdopen
    #define close _close
    #endif

    static char rcsid[] = "$Id$ TN";


    /*
    * Function prototypes
    */
    TCL_API int Isutf_Init(Tcl_Interp *interp);

    static int isutf_string_is_utf(ClientData clientData, Tcl_Interp *interp,
    int objc, Tcl_Obj *CONST objv[]);


    /*
    * This decoder by Bjoern Hoermann is the simplest I've found. It also works
    * by feeding it a single byte, as well as keeping a state. The state is
    * very useful for parsing UTF8 coming in in chunks over the network.
    *
    * http://bjoern.hoehrmann.de/utf-8/decoder/dfa/
    *
    */


    // Copyright (c) 2008-2009 Bjoern Hoehrmann <bjoern@hoehrmann.de>
    // See http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ for details.

    #define UTF8_ACCEPT 0
    #define UTF8_REJECT 1

    static const uint8_t utf8d[] = {

    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 00..1f
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 20..3f
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 40..5f
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 60..7f
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9, // 80..9f
    7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, // a0..bf
    8,8,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, // c0..df
    0xa,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x4,0x3,0x3, // e0..ef
    0xb,0x6,0x6,0x6,0x5,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8, // f0..ff
    0x0,0x1,0x2,0x3,0x5,0x8,0x7,0x1,0x1,0x1,0x4,0x6,0x1,0x1,0x1,0x1, // s0..s0
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,0,1,0,1,1,1,1,1,1, // s1..s2
    1,2,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1, // s3..s4
    1,2,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,3,1,3,1,1,1,1,1,1, // s5..s6
    1,3,1,1,1,1,1,3,1,3,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // s7..s8
    };

    #if 0
    static uint32_t decode(uint32_t* state, uint32_t* codep, uint32_t byte) {

    uint32_t type = utf8d[byte];

    *codep = (*state != UTF8_ACCEPT) ?
    (byte & 0x3fu) | (*codep << 6) :
    (0xff >> type) & (byte);

    *state = utf8d[256 + *state*16 + type];

    return *state;
    }
    #endif

    /*
    *
    * A simple validator/detector doesn't need the code point,
    * so it could be written like this (Initial state is set to UTF8_ACCEPT):
    *
    */
    static uint32_t validate_utf8(uint32_t *state, unsigned char *str, size_t len) {

    size_t i;
    uint32_t type;

    for (i = 0; i < len; i++) {
    // We don't care about the codepoint, so this is
    // a simplified version of the decode function.
    type = utf8d[(uint8_t)str[i]];
    *state = utf8d[256 + (*state) * 16 + type];

    if (*state == UTF8_REJECT)
    break;
    }

    return *state;
    }

    /*
    * If the text is valid utf8 UTF8_ACCEPT is returned. If it's
    * invalid UTF8_REJECT. If more data is needed, some other integer is returned.
    *
    */


    /*
    * Init everything
    */

    int Isutf_Init(Tcl_Interp *interp){

    #ifdef USE_TCL_STUBS
    Tcl_InitStubs(interp, "8.6", 0);
    #endif
    Tcl_CreateObjCommand(interp, "string_is_utf",
    isutf_string_is_utf, (ClientData)NULL,
    (Tcl_CmdDeleteProc *) NULL);

    Tcl_PkgProvide(interp,"Isutf", "1.0");

    return 0;
    }





    static int isutf_string_is_utf(ClientData clientData,
    Tcl_Interp *interp, int objc, Tcl_Obj *CONST objv[])
    {

    unsigned char *bytes;
    int bytelen;
    char buf[1024];
    Tcl_Obj *resultPtr;
    uint32_t state = UTF8_ACCEPT;

    resultPtr = Tcl_GetObjResult(interp);

    if(objc != 2){
    Tcl_WrongNumArgs(interp, 1, objv,
    "Usage: string_is_utf binary_string");
    return TCL_ERROR;
    }

    bytes = Tcl_GetByteArrayFromObj(objv[1], &bytelen);

    #if 0
    if(bytelen == 0) {

    sprintf(buf, "0 bytes in object!");
    Tcl_SetStringObj(resultPtr, buf, -1);
    return TCL_ERROR;
    }
    #endif

    if (validate_utf8(&state, bytes, bytelen) == UTF8_REJECT) {
    sprintf(buf, "Invalid UTF8 data!");
    Tcl_SetStringObj(resultPtr, buf, -1);
    Tcl_SetIntObj(resultPtr, 0);
    return TCL_OK;
    }

    Tcl_SetIntObj(resultPtr, 1);
    return TCL_OK;
    }
    --
    columbiaclosings.com
    What's not in Columbia anymore..
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From saito@saitology9@gmail.com to comp.lang.tcl on Wed Jan 8 17:36:28 2025
    From Newsgroup: comp.lang.tcl

    On 1/8/2025 2:32 PM, Rich wrote:

    I did something like that some 15 years ago. But my case involved a
    migration. I had a ton of legacy iso-8859 files on a system-wide
    utf-8 Linux system. That caused me problems too, but iconv fixed it.

    In my case, I used the \uxxxx escapes for anything that was not plain
    ASCII, so all my scripts are both "basic 8859" and "utf-8" at the same
    time, and having Tcl 9 source them as utf-8 won't cause an issue. But
    it sounds like Uwe directly entered the extended 8859 characters into
    the scripts. Which very well may have made perfect sense if he had
    more than one or two of them per script.

    Interesting thread.

    Is there a way check a script file for such incompatibilities ahead of time?

    Would this work as a solution? You build your own Tcl/Tk and add or
    duplicate the source command from an earlier version that you are happy
    with. Then you start up your app as you do currenly, and once it is
    loaded, you switch the source command to the new version and change the
    system encoding.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Rich@rich@example.invalid to comp.lang.tcl on Wed Jan 8 22:53:40 2025
    From Newsgroup: comp.lang.tcl

    Luc <luc@sep.invalid> wrote:
    On Wed, 8 Jan 2025 19:32:24 -0000 (UTC), Rich wrote:

    Instead of main.tcl sourcing set_encoding.tcl, starter.tcl runs some
    'encoding' command then sources main.tcl. Basically, a wrapper.

    Yes, that works. But then Uwe has to go and "wrapperize" all the
    various scripts, on all the various client systems. So he's back in
    the same boat of "major modifications need be made now" as changing all >>the launching instances to launch with "-encoding iso-8859".

    True, but he has considered that kind of effort. His words:


    "That means we have to add "-encoding iso8859-1"
    to ALL source and ALL tclsh calls in ALL scripts.
    So far, so good(or bad?)."

    "What initially seems quite doable, looks more and more scary
    to me. First, if we ever may switch encoding to utf-8 we
    have to alter all those lines again."


    So in my mind, the "customer" accepts (though grudgingly) making
    large scale changes, but is concerned with possible new changes
    in the future. A wrapper can handle the future quite gracefully.

    Uwe's reality is likely that at some point a "mass migration" may very
    well have to be done. There's at least two possibilities:

    1) Tcl9 remains as it is today, loading all scripts as UTF-8 unless
    told otherwise by a user provided option. Either all iso-8859 scripts
    have to be modifed to become:
    1a) UTF-8 encoded;
    1b) modified pass the -encoding parameter to [source]
    1c) a wrapper deployed that 'adjusts' things such that the main
    script, and all sourced scripts use -encoding to source as iso-8859

    All appear to be substantial work based on Uwe's statements so far, and
    all have a risk of overlooking one or more that should have been
    modified.

    2) Tcl9 patch X reverts to using "system encoding" (and the user's of
    these scripts are on systems where "system encoding" is presently
    returning iso-8859). So things work again, with no changes, for the
    moment. But then Windows version 1Y.Z changes things such that it now
    uses a system encoding of UTF-8. Suddenly, the same problem from 1
    returns unless the user's have the abilty to adjust their system
    encoding back (and if 'system encoding' is an "administrator
    controlled" setting these users, then this option is not available).

    So my two cents, for what it is worth, given that I suspect this change
    will eventually 'force itself' no matter what Tcl9 patch level X might
    do, would be to begin the process of migrating all of these scripts to
    UTF-8 encoding. It will be hard, but once done, it likely will be
    stable again for the future.

    I've resisted pointing this one out, but long term, yes, updating all
    the scripts to be utf-8 encoded is the right, long term, answer. But
    that belies all the current, short term effort, involved in doing so.

    Actually, when I mentioned my migration case, I was also thinking that
    I could afford to do it because I was migrating to Linux and utf-8 was
    not even the future anymore, it was pretty much the present. But maybe running iconv wouldn't be acceptable because Uwe is (I assume) on
    Windows.

    From his posts on this thread, we can assume that his scripts are being
    used on windows systems. That does not imply much about where Uwe
    develops those same scripts. I have lots of my own scripts that I use
    on $work's windows machine, but all of them are written on Linux.

    Does a Windows user want to convert his files to utf-8?

    The average/median windows user does not even know what UTF-8 means nor
    why it is significant. They just expect that when the launch "icon X"
    that expected program X appears, and that the text inside is as
    expected. So it is much more likely the work/effort of "convert to
    utf-8" will fall on Uwe, as it is very likely the windows users know
    nothing of any of this (or if they 'know' anything, it is something
    simple for them, such as: "set this selection box in this windows
    config pane to say Y" and that ends their knowledge).

    Won't that cause problems if the system is iso-8859-1?

    Only if windows tries to interpret the UTF-8 data as iso-8859
    characters. But as far as the Tcl scripts go, once the scripts are
    UTF-8, and [source] is using UTF-8 to read them, the fact that windows
    system might be iso-8859 is irrelivant.

    Windows still uses iso-8859-1, right?

    Honestly I have no idea. The *only* windows machine I use is $work's
    windows machine, and the 'administrator' controls most of it so I can
    only adjust things in a very narrow band (very irritating at times, but
    their machine, their rules).

    So yes, I guess Tcl9 causes trouble to 8859-1 users.

    Only if they directly entered any codepoints that were beyond plain
    ASCII. Code points 0 through 127 are identical between 8859 and UTF-8.
    If the files used plain ASCII, and the \uXXXX escapes, there would be
    no trouble at all. Of course if one is using a lot of non-English
    characters for non-English languages, seeing the actual characters in
    the scripts vs. walls of \u00b0 \u00a0 \u2324 everywhere makes for an
    easier development effort.

    Yes, sounds like it needs some fixing.

    Agreed. Uwe may be able to put off the fixig for some more time, but
    this change is going to arrive one day. He will likely have to make it
    at some point.

    I have my own special case, I use Debian 9 which only ships 8.6.6 so
    I had to build 8.6.15 from source because I really need Unicode.

    8.6.6 handled Unicode fine. In fact, 8.5 handled Unicode (so long as
    one stuck to the BMP) just fine.

    But for some time I used Freewrap as a single-file batteries included
    Tcl/Tk interpreter. So maybe Uwe should just use a different interpreter, likely just a slightly older version of Tcl/Tk and embrace Tcl9 later.

    That is another option, a custom build that defaults to iso-8859.

    I wonder if one can hack the encoding issue on the Tcl9 source and
    rebuild it.

    The answer is likely a "yes". But I've not looked at the code to know
    that for sure. But this just feels like a "one line change" followed
    by a recompile. But now one has to also deliver that custom runtime
    as well as the scripts that go with it.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Rich@rich@example.invalid to comp.lang.tcl on Wed Jan 8 23:00:09 2025
    From Newsgroup: comp.lang.tcl

    saito <saitology9@gmail.com> wrote:
    On 1/8/2025 2:32 PM, Rich wrote:

    I did something like that some 15 years ago. But my case involved a
    migration. I had a ton of legacy iso-8859 files on a system-wide
    utf-8 Linux system. That caused me problems too, but iconv fixed it.

    In my case, I used the \uxxxx escapes for anything that was not plain
    ASCII, so all my scripts are both "basic 8859" and "utf-8" at the same
    time, and having Tcl 9 source them as utf-8 won't cause an issue. But
    it sounds like Uwe directly entered the extended 8859 characters into
    the scripts. Which very well may have made perfect sense if he had
    more than one or two of them per script.

    Interesting thread.

    Is there a way check a script file for such incompatibilities ahead
    of time?

    Detecting 'character encodings' reliably is a fickle business. One
    could run a UTF-8 validator (which it appears Ted posted in another
    post) that validates that all the bytes in a script are valid UTF-8
    encodings. That implies it is UTF-8, and is likely mostly reliable.
    But I suppose if one wanted to do so one could come up with chimera
    file that is valid UTF-8 encoding but also valid for some other
    character encoding. Then "UTF-8" encoding valid may not be the correct interpretation for the file.

    Would this work as a solution? You build your own Tcl/Tk and add or duplicate the source command from an earlier version that you are happy with. Then you start up your app as you do currenly, and once it is
    loaded, you switch the source command to the new version and change the system encoding.

    That would likey work -- but would add the burden of distributing and maintaining that custom patched version into the future, which Uwe may
    not want to take on.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Luc@luc@sep.invalid to comp.lang.tcl on Wed Jan 8 20:28:51 2025
    From Newsgroup: comp.lang.tcl

    On Wed, 8 Jan 2025 22:53:40 -0000 (UTC), Rich wrote:

    Won't that cause problems if the system is iso-8859-1?

    Only if windows tries to interpret the UTF-8 data as iso-8859
    characters. But as far as the Tcl scripts go, once the scripts are
    UTF-8, and [source] is using UTF-8 to read them, the fact that windows >system might be iso-8859 is irrelivant.

    I was thinking that if the Windows user edits the file on Windows,
    maybe Windows will write it as iso-8859. I honestly don't know.


    8.6.6 handled Unicode fine. In fact, 8.5 handled Unicode (so long as
    one stuck to the BMP) just fine.

    I am positive that 8.6.6 only partially supports Unicode. I found many characters that would not display correctly on a text widget and would
    be saved as garbled content if captured in the widget and written to
    file. I even had problems with glob and other commands when applied to
    some file names. For example, some html page I had downloaded from
    somewhere had something to do with countries and the page title had
    Unicode flags in the title, so the title and the flags carried over
    to the file name when I saved it. The complete implementation of
    Unicode begins in 8.6.10 or 8.6.13, I can't remember which, I think
    it's 8.6.13.

    I know that is specifically mentioned in a wikit page, I can't
    remember which one but that is not terribly relevant right now.
    --
    Luc



    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Rich@rich@example.invalid to comp.lang.tcl on Thu Jan 9 03:57:14 2025
    From Newsgroup: comp.lang.tcl

    Luc <luc@sep.invalid> wrote:
    On Wed, 8 Jan 2025 22:53:40 -0000 (UTC), Rich wrote:

    Won't that cause problems if the system is iso-8859-1?

    Only if windows tries to interpret the UTF-8 data as iso-8859
    characters. But as far as the Tcl scripts go, once the scripts are
    UTF-8, and [source] is using UTF-8 to read them, the fact that windows >>system might be iso-8859 is irrelivant.

    I was thinking that if the Windows user edits the file on Windows,
    maybe Windows will write it as iso-8859. I honestly don't know.

    I don't know what windows does either. A /reasonable/ approach (which
    likely means windows deliberately does not do this) is to write it as
    the "system encoding" unless the user explicitly says to use something
    else, or unless something in the file indicated it was originally some
    other encoding.

    8.6.6 handled Unicode fine. In fact, 8.5 handled Unicode (so long as
    one stuck to the BMP) just fine.

    I am positive that 8.6.6 only partially supports Unicode.

    What were the codepoint values at issue? 8.6.6 worked fine with the
    BMP (code points 0000 to FFFF) characters.

    I found many characters that would not display correctly on a text
    widget

    Display depends upon whether your font being used had a glyph for the codepoint - no glyph in the font, no display in the text widget (even
    though 8.6.6 likely transparently handled the code point properly,
    assuming it was within the BMP).

    and would be saved as garbled content if captured in the widget and
    written to file.

    That also depends upon what your system encoding was set to, and
    whether you forced a specific encoding when writing the file. If the
    code points were in the BMP, and you explicitly set utf-8 encoding
    before writing to the file, then the file's contents were properly
    encoded even as far back as 8.5 (I know this one because I processed
    millions of utf-8 files with only BMP code points through 8.5 for $work
    with zero utf-8 encoding issues).

    I even had problems with glob and other commands when applied to
    some file names. For example, some html page I had downloaded from
    somewhere had something to do with countries and the page title had
    Unicode flags in the title, so the title and the flags carried over
    to the file name when I saved it.

    Country flags are very likely characters that are beyond the BMP, and
    yes, 8.6.6 likely did not handle those properly.

    The complete implementation of Unicode begins in 8.6.10 or 8.6.13, I
    can't remember which, I think it's 8.6.13.

    That is probably when support for the extended Unicode characters
    (planes beyond the BMP) started to be added.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Luc@luc@sep.invalid to comp.lang.tcl on Thu Jan 9 01:15:16 2025
    From Newsgroup: comp.lang.tcl

    On Thu, 9 Jan 2025 03:57:14 -0000 (UTC), Rich wrote:

    Display depends upon whether your font being used had a glyph for the >codepoint - no glyph in the font, no display in the text widget

    That also depends upon what your system encoding was set to, and

    That is probably when support for the extended Unicode characters
    (planes beyond the BMP) started to be added.

    **************************

    Nothing to do with fonts or encoding. The problem vanished as soon as
    I used 8.6.13, later 8.6.15. It was extended Unicode characters.

    You can see my discussion here, at the end of the page:

    https://wiki.tcl-lang.org/page/Unicode
    --
    Luc


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Uwe Schmitz@schmitzu@mail.de to comp.lang.tcl on Thu Jan 9 10:12:25 2025
    From Newsgroup: comp.lang.tcl

    Folks,

    thanks for all your suggestions and discussions.

    I think this discussion has helped to bring the topic
    into focus. May be that other Tcl users will be affected
    by this incompatibility when migrating to Tcl9.
    We have now examined the problem from many different
    angles and have come up with more or less elaborate solutions.

    My conclusion is: If migrating to Tcl9 all your source files
    have to be encoded in utf-8. Otherwise you will have much more
    maintaining efforts (adding iso8859, redefine commands, roll your own tcl, ...).

    This can be mitigated if we introduce one of the changes I have proposed:
    1. a switch (environment variable?) that restores the Tcl8 behavior and/or
    2. a magic comment within the source file that can be used to determine the encoding.

    I try to file a ticket about this.

    Nevertheless, this point should be noted under "Important Incompatibilities in Tcl 9.0"
    on the Tcl9 page:
    https://www.tcl.tk/software/tcltk/9.0.html

    Thanks again for your time!
    Happy Tcl'ing
    Uwe

    Am 09.01.2025 um 05:15 schrieb Luc:
    On Thu, 9 Jan 2025 03:57:14 -0000 (UTC), Rich wrote:

    Display depends upon whether your font being used had a glyph for the
    codepoint - no glyph in the font, no display in the text widget

    That also depends upon what your system encoding was set to, and

    That is probably when support for the extended Unicode characters
    (planes beyond the BMP) started to be added.

    **************************

    Nothing to do with fonts or encoding. The problem vanished as soon as
    I used 8.6.13, later 8.6.15. It was extended Unicode characters.

    You can see my discussion here, at the end of the page:

    https://wiki.tcl-lang.org/page/Unicode


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Uwe Schmitz@schmitzu@mail.de to comp.lang.tcl on Thu Jan 9 10:27:47 2025
    From Newsgroup: comp.lang.tcl

    Rich,

    at first, thank you very much for explaining my situation very well.
    I couldn't have argued better ;-)

    Let me add a note on why characters outside the iso7-bit range
    cannot always be replaced by the \uXXXX notation:
    Comments.
    If you like to write comments in your native language it
    should be no very readable to code e.g. german umlauts as \uXXXX.
    Especially if you extract the program documentation out of from
    the source code in a kind of “literate programming” (which I often
    do), the use of \u notation is very cumbersome.

    Best wishes
    Uwe


    Am 08.01.2025 um 23:53 schrieb Rich:
    Luc <luc@sep.invalid> wrote:
    On Wed, 8 Jan 2025 19:32:24 -0000 (UTC), Rich wrote:

    Instead of main.tcl sourcing set_encoding.tcl, starter.tcl runs some
    'encoding' command then sources main.tcl. Basically, a wrapper.

    Yes, that works. But then Uwe has to go and "wrapperize" all the
    various scripts, on all the various client systems. So he's back in
    the same boat of "major modifications need be made now" as changing all
    the launching instances to launch with "-encoding iso-8859".

    True, but he has considered that kind of effort. His words:


    "That means we have to add "-encoding iso8859-1"
    to ALL source and ALL tclsh calls in ALL scripts.
    So far, so good(or bad?)."

    "What initially seems quite doable, looks more and more scary
    to me. First, if we ever may switch encoding to utf-8 we
    have to alter all those lines again."


    So in my mind, the "customer" accepts (though grudgingly) making
    large scale changes, but is concerned with possible new changes
    in the future. A wrapper can handle the future quite gracefully.

    Uwe's reality is likely that at some point a "mass migration" may very
    well have to be done. There's at least two possibilities:

    1) Tcl9 remains as it is today, loading all scripts as UTF-8 unless
    told otherwise by a user provided option. Either all iso-8859 scripts
    have to be modifed to become:
    1a) UTF-8 encoded;
    1b) modified pass the -encoding parameter to [source]
    1c) a wrapper deployed that 'adjusts' things such that the main
    script, and all sourced scripts use -encoding to source as iso-8859

    All appear to be substantial work based on Uwe's statements so far, and
    all have a risk of overlooking one or more that should have been
    modified.

    2) Tcl9 patch X reverts to using "system encoding" (and the user's of
    these scripts are on systems where "system encoding" is presently
    returning iso-8859). So things work again, with no changes, for the
    moment. But then Windows version 1Y.Z changes things such that it now
    uses a system encoding of UTF-8. Suddenly, the same problem from 1
    returns unless the user's have the abilty to adjust their system
    encoding back (and if 'system encoding' is an "administrator
    controlled" setting these users, then this option is not available).

    So my two cents, for what it is worth, given that I suspect this change
    will eventually 'force itself' no matter what Tcl9 patch level X might
    do, would be to begin the process of migrating all of these scripts to
    UTF-8 encoding. It will be hard, but once done, it likely will be
    stable again for the future.

    I've resisted pointing this one out, but long term, yes, updating all
    the scripts to be utf-8 encoded is the right, long term, answer. But
    that belies all the current, short term effort, involved in doing so.

    Actually, when I mentioned my migration case, I was also thinking that
    I could afford to do it because I was migrating to Linux and utf-8 was
    not even the future anymore, it was pretty much the present. But maybe
    running iconv wouldn't be acceptable because Uwe is (I assume) on
    Windows.

    From his posts on this thread, we can assume that his scripts are being
    used on windows systems. That does not imply much about where Uwe
    develops those same scripts. I have lots of my own scripts that I use
    on $work's windows machine, but all of them are written on Linux.

    Does a Windows user want to convert his files to utf-8?

    The average/median windows user does not even know what UTF-8 means nor
    why it is significant. They just expect that when the launch "icon X"
    that expected program X appears, and that the text inside is as
    expected. So it is much more likely the work/effort of "convert to
    utf-8" will fall on Uwe, as it is very likely the windows users know
    nothing of any of this (or if they 'know' anything, it is something
    simple for them, such as: "set this selection box in this windows
    config pane to say Y" and that ends their knowledge).

    Won't that cause problems if the system is iso-8859-1?

    Only if windows tries to interpret the UTF-8 data as iso-8859
    characters. But as far as the Tcl scripts go, once the scripts are
    UTF-8, and [source] is using UTF-8 to read them, the fact that windows
    system might be iso-8859 is irrelivant.

    Windows still uses iso-8859-1, right?

    Honestly I have no idea. The *only* windows machine I use is $work's
    windows machine, and the 'administrator' controls most of it so I can
    only adjust things in a very narrow band (very irritating at times, but
    their machine, their rules).

    So yes, I guess Tcl9 causes trouble to 8859-1 users.

    Only if they directly entered any codepoints that were beyond plain
    ASCII. Code points 0 through 127 are identical between 8859 and UTF-8.
    If the files used plain ASCII, and the \uXXXX escapes, there would be
    no trouble at all. Of course if one is using a lot of non-English
    characters for non-English languages, seeing the actual characters in
    the scripts vs. walls of \u00b0 \u00a0 \u2324 everywhere makes for an
    easier development effort.

    Yes, sounds like it needs some fixing.

    Agreed. Uwe may be able to put off the fixig for some more time, but
    this change is going to arrive one day. He will likely have to make it
    at some point.

    I have my own special case, I use Debian 9 which only ships 8.6.6 so
    I had to build 8.6.15 from source because I really need Unicode.

    8.6.6 handled Unicode fine. In fact, 8.5 handled Unicode (so long as
    one stuck to the BMP) just fine.

    But for some time I used Freewrap as a single-file batteries included
    Tcl/Tk interpreter. So maybe Uwe should just use a different interpreter,
    likely just a slightly older version of Tcl/Tk and embrace Tcl9 later.

    That is another option, a custom build that defaults to iso-8859.

    I wonder if one can hack the encoding issue on the Tcl9 source and
    rebuild it.

    The answer is likely a "yes". But I've not looked at the code to know
    that for sure. But this just feels like a "one line change" followed
    by a recompile. But now one has to also deliver that custom runtime
    as well as the scripts that go with it.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Harald Oehlmann@wortkarg3@yahoo.com to comp.lang.tcl on Thu Jan 9 10:40:44 2025
    From Newsgroup: comp.lang.tcl

    Am 09.01.2025 um 10:12 schrieb Uwe Schmitz:
    Nevertheless, this point should be noted under "Important
    Incompatibilities in Tcl 9.0"
    on the Tcl9 page:
    https://www.tcl.tk/software/tcltk/9.0.html

    Hi Uwe,
    thanks for all your contributions.

    Here is the wiki page for TCL script migration:

    https://core.tcl-lang.org/tcl/wiki?name=Migrating+scripts+to+Tcl+9&p

    Please look to section "Default encoding for scripts is UTF-8".

    The also mentioned migration tools by Ashok also check the codepage
    issue. You may consider to use those tools also to detect other
    incompatible changes.
    https://github.com/apnadkarni/tcl9-migrate

    I am happy to include any missing information to this page.

    Thank you and take care,
    Harald
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Rich@rich@example.invalid to comp.lang.tcl on Thu Jan 9 15:37:22 2025
    From Newsgroup: comp.lang.tcl

    Uwe Schmitz <schmitzu@mail.de> wrote:
    Rich,

    at first, thank you very much for explaining my situation very well.
    I couldn't have argued better ;-)

    Let me add a note on why characters outside the iso7-bit range
    cannot always be replaced by the \uXXXX notation:
    Comments.

    If you like to write comments in your native language it
    should be no very readable to code e.g. german umlauts as \uXXXX.
    Especially if you extract the program documentation out of from
    the source code in a kind of “literate programming” (which I often
    do), the use of \u notation is very cumbersome.

    This was my suspision. In my case, the non-ascii characters are not
    part of the language (English in my case) script, they are extras (such
    as arrows/lines or the degree symbol, etc.) and so the script is 99.9% readable, with a few \uXXXX sometimes occurring.

    But writing a string out where every third character is \uXXXX makes
    for a very human unreadable string (be it a comment, or a string for
    the code to use).

    If you develop on Linux (or have a Linux machine available) you may
    wish to begin experimenting with using iconv to convert some scripts to
    UTF-8 encoding. If things work properly, it might be best to start
    that conversion (even if you do it slowly over time) sooner rather than
    later. It will be work, but it is work that you are likely going to
    have to perform at some point anyway.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Rich@rich@example.invalid to comp.lang.tcl on Thu Jan 9 15:41:39 2025
    From Newsgroup: comp.lang.tcl

    Luc <luc@sep.invalid> wrote:
    On Thu, 9 Jan 2025 03:57:14 -0000 (UTC), Rich wrote:

    Display depends upon whether your font being used had a glyph for the >>codepoint - no glyph in the font, no display in the text widget

    That also depends upon what your system encoding was set to, and

    That is probably when support for the extended Unicode characters
    (planes beyond the BMP) started to be added.

    **************************

    Nothing to do with fonts or encoding. The problem vanished as soon as
    I used 8.6.13, later 8.6.15. It was extended Unicode characters.

    You can see my discussion here, at the end of the page:

    https://wiki.tcl-lang.org/page/Unicode

    The answer confirmed what I said earlier:

    "8.6 does not support characters above BMP without a little bit of
    hackery"

    And you were trying to make use of a 1F4C4 character, which is outside
    the BMP.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Luc@luc@sep.invalid to comp.lang.tcl on Thu Jan 9 14:38:36 2025
    From Newsgroup: comp.lang.tcl

    On Thu, 9 Jan 2025 15:37:22 -0000 (UTC), Rich wrote:

    If you develop on Linux (or have a Linux machine available) you may
    wish to begin experimenting with using iconv to convert some scripts to >UTF-8 encoding.

    A quick search shows there is iconv for Windows.


    If things work properly, it might be best to start
    that conversion (even if you do it slowly over time) sooner rather than >later. It will be work, but it is work that you are likely going to
    have to perform at some point anyway.

    Yes, but now I think that Tcl9 is wrong. Blanket imposition of any
    enconding is unfair.
    --
    Luc


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Rich@rich@example.invalid to comp.lang.tcl on Fri Jan 10 00:12:48 2025
    From Newsgroup: comp.lang.tcl

    Luc <luc@sep.invalid> wrote:
    On Thu, 9 Jan 2025 15:37:22 -0000 (UTC), Rich wrote:

    If things work properly, it might be best to start that conversion
    (even if you do it slowly over time) sooner rather than later. It
    will be work, but it is work that you are likely going to have to
    perform at some point anyway.

    Yes, but now I think that Tcl9 is wrong. Blanket imposition of any
    enconding is unfair.

    Tcl has to do one of two things:

    1) Pick a default it will use.

    2) Use the "system encoding" (which is still 'imposing', just
    'imposing' whatever the OS itself imposes).

    But there always /is/ an encoding being used, because there is no way
    to process a textual file otherwise. Some "encoding" has to be chosen
    to use to decode the characters in text files.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Luc@luc@sep.invalid to comp.lang.tcl on Fri Jan 10 00:53:53 2025
    From Newsgroup: comp.lang.tcl

    On Fri, 10 Jan 2025 00:12:48 -0000 (UTC), Rich wrote:

    2) Use the "system encoding" (which is still 'imposing', just
    'imposing' whatever the OS itself imposes).

    Is the OS really imposing though? I honestly don't know about Windows,
    but Linux lets me choose the system-wide encoding. And whatever I
    chose, I must've chosen it for some reason. It's not Tcl's place
    to challenge my decision.

    And if the poor sorry Windows user really can't choose his encoding,
    then why should Tcl make the user's life even more difficult?

    The 8.6 way is wiser.
    --
    Luc


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From eric.boudaillier@eric.boudaillier@gmail.com (eric) to comp.lang.tcl on Fri Jan 10 06:38:02 2025
    From Newsgroup: comp.lang.tcl

    On Fri, 10 Jan 2025 3:53:53 +0000, Luc wrote:

    On Fri, 10 Jan 2025 00:12:48 -0000 (UTC), Rich wrote:

    2) Use the "system encoding" (which is still 'imposing', just
    'imposing' whatever the OS itself imposes).

    Is the OS really imposing though? I honestly don't know about Windows,
    but Linux lets me choose the system-wide encoding. And whatever I
    chose, I must've chosen it for some reason. It's not Tcl's place
    to challenge my decision.

    And if the poor sorry Windows user really can't choose his encoding,
    then why should Tcl make the user's life even more difficult?

    The 8.6 way is wiser.

    Not sure... your choice of encoding may not be the one of your
    application users.
    In this case, your code may fail to load in the user's encoding choice.

    Eric

    --
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Uwe Schmitz@schmitzu@mail.de to comp.lang.tcl on Fri Jan 10 12:55:00 2025
    From Newsgroup: comp.lang.tcl

    Rich,

    Yes, what you write in your last paragraph is the conclusion
    from the entire discussion here. But I wouldn't have thought
    of that at the start of the Tcl9 migration process.
    But this is definitely the path to follow.

    Best wishes
    Uwe

    Am 09.01.2025 um 16:37 schrieb Rich:
    Uwe Schmitz <schmitzu@mail.de> wrote:
    Rich,

    at first, thank you very much for explaining my situation very well.
    I couldn't have argued better ;-)

    Let me add a note on why characters outside the iso7-bit range
    cannot always be replaced by the \uXXXX notation:
    Comments.

    If you like to write comments in your native language it
    should be no very readable to code e.g. german umlauts as \uXXXX.
    Especially if you extract the program documentation out of from
    the source code in a kind of “literate programming” (which I often
    do), the use of \u notation is very cumbersome.

    This was my suspision. In my case, the non-ascii characters are not
    part of the language (English in my case) script, they are extras (such
    as arrows/lines or the degree symbol, etc.) and so the script is 99.9% readable, with a few \uXXXX sometimes occurring.

    But writing a string out where every third character is \uXXXX makes
    for a very human unreadable string (be it a comment, or a string for
    the code to use).

    If you develop on Linux (or have a Linux machine available) you may
    wish to begin experimenting with using iconv to convert some scripts to
    UTF-8 encoding. If things work properly, it might be best to start
    that conversion (even if you do it slowly over time) sooner rather than later. It will be work, but it is work that you are likely going to
    have to perform at some point anyway.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Uwe Schmitz@schmitzu@mail.de to comp.lang.tcl on Fri Jan 10 13:09:39 2025
    From Newsgroup: comp.lang.tcl

    Harald,

    thanks for the wiki page. It has definitely a lot of
    information.

    May be, you (or I can do that too) may add to the "Default encoding..." paragraph, that in the long term it's best to encode ALL tcl source
    files in utf-8 to get out of this "-encoding ..." hell.

    Another thing that hurts me and is off-topic here (sorry):
    The changed variable name resolution also affects itcl::class
    defintions. The following leads to an error:

    ::itcl::class A {
    public common tclVersion $tcl_version
    }

    Because the ::itcl::class commands open a namespace, the resolution
    of the global variable tcl_version doesn't succeed. You
    have to use the complete path $::tcl_version.

    Best wishes
    Uwe


    Am 09.01.2025 um 10:40 schrieb Harald Oehlmann:
    Am 09.01.2025 um 10:12 schrieb Uwe Schmitz:
    Nevertheless, this point should be noted under "Important Incompatibilities in Tcl 9.0"
    on the Tcl9 page:
    https://www.tcl.tk/software/tcltk/9.0.html

    Hi Uwe,
    thanks for all your contributions.

    Here is the wiki page for TCL script migration:

    https://core.tcl-lang.org/tcl/wiki?name=Migrating+scripts+to+Tcl+9&p

    Please look to section "Default encoding for scripts is UTF-8".

    The also mentioned migration tools by Ashok also check the codepage issue. You may consider to use those tools also to detect other incompatible changes.
    https://github.com/apnadkarni/tcl9-migrate

    I am happy to include any missing information to this page.

    Thank you and take care,
    Harald

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Harald Oehlmann@wortkarg3@yahoo.com to comp.lang.tcl on Fri Jan 10 17:42:25 2025
    From Newsgroup: comp.lang.tcl

    Am 10.01.2025 um 13:09 schrieb Uwe Schmitz:
    Another thing that hurts me and is off-topic here (sorry):
    The changed variable name resolution also affects itcl::class
    defintions. The following leads to an error:

    ::itcl::class A {
       public common tclVersion $tcl_version
    }

    Because the ::itcl::class commands open a namespace, the resolution
    of the global variable tcl_version doesn't succeed. You
    have to use the complete path $::tcl_version.

    Yes, that is intentional and important. And Ashoks migration tolls also
    catch this.

    The issue in TCL 8.6 is:

    namespace eval test { set test A }

    Works as follows:
    - if test exists in the global namespace, the global is set.
    - if test does not exist in the global namespace, a namespace variable
    "test" is created.

    This feature had the consequence, that namespace variables sometimes
    overwrote global variables, but were intended as namespace variables.

    In TCL 9, this will always address a namespace variable.

    Take care,
    Harald
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Luc@luc@sep.invalid to comp.lang.tcl on Fri Jan 10 16:29:44 2025
    From Newsgroup: comp.lang.tcl

    On Fri, 10 Jan 2025 06:38:02 +0000, eric wrote:

    The 8.6 way is wiser.

    Not sure... your choice of encoding may not be the one of your
    application users.
    In this case, your code may fail to load in the user's encoding choice. **************************

    The user's choice is and always will be a point of uncertainty.

    Tcl9 introduces an additional uncertainty with the developer.
    --
    Luc


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Rich@rich@example.invalid to comp.lang.tcl on Fri Jan 10 20:13:26 2025
    From Newsgroup: comp.lang.tcl

    Luc <luc@sep.invalid> wrote:
    On Fri, 10 Jan 2025 06:38:02 +0000, eric wrote:

    The 8.6 way is wiser.

    Not sure... your choice of encoding may not be the one of your
    application users.
    In this case, your code may fail to load in the user's encoding choice.
    **************************

    The user's choice is and always will be a point of uncertainty.

    Tcl9 introduces an additional uncertainty with the developer.

    Actually, 9 /reduces/ uncertianty.

    Current method:

    Developer: sets his system encoding to ISO-8859. Writes Tcl script,
    and includes 8859 code points directly into the script. everything
    works for developer, on his system.

    User #1: sets his system encoding to CP437 (the original DOS character encoding -- I needed to pick 'something' other than 8859). Downloads "Developers" script from github, and launches it.

    If Tcl interprets the script using the sysstem encoding, it will
    interpret the ISO-8859 script as if it were DOS CP437 bytes. So many
    extended language letters will instead become lots of line draw
    characters (just one example). The user is disappointed, as the script
    does not work "out of the box" for him.

    New Tcl9 method:

    All Tcl scripts are UTF-8, no exceptions (not really true, but close
    enough).

    Developer: Must create the script with UTF-8 encoding. Note, the
    'developer' could continue to use 8859 for 'writing' things, they
    simply must use iconv (or similar) to convert to UTF-8 before feeding
    the script to the Tcl interpreter.

    User #1: Continues setting his/her system encoding to CP437. But now,
    they download developer's script from github, and when they launch it
    with Tcl9, it is always interpreted as UTF-8. It "just works", and
    User #1 sees the proper accented characters the developer put into the
    script for prompts or other strings. User #1 did not have to do
    anything, and the script worked "out of the box" for him/her.



    With the Tcl9 method of "all scripts must be UTF-8" there is less
    uncertianty, because the script will be interpreted using the same
    encoding everywhere, no matter what odd local system setting any given
    user may have chosen.

    --- Synchronet 3.20a-Linux NewsLink 1.114