• Buffer contents well-defined after fgets() reaches EOF ?

    From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.c on Sun Feb 9 06:59:00 2025
    From Newsgroup: comp.lang.c

    To get the last line of a text file I'm using

    char buf[BUFSIZ];
    while (fgets (buf, BUFSIZ, fd) != NULL)
    ; // read to last line

    If the end of the file is reached my test shows that the previous
    contents of 'buf' are still existing (not cleared or overwritten).

    But the man page does not say anything whether this is guaranteed;
    it says: "Reading stops after an EOF or a newline.", but it says
    nothing about [not] writing to or [not] resetting the buffer.

    Is that simple construct safe to get the last line of a text file?

    Thanks.

    Janis
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Kaz Kylheku@643-408-1753@kylheku.com to comp.lang.c on Sun Feb 9 06:23:33 2025
    From Newsgroup: comp.lang.c

    On 2025-02-09, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    To get the last line of a text file I'm using

    char buf[BUFSIZ];
    while (fgets (buf, BUFSIZ, fd) != NULL)
    ; // read to last line

    If the end of the file is reached my test shows that the previous
    contents of 'buf' are still existing (not cleared or overwritten).

    Whenever fgets successfully reads one or more characters, and
    adds them to the array (followed by a null terminator), it
    returns the pointer it was given.

    fgets only returns null when:

    - it hits EOF when trying to obtain the first character.

    - it hits an I/O error.

    But the man page does not say anything whether this is guaranteed;
    it says: "Reading stops after an EOF or a newline.", but it says
    nothing about [not] writing to or [not] resetting the buffer.

    But of course, ISO C has the requirements nailed down. e.g. C99:

    "The fgets function returns s if successful. If end-of-file is
    encountered and no characters have been read into the array, the
    contents of the array remain unchanged and a null pointer is returned.
    If a read error occurs during the operation, the array contents are indeterminate and a null pointer is returned."

    Beware of man pages identifying themselves as "Linux Programmer's
    Manual". Their quality is all over the place, and rarely hits
    a high note.

    Is that simple construct safe to get the last line of a text file?

    While fgets returns a pointer, you have a good line of input.
    The terminating newline is included, unless it's the last line
    and the file is missing it.

    Some C newbies make this mistake:

    while (!feof(stdin)) {
    fgets(...)
    /* process line */
    }

    their code ends up processing the last line twice. On the last byte
    of input, which is usually the terminating newline of the last line,
    fgets returns without having reached EOF. The loop spins around one more
    time. This time fgets returns NULL, not having read a single byte. The
    code doesn't check this and processes the buffer again.
    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Andrey Tarasevich@noone@noone.net to comp.lang.c on Sat Feb 8 22:23:49 2025
    From Newsgroup: comp.lang.c

    On Sat 2/8/2025 9:59 PM, Janis Papanagnou wrote:
    To get the last line of a text file I'm using

    char buf[BUFSIZ];
    while (fgets (buf, BUFSIZ, fd) != NULL)
    ; // read to last line

    If the end of the file is reached my test shows that the previous
    contents of 'buf' are still existing (not cleared or overwritten).

    But the man page does not say anything whether this is guaranteed;
    it says: "Reading stops after an EOF or a newline.", but it says
    nothing about [not] writing to or [not] resetting the buffer.

    Is that simple construct safe to get the last line of a text file?


    What situation exactly are you talking about? When end-of-file is
    encountered _immediately_, before reading the very first character? Of
    when end-of-file is encountered after reading something (i.e. when the
    last line in the file does not end with new-line character)?

    The former situation is covered by the spec: "If end-of-file is
    encountered and no characters have been read into the array, the
    contents of the array remain unchanged and a null pointer is returned".

    The second situation does not need additional clarifications. Per
    general spec as many characters as available before the end-of-file will
    be read and then terminated with '\0'. In such case there will be no
    new-line character in the buffer.

    So, in both cases we are perfectly safe when reading the last line of a
    text file, if you don't forget to check the return value of `fgets`.

    (This is all under assumption that size limit does not kick in. I
    believe your question is not about that.)

    Note also that `fgets` is not permitted to assume that the limit value
    (the second parameter) correctly describes the accessible size of the
    buffer. E.g. for this reason it is not permitted to zero-out the buffer
    before reading. For example, this code is valid and has defined behavior

    char buffer[10];
    fgets(buffer, 1000, f);

    provided the current line of the file fits into `char[10]`. I.e. even
    though we "lied" to `fgets` about the limit, it is still required to
    work correctly if the actual data fits into the actual buffer.

    So, why do you care that "the previous contents of 'buf' are still
    existing"?
    --
    Best regards,
    Andrey
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Andrey Tarasevich@noone@noone.net to comp.lang.c on Sat Feb 8 23:12:44 2025
    From Newsgroup: comp.lang.c

    On Sat 2/8/2025 10:23 PM, Andrey Tarasevich wrote:

    Note also that `fgets` is not permitted to assume that the limit value
    (the second parameter) correctly describes the accessible size of the buffer. E.g. for this reason it is not permitted to zero-out the buffer before reading. For example, this code is valid and has defined behavior

      char buffer[10];
      fgets(buffer, 1000, f);

    provided the current line of the file fits into `char[10]`. I.e. even
    though we "lied" to `fgets` about the limit, it is still required to
    work correctly if the actual data fits into the actual buffer.

    ... and this part of specification effectively guarantees, that any
    [tail] portion of the buffer not overwritten by the characters obtained
    from file, will remain unchanged. If `fgets` reads 5 characters from the
    file, only first 6 characters of the buffer will be overwritten, while
    the rest is guaranteed to remain untouched. If `fgets` reads nothing
    (instant end-of-file), the entire buffer remains untouched.
    --
    Best regards,
    Andrey



    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.c on Sun Feb 9 08:13:10 2025
    From Newsgroup: comp.lang.c

    First; thanks Kaz and Andrey for the replies. - As so often answering
    more than I asked or needed. :-)

    The provided C standard quote answers my question. - Thanks!


    On 09.02.2025 07:23, Andrey Tarasevich wrote:
    On Sat 2/8/2025 9:59 PM, Janis Papanagnou wrote:
    To get the last line of a text file I'm using

    char buf[BUFSIZ];
    while (fgets (buf, BUFSIZ, fd) != NULL)
    ; // read to last line

    If the end of the file is reached my test shows that the previous
    contents of 'buf' are still existing (not cleared or overwritten).

    But the man page does not say anything whether this is guaranteed;
    it says: "Reading stops after an EOF or a newline.", but it says
    nothing about [not] writing to or [not] resetting the buffer.

    Is that simple construct safe to get the last line of a text file?


    What situation exactly are you talking about? When end-of-file is
    encountered _immediately_, before reading the very first character? Of
    when end-of-file is encountered after reading something (i.e. when the
    last line in the file does not end with new-line character)?

    I have a _coherent_ file, with a few NL terminated lines of text.

    Usually I use fgets() in contexts where I process every line, like

    while (fgets (buf, BUFSIZ, fd) != NULL) {
    operate_on (buf);
    }
    // here the status of buf[] is usually not important any more

    My actual context was different, like

    while (fgets (buf, BUFSIZ, fd) != NULL) {
    // buf[] contents are ignored here
    }
    operate_on (buf[]); // which I assumed contains last line


    The former situation is covered by the spec: "If end-of-file is
    encountered and no characters have been read into the array, the
    contents of the array remain unchanged and a null pointer is returned".

    The second situation does not need additional clarifications. Per
    general spec as many characters as available before the end-of-file will
    be read and then terminated with '\0'. In such case there will be no
    new-line character in the buffer.

    So, in both cases we are perfectly safe when reading the last line of a
    text file, if you don't forget to check the return value of `fgets`.

    I suppose you mean what I already had in my code above: ... != NULL


    (This is all under assumption that size limit does not kick in. I
    believe your question is not about that.)

    Yes, it was just the one posted question. (No incoherent text files,
    no error conditions, no signals, no buffer size mistakes, etc.)


    Note also that `fgets` is not permitted to assume that the limit value
    (the second parameter) correctly describes the accessible size of the
    buffer. E.g. for this reason it is not permitted to zero-out the buffer before reading. For example, this code is valid and has defined behavior

    char buffer[10];
    fgets(buffer, 1000, f);

    provided the current line of the file fits into `char[10]`. I.e. even
    though we "lied" to `fgets` about the limit, it is still required to
    work correctly if the actual data fits into the actual buffer.

    So, why do you care that "the previous contents of 'buf' are still
    existing"?

    I hope it got clear by the two code snippets I posted above...

    Usually I read and process the data that I got in buf from fgets()
    while there *is* data (fgets() != NULL), and I thus don't care any
    more about buffer contents validity after the loop (fgets() == NULL).

    But now I wanted to ignore all data that I got for fgets() != NULL
    in the loop. And I hoped that *after* the loop the last read data is
    still valid.

    Janis

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.lang.c on Sun Feb 9 12:50:46 2025
    From Newsgroup: comp.lang.c

    On Sun, 9 Feb 2025 08:13:10 +0100
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

    First; thanks Kaz and Andrey for the replies. - As so often answering
    more than I asked or needed. :-)

    The provided C standard quote answers my question. - Thanks!


    On 09.02.2025 07:23, Andrey Tarasevich wrote:
    On Sat 2/8/2025 9:59 PM, Janis Papanagnou wrote:
    To get the last line of a text file I'm using

    char buf[BUFSIZ];
    while (fgets (buf, BUFSIZ, fd) != NULL)
    ; // read to last line

    If the end of the file is reached my test shows that the previous
    contents of 'buf' are still existing (not cleared or overwritten).

    But the man page does not say anything whether this is guaranteed;
    it says: "Reading stops after an EOF or a newline.", but it says
    nothing about [not] writing to or [not] resetting the buffer.

    Is that simple construct safe to get the last line of a text file?


    What situation exactly are you talking about? When end-of-file is encountered _immediately_, before reading the very first character?
    Of when end-of-file is encountered after reading something (i.e.
    when the last line in the file does not end with new-line
    character)?

    I have a _coherent_ file, with a few NL terminated lines of text.


    I wonder what you mean by "coherent".

    Usually I use fgets() in contexts where I process every line, like

    while (fgets (buf, BUFSIZ, fd) != NULL) {
    operate_on (buf);
    }
    // here the status of buf[] is usually not important any more

    My actual context was different, like

    while (fgets (buf, BUFSIZ, fd) != NULL) {
    // buf[] contents are ignored here
    }
    operate_on (buf[]); // which I assumed contains last line


    It depends on definition of "last line".
    What do you consider "last line" of the file in which last character is
    not LF? The one before the last LF or one after? Your code would get
    the latter.

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Andrey Tarasevich@noone@noone.net to comp.lang.c on Sun Feb 9 07:27:26 2025
    From Newsgroup: comp.lang.c

    On Sat 2/8/2025 11:13 PM, Janis Papanagnou wrote:
    But now I wanted to ignore all data that I got for fgets() != NULL
    in the loop. And I hoped that *after* the loop the last read data is
    still valid.

    As Michael already noted it depends on what you consider as the last
    piece of valid data in your file. Say, what do you want to see as "the
    last line" in a file that ends with

    abracadabra\n<EOF here>

    ?

    Is "abracadabra" the last line? Or is the last line supposed to be empty
    in this case?
    --
    Best regards,
    Andrey
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.c on Sun Feb 9 18:29:01 2025
    From Newsgroup: comp.lang.c

    On 09.02.2025 11:50, Michael S wrote:
    On Sun, 9 Feb 2025 08:13:10 +0100
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

    [...]

    I have a _coherent_ file, with a few NL terminated lines of text.

    I wonder what you mean by "coherent".

    A badly chosen word; I noticed it too late only after posting.

    I meant consistent with respect to the line terminators (i.e. none
    missing, each line [including the last one] has one, no mixture of
    LF, CR, LR-LF, etc. (and also no fixed-sized unterminated lines as
    we may still know from mainframes, just to be complete).


    Usually I use fgets() in contexts where I process every line, like

    while (fgets (buf, BUFSIZ, fd) != NULL) {
    operate_on (buf);
    }
    // here the status of buf[] is usually not important any more

    My actual context was different, like

    while (fgets (buf, BUFSIZ, fd) != NULL) {
    // buf[] contents are ignored here
    }
    operate_on (buf[]); // which I assumed contains last line


    It depends on definition of "last line".
    What do you consider "last line" of the file in which last character is
    not LF?

    I consider missing newlines at the end of any text line as a bug.
    (And I'm not inclined to use a weaker word than "bug".) YMMV.

    The one before the last LF or one after? Your code would get
    the latter.

    It's a non-issue (for me), as should have got obvious.

    Janis

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.c on Sun Feb 9 18:34:55 2025
    From Newsgroup: comp.lang.c

    On 09.02.2025 16:27, Andrey Tarasevich wrote:
    On Sat 2/8/2025 11:13 PM, Janis Papanagnou wrote:
    But now I wanted to ignore all data that I got for fgets() != NULL
    in the loop. And I hoped that *after* the loop the last read data is
    still valid.

    As Michael already noted it depends on what you consider as the last
    piece of valid data in your file.

    I have a strong opinion of a text file concerning line terminators;
    I answered that in my reply Michael.

    Say, what do you want to see as "the
    last line" in a file that ends with

    abracadabra\n<EOF here>

    ?

    Is "abracadabra" the last line? Or is the last line supposed to be empty
    in this case?

    If "\n" is a string literal (2 characters, '\' and 'n') then it's an
    incomplete line (as to my standards), if it's meant as a <LF> control
    character then it's complete. (Similar with <CR> on old Apple/Macs and
    <CR><LF> on DOS-alike systems.)

    Janis

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.lang.c on Sun Feb 9 23:52:17 2025
    From Newsgroup: comp.lang.c

    On Sat, 8 Feb 2025 23:12:44 -0800, Andrey Tarasevich wrote:

    If `fgets` reads nothing (instant end-of-file), the entire buffer
    remains untouched.

    You mean, only a single null byte gets written.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.lang.c on Sun Feb 9 16:57:02 2025
    From Newsgroup: comp.lang.c

    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
    On 09.02.2025 16:27, Andrey Tarasevich wrote:
    [...]
    Say, what do you want to see as "the
    last line" in a file that ends with

    abracadabra\n<EOF here>

    ?

    Is "abracadabra" the last line? Or is the last line supposed to be empty
    in this case?

    If "\n" is a string literal (2 characters, '\' and 'n') then it's an incomplete line (as to my standards), if it's meant as a <LF> control character then it's complete. (Similar with <CR> on old Apple/Macs and <CR><LF> on DOS-alike systems.)

    It seems obvious to me that Andrey intended the \n to be a new-line
    character (which is almost always LF in modern C implementations).

    Here's (some of) what the C standard says about text streams:

    A text stream is an ordered sequence of characters composed into
    lines, each line consisting of zero or more characters plus a
    terminating new-line character. Whether the last line requires
    a terminating new-line character is implementation-defined.

    For an implementation that *doesn't* require a new-line on the
    last line, a stream without a trailing new-line is valid. For an implementation that *does* require it, such a stream is invalid,
    and a program that attempts to process it can have undefined behavior.

    Most modern implementations don't require that trailing new-line.
    For example, `echo -n hello > hello.txt` creates a valid text file.
    Of course a C program that deals with text files can impose any
    additional restrictions its author likes.

    The above describes how a text stream looks to a C program. The
    external representation can be quite different, with transformations
    to map between them. The most common such transformation is
    mapping the DOS/Windows CR-LF line terminator to LF on input, and
    vice versa on output. Or the external representation might store
    each line as a fixed-length character sequence padded with spaces.
    --
    Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
    void Void(void) { Void(); } /* The recursive call of the void */
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Andrey Tarasevich@noone@noone.net to comp.lang.c on Sun Feb 9 17:06:02 2025
    From Newsgroup: comp.lang.c

    On Sun 2/9/2025 3:52 PM, Lawrence D'Oliveiro wrote:
    On Sat, 8 Feb 2025 23:12:44 -0800, Andrey Tarasevich wrote:

    If `fgets` reads nothing (instant end-of-file), the entire buffer
    remains untouched.

    You mean, only a single null byte gets written.

    No. The buffer is not changed at all in such case.
    --
    Best regards,
    Andrey
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Andrey Tarasevich@noone@noone.net to comp.lang.c on Sun Feb 9 17:22:43 2025
    From Newsgroup: comp.lang.c

    On Sun 2/9/2025 5:06 PM, Andrey Tarasevich wrote:
    On Sun 2/9/2025 3:52 PM, Lawrence D'Oliveiro wrote:
    On Sat, 8 Feb 2025 23:12:44 -0800, Andrey Tarasevich wrote:

    If `fgets` reads nothing (instant end-of-file), the entire buffer
    remains untouched.

    You mean, only a single null byte gets written.

    No. The buffer is not changed at all in such case.

    ... which actually raises an interesting quiz/puzzle/question:

    Under what circumstances `fgets` is expected to return an empty
    string? (I.e. set the [0] entry of the buffer to '\0' and return non-null)?

    The only answer I can see right away is:

    When one calls it as `fgets(buffer, 1, file)`, i.e. asks it to read 0 characters.

    This is under assumption that asking `fgets` to read 0 characters is
    supposed to prevent it from detecting end-of-file condition or I/O error condition. One can probably do some nitpicking at the current wording...
    but I believe the above is the intent.
    --
    Best regards,
    Andrey

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Ben Bacarisse@ben@bsb.me.uk to comp.lang.c on Mon Feb 10 01:32:16 2025
    From Newsgroup: comp.lang.c

    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:

    To get the last line of a text file I'm using

    char buf[BUFSIZ];
    while (fgets (buf, BUFSIZ, fd) != NULL)
    ; // read to last line

    If the end of the file is reached my test shows that the previous
    contents of 'buf' are still existing (not cleared or overwritten).

    Something that has not yet come up (as far as I can see) is that you
    might need to handle an empty file. In such a case, nothing gets
    written and fgets returns NULL right away. Processing buf in this
    situation is then undefined.

    One way to handle this is to put into buf something that can't get read
    by fgets. Two newlines is a good candidate:

    char buf[BUFSIZE] = "\n\n";

    You can then test for that if need be, though of course it all depends
    on what your application is doing.
    --
    Ben.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.c on Mon Feb 10 02:35:01 2025
    From Newsgroup: comp.lang.c

    On 10.02.2025 01:57, Keith Thompson wrote:
    [...]

    Here's (some of) what the C standard says about text streams:

    A text stream is an ordered sequence of characters composed into
    lines, each line consisting of zero or more characters plus a
    terminating new-line character. Whether the last line requires
    a terminating new-line character is implementation-defined.

    For an implementation that *doesn't* require a new-line on the
    last line, a stream without a trailing new-line is valid. For an implementation that *does* require it, such a stream is invalid,
    and a program that attempts to process it can have undefined behavior.

    This is what "C" accepts (or tolerates), yes.

    Given that some folks with the aid of some fancy editors makes it
    possible to suppress (or not create) the final line ending - bytes
    are still expensive it seems - I suppose it's a sensible requirement
    for "C" compilers to be tolerant here.


    Most modern implementations don't require that trailing new-line.
    For example, `echo -n hello > hello.txt` creates a valid text file.
    Of course a C program that deals with text files can impose any
    additional restrictions its author likes.

    And cat alpha.c beta.c > gamma.c will create inconsistent texts if
    there's no line terminator on the last lines of some files.


    The above describes how a text stream looks to a C program. The
    external representation can be quite different, with transformations
    to map between them.

    (Concerning this thread; I'm anyway operating on custom data files
    in plain text format, so I'm less concerned about how "C" compilers
    expect their "C" source.)

    The most common such transformation is
    mapping the DOS/Windows CR-LF line terminator to LF on input, and
    vice versa on output. Or the external representation might store
    each line as a fixed-length character sequence padded with spaces.

    I appreciate that the editor I use keeps data consistent but allows
    an explicit change between Unix and DOS text modes (where necessary
    of if desired).

    The most extreme context I had worked in was a company that allowed
    (for every employee) a free choice of used computer technology; that
    led to program text files that literally had all the inconsistencies.
    Since many files were edited by different folks there where all sorts
    of line terminators mixed even in the same one file, and there either
    were complete last lines or not. The (some?) IDEs used were tolerant
    WRT line terminators and their mixing. Other tools reacted sensibly.
    The first thing I've done was to write a "C" tool to detect and fix
    these sorts of inconsistencies.

    Janis

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.c on Mon Feb 10 02:40:28 2025
    From Newsgroup: comp.lang.c

    On 10.02.2025 02:32, Ben Bacarisse wrote:
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:

    To get the last line of a text file I'm using

    char buf[BUFSIZ];
    while (fgets (buf, BUFSIZ, fd) != NULL)
    ; // read to last line

    If the end of the file is reached my test shows that the previous
    contents of 'buf' are still existing (not cleared or overwritten).

    Something that has not yet come up (as far as I can see) is that you
    might need to handle an empty file. In such a case, nothing gets
    written and fgets returns NULL right away. Processing buf in this
    situation is then undefined.

    I haven't considered that at all because my context is very specific;
    it's just three text lines (a comment line, an empty separator line,
    and a line with the payload data that I'm interested in). If there's
    a file it will have exactly these three lines (and all correctly and consistently terminated).


    One way to handle this is to put into buf something that can't get read
    by fgets. Two newlines is a good candidate:

    char buf[BUFSIZE] = "\n\n";

    You can then test for that if need be, though of course it all depends
    on what your application is doing.

    Thanks for pointing it out and for the suggestion.

    Janis

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.c on Mon Feb 10 02:44:27 2025
    From Newsgroup: comp.lang.c

    On 10.02.2025 02:06, Andrey Tarasevich wrote:
    On Sun 2/9/2025 3:52 PM, Lawrence D'Oliveiro wrote:
    On Sat, 8 Feb 2025 23:12:44 -0800, Andrey Tarasevich wrote:

    If `fgets` reads nothing (instant end-of-file), the entire buffer
    remains untouched.

    You mean, only a single null byte gets written.

    This was actually what I feared some implementation might do
    (unless it's specified by the "C" standard, which luckily is,
    as has been shown and got quoted in this thread).

    No. The buffer is not changed at all in such case.

    Which had been the good news.

    Janis

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.lang.c on Mon Feb 10 03:28:01 2025
    From Newsgroup: comp.lang.c

    On Sun, 9 Feb 2025 17:06:02 -0800, Andrey Tarasevich wrote:

    On Sun 2/9/2025 3:52 PM, Lawrence D'Oliveiro wrote:
    On Sat, 8 Feb 2025 23:12:44 -0800, Andrey Tarasevich wrote:

    If `fgets` reads nothing (instant end-of-file), the entire buffer
    remains untouched.

    You mean, only a single null byte gets written.

    No. The buffer is not changed at all in such case.

    From the man page <https://manpages.debian.org/fgets(3)>:

    fgets() reads in at most one less than size characters from stream and
    stores them into the buffer pointed to by s. Reading stops after an
    EOF or a newline. If a newline is read, it is stored into the buffer.
    A terminating null byte ('\0') is stored after the last character in
    the buffer.

    Note there is no qualification like “a terminating null byte is stored
    after the last character if EOF was not reached”. It’s clear the terminating null byte is *always* stored.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Andrey Tarasevich@noone@noone.net to comp.lang.c on Sun Feb 9 20:11:22 2025
    From Newsgroup: comp.lang.c

    On Sun 2/9/2025 7:28 PM, Lawrence D'Oliveiro wrote:

    From the man page <https://manpages.debian.org/fgets(3)>:

    fgets() reads in at most one less than size characters from stream and
    stores them into the buffer pointed to by s. Reading stops after an
    EOF or a newline. If a newline is read, it is stored into the buffer.
    A terminating null byte ('\0') is stored after the last character in
    the buffer.

    Note there is no qualification like “a terminating null byte is stored after the last character if EOF was not reached”. It’s clear the terminating null byte is *always* stored.

    Well, the language standard says differently.

    You are referring to a specific manpage that follows POSIX. When taken literally it seems to contradict the standard specification for `fgets`,
    but I highly doubt this was the intent. Apparently someone tried to
    re-word the spec for better readability, but managed to botch it.

    This manpage, for one example, is in full agreement with the standard

    https://www.man7.org/linux/man-pages/man3/fgets.3p.html

    A practical experiment demonstrates that [supposedly] POSIX-obeying implementations do not write '\0' into the buffer in "immediate
    end-of-file" situations:
    https://coliru.stacked-crooked.com/a/3e672e6718dd388b
    --
    Best regards,
    Andrey

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.lang.c on Sun Feb 9 20:37:34 2025
    From Newsgroup: comp.lang.c

    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
    On 10.02.2025 01:57, Keith Thompson wrote:
    [...]
    The above describes how a text stream looks to a C program. The
    external representation can be quite different, with transformations
    to map between them.

    (Concerning this thread; I'm anyway operating on custom data files
    in plain text format, so I'm less concerned about how "C" compilers
    expect their "C" source.)

    The requirements for text streams are distinct from the requirements
    for C source files. For example, you might have a cross-compiler
    where C source files follow the rules of the OS where the compiler
    runs, and text files processed via stdio follow the rules of
    the target system. And a C compiler might not use stdio to read
    source files. It might not even be implemented in C.

    In particular, the standard has this specific requirement for source
    files (this is from the "Translation phases" section):

    A source file that is not empty shall end in a new-line
    character, which shall not be immediately preceded by a backslash
    character before any such splicing takes place.

    (This is in translation phase 2; any new-line characters might be
    the result of a transformation during phase 1.)

    So a non-empty file not ending in a new-line character might be a
    valid text file for use with stdio, but is not a valid C source file.
    On the other hand, the mapping described in translation phase 1
    might add a new-line character to such a file, so a conforming
    compiler could accept such a source file without complaint.

    Of the compilers I've tried, gcc and tcc quietly accept a source
    file with no trailing newline, and clang rejects it with the right
    options (-std=c?? -pedantic-errors).

    [...]

    The most extreme context I had worked in was a company that allowed
    (for every employee) a free choice of used computer technology; that
    led to program text files that literally had all the inconsistencies.
    Since many files were edited by different folks there where all sorts
    of line terminators mixed even in the same one file, and there either
    were complete last lines or not. The (some?) IDEs used were tolerant
    WRT line terminators and their mixing. Other tools reacted sensibly.
    The first thing I've done was to write a "C" tool to detect and fix
    these sorts of inconsistencies.

    Been there, done that. There seems to be a tendency in the Windows
    world to create text files with no terminator on the last line.
    In some cases I've been able to translate the source files to a
    consistent format. In others, doing so would have created huge
    diffs in the source control system, so I left well enough alone.

    My preferred editor, vim, handles files with either LF or CRLF line
    endings gracefully, but if there's a mix it shows "^M" at the end of
    each line that has a Windows-style CRLF ending. I found a possible
    solution, but I haven't bothered using it since I'm not currently
    dealing with such files.

    <https://vi.stackexchange.com/q/39297/2380>

    This is already off-topic, so I won't even mention tabs vs. spaces.
    --
    Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
    void Void(void) { Void(); } /* The recursive call of the void */
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.c on Mon Feb 10 07:08:22 2025
    From Newsgroup: comp.lang.c

    On 10.02.2025 05:37, Keith Thompson wrote:
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:

    The most extreme context I had worked in was a company that allowed
    (for every employee) a free choice of used computer technology; that
    led to program text files that literally had all the inconsistencies.
    Since many files were edited by different folks there where all sorts
    of line terminators mixed even in the same one file, and there either
    were complete last lines or not. The (some?) IDEs used were tolerant
    WRT line terminators and their mixing. Other tools reacted sensibly.
    The first thing I've done was to write a "C" tool to detect and fix
    these sorts of inconsistencies.

    Been there, done that. There seems to be a tendency in the Windows
    world to create text files with no terminator on the last line.

    Yep.

    In some cases I've been able to translate the source files to a
    consistent format. In others, doing so would have created huge
    diffs in the source control system, so I left well enough alone.

    Yes, that is what you buy with a fix. But it pays, IME. What I had
    done was to provide scripts to automate the transformation, I did
    a short-term code freeze on a whole project, transformed the data,
    and since that point we had a consistent base. The good thing was
    that the single CR terminators (old Apple/Macs, pre OS-X) were only
    in older code. And the handling of LF vs. CR-LF was okay once that
    the script streamlined the data, either converting all to LF, or,
    to keep any conistent variant (whether it was LF or CR-LF).


    My preferred editor, vim, handles files with either LF or CRLF line
    endings gracefully, but if there's a mix it shows "^M" at the end of
    each line that has a Windows-style CRLF ending.

    Yes. With the change I described we got rid of this issue.

    I found a possible
    solution, but I haven't bothered using it since I'm not currently
    dealing with such files.

    Same here. For me it was just a historic little episode.


    <https://vi.stackexchange.com/q/39297/2380>

    This is already off-topic, so I won't even mention tabs vs. spaces.

    But as Vim users we don't have any issues here; as long as the
    indentation is _visibly_ consistent we can fix any tab/space-mix
    on the fly and easily with Vim.

    Janis

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.lang.c on Sun Feb 9 22:41:36 2025
    From Newsgroup: comp.lang.c

    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
    On 10.02.2025 05:37, Keith Thompson wrote:
    [...]
    This is already off-topic, so I won't even mention tabs vs. spaces.

    But as Vim users we don't have any issues here; as long as the
    indentation is _visibly_ consistent we can fix any tab/space-mix
    on the fly and easily with Vim.

    Yes, *if* the indentation is visibly consistent.

    At a previous job, I reviewed an update whose apparent meaning
    differed depending on whether the editor was configured with 4- or
    8-column tabstops. I don't remember the exact details, but the code
    looked like either:

    if (condition)
    statement1;
    statement2;

    or:

    if (condition)
    statement1;
    statement2;

    depending on the reader's settings. Of course they're semantically
    equivalent, but the first is the way the developer saw it, and the
    second is misleading and is the way it looked to me.

    This kind of thing is why I use only spaces for indentation and curly
    braces even when there's only one statement in the block (unless I'm
    working under a coding standard that says otherwise).
    --
    Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
    void Void(void) { Void(); } /* The recursive call of the void */
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.c on Mon Feb 10 07:54:07 2025
    From Newsgroup: comp.lang.c

    On 10.02.2025 07:41, Keith Thompson wrote:
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
    On 10.02.2025 05:37, Keith Thompson wrote:
    [...]
    This is already off-topic, so I won't even mention tabs vs. spaces.

    But as Vim users we don't have any issues here; as long as the
    indentation is _visibly_ consistent we can fix any tab/space-mix
    on the fly and easily with Vim.

    Yes, *if* the indentation is visibly consistent.

    At a previous job, I reviewed an update whose apparent meaning
    differed depending on whether the editor was configured with 4- or
    8-column tabstops. I don't remember the exact details, but the code
    looked like either:

    if (condition)
    statement1;
    statement2;

    or:

    if (condition)
    statement1;
    statement2;

    depending on the reader's settings. Of course they're semantically equivalent, but the first is the way the developer saw it, and the
    second is misleading and is the way it looked to me.

    Yeah, misleading code is a pain, especially if you have got the job
    to fix some error in these incoherent formatted modules. (I suppose
    that case is yet more than only misleading if you are programming in
    Python where indentation even carries semantics.)


    This kind of thing is why I use only spaces for indentation

    I think it's personal preference. Mine is to use only tabs (with a
    data type specific width) for the indentation.

    Janis

    and curly
    braces even when there's only one statement in the block (unless I'm
    working under a coding standard that says otherwise).


    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.lang.c on Mon Feb 10 07:21:11 2025
    From Newsgroup: comp.lang.c

    On Sun, 9 Feb 2025 20:11:22 -0800, Andrey Tarasevich wrote:

    This manpage, for one example, is in full agreement with the standard

    https://www.man7.org/linux/man-pages/man3/fgets.3p.html

    Notice these two sentences would seem to contradict one another:

    A null byte shall be written immediately after the last byte read
    into the array. If the end-of-file condition is encountered before
    any bytes are read, the contents of the array pointed to by s
    shall not be changed.

    A practical experiment demonstrates that [supposedly] POSIX-obeying implementations do not write '\0' into the buffer in "immediate
    end-of-file" situations: https://coliru.stacked-crooked.com/a/3e672e6718dd388b

    My test program does the same. I would say that settles it.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.lang.c on Mon Feb 10 12:49:11 2025
    From Newsgroup: comp.lang.c

    On Sun, 9 Feb 2025 17:22:43 -0800
    Andrey Tarasevich <noone@noone.net> wrote:

    On Sun 2/9/2025 5:06 PM, Andrey Tarasevich wrote:
    On Sun 2/9/2025 3:52 PM, Lawrence D'Oliveiro wrote:
    On Sat, 8 Feb 2025 23:12:44 -0800, Andrey Tarasevich wrote:

    If `fgets` reads nothing (instant end-of-file), the entire buffer
    remains untouched.

    You mean, only a single null byte gets written.

    No. The buffer is not changed at all in such case.

    ... which actually raises an interesting quiz/puzzle/question:

    Under what circumstances `fgets` is expected to return an empty
    string? (I.e. set the [0] entry of the buffer to '\0' and return
    non-null)?

    The only answer I can see right away is:

    When one calls it as `fgets(buffer, 1, file)`, i.e. asks it to
    read 0 characters.

    This is under assumption that asking `fgets` to read 0 characters is supposed to prevent it from detecting end-of-file condition or I/O
    error condition. One can probably do some nitpicking at the current wording... but I believe the above is the intent.


    fgets() is one of many poorly defined standard library functions
    inherited from early UNIX days. It is not horrendous, like gets(), so I personally would not suggest deprecation. Instead, I would suggest
    addition of another function with similar goals, but better-thought API
    to the Standard library.

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.lang.c on Mon Feb 10 16:39:44 2025
    From Newsgroup: comp.lang.c

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    On Sun, 9 Feb 2025 20:11:22 -0800, Andrey Tarasevich wrote:

    This manpage, for one example, is in full agreement with the standard

    https://www.man7.org/linux/man-pages/man3/fgets.3p.html

    Notice these two sentences would seem to contradict one another:

    That manual page is not definitive, or a standard.

    The ISO C standard is definitive, and parrotted here:

    https://pubs.opengroup.org/onlinepubs/9799919799/functions/fgets.html

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From James Kuyper@jameskuyper@alumni.caltech.edu to comp.lang.c on Mon Feb 10 13:58:05 2025
    From Newsgroup: comp.lang.c

    On 2/10/25 02:21, Lawrence D'Oliveiro wrote:
    On Sun, 9 Feb 2025 20:11:22 -0800, Andrey Tarasevich wrote:

    This manpage, for one example, is in full agreement with the standard

    https://www.man7.org/linux/man-pages/man3/fgets.3p.html

    Notice these two sentences would seem to contradict one another:

    A null byte shall be written immediately after the last byte read
    into the array. If the end-of-file condition is encountered before
    any bytes are read, the contents of the array pointed to by s
    shall not be changed.

    Note: this wording is almost identical to relevant wording in the
    current C standard:

    "A null character is written immediately after the last character read
    into the array." (7.23.7p2).
    "If end-of-file is encountered and no characters have been read into the
    array, the contents of the array remain unchanged and a null pointer is returned." (7.23.7p3)

    I don't see the contradiction. If "no characters are read into the
    array", there is no such thing as "the last byte read into the array",
    so a null byte has no location where it should be written. Therefore,
    there's no reason for changing the contents of the array.


    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Mark Bourne@nntp.mbourne@spamgourmet.com to comp.lang.c on Mon Feb 10 21:57:29 2025
    From Newsgroup: comp.lang.c

    Janis Papanagnou wrote:
    On 09.02.2025 11:50, Michael S wrote:
    What do you consider "last line" of the file in which last character is
    not LF?

    I consider missing newlines at the end of any text line as a bug.
    (And I'm not inclined to use a weaker word than "bug".) YMMV.

    I think I once saw somewhere that utilities originating on Unix
    typically consider \n to be a line terminator, so include it at the end
    of every line including the last, whereas those originating on
    DOS/Windows typically consider \n to be a line separator, so don't
    include it at the end of the last line. So Unix-originated utilities
    might not behave as expected if the file doesn't end with \n, whereas Windows-originated utilities might treat the file as having an extra
    blank line at the end of the file if it does end with \n. Utilities
    ported from one system to the other sometimes continue following the convention of their origin, rather than the system they're running on.

    I'm not sure where I originally saw that, but for what it's worth the following Stack Overflow makes a similar claim: <https://stackoverflow.com/a/729795>. Most of the answer discusses
    POSIX, with a "line" defined as ending with a terminating newline, hence
    every line including the last ends with a newline, while a final
    footnote notes that doesn't necessarily apply to non-POSIX systems, particularly Windows.
    --
    Mark.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Mark Bourne@nntp.mbourne@spamgourmet.com to comp.lang.c on Mon Feb 10 22:57:25 2025
    From Newsgroup: comp.lang.c

    Janis Papanagnou wrote:
    I have a _coherent_ file, with a few NL terminated lines of text.

    Usually I use fgets() in contexts where I process every line, like

    while (fgets (buf, BUFSIZ, fd) != NULL) {
    operate_on (buf);
    }
    // here the status of buf[] is usually not important any more

    My actual context was different, like

    while (fgets (buf, BUFSIZ, fd) != NULL) {
    // buf[] contents are ignored here
    }
    operate_on (buf[]); // which I assumed contains last line

    ...

    Usually I read and process the data that I got in buf from fgets()
    while there *is* data (fgets() != NULL), and I thus don't care any
    more about buffer contents validity after the loop (fgets() == NULL).

    But now I wanted to ignore all data that I got for fgets() != NULL
    in the loop. And I hoped that *after* the loop the last read data is
    still valid.

    What does fgets do if the file is completely empty? I may be wrong
    (more familiar with Python than C these days), but it doesn't look like
    that should be any different from any other end-of-file condition, so presumably the first call to fgets would return NULL, without ever
    modifying the buffer. Unless the buffer is initialised (e.g. to an
    empty string) before the while loop, that would result in an
    uninitialised buffer being passed to operate_on.
    --
    Mark.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Mark Bourne@nntp.mbourne@spamgourmet.com to comp.lang.c on Mon Feb 10 23:22:54 2025
    From Newsgroup: comp.lang.c

    Janis Papanagnou wrote:
    On 10.02.2025 07:41, Keith Thompson wrote:
    At a previous job, I reviewed an update whose apparent meaning
    differed depending on whether the editor was configured with 4- or
    8-column tabstops. I don't remember the exact details, but the code
    looked like either:

    if (condition)
    statement1;
    statement2;

    or:

    if (condition)
    statement1;
    statement2;

    depending on the reader's settings. Of course they're semantically
    equivalent, but the first is the way the developer saw it, and the
    second is misleading and is the way it looked to me.

    Yeah, misleading code is a pain, especially if you have got the job
    to fix some error in these incoherent formatted modules. (I suppose
    that case is yet more than only misleading if you are programming in
    Python where indentation even carries semantics.)

    That was an issue in Python 2 where, I think, a single tab was treated
    as equivalent to 8 spaces for the purposes of block scoping. Depending
    on editor settings, that may or may not match how it visually appears.
    For that reason, Python 3 makes it an error to mix tabs and spaces in
    ways that would be misleading, i.e. if the meaning would depend on the
    size of a tab. Even before Python 3, the issue was generally avoided by coding conventions, e.g. using only spaces and not tabs. Not intending
    to go into any more detail than that, this being comp.lang.c not comp.lang.python ;)
    --
    Mark.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Mark Bourne@nntp.mbourne@spamgourmet.com to comp.lang.c on Mon Feb 10 23:24:51 2025
    From Newsgroup: comp.lang.c

    Mark Bourne wrote:
    Janis Papanagnou wrote:
    I have a _coherent_ file, with a few NL terminated lines of text.

    Usually I use fgets() in contexts where I process every line, like

         while (fgets (buf, BUFSIZ, fd) != NULL) {
             operate_on (buf);
         }
         // here the status of buf[] is usually not important any more

    My actual context was different, like

         while (fgets (buf, BUFSIZ, fd) != NULL) {
             // buf[] contents are ignored here
         }
         operate_on (buf[]);  // which I assumed contains last line

    ...

    Usually I read and process the data that I got in buf from fgets()
    while there *is* data (fgets() != NULL), and I thus don't care any
    more about buffer contents validity after the loop (fgets() == NULL).

    But now I wanted to ignore all data that I got for fgets() != NULL
    in the loop. And I hoped that *after* the loop the last read data is
    still valid.

    What does fgets do if the file is completely empty?  I may be wrong
    (more familiar with Python than C these days), but it doesn't look like
    that should be any different from any other end-of-file condition, so presumably the first call to fgets would return NULL, without ever
    modifying the buffer.  Unless the buffer is initialised (e.g. to an
    empty string) before the while loop, that would result in an
    uninitialised buffer being passed to operate_on.

    Ah, I see Ben covered that earlier today elsewhere in the thread.
    --
    Mark.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Kaz Kylheku@643-408-1753@kylheku.com to comp.lang.c on Tue Feb 11 00:59:56 2025
    From Newsgroup: comp.lang.c

    On 2025-02-10, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    On 10.02.2025 07:41, Keith Thompson wrote:
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
    On 10.02.2025 05:37, Keith Thompson wrote:
    [...]
    This is already off-topic, so I won't even mention tabs vs. spaces.

    But as Vim users we don't have any issues here; as long as the
    indentation is _visibly_ consistent we can fix any tab/space-mix
    on the fly and easily with Vim.

    Yes, *if* the indentation is visibly consistent.

    At a previous job, I reviewed an update whose apparent meaning
    differed depending on whether the editor was configured with 4- or
    8-column tabstops. I don't remember the exact details, but the code
    looked like either:

    if (condition)
    statement1;
    statement2;

    or:

    if (condition)
    statement1;
    statement2;

    depending on the reader's settings. Of course they're semantically
    equivalent, but the first is the way the developer saw it, and the
    second is misleading and is the way it looked to me.

    Yeah, misleading code is a pain, especially if you have got the job
    to fix some error in these incoherent formatted modules.

    Turning on gcc -Wmisleading-indentation could go a long way toward
    hunting down the trouble spots.

    Not sure how that well that deals with inconsistent mixtures of tabs and spaces. The man page documentation (I realize there is also the real
    GCC manual) says that the amount of indentation is determined by
    the -ftabstop=N option, where N defaults to 8.

    So I'm guessing that you may have to compile the code at least
    two ways---with -ftabstop=4 -ftabstop=8 and possibly other
    choices---to get the all bad spots to look like your second example and
    be diagnosed.

    They should support a nondeterministic behavior:

    -fabstop=2,4,8

    fork reality into all those tabstops and diagnose in all of them
    in the same pass.
    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.lang.c on Tue Feb 11 01:03:21 2025
    From Newsgroup: comp.lang.c

    On Mon, 10 Feb 2025 13:58:05 -0500, James Kuyper wrote:

    I don't see the contradiction. If "no characters are read into the
    array", there is no such thing as "the last byte read into the array",
    so a null byte has no location where it should be written. Therefore,
    there's no reason for changing the contents of the array.

    What if the array is only big enough for one byte? In this case, no
    characters can be read into it. Is a trailing null inserted in this case?
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From James Kuyper@jameskuyper@alumni.caltech.edu to comp.lang.c on Mon Feb 10 22:33:03 2025
    From Newsgroup: comp.lang.c

    On 2/10/25 20:03, Lawrence D'Oliveiro wrote:
    On Mon, 10 Feb 2025 13:58:05 -0500, James Kuyper wrote:

    I don't see the contradiction. If "no characters are read into the
    array", there is no such thing as "the last byte read into the array",
    so a null byte has no location where it should be written. Therefore,
    there's no reason for changing the contents of the array.

    What if the array is only big enough for one byte? In this case, no characters can be read into it. Is a trailing null inserted in this case?

    "The fgets function reads at most one less than the number of characters specified by n from the stream pointed to by stream into the array
    pointed to by s." (7.32.7.2p2)

    If the buffer length is 1, "at most one less than the number ...
    specified" is 0. Therefore, fgets() cannot read any characters into the
    buffer, no matter what the contents of the input stream are. Again,
    since there is no "last byte read into the array", there is no location
    where a null byte should be written.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Kaz Kylheku@643-408-1753@kylheku.com to comp.lang.c on Tue Feb 11 03:42:03 2025
    From Newsgroup: comp.lang.c

    On 2025-02-11, James Kuyper <jameskuyper@alumni.caltech.edu> wrote:
    What if the array is only big enough for one byte? In this case, no
    characters can be read into it. Is a trailing null inserted in this case?

    "The fgets function reads at most one less than the number of characters specified by n from the stream pointed to by stream into the array
    pointed to by s." (7.32.7.2p2)

    If the buffer length is 1, "at most one less than the number ...
    specified" is 0. Therefore, fgets() cannot read any characters into the buffer, no matter what the contents of the input stream are. Again,
    since there is no "last byte read into the array", there is no location
    where a null byte should be written.

    If the array consists of two bytes, then it's possible to use the fgets function carry out the job of process input in a line-wise fashion,
    using fragments of lines that are one character wide. For instance
    "foo\n" may be be read in three parts "f\0", "o\0","\n\0".

    If the buffer is one byte wide, then it's not possible for the loop
    around fgets to meaningfully process the file.

    Therefore, it might as well just return null on the first call
    and every subsequent one.

    It simply doesn't make sense to use an array of one byte.

    A one bytea area is too small, since it can only hold a string of zero
    length, and a non-zero-length file cannot be expressed as a catenation
    of strings of zero length.

    The fgets function /could/ null terminate always, even when returning
    null, and even in the one-byte-buffer case. But what would be the point;
    there is no need for code to rely on the buffer when null has been
    returned.

    When we use fgets, we can (and probably should) pretend that the buffer
    is just a work area or context buffer for the function, and the return
    value is the real data (which happens to point to the context buffer).
    When we get null, the operation yielded no data. We "got something"
    only when fgets returns a pointer to it.
    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.lang.c on Tue Feb 11 04:54:23 2025
    From Newsgroup: comp.lang.c

    On Mon, 10 Feb 2025 22:33:03 -0500, James Kuyper wrote:

    On 2/10/25 20:03, Lawrence D'Oliveiro wrote:

    On Mon, 10 Feb 2025 13:58:05 -0500, James Kuyper wrote:

    I don't see the contradiction. If "no characters are read into the
    array", there is no such thing as "the last byte read into the array",
    so a null byte has no location where it should be written. Therefore,
    there's no reason for changing the contents of the array.

    What if the array is only big enough for one byte? In this case, no
    characters can be read into it. Is a trailing null inserted in this
    case?

    "The fgets function reads at most one less than the number of characters specified by n from the stream pointed to by stream into the array
    pointed to by s." (7.32.7.2p2)

    If the buffer length is 1, "at most one less than the number ...
    specified" is 0. Therefore, fgets() cannot read any characters into the buffer, no matter what the contents of the input stream are. Again,
    since there is no "last byte read into the array", there is no location
    where a null byte should be written.

    Have you tried it? I have.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Andrey Tarasevich@noone@noone.net to comp.lang.c on Mon Feb 10 22:32:12 2025
    From Newsgroup: comp.lang.c

    On Mon 2/10/2025 5:03 PM, Lawrence D'Oliveiro wrote:
    On Mon, 10 Feb 2025 13:58:05 -0500, James Kuyper wrote:

    I don't see the contradiction. If "no characters are read into the
    array", there is no such thing as "the last byte read into the array",
    so a null byte has no location where it should be written. Therefore,
    there's no reason for changing the contents of the array.

    What if the array is only big enough for one byte? In this case, no characters can be read into it. Is a trailing null inserted in this case?

    If by that you mean, "what if the value of 1 is passed as second
    argument", then, as I stated in one of my previous messages:

    No attempt to read anything from the stream is made, which means that end-of-file or I/O error conditions do not arise (unless, perhaps, the
    stream was already in error condition) and the [0] byte of the buffer is simply set to '\0'.
    --
    Best regards,
    Andrey
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Andrey Tarasevich@noone@noone.net to comp.lang.c on Mon Feb 10 22:38:53 2025
    From Newsgroup: comp.lang.c

    On Mon 2/10/2025 10:32 PM, Andrey Tarasevich wrote:
    On Mon 2/10/2025 5:03 PM, Lawrence D'Oliveiro wrote:
    On Mon, 10 Feb 2025 13:58:05 -0500, James Kuyper wrote:

    I don't see the contradiction. If "no characters are read into the
    array", there is no such thing as "the last byte read into the array",
    so a null byte has no location where it should be written. Therefore,
    there's no reason for changing the contents of the array.

    What if the array is only big enough for one byte? In this case, no
    characters can be read into it. Is a trailing null inserted in this case?

    If by that you mean, "what if the value of 1 is passed as second
    argument", then, as I stated in one of my previous messages:

    No attempt to read anything from the stream is made, which means that end-of-file or I/O error conditions do not arise (unless, perhaps, the stream was already in error condition) and the [0] byte of the buffer is simply set to '\0'.


    ... and non-null pointer (pointer to the buffer) is returned.

    For what it is worth, an experiment shows that if the stream is already
    in end-of-file state at the moment of the call, `fgets` still behaves as
    if the call was successful - the buffer is modified as described above, non-null pointer is returned:

    https://coliru.stacked-crooked.com/a/a68382afcf4ff155
    --
    Best regards,
    Andrey

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Kaz Kylheku@643-408-1753@kylheku.com to comp.lang.c on Tue Feb 11 12:04:22 2025
    From Newsgroup: comp.lang.c

    On 2025-02-11, Andrey Tarasevich <noone@noone.net> wrote:
    If by that you mean, "what if the value of 1 is passed as second
    argument", then, as I stated in one of my previous messages:

    No attempt to read anything from the stream is made, which means that end-of-file or I/O error conditions do not arise (unless, perhaps, the stream was already in error condition) and the [0] byte of the buffer is simply set to '\0'.

    ISO C: "If a read error occurs during the operation, the members of the
    array have unspecified values and a null pointer is returned."

    (I think that stretches to the situation when the error has happened
    already, but clearerr(stream) has not been called to remove the
    condition.)

    Whenever fgets returns null due to not being able to read any characters
    into the array, it should not change the value of the elements of the
    array, even if the reason is that the array hos no room.

    We can think about the possibility of fgets returning a pointer
    to a null string when an array of size 1 is uzed, without advancing
    the stream.

    I find it not so easy to argue that it would not be /conforming/. The
    behavior can be regarded as a straightforward special case of the
    ordinary behavior, when fgets adds one or more characters to the array,
    runs out of room, and then null terminates and exits.

    I find it easy to argue that it's anything but a bad idea for fgets to
    ever return an empty string.

    The way fgets is defined, it provides single clear termination signal
    for loops; the null pointer.

    If an implementation of fgets may return an empty string (only
    conceivably allowed in the size 1 array case), then that constitutes an additional new termination signal. A program not looking for this
    additional termination signal shall loop indefinitely over a finite
    stream.

    While in that situation, the implementation might be conforming, and be processing a strictly conforming program, even if so, the infinite
    looping is a needlessly poor situation which can be avoided by not
    taking that interpretation: i.e. if no characters are added to the array
    for any reason, then have fgets always return NULL, rather than an empty string.

    It would be a good idea to add the requirement "fgets shall not
    return a pointer to an empty string" to its description to codify that.
    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From James Kuyper@jameskuyper@alumni.caltech.edu to comp.lang.c on Tue Feb 11 13:07:53 2025
    From Newsgroup: comp.lang.c

    On 2/10/25 23:54, Lawrence D'Oliveiro wrote:
    On Mon, 10 Feb 2025 22:33:03 -0500, James Kuyper wrote:
    ...
    "The fgets function reads at most one less than the number of characters
    specified by n from the stream pointed to by stream into the array
    pointed to by s." (7.32.7.2p2)

    If the buffer length is 1, "at most one less than the number ...
    specified" is 0. Therefore, fgets() cannot read any characters into the
    buffer, no matter what the contents of the input stream are. Again,
    since there is no "last byte read into the array", there is no location
    where a null byte should be written.

    Have you tried it? I have.

    I just tried it, using gcc and found that fgets() does set the first
    byte of the buffer to a null character. Therefore, it doesn't conform to
    the requirements of the standard. That's not particularly surprising -
    calling fgets with useless arguments isn't something that I'd expect to
    be a high priority on their pre-delivery tests.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.lang.c on Tue Feb 11 21:47:21 2025
    From Newsgroup: comp.lang.c

    On Tue, 11 Feb 2025 13:07:53 -0500, James Kuyper wrote:

    I just tried it, using gcc and found that fgets() does set the first
    byte of the buffer to a null character. Therefore, it doesn't conform to
    the requirements of the standard.

    GCC is, however, the closest thing we have to a de-facto standard for C.

    Is there another C compiler/runtime that behaves different?
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.lang.c on Tue Feb 11 13:59:33 2025
    From Newsgroup: comp.lang.c

    James Kuyper <jameskuyper@alumni.caltech.edu> writes:
    On 2/10/25 23:54, Lawrence D'Oliveiro wrote:
    On Mon, 10 Feb 2025 22:33:03 -0500, James Kuyper wrote:
    ...
    "The fgets function reads at most one less than the number of characters >>> specified by n from the stream pointed to by stream into the array
    pointed to by s." (7.32.7.2p2)

    If the buffer length is 1, "at most one less than the number ...
    specified" is 0. Therefore, fgets() cannot read any characters into the
    buffer, no matter what the contents of the input stream are. Again,
    since there is no "last byte read into the array", there is no location
    where a null byte should be written.

    Have you tried it? I have.

    I just tried it, using gcc and found that fgets() does set the first
    byte of the buffer to a null character. Therefore, it doesn't conform to
    the requirements of the standard. That's not particularly surprising - calling fgets with useless arguments isn't something that I'd expect to
    be a high priority on their pre-delivery tests.

    As you know, gcc doesn't implement fgets(). Were you using GNU libc?
    --
    Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
    void Void(void) { Void(); } /* The recursive call of the void */
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From James Kuyper@jameskuyper@alumni.caltech.edu to comp.lang.c on Tue Feb 11 17:44:30 2025
    From Newsgroup: comp.lang.c

    On 2/11/25 16:47, Lawrence D'Oliveiro wrote:
    On Tue, 11 Feb 2025 13:07:53 -0500, James Kuyper wrote:

    I just tried it, using gcc and found that fgets() does set the first
    byte of the buffer to a null character. Therefore, it doesn't conform to
    the requirements of the standard.

    GCC is, however, the closest thing we have to a de-facto standard for C.

    I've no interest in de-facto standards. I'm only interested in de-jure standards such as ISO/IEC 9899:2023. Feel free to have different
    preferences.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From James Kuyper@jameskuyper@alumni.caltech.edu to comp.lang.c on Tue Feb 11 17:58:43 2025
    From Newsgroup: comp.lang.c

    On 2/11/25 16:59, Keith Thompson wrote:
    James Kuyper <jameskuyper@alumni.caltech.edu> writes:
    ...
    I just tried it, using gcc and found that fgets() does set the first
    byte of the buffer to a null character. Therefore, it doesn't conform to
    the requirements of the standard. That's not particularly surprising -
    calling fgets with useless arguments isn't something that I'd expect to
    be a high priority on their pre-delivery tests.

    As you know, gcc doesn't implement fgets(). Were you using GNU lib
    .
    Yes. To be specific, Ubuntu GLIBC 2.35-0ubuntu3.9.

    Here's my test code:

    #include <stdio.h>
    #include <stdlib.h>
    int main(int argc, char *argv[])
    {
    char fill = 1;
    char buffer = fill;
    char *retval = NULL;
    FILE *infile;
    if(argc < 2)
    infile = stdin;
    else{
    infile = fopen(argv[1], "r");
    if(!infile)
    {
    perror(argv[1]);
    return EXIT_FAILURE;
    }
    }

    while((retval = fgets(&buffer, 1, infile)) == &buffer)
    {
    printf("%ld:'%u'\n", ftell(infile), (unsigned)buffer);
    buffer = fill++;
    }
    if(ferror(infile))
    perror("fgets");

    printf("%p!=%p ferror:%d feof:%d '%c'\n",
    (void*)&buffer, (void*)retval,
    ferror(infile), feof(infile), buffer);
    }

    Note that if fgets() works as it should, that's an infinite loop, since
    no data is read in, and therefore there's no movement through the input
    file. I wrote code that executes after the infinite loop just to cover
    the possibility that it doesn't work that way.

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.lang.c on Tue Feb 11 15:40:15 2025
    From Newsgroup: comp.lang.c

    James Kuyper <jameskuyper@alumni.caltech.edu> writes:
    On 2/11/25 16:59, Keith Thompson wrote:
    James Kuyper <jameskuyper@alumni.caltech.edu> writes:
    ...
    I just tried it, using gcc and found that fgets() does set the first
    byte of the buffer to a null character. Therefore, it doesn't conform to >>> the requirements of the standard. That's not particularly surprising -
    calling fgets with useless arguments isn't something that I'd expect to
    be a high priority on their pre-delivery tests.

    As you know, gcc doesn't implement fgets(). Were you using GNU lib
    .
    Yes. To be specific, Ubuntu GLIBC 2.35-0ubuntu3.9.

    Here's my test code:

    #include <stdio.h>
    #include <stdlib.h>
    int main(int argc, char *argv[])
    {
    char fill = 1;
    char buffer = fill;
    char *retval = NULL;
    FILE *infile;
    if(argc < 2)
    infile = stdin;
    else{
    infile = fopen(argv[1], "r");
    if(!infile)
    {
    perror(argv[1]);
    return EXIT_FAILURE;
    }
    }

    while((retval = fgets(&buffer, 1, infile)) == &buffer)
    {
    printf("%ld:'%u'\n", ftell(infile), (unsigned)buffer);
    buffer = fill++;
    }
    if(ferror(infile))
    perror("fgets");

    printf("%p!=%p ferror:%d feof:%d '%c'\n",
    (void*)&buffer, (void*)retval,
    ferror(infile), feof(infile), buffer);
    }

    Note that if fgets() works as it should, that's an infinite loop, since
    no data is read in, and therefore there's no movement through the input
    file. I wrote code that executes after the infinite loop just to cover
    the possibility that it doesn't work that way.

    I get an infinite loop with both glibc and musl on Ubuntu, and under
    Termux on Android (Bionic library implementation):

    $ ./jk < /dev/null | head -n 3
    0:'0'
    0:'0'
    0:'0'
    $ echo hello | ./jk | head -n 3
    -1:'0'
    -1:'0'
    -1:'0'
    $

    With newlib on Cygwin, there is no infinite loop:

    $ ./jk.exe < /dev/null
    0x7ffffcc17!=0x0 ferror:0 feof:0 ''
    $ echo hello | ./jk.exe
    0x7ffffcc17!=0x0 ferror:0 feof:0 ''
    $
    --
    Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
    void Void(void) { Void(); } /* The recursive call of the void */
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.lang.c on Wed Feb 12 06:16:22 2025
    From Newsgroup: comp.lang.c

    On Tue, 11 Feb 2025 17:44:30 -0500, James Kuyper wrote:

    On 2/11/25 16:47, Lawrence D'Oliveiro wrote:

    On Tue, 11 Feb 2025 13:07:53 -0500, James Kuyper wrote:

    I just tried it, using gcc and found that fgets() does set the first
    byte of the buffer to a null character. Therefore, it doesn't conform
    to the requirements of the standard.

    GCC is, however, the closest thing we have to a de-facto standard for
    C.

    I've no interest in de-facto standards. I'm only interested in de-jure standards such as ISO/IEC 9899:2023.

    Where is there an implementation that conforms to that?
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.lang.c on Thu Feb 13 07:14:28 2025
    From Newsgroup: comp.lang.c

    Michael S <already5chosen@yahoo.com> writes:

    On Sun, 9 Feb 2025 17:22:43 -0800
    Andrey Tarasevich <noone@noone.net> wrote:

    On Sun 2/9/2025 5:06 PM, Andrey Tarasevich wrote:

    On Sun 2/9/2025 3:52 PM, Lawrence D'Oliveiro wrote:

    On Sat, 8 Feb 2025 23:12:44 -0800, Andrey Tarasevich wrote:

    If `fgets` reads nothing (instant end-of-file), the entire buffer
    remains untouched.

    You mean, only a single null byte gets written.

    No. The buffer is not changed at all in such case.

    ... which actually raises an interesting quiz/puzzle/question:

    Under what circumstances `fgets` is expected to return an empty
    string? (I.e. set the [0] entry of the buffer to '\0' and return
    non-null)?

    The only answer I can see right away is:

    When one calls it as `fgets(buffer, 1, file)`, i.e. asks it to
    read 0 characters.

    This is under assumption that asking `fgets` to read 0 characters is
    supposed to prevent it from detecting end-of-file condition or I/O
    error condition. One can probably do some nitpicking at the current
    wording... but I believe the above is the intent.

    fgets() is one of many poorly defined standard library functions
    inherited from early UNIX days. [...]

    What about the fgets() function do you think is poorly defined?

    Second question: by "poorly defined" do you mean "defined
    wrongly" or "defined ambiguously" (or both)?
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.lang.c on Thu Feb 13 07:29:44 2025
    From Newsgroup: comp.lang.c

    Andrey Tarasevich <noone@noone.net> writes:

    On Mon 2/10/2025 5:03 PM, Lawrence D'Oliveiro wrote:

    On Mon, 10 Feb 2025 13:58:05 -0500, James Kuyper wrote:

    I don't see the contradiction. If "no characters are read into the
    array", there is no such thing as "the last byte read into the array",
    so a null byte has no location where it should be written. Therefore,
    there's no reason for changing the contents of the array.

    What if the array is only big enough for one byte? In this case, no
    characters can be read into it. Is a trailing null inserted in this case?

    If by that you mean, "what if the value of 1 is passed as second
    argument", then, as I stated in one of my previous messages:

    No attempt to read anything from the stream is made, which means that end-of-file or I/O error conditions do not arise (unless, perhaps, the
    stream was already in error condition) and the [0] byte of the buffer
    is simply set to '\0'.

    You're in good company. Doing a web search turned up this description --

    fgets() - Read a String from a Stream
    Last Updated: 2024-09-20

    #include <stdio.h>
    char *fgets(char *string, int n, FILE *stream);

    General Description

    Reads bytes from a stream pointed to by stream into an array
    pointed to by string, starting at the position indicated by the
    file position indicator. Reading continues until the number of
    characters read is equal to n-1, or until a new-line character
    (\n), or until the end of the stream, whichever comes first. The
    fgets() function stores the result in string and adds a null
    character (\0) to the end of the string. The string includes the
    new-line character, if read.

    The fgets() function is not supported for files opened with
    type=record.

    The fgets() function has the same restriction as any read
    operation for a read immediately following a write or a write
    immediately following a read. Between a write and a subsequent
    read, an intervening flush or reposition must occur. Between a
    read and a subsequent write, an intervening flush or reposition
    must also occur, unless an EOF has been reached.

    Returned Value

    If successful, fgets() returns a pointer to the string buffer.

    If unsuccessful, fgets() returns NULL.

    If n is less than or equal to 0, it indicates a domain error;
    errno is set to EDOM to indicate the cause of the failure.

    When n equals 1, it indicates a valid result. It means that the
    string buffer has only room for the null terminator; nothing is
    physically read from the file. (Such an operation is still
    considered a read operation, so it cannot immediately follow a
    write operation unless an intervening flush or reposition
    operation occurs first.)

    If n is greater than 1, fgets() will only fail if an I/O error
    occurs or if EOF is reached and no data is read from the file.

    Note: You should use ferror() and feof() to determine whether an
    error or an EOF condition occurred. An EOF is only reached when
    an attempt is made to read "past" the last byte of data. Reading
    up to and including the last byte of data does not turn on the EOF
    indicator.

    If EOF is reached after data has already been read into the string
    buffer, fgets() returns a pointer to the string buffer to indicate
    success. A subsequent call would return NULL since fgets() would
    reach EOF without reading any data.


    That description was found on this page:

    https://www.ibm.com/docs/en/zvm/7.4?
    topic=descriptions-fgets-read-string-from-stream
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.lang.c on Fri Feb 14 16:51:08 2025
    From Newsgroup: comp.lang.c

    On Thu, 13 Feb 2025 07:14:28 -0800
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

    Michael S <already5chosen@yahoo.com> writes:

    On Sun, 9 Feb 2025 17:22:43 -0800
    Andrey Tarasevich <noone@noone.net> wrote:

    On Sun 2/9/2025 5:06 PM, Andrey Tarasevich wrote:

    On Sun 2/9/2025 3:52 PM, Lawrence D'Oliveiro wrote:

    On Sat, 8 Feb 2025 23:12:44 -0800, Andrey Tarasevich wrote:

    If `fgets` reads nothing (instant end-of-file), the entire
    buffer remains untouched.

    You mean, only a single null byte gets written.

    No. The buffer is not changed at all in such case.

    ... which actually raises an interesting quiz/puzzle/question:

    Under what circumstances `fgets` is expected to return an empty
    string? (I.e. set the [0] entry of the buffer to '\0' and return
    non-null)?

    The only answer I can see right away is:

    When one calls it as `fgets(buffer, 1, file)`, i.e. asks it to
    read 0 characters.

    This is under assumption that asking `fgets` to read 0 characters
    is supposed to prevent it from detecting end-of-file condition or
    I/O error condition. One can probably do some nitpicking at the
    current wording... but I believe the above is the intent.

    fgets() is one of many poorly defined standard library functions
    inherited from early UNIX days. [...]

    What about the fgets() function do you think is poorly defined?

    Second question: by "poorly defined" do you mean "defined
    wrongly" or "defined ambiguously" (or both)?

    For starter, it looks like designers of fgets() did not believe in
    their own motto about files being just streams of bytes.
    I don't know the history, so, may be, the function was defined this way
    for portability with systems where text files have special record-based structure?

    Then, everything about it feels inelegant.
    A return value carries just 1 bit of information, success or failure.
    So why did they encode this information in baroque way instead of
    something obvious, 0 and 1?
    Appending zero at the end also feels like a hack, but it is necessary
    because of the main problem. And the main problem is: how the user is
    supposed to figure out how many bytes were read?
    In well-designed API this question should be answered in O(1) time.
    With fgets(), it can be answered in O(N) time when input is trusted to
    contain no zeros. When input is arbitrary, finding out the answer is
    even harder and requires quirks.

    What is my suggestion for alternative?
    Without too deep thinking I'd suggest (ignoring issues of restrict for
    sake of brevity) function that gives the same answer like foo() below,
    but hopefully does it faster:

    char* foo(FILE* fp, char* dst, int count, char last_c)
    {
    while (count > 0) {
    int ch = fgetc(fp);
    if (ch == EOF) {
    if (ferror(fp))
    dst = NULL;
    break;
    }
    *dst++ = ch;
    if (ch == last_c)
    break;
    --count;
    }
    return dst;
    }

    The function foo() is more generic than fgets(). For use instead of
    fgets() it should be accompanied by standard constant EOL_CHAR.

    I am not completely satisfied with proposed solution. The API is
    still less obvious than it could be. But it is much better than fgets().









    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.lang.c on Fri Feb 14 15:10:50 2025
    From Newsgroup: comp.lang.c

    Michael S <already5chosen@yahoo.com> writes:
    On Thu, 13 Feb 2025 07:14:28 -0800
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:


    fgets() is one of many poorly defined standard library functions
    inherited from early UNIX days. [...]

    What about the fgets() function do you think is poorly defined?

    Second question: by "poorly defined" do you mean "defined
    wrongly" or "defined ambiguously" (or both)?

    For starter, it looks like designers of fgets() did not believe in
    their own motto about files being just streams of bytes.

    How so? The 's' in fgets is for 'string'. File Get String (fgets).

    If there is no string, it returns NULL, just like other string
    functions. Seems quite logical to me.

    I don't know the history, so, may be, the function was defined this way
    for portability with systems where text files have special record-based >structure?

    No, fgets was defined long before C portability to anything other than
    unix was considered. I'd guess it was originally a convenience function
    used in several utilities before being moved to libc.


    Then, everything about it feels inelegant.
    A return value carries just 1 bit of information, success or failure.
    So why did they encode this information in baroque way instead of
    something obvious, 0 and 1?

    Because it is a string function.

    Appending zero at the end also feels like a hack, but it is necessary

    Because it is a string function, and strings are terminated with a
    NUL byte.

    because of the main problem. And the main problem is: how the user is >supposed to figure out how many bytes were read?

    Most simply (albeit less performant) by using strlen on the result.

    In well-designed API this question should be answered in O(1) time.
    With fgets(), it can be answered in O(N) time when input is trusted to >contain no zeros.

    Which is a prerequisite for using file-get-string (fgets) in the
    first place - strings cannot have embedded NUL-bytes.

    If you're reading non-string data use read/pread/mmap.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.lang.c on Fri Feb 14 17:23:58 2025
    From Newsgroup: comp.lang.c

    On Fri, 14 Feb 2025 15:10:50 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:



    If you're reading non-string data use read/pread/mmap.


    I don't know about you, but in decades of practice I didn't yet
    encounter a situation when I can trust a file input with 100%
    certainty.


    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.lang.c on Fri Feb 14 16:46:08 2025
    From Newsgroup: comp.lang.c

    Michael S <already5chosen@yahoo.com> writes:
    On Fri, 14 Feb 2025 15:10:50 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:



    If you're reading non-string data use read/pread/mmap.


    I don't know about you, but in decades of practice I didn't yet
    encounter a situation when I can trust a file input with 100%
    certainty.


    It may be a language difference, but I don't understand what
    you're saying here.

    If I read a data file with a particular format (e.g. ELF), I
    can trust that when I read the header, I'll read the header.

    I would never use fgets for that. fread, perhaps, but I'd
    more more likely use mmap when processing a structure binary
    like and ELF or COFF file.

    The header may be garbage, but any file is allowed to have
    corrupt content. In the case of ELF, the magic numbers
    provide some assurance that the rest of the header is
    trustworthy.

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Kaz Kylheku@643-408-1753@kylheku.com to comp.lang.c on Fri Feb 14 17:22:59 2025
    From Newsgroup: comp.lang.c

    On 2025-02-14, Michael S <already5chosen@yahoo.com> wrote:
    For starter, it looks like designers of fgets() did not believe in
    their own motto about files being just streams of bytes.

    They obviously did, which is exactly why they painstakingly preserved
    the annoying line terminators in the returned data.

    I don't know the history, so, may be, the function was defined this way
    for portability with systems where text files have special record-based structure?

    You are sliding into muddled thinking here.

    Then, everything about it feels inelegant.
    A return value carries just 1 bit of information, success or failure.

    Why would you assert a claim for which the standard library alone
    is replete with counterexamples: getchar, malloc, getenv, pow, sin.

    Did you mean /the/ return value (of fgets)?

    So why did they encode this information in baroque way instead of
    something obvious, 0 and 1?

    Because you can express this concept:

    char work_area[SIZE];
    char *line;

    while ((line = fgets(work_area, sizeof work_area, stream)))
    {
    /* process line */
    }

    The work_area just provides storage for the operation: line is the
    returned line.

    The loop would work even if fgets sometimes returned pointers that
    are not the to first byte of work_area. It just so happens that
    they always are.

    It is meaningful to capture the returned value and work with
    it as if it were distinct from the buffer.

    Appending zero at the end also feels like a hack, but it is necessary
    because of the main problem.

    Appending zero is necessary so that the result meets the definition
    of a C character string, without which it cannot be passed into string-manipulating functions like strlen.

    Home-grown functions that resemble fgets, but forget to add a null
    byte sometimes, are the subjects of security CVEs.

    And the main problem is: how the user is
    supposed to figure out how many bytes were read?

    Yes, how are they, if you take away the null byte?

    In well-designed API this question should be answered in O(1) time.

    In the context of C strings, that buys you almost nothing.
    Even if you know the length, it's going to get measured numerous
    more times.

    It would be good if fgets nuked the terminating newline.

    Many uses of fgets, after every operation, look for the newline
    and nuke it, before doing anything else.

    There is a nice idiom for that, by the way, which avoids an
    temporary variable and if test:

    line[strcspn(line, "\n")] = 0;

    strcspn(line, "\n") calculates the length of the prefix of line
    which consists of non-newlines. That value is precisely the
    array index of the first newline, if there is one, or else
    of the terminating null, if there isn't a newline. Either
    way, you can clobber that with a newline.

    Once you see the above, you will never do this again:

    newline = strchr(line, '\n');
    if (newline)
    *newline = 0;

    With fgets(), it can be answered in O(N) time when input is trusted to contain no zeros.

    We have decided in the C world that text does not contain zeros.

    This has become so pervasive that the remaining naysayers can safely
    regarded as part of a lunatic fringe.

    Software that tries to support the presence of raw nulls in text is
    actively harmful for security.

    For instance, a piece of text with embedded nulls might have valid
    overall syntax which makes it immune to an injection attack.

    But when it is sent to another piece of software which interprets
    the null as a terminator, the syntax is chopped in half, allowing
    it to be completed by a malicious actor.

    When input is arbitrary, finding out the answer is
    even harder and requires quirks.

    When input is arbitrary, don't use fgets? It's for text.

    The function foo() is more generic than fgets(). For use instead of
    fgets() it should be accompanied by standard constant EOL_CHAR.

    I am not completely satisfied with proposed solution. The API is
    still less obvious than it could be. But it is much better than fgets().

    If last_c is '\n', you're still writing the pesky newline that
    the caller will often want to remove.

    Adding a terminating null and returning a pointer to that null
    would be better.

    You could then call the operation again with the returned dst
    pointer, and it would continue extending the string,
    without obliterating the last character.

    I'm sure I've seen a foo-like function in software before:
    reading delimited by an arbitrary byte, with length signaling.
    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Kaz Kylheku@643-408-1753@kylheku.com to comp.lang.c on Fri Feb 14 17:28:30 2025
    From Newsgroup: comp.lang.c

    On 2025-02-14, Michael S <already5chosen@yahoo.com> wrote:
    On Fri, 14 Feb 2025 15:10:50 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:



    If you're reading non-string data use read/pread/mmap.


    I don't know about you, but in decades of practice I didn't yet
    encounter a situation when I can trust a file input with 100%
    certainty.

    With a little care, you can cheerfully process through a binary
    executables with fgets, if you open the stream in binary mode (or on
    Unixes where you don't have to).
    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.lang.c on Fri Feb 14 11:03:20 2025
    From Newsgroup: comp.lang.c

    Kaz Kylheku <643-408-1753@kylheku.com> writes:
    [...]
    It would be good if fgets nuked the terminating newline.

    Many uses of fgets, after every operation, look for the newline
    and nuke it, before doing anything else.

    There is a nice idiom for that, by the way, which avoids an
    temporary variable and if test:

    line[strcspn(line, "\n")] = 0;
    [...]

    Then how do you detect a partial line? That can occur either if
    the last line doesn't have a terminating newline (on systems that
    permit it) or a line that's too long to fit in the array.
    --
    Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
    void Void(void) { Void(); } /* The recursive call of the void */
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.c on Fri Feb 14 20:23:50 2025
    From Newsgroup: comp.lang.c

    On 14.02.2025 15:51, Michael S wrote:
    [ fgets() poorly defined? ]

    [...]

    Just a comment on this:

    Then, everything about it feels inelegant.
    A return value carries just 1 bit of information, success or failure.
    So why did they encode this information in baroque way instead of
    something obvious, 0 and 1?

    I consider it to be differently; it basically returns a pointer
    to work with on the data, and the special NULL pointer value is
    just the often seen hack where a special pointer value provides
    an error indication.

    Typical application (for me) is

    if ((line = fgets (buf, BUFSIZ, fd)) == NULL)
    // handle error...
    else
    // process data

    Moreover, returning the pointer to the data makes it possible to
    (e.g.) nest string processing functions (including 'fgets') or
    to chain processing or immediate access/dereference the string
    contents.

    IMO the 'fgets' function matches the typical interface for such
    string functions (in C) allowing such programming language idioms
    like the two or three mentioned.

    I think it is generally arguable whether code patterns like
    if ((line = fgets (buf, BUFSIZ, fd)) == NULL)
    can be considered clean code with clean syntax and a clean design.
    But not in a "C" language newsgroup where such things are typical
    (with this function design) as language specific code pattern.

    Janis

    [...]

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Kaz Kylheku@643-408-1753@kylheku.com to comp.lang.c on Fri Feb 14 19:34:59 2025
    From Newsgroup: comp.lang.c

    On 2025-02-14, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    Kaz Kylheku <643-408-1753@kylheku.com> writes:
    [...]
    It would be good if fgets nuked the terminating newline.

    Many uses of fgets, after every operation, look for the newline
    and nuke it, before doing anything else.

    There is a nice idiom for that, by the way, which avoids an
    temporary variable and if test:

    line[strcspn(line, "\n")] = 0;
    [...]

    Then how do you detect a partial line? That can occur either if
    the last line doesn't have a terminating newline (on systems that
    permit it) or a line that's too long to fit in the array.

    I've seen many programs like this don't care. They have some
    'char buf[4096]' and that's that.

    In a program not required or designed to handle arbitrarily
    long lines, you can do something very simple (prior to the
    above line[strcspn(line, "\n")] = 0 expression).

    - zero-initialize the buffer.

    - after every call to fgets, inspect the value of the second-to-last
    array element. If the value is neither zero, nor '\n', then somehow
    diagnose that a too-long line has been presented to the program,
    contrary to its documented limitations.

    This will yield a false positive on an unterminated last line. That
    issue can be added as a documented limitation, or else the buffer can be
    sized one greater than what the documented line length limit requires,
    so that the program allows inner lines to be one character longer than
    the documented limit, but is strict with regard to an unterminated last
    line.
    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From James Kuyper@jameskuyper@alumni.caltech.edu to comp.lang.c on Fri Feb 14 14:38:40 2025
    From Newsgroup: comp.lang.c

    On 2/14/25 14:23, Janis Papanagnou wrote:
    ...
    I consider it to be differently; it basically returns a pointer
    to work with on the data, and the special NULL pointer value is
    just the often seen hack where a special pointer value provides
    an error indication.

    Typical application (for me) is

    if ((line = fgets (buf, BUFSIZ, fd)) == NULL)
    // handle error...
    else
    // process data

    Moreover, returning the pointer to the data makes it possible to
    (e.g.) nest string processing functions (including 'fgets') or
    to chain processing or immediate access/dereference the string
    contents.

    IMO the 'fgets' function matches the typical interface for such
    string functions (in C) allowing such programming language idioms
    like the two or three mentioned.

    I think it is generally arguable whether code patterns like
    if ((line = fgets (buf, BUFSIZ, fd)) == NULL)
    can be considered clean code with clean syntax and a clean design.
    But not in a "C" language newsgroup where such things are typical
    (with this function design) as language specific code pattern.

    As with several of the string processing functions, I think fgets()
    would be better if it returned a pointer to the end of the data that was
    read in, rather than to the beginning. The chaining you talk about does
    not, in general, work properly if the return value from fgets() is NULL,
    or the entire buffer was filled without writing a null character.


    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.c on Fri Feb 14 20:51:38 2025
    From Newsgroup: comp.lang.c

    On 14.02.2025 18:22, Kaz Kylheku wrote:
    [ ... ]

    It would be good if fgets nuked the terminating newline.

    Many uses of fgets, after every operation, look for the newline
    and nuke it, before doing anything else.

    There is a nice idiom for that, by the way, which avoids an
    temporary variable and if test:

    line[strcspn(line, "\n")] = 0;

    This is nice.

    In the test code which was the base of this thread I'm relying
    on the existing '\n' and use buf[strlen(buf)-1] = '\0'; to
    remove the last character.

    [...]

    We have decided in the C world that text does not contain zeros.

    This has become so pervasive that the remaining naysayers can safely
    regarded as part of a lunatic fringe.

    Software that tries to support the presence of raw nulls in text is
    actively harmful for security.

    Actually, in the same code, I'm also using the strtok() function
    to iterate over the buffer to get pointers to the separate tokens;
    if I'm not mistaken, that function places '\0' characters in the
    buffer to separate the string tokens. This is very efficient and
    (since the original buffer data isn't necessary any more) there's
    no problems (here) with its data interspersed with '\0'; strings
    (the tokens) get accessed through the returned pointers, and the
    buffer is just the physical (now sort of "binary") storage.

    Janis

    [...]


    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.lang.c on Fri Feb 14 20:01:16 2025
    From Newsgroup: comp.lang.c

    Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:
    Kaz Kylheku <643-408-1753@kylheku.com> writes:
    [...]
    It would be good if fgets nuked the terminating newline.

    Many uses of fgets, after every operation, look for the newline
    and nuke it, before doing anything else.

    There is a nice idiom for that, by the way, which avoids an
    temporary variable and if test:

    line[strcspn(line, "\n")] = 0;
    [...]

    Then how do you detect a partial line? That can occur either if
    the last line doesn't have a terminating newline (on systems that
    permit it) or a line that's too long to fit in the array.

    I terminate the line at the first newline, if present.

    while ((line = fgets(buffer, sizeof(buffer), f)) != NULL) {
    char *cp;

    if ((cp = strchr(line, '\n')) != NULL) {
    *cp = '\0';
    }

    history_set_pos(0);
    if (history_search(line, 1) == -1) {
    add_history(line);
    }

    commands.parse_and_execute(line);
    }


    Not the most efficient, perhaps, but better than using
    getchar to read the line.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.c on Fri Feb 14 21:02:19 2025
    From Newsgroup: comp.lang.c

    On 14.02.2025 20:38, James Kuyper wrote:

    As with several of the string processing functions, I think fgets()
    would be better if it returned a pointer to the end of the data that was
    read in, rather than to the beginning.

    Yes, that's another option. The language designer have to decide which
    behavior is more useful. There's pros and cons, IMO. On the minus side
    would be that the origin of the string gets lost that way. (Of course
    you can adjust your code then to keep a copy. But any way implemented,
    you need to adjust your code according to how the function is defined.)

    The chaining you talk about does
    not, in general, work properly if the return value from fgets() is NULL,

    Yes.

    or the entire buffer was filled without writing a null character.

    I read the man page that as if that at least would be guaranteed:

    "A terminating null byte ('\0') is stored
    after the last character in the buffer."

    (also in cases where no EOL or EOF is read).

    Janis

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.lang.c on Sat Feb 15 08:12:33 2025
    From Newsgroup: comp.lang.c

    Andrey Tarasevich <noone@noone.net> writes:

    On Sun 2/9/2025 5:06 PM, Andrey Tarasevich wrote:

    On Sun 2/9/2025 3:52 PM, Lawrence D'Oliveiro wrote:

    On Sat, 8 Feb 2025 23:12:44 -0800, Andrey Tarasevich wrote:

    If `fgets` reads nothing (instant end-of-file), the entire
    buffer remains untouched.

    You mean, only a single null byte gets written.

    No. The buffer is not changed at all in such case.

    ... which actually raises an interesting quiz/puzzle/question:

    Under what circumstances `fgets` is expected to return an empty
    string? (I.e. set the [0] entry of the buffer to '\0' and return
    non-null)?

    The only answer I can see right away is:

    When one calls it as `fgets(buffer, 1, file)`, i.e. asks it to
    read 0 characters.

    This is under assumption that asking `fgets` to read 0 characters
    is supposed to prevent it from detecting end-of-file condition or
    I/O error condition. One can probably do some nitpicking at the
    current wording... but I believe the above is the intent.

    Clearly there are more than a few C implementors who agree with that.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.lang.c on Sat Feb 15 08:37:20 2025
    From Newsgroup: comp.lang.c

    Michael S <already5chosen@yahoo.com> writes:

    On Thu, 13 Feb 2025 07:14:28 -0800
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

    Michael S <already5chosen@yahoo.com> writes:

    On Sun, 9 Feb 2025 17:22:43 -0800
    Andrey Tarasevich <noone@noone.net> wrote:

    On Sun 2/9/2025 5:06 PM, Andrey Tarasevich wrote:

    On Sun 2/9/2025 3:52 PM, Lawrence D'Oliveiro wrote:

    On Sat, 8 Feb 2025 23:12:44 -0800, Andrey Tarasevich wrote:

    If `fgets` reads nothing (instant end-of-file), the entire
    buffer remains untouched.

    You mean, only a single null byte gets written.

    No. The buffer is not changed at all in such case.

    ... which actually raises an interesting quiz/puzzle/question:

    Under what circumstances `fgets` is expected to return an empty
    string? (I.e. set the [0] entry of the buffer to '\0' and return
    non-null)?

    The only answer I can see right away is:

    When one calls it as `fgets(buffer, 1, file)`, i.e. asks it to
    read 0 characters.

    This is under assumption that asking `fgets` to read 0 characters
    is supposed to prevent it from detecting end-of-file condition or
    I/O error condition. One can probably do some nitpicking at the
    current wording... but I believe the above is the intent.

    fgets() is one of many poorly defined standard library functions
    inherited from early UNIX days. [...]

    What about the fgets() function do you think is poorly defined?

    Second question: by "poorly defined" do you mean "defined
    wrongly" or "defined ambiguously" (or both)?

    For starter, it looks like designers of fgets() did not believe in
    their own motto about files being just streams of bytes.
    I don't know the history, so, may be, the function was defined this way
    for portability with systems where text files have special record-based structure?

    Then, everything about it feels inelegant.
    A return value carries just 1 bit of information, success or failure.
    So why did they encode this information in baroque way instead of
    something obvious, 0 and 1?
    Appending zero at the end also feels like a hack, but it is necessary
    because of the main problem. And the main problem is: how the user is supposed to figure out how many bytes were read?
    In well-designed API this question should be answered in O(1) time.
    With fgets(), it can be answered in O(N) time when input is trusted to contain no zeros. When input is arbitrary, finding out the answer is
    even harder and requires quirks.

    If I understand you correctly your complaint is that the existing
    semantics are not as useful as you would like them to be, even
    though the current definition does make the behavior well defined.
    Is that right?

    Clearly using fgets() is problematic when the input stream might
    contain null characters. To me it seems obvious that the original
    implementors expected that fgets() would not be used in such cases,
    perhaps with the less severe restriction that the presence of
    embedded nulls could be detected and simply rejected as bad input,
    much the same as overly long lines or a final line without a
    terminating newline character.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.lang.c on Sat Feb 15 19:02:55 2025
    From Newsgroup: comp.lang.c

    On Fri, 14 Feb 2025 20:51:38 +0100
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

    On 14.02.2025 18:22, Kaz Kylheku wrote:
    [ ... ]

    It would be good if fgets nuked the terminating newline.

    Many uses of fgets, after every operation, look for the newline
    and nuke it, before doing anything else.

    There is a nice idiom for that, by the way, which avoids an
    temporary variable and if test:

    line[strcspn(line, "\n")] = 0;

    This is nice.

    In the test code which was the base of this thread I'm relying
    on the existing '\n' and use buf[strlen(buf)-1] = '\0'; to
    remove the last character.

    [...]

    We have decided in the C world that text does not contain zeros.

    This has become so pervasive that the remaining naysayers can safely regarded as part of a lunatic fringe.

    Software that tries to support the presence of raw nulls in text is actively harmful for security.

    Actually, in the same code, I'm also using the strtok() function
    to iterate over the buffer to get pointers to the separate tokens;
    if I'm not mistaken, that function places '\0' characters in the
    buffer to separate the string tokens. This is very efficient and
    (since the original buffer data isn't necessary any more) there's
    no problems (here) with its data interspersed with '\0'; strings
    (the tokens) get accessed through the returned pointers, and the
    buffer is just the physical (now sort of "binary") storage.

    Janis

    [...]




    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.lang.c on Sat Feb 15 19:29:11 2025
    From Newsgroup: comp.lang.c

    On Fri, 14 Feb 2025 20:51:38 +0100
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

    Actually, in the same code, I'm also using the strtok() function

    strtok() is one of the relatively small set of more problemetic
    functions in C library that are not thread-safe.
    If you only care about POSIX target, the I'd reccomend to avoid strtok
    and to use strtok_r().



    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.lang.c on Sat Feb 15 19:41:11 2025
    From Newsgroup: comp.lang.c

    On Fri, 14 Feb 2025 17:22:59 -0000 (UTC)
    Kaz Kylheku <643-408-1753@kylheku.com> wrote:


    Did you mean /the/ return value (of fgets)?


    Yes, I did.

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.lang.c on Sat Feb 15 19:53:10 2025
    From Newsgroup: comp.lang.c

    On Fri, 14 Feb 2025 21:02:19 +0100
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

    On 14.02.2025 20:38, James Kuyper wrote:

    As with several of the string processing functions, I think fgets()
    would be better if it returned a pointer to the end of the data
    that was read in, rather than to the beginning.

    Yes, that's another option. The language designer have to decide which behavior is more useful. There's pros and cons, IMO.

    IMHO, there are no cons.
    Returning pointer to the end of data is very obviously superior.

    On the minus side
    would be that the origin of the string gets lost that way.

    Huh?
    How could you lose something you just passed to the function?
    In most typical code, it's not even a complex expression or pointer,
    but name of array.

    (Of course
    you can adjust your code then to keep a copy. But any way implemented,
    you need to adjust your code according to how the function is
    defined.)



    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.lang.c on Sat Feb 15 20:06:02 2025
    From Newsgroup: comp.lang.c

    On Fri, 14 Feb 2025 19:34:59 -0000 (UTC)
    Kaz Kylheku <643-408-1753@kylheku.com> wrote:

    On 2025-02-14, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    Kaz Kylheku <643-408-1753@kylheku.com> writes:
    [...]
    It would be good if fgets nuked the terminating newline.

    Many uses of fgets, after every operation, look for the newline
    and nuke it, before doing anything else.

    There is a nice idiom for that, by the way, which avoids an
    temporary variable and if test:

    line[strcspn(line, "\n")] = 0;
    [...]

    Then how do you detect a partial line? That can occur either if
    the last line doesn't have a terminating newline (on systems that
    permit it) or a line that's too long to fit in the array.

    I've seen many programs like this don't care. They have some
    'char buf[4096]' and that's that.


    IMHO, even a program that is not designed to handle long lines should
    give an informative error diagnostic when it encounters one.
    Your trick described below is good for that, but one has to be rather
    good programmer in order to invent such trick. I believe that if one
    has to be good in order to use the API it's a clear indication that the
    API is *not* good.

    Similarly, IMHO, the programs not designed to handle presence of null characters in the text should give an informative error diagnostic when
    they are encountered. And in that fgets() is especially unhelpful.

    In a program not required or designed to handle arbitrarily
    long lines, you can do something very simple (prior to the
    above line[strcspn(line, "\n")] = 0 expression).

    - zero-initialize the buffer.

    - after every call to fgets, inspect the value of the second-to-last
    array element. If the value is neither zero, nor '\n', then somehow
    diagnose that a too-long line has been presented to the program,
    contrary to its documented limitations.

    This will yield a false positive on an unterminated last line. That
    issue can be added as a documented limitation, or else the buffer can
    be sized one greater than what the documented line length limit
    requires, so that the program allows inner lines to be one character
    longer than the documented limit, but is strict with regard to an unterminated last line.



    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.lang.c on Sat Feb 15 20:08:56 2025
    From Newsgroup: comp.lang.c

    On Sat, 15 Feb 2025 08:37:20 -0800
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

    Michael S <already5chosen@yahoo.com> writes:

    On Thu, 13 Feb 2025 07:14:28 -0800
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

    Michael S <already5chosen@yahoo.com> writes:

    On Sun, 9 Feb 2025 17:22:43 -0800
    Andrey Tarasevich <noone@noone.net> wrote:

    On Sun 2/9/2025 5:06 PM, Andrey Tarasevich wrote:

    On Sun 2/9/2025 3:52 PM, Lawrence D'Oliveiro wrote:

    On Sat, 8 Feb 2025 23:12:44 -0800, Andrey Tarasevich wrote:

    If `fgets` reads nothing (instant end-of-file), the entire
    buffer remains untouched.

    You mean, only a single null byte gets written.

    No. The buffer is not changed at all in such case.

    ... which actually raises an interesting quiz/puzzle/question:

    Under what circumstances `fgets` is expected to return an
    empty string? (I.e. set the [0] entry of the buffer to '\0' and
    return non-null)?

    The only answer I can see right away is:

    When one calls it as `fgets(buffer, 1, file)`, i.e. asks it to
    read 0 characters.

    This is under assumption that asking `fgets` to read 0 characters
    is supposed to prevent it from detecting end-of-file condition or
    I/O error condition. One can probably do some nitpicking at the
    current wording... but I believe the above is the intent.

    fgets() is one of many poorly defined standard library functions
    inherited from early UNIX days. [...]

    What about the fgets() function do you think is poorly defined?

    Second question: by "poorly defined" do you mean "defined
    wrongly" or "defined ambiguously" (or both)?

    For starter, it looks like designers of fgets() did not believe in
    their own motto about files being just streams of bytes.
    I don't know the history, so, may be, the function was defined this
    way for portability with systems where text files have special
    record-based structure?

    Then, everything about it feels inelegant.
    A return value carries just 1 bit of information, success or
    failure. So why did they encode this information in baroque way
    instead of something obvious, 0 and 1?
    Appending zero at the end also feels like a hack, but it is
    necessary because of the main problem. And the main problem is:
    how the user is supposed to figure out how many bytes were read?
    In well-designed API this question should be answered in O(1) time.
    With fgets(), it can be answered in O(N) time when input is trusted
    to contain no zeros. When input is arbitrary, finding out the
    answer is even harder and requires quirks.

    If I understand you correctly your complaint is that the existing
    semantics are not as useful as you would like them to be, even
    though the current definition does make the behavior well defined.
    Is that right?


    Yes.

    Clearly using fgets() is problematic when the input stream might
    contain null characters. To me it seems obvious that the original implementors expected that fgets() would not be used in such cases,
    perhaps with the less severe restriction that the presence of
    embedded nulls could be detected and simply rejected as bad input,
    much the same as overly long lines or a final line without a
    terminating newline character.

    My impression is that they didn't spend much time thinking.





    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.lang.c on Sat Feb 15 20:29:15 2025
    From Newsgroup: comp.lang.c

    On Fri, 14 Feb 2025 17:22:59 -0000 (UTC)
    Kaz Kylheku <643-408-1753@kylheku.com> wrote:

    On 2025-02-14, Michael S <already5chosen@yahoo.com> wrote:
    For starter, it looks like designers of fgets() did not believe in
    their own motto about files being just streams of bytes.

    They obviously did, which is exactly why they painstakingly preserved
    the annoying line terminators in the returned data.

    I don't know the history, so, may be, the function was defined this
    way for portability with systems where text files have special
    record-based structure?

    You are sliding into muddled thinking here.

    Then, everything about it feels inelegant.
    A return value carries just 1 bit of information, success or
    failure.

    Why would you assert a claim for which the standard library alone
    is replete with counterexamples: getchar, malloc, getenv, pow, sin.

    Did you mean /the/ return value (of fgets)?

    So why did they encode this information in baroque way instead of
    something obvious, 0 and 1?

    Because you can express this concept:

    char work_area[SIZE];
    char *line;

    while ((line = fgets(work_area, sizeof work_area, stream)))
    {
    /* process line */
    }

    The work_area just provides storage for the operation: line is the
    returned line.

    The loop would work even if fgets sometimes returned pointers that
    are not the to first byte of work_area. It just so happens that
    they always are.

    It is meaningful to capture the returned value and work with
    it as if it were distinct from the buffer.

    Appending zero at the end also feels like a hack, but it is
    necessary because of the main problem.

    Appending zero is necessary so that the result meets the definition
    of a C character string, without which it cannot be passed into string-manipulating functions like strlen.

    Home-grown functions that resemble fgets, but forget to add a null
    byte sometimes, are the subjects of security CVEs.

    And the main problem is: how the user is
    supposed to figure out how many bytes were read?

    Yes, how are they, if you take away the null byte?

    In well-designed API this question should be answered in O(1) time.


    In the context of C strings, that buys you almost nothing.
    Even if you know the length, it's going to get measured numerous
    more times.

    It would be good if fgets nuked the terminating newline.

    Many uses of fgets, after every operation, look for the newline
    and nuke it, before doing anything else.

    There is a nice idiom for that, by the way, which avoids an
    temporary variable and if test:

    line[strcspn(line, "\n")] = 0;

    strcspn(line, "\n") calculates the length of the prefix of line
    which consists of non-newlines. That value is precisely the
    array index of the first newline, if there is one, or else
    of the terminating null, if there isn't a newline. Either
    way, you can clobber that with a newline.

    Once you see the above, you will never do this again:

    newline = strchr(line, '\n');
    if (newline)
    *newline = 0;

    With fgets(), it can be answered in O(N) time when input is trusted
    to contain no zeros.

    We have decided in the C world that text does not contain zeros.


    Yes, for internal data.
    External inputs has to be sanitized.

    This has become so pervasive that the remaining naysayers can safely
    regarded as part of a lunatic fringe.

    Software that tries to support the presence of raw nulls in text is
    actively harmful for security.

    For instance, a piece of text with embedded nulls might have valid
    overall syntax which makes it immune to an injection attack.

    But when it is sent to another piece of software which interprets
    the null as a terminator, the syntax is chopped in half, allowing
    it to be completed by a malicious actor.


    I don't quite understand. In particular, I don't understand if you
    argue in favor of fgets() or against it.

    When input is arbitrary, finding out the answer is
    even harder and requires quirks.

    When input is arbitrary, don't use fgets? It's for text.

    The function foo() is more generic than fgets(). For use instead of
    fgets() it should be accompanied by standard constant EOL_CHAR.

    I am not completely satisfied with proposed solution. The API is
    still less obvious than it could be. But it is much better than
    fgets().

    If last_c is '\n', you're still writing the pesky newline that
    the caller will often want to remove.

    Adding a terminating null and returning a pointer to that null
    would be better.


    If the caller wants it, it can easily do it by itself.
    OTOH, If we follow your proposal, we lose information about
    presence/absence of EOL at the end of the file. I think, for generic
    function it's better to not lose any information, even even an
    information that is not useful for 99.99% of the callers.

    You could then call the operation again with the returned dst
    pointer, and it would continue extending the string,
    without obliterating the last character.

    I'm sure I've seen a foo-like function in software before:
    reading delimited by an arbitrary byte, with length signaling.


    I certainly do not pretend that I invented anything new here.
    Nor did I pretend that it's the best possible.
    More so, I'd like it even more mundane. I just can't figure out, how to
    do it without addition of one more [pointer] parameter.

    One obvious possibility is to return # of characters read instead of
    pointer. Then 0 can mean EOF and negative values can mean I/O errors.
    But that is also not sufficiently boring.









    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.c on Sun Feb 16 04:29:20 2025
    From Newsgroup: comp.lang.c

    On 15.02.2025 18:29, Michael S wrote:
    On Fri, 14 Feb 2025 20:51:38 +0100
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

    Actually, in the same code, I'm also using the strtok() function

    strtok() is one of the relatively small set of more problemetic
    functions in C library that are not thread-safe.

    I know that it's not thread-safe. (You can't miss that information
    if you look up the man page to inspect the function interface.)

    If you only care about POSIX target, the I'd reccomend to avoid strtok
    and to use strtok_r().

    But since I don't use threads - neither here nor did I ever needed
    them generally in my "C" contexts - that's unnecessary. Isn't it?

    Moreover, I prefer functions with a simpler interface to functions
    with a more clumsy one (I mean the 'char **saveptr' part); so why
    use the complex one in the first place if it just complicates its
    use and reduces the code clarity unnecessarily.

    Re "more problematic functions in C library"...
    I had to chuckle on that; if you're coming from other languages
    most "C" functions - especially the low-level "C" functions that
    operate on memory with pointers - don't look "unproblematic". :-)

    Janis

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.c on Sun Feb 16 04:33:17 2025
    From Newsgroup: comp.lang.c

    On 15.02.2025 19:29, Michael S wrote:

    One obvious possibility is to return # of characters read instead of
    pointer. Then 0 can mean EOF and negative values can mean I/O errors.
    But that is also not sufficiently boring.

    But isn't that the (already existing) interface that 'fread()' had
    been designed for?

    Janis

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.c on Sun Feb 16 04:48:20 2025
    From Newsgroup: comp.lang.c

    On 15.02.2025 18:53, Michael S wrote:
    On Fri, 14 Feb 2025 21:02:19 +0100
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

    On 14.02.2025 20:38, James Kuyper wrote:

    As with several of the string processing functions, I think fgets()
    would be better if it returned a pointer to the end of the data
    that was read in, rather than to the beginning.

    Yes, that's another option. The language designer have to decide which
    behavior is more useful. There's pros and cons, IMO.

    IMHO, there are no cons.

    If you think so.

    Returning pointer to the end of data is very obviously superior.

    I seem to recall to have had uses for both variants in the past. (Not
    that it would have made a big difference.)


    On the minus side
    would be that the origin of the string gets lost that way.

    Huh?
    How could you lose something you just passed to the function?

    For example if you use idioms like s = str...(t++, ...) .

    In most typical code, it's not even a complex expression or pointer,
    but name of array.

    I really don't want to argue with you about what is "The Best" design.

    Personally I took advantage of how it's actually in "C" defined, and I
    also occasionally missed the other design variant in some other cases.
    (No design variant would have prevented me from "working around" the
    effects of the missing one.)

    And the "C" lib designers are not stupid; I'd think they had considered
    what interface they implement. (Just speculating here of course.)

    Janis


    (Of course
    you can adjust your code then to keep a copy. But any way implemented,
    you need to adjust your code according to how the function is
    defined.)




    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From James Kuyper@jameskuyper@alumni.caltech.edu to comp.lang.c on Sun Feb 16 01:04:11 2025
    From Newsgroup: comp.lang.c

    On 2/15/25 22:29, Janis Papanagnou wrote:
    On 15.02.2025 18:29, Michael S wrote:
    On Fri, 14 Feb 2025 20:51:38 +0100
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

    Actually, in the same code, I'm also using the strtok() function

    strtok() is one of the relatively small set of more problemetic
    functions in C library that are not thread-safe.

    I know that it's not thread-safe. (You can't miss that information
    if you look up the man page to inspect the function interface.)

    If you only care about POSIX target, the I'd reccomend to avoid strtok
    and to use strtok_r().

    If you cannot assume POSIX, but can assume C2011 or later, you might be
    able to use strtok_s() instead. You need to add

    #ifdef __STDC_LIB_EXT1__
        #define __STDC_WANT_LIB_EXT1__ 1
        // strtok_s() will be declared in <string.h>
    #endif
    #include <string.h>

    But since I don't use threads - neither here nor did I ever needed
    them generally in my "C" contexts - that's unnecessary. Isn't it?

    No. What makes strtok() problematic can come up without any use of
    threads. Consider for the moment a bug I had to investigate. A function
    that was looping through strtok() calls to parse a string called a
    utility function during each pass through the loop. The utility function
    also called strtok() in a loop to parse an entirely different string for
    a different purpose. Exercise for the student: figure out what the
    consequences were.

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Kaz Kylheku@643-408-1753@kylheku.com to comp.lang.c on Sun Feb 16 07:32:23 2025
    From Newsgroup: comp.lang.c

    On 2025-02-15, Michael S <already5chosen@yahoo.com> wrote:
    On Fri, 14 Feb 2025 20:51:38 +0100
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

    Actually, in the same code, I'm also using the strtok() function

    strtok() is one of the relatively small set of more problemetic
    functions in C library that are not thread-safe.

    The design of the strtok() API is not inherently unsafe against threads;
    but it requires thread-local storage to be safe.

    Since ISO C has threads now, it now takes the opportunity to
    explicitly removes any requirements for thread safety in strtok.

    However, it is possible for an implementation to step forward and
    make it thread safe. For instance, in a POSIX system, a thread-specific
    key can be allocated for strtok on library initialization,
    or the first use of strtok (via pthread_once).

    static pthread_key_t strtok_key;

    // ...

    if (pthread_key_create(&strtok_key, NULL))
    ...

    Then strtok does

    char *strtok (char * restrict str, const char * restrit delim)
    {
    if (str == NULL)
    str = pthread_getspecific(strtok_key);

    ...

    // all return paths do this, if str has changed:
    pthread_setspecific(strtok_key, str);
    return ...;
    }

    Only problem is that this will not perform anywhere near as well as
    strtok_r, which specifies an inexpensive location for the context
    pointer.

    If you only care about POSIX target, the I'd reccomend to avoid strtok
    and to use strtok_r().

    I would recommend learning about strspn and strcspn, and writing
    your own tokenizing loop:

    /* strtok-like loop: input variabls are str and delim */

    for (;;) {
    /* skip delim chars to find start of tok */
    char *tok = str + strspn(str, delim);

    /* tokens must be nonempty;
    if (*tok == 0)
    break;

    /* OK; tok points to non-delim char.
    Find end of token: skip span of non-delim chars. */
    char *end = tok + strcspn(str, delim);

    /* Record whether the end of the token is the end
    of the string. */
    char more = *end;

    /* null-terminate token */
    *end = 0;

    { /* process tok here */ }

    if (!more)
    break;

    /* If there is more material after the tok, point
    str there and continue */
    str = end + 1;
    }

    The strok function is ill-suited to many situations. For instance,
    there are situations in which you do want empty tokens, like CSV, such
    that ",abc,def," shows four tokens, two of them empty.

    With the strspn and strcspn building blocks, you can easily whip up a
    custom tokenizing loop that has the right semantics for the situation.

    We can also write our loop such that it restores the original
    character that was overwritten in order to null-terminate the token,
    simply by adding *end = more. Thus when the loop ends, the string
    is restored to its original state.

    I can understand code like that above without having to look up
    anything, but if I see strtok or strtok_r code after many years of not
    working with strtok, I will need a refresher on how exactly they define
    a token.
    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Kaz Kylheku@643-408-1753@kylheku.com to comp.lang.c on Sun Feb 16 07:37:09 2025
    From Newsgroup: comp.lang.c

    On 2025-02-16, James Kuyper <jameskuyper@alumni.caltech.edu> wrote:
    No. What makes strtok() problematic can come up without any use of
    threads. Consider for the moment a bug I had to investigate. A function
    that was looping through strtok() calls to parse a string called a
    utility function during each pass through the loop. The utility function
    also called strtok() in a loop to parse an entirely different string for
    a different purpose. Exercise for the student: figure out what the consequences were.

    Moreover, if strtok is thread-safe thanks to using thread-specific
    storage for the context, that will not make it recursion-safe. It will
    make the bug behave predictably for the same inputs, no matter how other threads use strtok.
    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.lang.c on Sun Feb 16 10:48:44 2025
    From Newsgroup: comp.lang.c

    On Sun, 16 Feb 2025 04:29:20 +0100
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

    On 15.02.2025 18:29, Michael S wrote:
    On Fri, 14 Feb 2025 20:51:38 +0100
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

    Actually, in the same code, I'm also using the strtok() function

    strtok() is one of the relatively small set of more problemetic
    functions in C library that are not thread-safe.

    I know that it's not thread-safe. (You can't miss that information
    if you look up the man page to inspect the function interface.)

    If you only care about POSIX target, the I'd reccomend to avoid
    strtok and to use strtok_r().

    But since I don't use threads - neither here nor did I ever needed
    them generally in my "C" contexts - that's unnecessary. Isn't it?

    Moreover, I prefer functions with a simpler interface to functions
    with a more clumsy one (I mean the 'char **saveptr' part); so why
    use the complex one in the first place if it just complicates its
    use and reduces the code clarity unnecessarily.


    I don't see how explicit context variable can be considered less clear
    than context hidden within library in non-obvious way (see post of Kaz
    that points out that there are at least two options of how exactly it
    could be handled, with different semantics).

    Re "more problematic functions in C library"...
    I had to chuckle on that; if you're coming from other languages
    most "C" functions - especially the low-level "C" functions that
    operate on memory with pointers - don't look "unproblematic". :-)

    Janis


    I tend to have no problems with low-level C RTL functions, in
    particular those with names start with 'mem'. More problems with some
    of those that try to be "higher level", for example, strcat(). Even more
    with those that their designers probably considered 'object-oriented',
    like strtok().

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.lang.c on Sun Feb 16 11:05:46 2025
    From Newsgroup: comp.lang.c

    On Sun, 16 Feb 2025 07:32:23 -0000 (UTC)
    Kaz Kylheku <643-408-1753@kylheku.com> wrote:

    On 2025-02-15, Michael S <already5chosen@yahoo.com> wrote:
    On Fri, 14 Feb 2025 20:51:38 +0100
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

    Actually, in the same code, I'm also using the strtok() function

    strtok() is one of the relatively small set of more problemetic
    functions in C library that are not thread-safe.

    The design of the strtok() API is not inherently unsafe against
    threads; but it requires thread-local storage to be safe.

    Since ISO C has threads now, it now takes the opportunity to
    explicitly removes any requirements for thread safety in strtok.

    However, it is possible for an implementation to step forward and
    make it thread safe. For instance, in a POSIX system, a
    thread-specific key can be allocated for strtok on library
    initialization, or the first use of strtok (via pthread_once).

    static pthread_key_t strtok_key;

    // ...

    if (pthread_key_create(&strtok_key, NULL))
    ...

    Then strtok does

    char *strtok (char * restrict str, const char * restrit delim)
    {
    if (str == NULL)
    str = pthread_getspecific(strtok_key);

    ...

    // all return paths do this, if str has changed:
    pthread_setspecific(strtok_key, str);
    return ...;
    }

    Only problem is that this will not perform anywhere near as well as
    strtok_r, which specifies an inexpensive location for the context
    pointer.

    If you only care about POSIX target, the I'd reccomend to avoid
    strtok and to use strtok_r().

    I would recommend learning about strspn and strcspn, and writing
    your own tokenizing loop:

    /* strtok-like loop: input variabls are str and delim */

    for (;;) {
    /* skip delim chars to find start of tok */
    char *tok = str + strspn(str, delim);

    /* tokens must be nonempty;
    if (*tok == 0)
    break;

    /* OK; tok points to non-delim char.
    Find end of token: skip span of non-delim chars. */
    char *end = tok + strcspn(str, delim);

    /* Record whether the end of the token is the end
    of the string. */
    char more = *end;

    /* null-terminate token */
    *end = 0;

    { /* process tok here */ }

    if (!more)
    break;

    /* If there is more material after the tok, point
    str there and continue */
    str = end + 1;
    }

    The strok function is ill-suited to many situations. For instance,
    there are situations in which you do want empty tokens, like CSV, such
    that ",abc,def," shows four tokens, two of them empty.

    With the strspn and strcspn building blocks, you can easily whip up a
    custom tokenizing loop that has the right semantics for the situation.

    We can also write our loop such that it restores the original
    character that was overwritten in order to null-terminate the token,
    simply by adding *end = more. Thus when the loop ends, the string
    is restored to its original state.

    I can understand code like that above without having to look up
    anything, but if I see strtok or strtok_r code after many years of not working with strtok, I will need a refresher on how exactly they
    define a token.


    For parsing of something important and relatively well-defined, like
    CSV, I'd very seriously consider option of not using standard str*
    utilities at all, with exception of those, where coding your own
    requires special expertise, i.e. primarily strtod(). BTW, even strtod()
    can't be blindly relied on for .csv, because it accepts hex floats,
    while standard CSV parser has to reject them.
    Most likely, avoiding fgets() is also a good idea in this case.


    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.c on Sun Feb 16 18:59:31 2025
    From Newsgroup: comp.lang.c

    On 16.02.2025 07:04, James Kuyper wrote:
    On 2/15/25 22:29, Janis Papanagnou wrote:

    But since I don't use threads - neither here nor did I ever needed
    them generally in my "C" contexts - that's unnecessary. Isn't it?

    No. What makes strtok() problematic can come up without any use of
    threads. Consider for the moment a bug I had to investigate. A function
    that was looping through strtok() calls to parse a string called a
    utility function during each pass through the loop. The utility function
    also called strtok() in a loop to parse an entirely different string for
    a different purpose. [...]

    You can construct any situations that don't apply to my application.

    All relevant things I can infer from strtok() is that it has to use
    static state information (which naturally doesn't support re-entrant
    code). I see that this obviously also consequently implies that it's
    not-thread safe and that you also obviously cannot nest calls as you
    depicted it above (and I think this is even documented for those to
    whom it may not be obvious).

    So again; if it's unnecessary here why should I prefer using a more
    clumsy code than necessary that makes the code less clear?

    If I'd write (for example) a library function to parse tokens then
    I'd certainly not use this function because I don't want conflicts
    and dependencies on the surrounding context of other code that uses
    this library function.

    But, again, in my application context its makes no sense.

    Janis

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.c on Sun Feb 16 19:14:31 2025
    From Newsgroup: comp.lang.c

    On 16.02.2025 09:48, Michael S wrote:
    On Sun, 16 Feb 2025 04:29:20 +0100
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

    Moreover, I prefer functions with a simpler interface to functions
    with a more clumsy one (I mean the 'char **saveptr' part); so why
    use the complex one in the first place if it just complicates its
    use and reduces the code clarity unnecessarily.

    I don't see how explicit context variable can be considered less clear
    than context hidden within library in non-obvious way [...]

    Explicitly maintaining unnecessary parameters and providing additional
    code for the logic to handle that unnecessarily is not obviously less
    clear to you? - Then I cannot help you, sorry.


    Re "more problematic functions in C library"...
    I had to chuckle on that; if you're coming from other languages
    most "C" functions - especially the low-level "C" functions that
    operate on memory with pointers - don't look "unproblematic". :-)

    I tend to have no problems with low-level C RTL functions, in
    particular those with names start with 'mem'.

    *shrug*

    I recall (in early C++ days when there wasn't yet a string type) to
    have based a set of string functions on the mem...() type functions
    (as opposed to the str...() type functions); it wasn't more difficult.
    Rather the effects had been (a) that we could operate binary strings,
    (b) that it was (slightly) faster code, and (c) that some code could
    get even simpler.

    More problems with some
    of those that try to be "higher level", for example, strcat(). Even more
    with those that their designers probably considered 'object-oriented',
    like strtok().

    I don't consider strtok() being 'object-oriented', rather the opposite
    because of the globally static attribute it has. OO objects typically
    carry their own state (unless you deliberately implement a Singleton
    pattern).

    Janis

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.c on Sun Feb 16 19:21:02 2025
    From Newsgroup: comp.lang.c

    On 16.02.2025 08:32, Kaz Kylheku wrote:

    I would recommend learning about strspn and strcspn, and writing
    your own tokenizing loop:

    Incidentally, in a recent toy project, I used it for parsing simple
    syntax.

    For the code of this thread the strtok() was simpler to use, though.


    The strok function is ill-suited to many situations. For instance,
    there are situations in which you do want empty tokens, like CSV, such
    that ",abc,def," shows four tokens, two of them empty.

    Sure. (Always use an appropriate solution for any given task.)

    Janis

    [...]

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.c on Sun Feb 16 19:25:46 2025
    From Newsgroup: comp.lang.c

    On 16.02.2025 10:05, Michael S wrote:
    On Sun, 16 Feb 2025 07:32:23 -0000 (UTC)
    Kaz Kylheku <643-408-1753@kylheku.com> wrote:

    [...]
    The strok function is ill-suited to many situations. For instance,
    there are situations in which you do want empty tokens, like CSV, such
    that ",abc,def," shows four tokens, two of them empty.

    With the strspn and strcspn building blocks, you can easily whip up a
    custom tokenizing loop that has the right semantics for the situation.

    We can also write our loop such that it restores the original
    character that was overwritten in order to null-terminate the token,
    simply by adding *end = more. Thus when the loop ends, the string
    is restored to its original state.

    I can understand code like that above without having to look up
    anything, but if I see strtok or strtok_r code after many years of not
    working with strtok, I will need a refresher on how exactly they
    define a token.

    For parsing of something important and relatively well-defined, like
    CSV, I'd very seriously consider option of not using standard str*
    utilities at all, with exception of those, where coding your own
    requires special expertise, i.e. primarily strtod(). BTW, even strtod()
    can't be blindly relied on for .csv, because it accepts hex floats,
    while standard CSV parser has to reject them.
    Most likely, avoiding fgets() is also a good idea in this case.

    I certainly wouldn't call a CSV format as being "well-defined". But
    CSV is certainly nasty enough to use some existing CSV library and
    not re-invent the wheel in the first place.

    Janis

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.lang.c on Sun Feb 16 20:26:53 2025
    From Newsgroup: comp.lang.c

    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
    On 16.02.2025 08:32, Kaz Kylheku wrote:

    I would recommend learning about strspn and strcspn, and writing
    your own tokenizing loop:

    Incidentally, in a recent toy project, I used it for parsing simple
    syntax.

    For the code of this thread the strtok() was simpler to use, though.

    lex/flex isn't that difficult to learn or to use; and quite a bit
    more flexible than hand-rolled tokenizers using str* functions.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.lang.c on Mon Feb 17 02:50:45 2025
    From Newsgroup: comp.lang.c

    On Sun, 09 Feb 2025 22:41:36 -0800, Keith Thompson wrote:

    if (condition)
    statement1;
    statement2;

    This is why I got into the habit of writing it like

    if (condition)
    {
    statement1;
    } /*if*/
    statement2;

    By the way, the braces are mandatory in Perl. Wonder why?
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.lang.c on Mon Feb 17 11:54:24 2025
    From Newsgroup: comp.lang.c

    On Sun, 16 Feb 2025 19:14:31 +0100
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

    On 16.02.2025 09:48, Michael S wrote:
    On Sun, 16 Feb 2025 04:29:20 +0100
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:


    *shrug*

    I recall (in early C++ days when there wasn't yet a string type) to
    have based a set of string functions on the mem...() type functions
    (as opposed to the str...() type functions); it wasn't more difficult.
    Rather the effects had been (a) that we could operate binary strings,
    (b) that it was (slightly) faster code, and (c) that some code could
    get even simpler.


    Janis


    Your first hand experience appears to match mine.
    Then, why *shrug*? Shouldn't you say *nod* or *noddle* ?

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.lang.c on Mon Feb 17 21:11:29 2025
    From Newsgroup: comp.lang.c

    Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:

    James Kuyper <jameskuyper@alumni.caltech.edu> writes:

    On 2/11/25 16:59, Keith Thompson wrote:

    James Kuyper <jameskuyper@alumni.caltech.edu> writes:

    ...

    I just tried it, using gcc and found that fgets() does set the
    first byte of the buffer to a null character. Therefore, it
    doesn't conform to the requirements of the standard. That's not
    particularly surprising - calling fgets with useless arguments
    isn't something that I'd expect to be a high priority on their
    pre-delivery tests.

    As you know, gcc doesn't implement fgets(). Were you using GNU lib

    .
    Yes. To be specific, Ubuntu GLIBC 2.35-0ubuntu3.9.

    Here's my test code:

    #include <stdio.h>
    #include <stdlib.h>
    int main(int argc, char *argv[])
    {
    char fill = 1;
    char buffer = fill;
    char *retval = NULL;
    FILE *infile;
    if(argc < 2)
    infile = stdin;
    else{
    infile = fopen(argv[1], "r");
    if(!infile)
    {
    perror(argv[1]);
    return EXIT_FAILURE;
    }
    }

    while((retval = fgets(&buffer, 1, infile)) == &buffer)
    {
    printf("%ld:'%u'\n", ftell(infile), (unsigned)buffer);
    buffer = fill++;
    }
    if(ferror(infile))
    perror("fgets");

    printf("%p!=%p ferror:%d feof:%d '%c'\n",
    (void*)&buffer, (void*)retval,
    ferror(infile), feof(infile), buffer);
    }

    Note that if fgets() works as it should, that's an infinite loop,
    since no data is read in, and therefore there's no movement through
    the input file. I wrote code that executes after the infinite loop
    just to cover the possibility that it doesn't work that way.

    I get an infinite loop with both glibc and musl on Ubuntu, and under
    Termux on Android (Bionic library implementation):

    $ ./jk < /dev/null | head -n 3
    0:'0'
    0:'0'
    0:'0'
    $ echo hello | ./jk | head -n 3
    -1:'0'
    -1:'0'
    -1:'0'
    $

    With newlib on Cygwin, there is no infinite loop:

    $ ./jk.exe < /dev/null
    0x7ffffcc17!=0x0 ferror:0 feof:0 ''
    $ echo hello | ./jk.exe
    0x7ffffcc17!=0x0 ferror:0 feof:0 ''
    $

    I have an amusing footnote to these trials.

    I wrote a short program to test fgets() under varying length
    arguments. Compiling with gcc on Ubuntu, I was surprised to
    discover the behavior of fgets() with a length argument of 1
    depended on the the optimization setting of the compiler -
    using -O0 gave a different result than -O1. Compiling with clang
    gave the same result under both optimization settings.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.lang.c on Tue Feb 18 20:17:00 2025
    From Newsgroup: comp.lang.c

    Michael S <already5chosen@yahoo.com> writes:

    On Sat, 15 Feb 2025 08:37:20 -0800
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:
    [...]

    Clearly using fgets() is problematic when the input stream might
    contain null characters. To me it seems obvious that the original
    implementors expected that fgets() would not be used in such cases,
    perhaps with the less severe restriction that the presence of
    embedded nulls could be detected and simply rejected as bad input,
    much the same as overly long lines or a final line without a
    terminating newline character.

    My impression is that they didn't spend much time thinking.

    I have no idea how much time was spent designing the fgets()
    interface, nor do I think it's important to know. I understand the
    limitations of fgets() and don't mind using it in circumstances
    where it provides a net positive value.
    --- Synchronet 3.20c-Linux NewsLink 1.2