To get the last line of a text file I'm using
char buf[BUFSIZ];
while (fgets (buf, BUFSIZ, fd) != NULL)
; // read to last line
If the end of the file is reached my test shows that the previous
contents of 'buf' are still existing (not cleared or overwritten).
But the man page does not say anything whether this is guaranteed;
it says: "Reading stops after an EOF or a newline.", but it says
nothing about [not] writing to or [not] resetting the buffer.
Is that simple construct safe to get the last line of a text file?
To get the last line of a text file I'm using
char buf[BUFSIZ];
while (fgets (buf, BUFSIZ, fd) != NULL)
; // read to last line
If the end of the file is reached my test shows that the previous
contents of 'buf' are still existing (not cleared or overwritten).
But the man page does not say anything whether this is guaranteed;
it says: "Reading stops after an EOF or a newline.", but it says
nothing about [not] writing to or [not] resetting the buffer.
Is that simple construct safe to get the last line of a text file?
Note also that `fgets` is not permitted to assume that the limit value
(the second parameter) correctly describes the accessible size of the buffer. E.g. for this reason it is not permitted to zero-out the buffer before reading. For example, this code is valid and has defined behavior
char buffer[10];
fgets(buffer, 1000, f);
provided the current line of the file fits into `char[10]`. I.e. even
though we "lied" to `fgets` about the limit, it is still required to
work correctly if the actual data fits into the actual buffer.
On Sat 2/8/2025 9:59 PM, Janis Papanagnou wrote:
To get the last line of a text file I'm using
char buf[BUFSIZ];
while (fgets (buf, BUFSIZ, fd) != NULL)
; // read to last line
If the end of the file is reached my test shows that the previous
contents of 'buf' are still existing (not cleared or overwritten).
But the man page does not say anything whether this is guaranteed;
it says: "Reading stops after an EOF or a newline.", but it says
nothing about [not] writing to or [not] resetting the buffer.
Is that simple construct safe to get the last line of a text file?
What situation exactly are you talking about? When end-of-file is
encountered _immediately_, before reading the very first character? Of
when end-of-file is encountered after reading something (i.e. when the
last line in the file does not end with new-line character)?
The former situation is covered by the spec: "If end-of-file is
encountered and no characters have been read into the array, the
contents of the array remain unchanged and a null pointer is returned".
The second situation does not need additional clarifications. Per
general spec as many characters as available before the end-of-file will
be read and then terminated with '\0'. In such case there will be no
new-line character in the buffer.
So, in both cases we are perfectly safe when reading the last line of a
text file, if you don't forget to check the return value of `fgets`.
(This is all under assumption that size limit does not kick in. I
believe your question is not about that.)
Note also that `fgets` is not permitted to assume that the limit value
(the second parameter) correctly describes the accessible size of the
buffer. E.g. for this reason it is not permitted to zero-out the buffer before reading. For example, this code is valid and has defined behavior
char buffer[10];
fgets(buffer, 1000, f);
provided the current line of the file fits into `char[10]`. I.e. even
though we "lied" to `fgets` about the limit, it is still required to
work correctly if the actual data fits into the actual buffer.
So, why do you care that "the previous contents of 'buf' are still
existing"?
First; thanks Kaz and Andrey for the replies. - As so often answering
more than I asked or needed. :-)
The provided C standard quote answers my question. - Thanks!
On 09.02.2025 07:23, Andrey Tarasevich wrote:
On Sat 2/8/2025 9:59 PM, Janis Papanagnou wrote:
To get the last line of a text file I'm using
char buf[BUFSIZ];
while (fgets (buf, BUFSIZ, fd) != NULL)
; // read to last line
If the end of the file is reached my test shows that the previous
contents of 'buf' are still existing (not cleared or overwritten).
But the man page does not say anything whether this is guaranteed;
it says: "Reading stops after an EOF or a newline.", but it says
nothing about [not] writing to or [not] resetting the buffer.
Is that simple construct safe to get the last line of a text file?
What situation exactly are you talking about? When end-of-file is encountered _immediately_, before reading the very first character?
Of when end-of-file is encountered after reading something (i.e.
when the last line in the file does not end with new-line
character)?
I have a _coherent_ file, with a few NL terminated lines of text.
Usually I use fgets() in contexts where I process every line, like
while (fgets (buf, BUFSIZ, fd) != NULL) {
operate_on (buf);
}
// here the status of buf[] is usually not important any more
My actual context was different, like
while (fgets (buf, BUFSIZ, fd) != NULL) {
// buf[] contents are ignored here
}
operate_on (buf[]); // which I assumed contains last line
But now I wanted to ignore all data that I got for fgets() != NULL
in the loop. And I hoped that *after* the loop the last read data is
still valid.
On Sun, 9 Feb 2025 08:13:10 +0100
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
I have a _coherent_ file, with a few NL terminated lines of text.
I wonder what you mean by "coherent".
Usually I use fgets() in contexts where I process every line, like
while (fgets (buf, BUFSIZ, fd) != NULL) {
operate_on (buf);
}
// here the status of buf[] is usually not important any more
My actual context was different, like
while (fgets (buf, BUFSIZ, fd) != NULL) {
// buf[] contents are ignored here
}
operate_on (buf[]); // which I assumed contains last line
It depends on definition of "last line".
What do you consider "last line" of the file in which last character is
not LF?
The one before the last LF or one after? Your code would get
the latter.
On Sat 2/8/2025 11:13 PM, Janis Papanagnou wrote:
But now I wanted to ignore all data that I got for fgets() != NULL
in the loop. And I hoped that *after* the loop the last read data is
still valid.
As Michael already noted it depends on what you consider as the last
piece of valid data in your file.
Say, what do you want to see as "the
last line" in a file that ends with
abracadabra\n<EOF here>
?
Is "abracadabra" the last line? Or is the last line supposed to be empty
in this case?
If `fgets` reads nothing (instant end-of-file), the entire buffer
remains untouched.
On 09.02.2025 16:27, Andrey Tarasevich wrote:[...]
Say, what do you want to see as "the
last line" in a file that ends with
abracadabra\n<EOF here>
?
Is "abracadabra" the last line? Or is the last line supposed to be empty
in this case?
If "\n" is a string literal (2 characters, '\' and 'n') then it's an incomplete line (as to my standards), if it's meant as a <LF> control character then it's complete. (Similar with <CR> on old Apple/Macs and <CR><LF> on DOS-alike systems.)
On Sat, 8 Feb 2025 23:12:44 -0800, Andrey Tarasevich wrote:
If `fgets` reads nothing (instant end-of-file), the entire buffer
remains untouched.
You mean, only a single null byte gets written.
On Sun 2/9/2025 3:52 PM, Lawrence D'Oliveiro wrote:
On Sat, 8 Feb 2025 23:12:44 -0800, Andrey Tarasevich wrote:
If `fgets` reads nothing (instant end-of-file), the entire buffer
remains untouched.
You mean, only a single null byte gets written.
No. The buffer is not changed at all in such case.
To get the last line of a text file I'm using
char buf[BUFSIZ];
while (fgets (buf, BUFSIZ, fd) != NULL)
; // read to last line
If the end of the file is reached my test shows that the previous
contents of 'buf' are still existing (not cleared or overwritten).
[...]
Here's (some of) what the C standard says about text streams:
A text stream is an ordered sequence of characters composed into
lines, each line consisting of zero or more characters plus a
terminating new-line character. Whether the last line requires
a terminating new-line character is implementation-defined.
For an implementation that *doesn't* require a new-line on the
last line, a stream without a trailing new-line is valid. For an implementation that *does* require it, such a stream is invalid,
and a program that attempts to process it can have undefined behavior.
Most modern implementations don't require that trailing new-line.
For example, `echo -n hello > hello.txt` creates a valid text file.
Of course a C program that deals with text files can impose any
additional restrictions its author likes.
The above describes how a text stream looks to a C program. The
external representation can be quite different, with transformations
to map between them.
The most common such transformation is
mapping the DOS/Windows CR-LF line terminator to LF on input, and
vice versa on output. Or the external representation might store
each line as a fixed-length character sequence padded with spaces.
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
To get the last line of a text file I'm using
char buf[BUFSIZ];
while (fgets (buf, BUFSIZ, fd) != NULL)
; // read to last line
If the end of the file is reached my test shows that the previous
contents of 'buf' are still existing (not cleared or overwritten).
Something that has not yet come up (as far as I can see) is that you
might need to handle an empty file. In such a case, nothing gets
written and fgets returns NULL right away. Processing buf in this
situation is then undefined.
One way to handle this is to put into buf something that can't get read
by fgets. Two newlines is a good candidate:
char buf[BUFSIZE] = "\n\n";
You can then test for that if need be, though of course it all depends
on what your application is doing.
On Sun 2/9/2025 3:52 PM, Lawrence D'Oliveiro wrote:
On Sat, 8 Feb 2025 23:12:44 -0800, Andrey Tarasevich wrote:
If `fgets` reads nothing (instant end-of-file), the entire buffer
remains untouched.
You mean, only a single null byte gets written.
No. The buffer is not changed at all in such case.
On Sun 2/9/2025 3:52 PM, Lawrence D'Oliveiro wrote:
On Sat, 8 Feb 2025 23:12:44 -0800, Andrey Tarasevich wrote:
If `fgets` reads nothing (instant end-of-file), the entire buffer
remains untouched.
You mean, only a single null byte gets written.
No. The buffer is not changed at all in such case.
From the man page <https://manpages.debian.org/fgets(3)>:
fgets() reads in at most one less than size characters from stream and
stores them into the buffer pointed to by s. Reading stops after an
EOF or a newline. If a newline is read, it is stored into the buffer.
A terminating null byte ('\0') is stored after the last character in
the buffer.
Note there is no qualification like “a terminating null byte is stored after the last character if EOF was not reached”. It’s clear the terminating null byte is *always* stored.
On 10.02.2025 01:57, Keith Thompson wrote:[...]
The above describes how a text stream looks to a C program. The
external representation can be quite different, with transformations
to map between them.
(Concerning this thread; I'm anyway operating on custom data files
in plain text format, so I'm less concerned about how "C" compilers
expect their "C" source.)
The most extreme context I had worked in was a company that allowed
(for every employee) a free choice of used computer technology; that
led to program text files that literally had all the inconsistencies.
Since many files were edited by different folks there where all sorts
of line terminators mixed even in the same one file, and there either
were complete last lines or not. The (some?) IDEs used were tolerant
WRT line terminators and their mixing. Other tools reacted sensibly.
The first thing I've done was to write a "C" tool to detect and fix
these sorts of inconsistencies.
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
The most extreme context I had worked in was a company that allowed
(for every employee) a free choice of used computer technology; that
led to program text files that literally had all the inconsistencies.
Since many files were edited by different folks there where all sorts
of line terminators mixed even in the same one file, and there either
were complete last lines or not. The (some?) IDEs used were tolerant
WRT line terminators and their mixing. Other tools reacted sensibly.
The first thing I've done was to write a "C" tool to detect and fix
these sorts of inconsistencies.
Been there, done that. There seems to be a tendency in the Windows
world to create text files with no terminator on the last line.
In some cases I've been able to translate the source files to a
consistent format. In others, doing so would have created huge
diffs in the source control system, so I left well enough alone.
My preferred editor, vim, handles files with either LF or CRLF line
endings gracefully, but if there's a mix it shows "^M" at the end of
each line that has a Windows-style CRLF ending.
I found a possible
solution, but I haven't bothered using it since I'm not currently
dealing with such files.
<https://vi.stackexchange.com/q/39297/2380>
This is already off-topic, so I won't even mention tabs vs. spaces.
On 10.02.2025 05:37, Keith Thompson wrote:[...]
This is already off-topic, so I won't even mention tabs vs. spaces.
But as Vim users we don't have any issues here; as long as the
indentation is _visibly_ consistent we can fix any tab/space-mix
on the fly and easily with Vim.
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
On 10.02.2025 05:37, Keith Thompson wrote:[...]
This is already off-topic, so I won't even mention tabs vs. spaces.
But as Vim users we don't have any issues here; as long as the
indentation is _visibly_ consistent we can fix any tab/space-mix
on the fly and easily with Vim.
Yes, *if* the indentation is visibly consistent.
At a previous job, I reviewed an update whose apparent meaning
differed depending on whether the editor was configured with 4- or
8-column tabstops. I don't remember the exact details, but the code
looked like either:
if (condition)
statement1;
statement2;
or:
if (condition)
statement1;
statement2;
depending on the reader's settings. Of course they're semantically equivalent, but the first is the way the developer saw it, and the
second is misleading and is the way it looked to me.
This kind of thing is why I use only spaces for indentation
and curly
braces even when there's only one statement in the block (unless I'm
working under a coding standard that says otherwise).
This manpage, for one example, is in full agreement with the standard
https://www.man7.org/linux/man-pages/man3/fgets.3p.html
A practical experiment demonstrates that [supposedly] POSIX-obeying implementations do not write '\0' into the buffer in "immediate
end-of-file" situations: https://coliru.stacked-crooked.com/a/3e672e6718dd388b
On Sun 2/9/2025 5:06 PM, Andrey Tarasevich wrote:
On Sun 2/9/2025 3:52 PM, Lawrence D'Oliveiro wrote:
On Sat, 8 Feb 2025 23:12:44 -0800, Andrey Tarasevich wrote:
If `fgets` reads nothing (instant end-of-file), the entire buffer
remains untouched.
You mean, only a single null byte gets written.
No. The buffer is not changed at all in such case.
... which actually raises an interesting quiz/puzzle/question:
Under what circumstances `fgets` is expected to return an empty
string? (I.e. set the [0] entry of the buffer to '\0' and return
non-null)?
The only answer I can see right away is:
When one calls it as `fgets(buffer, 1, file)`, i.e. asks it to
read 0 characters.
This is under assumption that asking `fgets` to read 0 characters is supposed to prevent it from detecting end-of-file condition or I/O
error condition. One can probably do some nitpicking at the current wording... but I believe the above is the intent.
On Sun, 9 Feb 2025 20:11:22 -0800, Andrey Tarasevich wrote:
This manpage, for one example, is in full agreement with the standard
https://www.man7.org/linux/man-pages/man3/fgets.3p.html
Notice these two sentences would seem to contradict one another:
On Sun, 9 Feb 2025 20:11:22 -0800, Andrey Tarasevich wrote:
This manpage, for one example, is in full agreement with the standard
https://www.man7.org/linux/man-pages/man3/fgets.3p.html
Notice these two sentences would seem to contradict one another:
A null byte shall be written immediately after the last byte read
into the array. If the end-of-file condition is encountered before
any bytes are read, the contents of the array pointed to by s
shall not be changed.
On 09.02.2025 11:50, Michael S wrote:
What do you consider "last line" of the file in which last character is
not LF?
I consider missing newlines at the end of any text line as a bug.
(And I'm not inclined to use a weaker word than "bug".) YMMV.
I have a _coherent_ file, with a few NL terminated lines of text.
Usually I use fgets() in contexts where I process every line, like
while (fgets (buf, BUFSIZ, fd) != NULL) {
operate_on (buf);
}
// here the status of buf[] is usually not important any more
My actual context was different, like
while (fgets (buf, BUFSIZ, fd) != NULL) {
// buf[] contents are ignored here
}
operate_on (buf[]); // which I assumed contains last line
Usually I read and process the data that I got in buf from fgets()
while there *is* data (fgets() != NULL), and I thus don't care any
more about buffer contents validity after the loop (fgets() == NULL).
But now I wanted to ignore all data that I got for fgets() != NULL
in the loop. And I hoped that *after* the loop the last read data is
still valid.
On 10.02.2025 07:41, Keith Thompson wrote:
At a previous job, I reviewed an update whose apparent meaning
differed depending on whether the editor was configured with 4- or
8-column tabstops. I don't remember the exact details, but the code
looked like either:
if (condition)
statement1;
statement2;
or:
if (condition)
statement1;
statement2;
depending on the reader's settings. Of course they're semantically
equivalent, but the first is the way the developer saw it, and the
second is misleading and is the way it looked to me.
Yeah, misleading code is a pain, especially if you have got the job
to fix some error in these incoherent formatted modules. (I suppose
that case is yet more than only misleading if you are programming in
Python where indentation even carries semantics.)
Janis Papanagnou wrote:
I have a _coherent_ file, with a few NL terminated lines of text.
Usually I use fgets() in contexts where I process every line, like
while (fgets (buf, BUFSIZ, fd) != NULL) {
operate_on (buf);
}
// here the status of buf[] is usually not important any more
My actual context was different, like
while (fgets (buf, BUFSIZ, fd) != NULL) {
// buf[] contents are ignored here
}
operate_on (buf[]); // which I assumed contains last line
...
Usually I read and process the data that I got in buf from fgets()
while there *is* data (fgets() != NULL), and I thus don't care any
more about buffer contents validity after the loop (fgets() == NULL).
But now I wanted to ignore all data that I got for fgets() != NULL
in the loop. And I hoped that *after* the loop the last read data is
still valid.
What does fgets do if the file is completely empty? I may be wrong
(more familiar with Python than C these days), but it doesn't look like
that should be any different from any other end-of-file condition, so presumably the first call to fgets would return NULL, without ever
modifying the buffer. Unless the buffer is initialised (e.g. to an
empty string) before the while loop, that would result in an
uninitialised buffer being passed to operate_on.
On 10.02.2025 07:41, Keith Thompson wrote:
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
On 10.02.2025 05:37, Keith Thompson wrote:[...]
This is already off-topic, so I won't even mention tabs vs. spaces.
But as Vim users we don't have any issues here; as long as the
indentation is _visibly_ consistent we can fix any tab/space-mix
on the fly and easily with Vim.
Yes, *if* the indentation is visibly consistent.
At a previous job, I reviewed an update whose apparent meaning
differed depending on whether the editor was configured with 4- or
8-column tabstops. I don't remember the exact details, but the code
looked like either:
if (condition)
statement1;
statement2;
or:
if (condition)
statement1;
statement2;
depending on the reader's settings. Of course they're semantically
equivalent, but the first is the way the developer saw it, and the
second is misleading and is the way it looked to me.
Yeah, misleading code is a pain, especially if you have got the job
to fix some error in these incoherent formatted modules.
I don't see the contradiction. If "no characters are read into the
array", there is no such thing as "the last byte read into the array",
so a null byte has no location where it should be written. Therefore,
there's no reason for changing the contents of the array.
On Mon, 10 Feb 2025 13:58:05 -0500, James Kuyper wrote:
I don't see the contradiction. If "no characters are read into the
array", there is no such thing as "the last byte read into the array",
so a null byte has no location where it should be written. Therefore,
there's no reason for changing the contents of the array.
What if the array is only big enough for one byte? In this case, no characters can be read into it. Is a trailing null inserted in this case?
What if the array is only big enough for one byte? In this case, no
characters can be read into it. Is a trailing null inserted in this case?
"The fgets function reads at most one less than the number of characters specified by n from the stream pointed to by stream into the array
pointed to by s." (7.32.7.2p2)
If the buffer length is 1, "at most one less than the number ...
specified" is 0. Therefore, fgets() cannot read any characters into the buffer, no matter what the contents of the input stream are. Again,
since there is no "last byte read into the array", there is no location
where a null byte should be written.
On 2/10/25 20:03, Lawrence D'Oliveiro wrote:
On Mon, 10 Feb 2025 13:58:05 -0500, James Kuyper wrote:
I don't see the contradiction. If "no characters are read into the
array", there is no such thing as "the last byte read into the array",
so a null byte has no location where it should be written. Therefore,
there's no reason for changing the contents of the array.
What if the array is only big enough for one byte? In this case, no
characters can be read into it. Is a trailing null inserted in this
case?
"The fgets function reads at most one less than the number of characters specified by n from the stream pointed to by stream into the array
pointed to by s." (7.32.7.2p2)
If the buffer length is 1, "at most one less than the number ...
specified" is 0. Therefore, fgets() cannot read any characters into the buffer, no matter what the contents of the input stream are. Again,
since there is no "last byte read into the array", there is no location
where a null byte should be written.
On Mon, 10 Feb 2025 13:58:05 -0500, James Kuyper wrote:
I don't see the contradiction. If "no characters are read into the
array", there is no such thing as "the last byte read into the array",
so a null byte has no location where it should be written. Therefore,
there's no reason for changing the contents of the array.
What if the array is only big enough for one byte? In this case, no characters can be read into it. Is a trailing null inserted in this case?
On Mon 2/10/2025 5:03 PM, Lawrence D'Oliveiro wrote:
On Mon, 10 Feb 2025 13:58:05 -0500, James Kuyper wrote:
I don't see the contradiction. If "no characters are read into the
array", there is no such thing as "the last byte read into the array",
so a null byte has no location where it should be written. Therefore,
there's no reason for changing the contents of the array.
What if the array is only big enough for one byte? In this case, no
characters can be read into it. Is a trailing null inserted in this case?
If by that you mean, "what if the value of 1 is passed as second
argument", then, as I stated in one of my previous messages:
No attempt to read anything from the stream is made, which means that end-of-file or I/O error conditions do not arise (unless, perhaps, the stream was already in error condition) and the [0] byte of the buffer is simply set to '\0'.
If by that you mean, "what if the value of 1 is passed as second
argument", then, as I stated in one of my previous messages:
No attempt to read anything from the stream is made, which means that end-of-file or I/O error conditions do not arise (unless, perhaps, the stream was already in error condition) and the [0] byte of the buffer is simply set to '\0'.
On Mon, 10 Feb 2025 22:33:03 -0500, James Kuyper wrote:...
"The fgets function reads at most one less than the number of characters
specified by n from the stream pointed to by stream into the array
pointed to by s." (7.32.7.2p2)
If the buffer length is 1, "at most one less than the number ...
specified" is 0. Therefore, fgets() cannot read any characters into the
buffer, no matter what the contents of the input stream are. Again,
since there is no "last byte read into the array", there is no location
where a null byte should be written.
Have you tried it? I have.
I just tried it, using gcc and found that fgets() does set the first
byte of the buffer to a null character. Therefore, it doesn't conform to
the requirements of the standard.
On 2/10/25 23:54, Lawrence D'Oliveiro wrote:
On Mon, 10 Feb 2025 22:33:03 -0500, James Kuyper wrote:...
"The fgets function reads at most one less than the number of characters >>> specified by n from the stream pointed to by stream into the array
pointed to by s." (7.32.7.2p2)
If the buffer length is 1, "at most one less than the number ...
specified" is 0. Therefore, fgets() cannot read any characters into the
buffer, no matter what the contents of the input stream are. Again,
since there is no "last byte read into the array", there is no location
where a null byte should be written.
Have you tried it? I have.
I just tried it, using gcc and found that fgets() does set the first
byte of the buffer to a null character. Therefore, it doesn't conform to
the requirements of the standard. That's not particularly surprising - calling fgets with useless arguments isn't something that I'd expect to
be a high priority on their pre-delivery tests.
On Tue, 11 Feb 2025 13:07:53 -0500, James Kuyper wrote:
I just tried it, using gcc and found that fgets() does set the first
byte of the buffer to a null character. Therefore, it doesn't conform to
the requirements of the standard.
GCC is, however, the closest thing we have to a de-facto standard for C.
James Kuyper <jameskuyper@alumni.caltech.edu> writes:...
.I just tried it, using gcc and found that fgets() does set the first
byte of the buffer to a null character. Therefore, it doesn't conform to
the requirements of the standard. That's not particularly surprising -
calling fgets with useless arguments isn't something that I'd expect to
be a high priority on their pre-delivery tests.
As you know, gcc doesn't implement fgets(). Were you using GNU lib
On 2/11/25 16:59, Keith Thompson wrote:
James Kuyper <jameskuyper@alumni.caltech.edu> writes:...
.I just tried it, using gcc and found that fgets() does set the first
byte of the buffer to a null character. Therefore, it doesn't conform to >>> the requirements of the standard. That's not particularly surprising -
calling fgets with useless arguments isn't something that I'd expect to
be a high priority on their pre-delivery tests.
As you know, gcc doesn't implement fgets(). Were you using GNU lib
Yes. To be specific, Ubuntu GLIBC 2.35-0ubuntu3.9.
Here's my test code:
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[])
{
char fill = 1;
char buffer = fill;
char *retval = NULL;
FILE *infile;
if(argc < 2)
infile = stdin;
else{
infile = fopen(argv[1], "r");
if(!infile)
{
perror(argv[1]);
return EXIT_FAILURE;
}
}
while((retval = fgets(&buffer, 1, infile)) == &buffer)
{
printf("%ld:'%u'\n", ftell(infile), (unsigned)buffer);
buffer = fill++;
}
if(ferror(infile))
perror("fgets");
printf("%p!=%p ferror:%d feof:%d '%c'\n",
(void*)&buffer, (void*)retval,
ferror(infile), feof(infile), buffer);
}
Note that if fgets() works as it should, that's an infinite loop, since
no data is read in, and therefore there's no movement through the input
file. I wrote code that executes after the infinite loop just to cover
the possibility that it doesn't work that way.
On 2/11/25 16:47, Lawrence D'Oliveiro wrote:
On Tue, 11 Feb 2025 13:07:53 -0500, James Kuyper wrote:
I just tried it, using gcc and found that fgets() does set the first
byte of the buffer to a null character. Therefore, it doesn't conform
to the requirements of the standard.
GCC is, however, the closest thing we have to a de-facto standard for
C.
I've no interest in de-facto standards. I'm only interested in de-jure standards such as ISO/IEC 9899:2023.
On Sun, 9 Feb 2025 17:22:43 -0800
Andrey Tarasevich <noone@noone.net> wrote:
On Sun 2/9/2025 5:06 PM, Andrey Tarasevich wrote:
On Sun 2/9/2025 3:52 PM, Lawrence D'Oliveiro wrote:
On Sat, 8 Feb 2025 23:12:44 -0800, Andrey Tarasevich wrote:
If `fgets` reads nothing (instant end-of-file), the entire buffer
remains untouched.
You mean, only a single null byte gets written.
No. The buffer is not changed at all in such case.
... which actually raises an interesting quiz/puzzle/question:
Under what circumstances `fgets` is expected to return an empty
string? (I.e. set the [0] entry of the buffer to '\0' and return
non-null)?
The only answer I can see right away is:
When one calls it as `fgets(buffer, 1, file)`, i.e. asks it to
read 0 characters.
This is under assumption that asking `fgets` to read 0 characters is
supposed to prevent it from detecting end-of-file condition or I/O
error condition. One can probably do some nitpicking at the current
wording... but I believe the above is the intent.
fgets() is one of many poorly defined standard library functions
inherited from early UNIX days. [...]
On Mon 2/10/2025 5:03 PM, Lawrence D'Oliveiro wrote:
On Mon, 10 Feb 2025 13:58:05 -0500, James Kuyper wrote:
I don't see the contradiction. If "no characters are read into the
array", there is no such thing as "the last byte read into the array",
so a null byte has no location where it should be written. Therefore,
there's no reason for changing the contents of the array.
What if the array is only big enough for one byte? In this case, no
characters can be read into it. Is a trailing null inserted in this case?
If by that you mean, "what if the value of 1 is passed as second
argument", then, as I stated in one of my previous messages:
No attempt to read anything from the stream is made, which means that end-of-file or I/O error conditions do not arise (unless, perhaps, the
stream was already in error condition) and the [0] byte of the buffer
is simply set to '\0'.
Michael S <already5chosen@yahoo.com> writes:
On Sun, 9 Feb 2025 17:22:43 -0800
Andrey Tarasevich <noone@noone.net> wrote:
On Sun 2/9/2025 5:06 PM, Andrey Tarasevich wrote:
On Sun 2/9/2025 3:52 PM, Lawrence D'Oliveiro wrote:
On Sat, 8 Feb 2025 23:12:44 -0800, Andrey Tarasevich wrote:
If `fgets` reads nothing (instant end-of-file), the entire
buffer remains untouched.
You mean, only a single null byte gets written.
No. The buffer is not changed at all in such case.
... which actually raises an interesting quiz/puzzle/question:
Under what circumstances `fgets` is expected to return an empty
string? (I.e. set the [0] entry of the buffer to '\0' and return
non-null)?
The only answer I can see right away is:
When one calls it as `fgets(buffer, 1, file)`, i.e. asks it to
read 0 characters.
This is under assumption that asking `fgets` to read 0 characters
is supposed to prevent it from detecting end-of-file condition or
I/O error condition. One can probably do some nitpicking at the
current wording... but I believe the above is the intent.
fgets() is one of many poorly defined standard library functions
inherited from early UNIX days. [...]
What about the fgets() function do you think is poorly defined?
Second question: by "poorly defined" do you mean "defined
wrongly" or "defined ambiguously" (or both)?
On Thu, 13 Feb 2025 07:14:28 -0800
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:
fgets() is one of many poorly defined standard library functions
inherited from early UNIX days. [...]
What about the fgets() function do you think is poorly defined?
Second question: by "poorly defined" do you mean "defined
wrongly" or "defined ambiguously" (or both)?
For starter, it looks like designers of fgets() did not believe in
their own motto about files being just streams of bytes.
I don't know the history, so, may be, the function was defined this way
for portability with systems where text files have special record-based >structure?
Then, everything about it feels inelegant.
A return value carries just 1 bit of information, success or failure.
So why did they encode this information in baroque way instead of
something obvious, 0 and 1?
Appending zero at the end also feels like a hack, but it is necessary
because of the main problem. And the main problem is: how the user is >supposed to figure out how many bytes were read?
In well-designed API this question should be answered in O(1) time.
With fgets(), it can be answered in O(N) time when input is trusted to >contain no zeros.
If you're reading non-string data use read/pread/mmap.
On Fri, 14 Feb 2025 15:10:50 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
If you're reading non-string data use read/pread/mmap.
I don't know about you, but in decades of practice I didn't yet
encounter a situation when I can trust a file input with 100%
certainty.
For starter, it looks like designers of fgets() did not believe in
their own motto about files being just streams of bytes.
I don't know the history, so, may be, the function was defined this way
for portability with systems where text files have special record-based structure?
Then, everything about it feels inelegant.
A return value carries just 1 bit of information, success or failure.
So why did they encode this information in baroque way instead of
something obvious, 0 and 1?
Appending zero at the end also feels like a hack, but it is necessary
because of the main problem.
And the main problem is: how the user is
supposed to figure out how many bytes were read?
In well-designed API this question should be answered in O(1) time.
With fgets(), it can be answered in O(N) time when input is trusted to contain no zeros.
When input is arbitrary, finding out the answer is
even harder and requires quirks.
The function foo() is more generic than fgets(). For use instead of
fgets() it should be accompanied by standard constant EOL_CHAR.
I am not completely satisfied with proposed solution. The API is
still less obvious than it could be. But it is much better than fgets().
On Fri, 14 Feb 2025 15:10:50 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
If you're reading non-string data use read/pread/mmap.
I don't know about you, but in decades of practice I didn't yet
encounter a situation when I can trust a file input with 100%
certainty.
It would be good if fgets nuked the terminating newline.[...]
Many uses of fgets, after every operation, look for the newline
and nuke it, before doing anything else.
There is a nice idiom for that, by the way, which avoids an
temporary variable and if test:
line[strcspn(line, "\n")] = 0;
[ fgets() poorly defined? ]
[...]
Then, everything about it feels inelegant.
A return value carries just 1 bit of information, success or failure.
So why did they encode this information in baroque way instead of
something obvious, 0 and 1?
[...]
Kaz Kylheku <643-408-1753@kylheku.com> writes:
[...]
It would be good if fgets nuked the terminating newline.[...]
Many uses of fgets, after every operation, look for the newline
and nuke it, before doing anything else.
There is a nice idiom for that, by the way, which avoids an
temporary variable and if test:
line[strcspn(line, "\n")] = 0;
Then how do you detect a partial line? That can occur either if
the last line doesn't have a terminating newline (on systems that
permit it) or a line that's too long to fit in the array.
I consider it to be differently; it basically returns a pointer
to work with on the data, and the special NULL pointer value is
just the often seen hack where a special pointer value provides
an error indication.
Typical application (for me) is
if ((line = fgets (buf, BUFSIZ, fd)) == NULL)
// handle error...
else
// process data
Moreover, returning the pointer to the data makes it possible to
(e.g.) nest string processing functions (including 'fgets') or
to chain processing or immediate access/dereference the string
contents.
IMO the 'fgets' function matches the typical interface for such
string functions (in C) allowing such programming language idioms
like the two or three mentioned.
I think it is generally arguable whether code patterns like
if ((line = fgets (buf, BUFSIZ, fd)) == NULL)
can be considered clean code with clean syntax and a clean design.
But not in a "C" language newsgroup where such things are typical
(with this function design) as language specific code pattern.
[ ... ]
It would be good if fgets nuked the terminating newline.
Many uses of fgets, after every operation, look for the newline
and nuke it, before doing anything else.
There is a nice idiom for that, by the way, which avoids an
temporary variable and if test:
line[strcspn(line, "\n")] = 0;
[...]
We have decided in the C world that text does not contain zeros.
This has become so pervasive that the remaining naysayers can safely
regarded as part of a lunatic fringe.
Software that tries to support the presence of raw nulls in text is
actively harmful for security.
[...]
Kaz Kylheku <643-408-1753@kylheku.com> writes:
[...]
It would be good if fgets nuked the terminating newline.[...]
Many uses of fgets, after every operation, look for the newline
and nuke it, before doing anything else.
There is a nice idiom for that, by the way, which avoids an
temporary variable and if test:
line[strcspn(line, "\n")] = 0;
Then how do you detect a partial line? That can occur either if
the last line doesn't have a terminating newline (on systems that
permit it) or a line that's too long to fit in the array.
As with several of the string processing functions, I think fgets()
would be better if it returned a pointer to the end of the data that was
read in, rather than to the beginning.
The chaining you talk about does
not, in general, work properly if the return value from fgets() is NULL,
or the entire buffer was filled without writing a null character.
On Sun 2/9/2025 5:06 PM, Andrey Tarasevich wrote:
On Sun 2/9/2025 3:52 PM, Lawrence D'Oliveiro wrote:
On Sat, 8 Feb 2025 23:12:44 -0800, Andrey Tarasevich wrote:
If `fgets` reads nothing (instant end-of-file), the entire
buffer remains untouched.
You mean, only a single null byte gets written.
No. The buffer is not changed at all in such case.
... which actually raises an interesting quiz/puzzle/question:
Under what circumstances `fgets` is expected to return an empty
string? (I.e. set the [0] entry of the buffer to '\0' and return
non-null)?
The only answer I can see right away is:
When one calls it as `fgets(buffer, 1, file)`, i.e. asks it to
read 0 characters.
This is under assumption that asking `fgets` to read 0 characters
is supposed to prevent it from detecting end-of-file condition or
I/O error condition. One can probably do some nitpicking at the
current wording... but I believe the above is the intent.
On Thu, 13 Feb 2025 07:14:28 -0800
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:
Michael S <already5chosen@yahoo.com> writes:
On Sun, 9 Feb 2025 17:22:43 -0800
Andrey Tarasevich <noone@noone.net> wrote:
On Sun 2/9/2025 5:06 PM, Andrey Tarasevich wrote:
On Sun 2/9/2025 3:52 PM, Lawrence D'Oliveiro wrote:
On Sat, 8 Feb 2025 23:12:44 -0800, Andrey Tarasevich wrote:
If `fgets` reads nothing (instant end-of-file), the entire
buffer remains untouched.
You mean, only a single null byte gets written.
No. The buffer is not changed at all in such case.
... which actually raises an interesting quiz/puzzle/question:
Under what circumstances `fgets` is expected to return an empty
string? (I.e. set the [0] entry of the buffer to '\0' and return
non-null)?
The only answer I can see right away is:
When one calls it as `fgets(buffer, 1, file)`, i.e. asks it to
read 0 characters.
This is under assumption that asking `fgets` to read 0 characters
is supposed to prevent it from detecting end-of-file condition or
I/O error condition. One can probably do some nitpicking at the
current wording... but I believe the above is the intent.
fgets() is one of many poorly defined standard library functions
inherited from early UNIX days. [...]
What about the fgets() function do you think is poorly defined?
Second question: by "poorly defined" do you mean "defined
wrongly" or "defined ambiguously" (or both)?
For starter, it looks like designers of fgets() did not believe in
their own motto about files being just streams of bytes.
I don't know the history, so, may be, the function was defined this way
for portability with systems where text files have special record-based structure?
Then, everything about it feels inelegant.
A return value carries just 1 bit of information, success or failure.
So why did they encode this information in baroque way instead of
something obvious, 0 and 1?
Appending zero at the end also feels like a hack, but it is necessary
because of the main problem. And the main problem is: how the user is supposed to figure out how many bytes were read?
In well-designed API this question should be answered in O(1) time.
With fgets(), it can be answered in O(N) time when input is trusted to contain no zeros. When input is arbitrary, finding out the answer is
even harder and requires quirks.
On 14.02.2025 18:22, Kaz Kylheku wrote:
[ ... ]
It would be good if fgets nuked the terminating newline.
Many uses of fgets, after every operation, look for the newline
and nuke it, before doing anything else.
There is a nice idiom for that, by the way, which avoids an
temporary variable and if test:
line[strcspn(line, "\n")] = 0;
This is nice.
In the test code which was the base of this thread I'm relying
on the existing '\n' and use buf[strlen(buf)-1] = '\0'; to
remove the last character.
[...]
We have decided in the C world that text does not contain zeros.
This has become so pervasive that the remaining naysayers can safely regarded as part of a lunatic fringe.
Software that tries to support the presence of raw nulls in text is actively harmful for security.
Actually, in the same code, I'm also using the strtok() function
to iterate over the buffer to get pointers to the separate tokens;
if I'm not mistaken, that function places '\0' characters in the
buffer to separate the string tokens. This is very efficient and
(since the original buffer data isn't necessary any more) there's
no problems (here) with its data interspersed with '\0'; strings
(the tokens) get accessed through the returned pointers, and the
buffer is just the physical (now sort of "binary") storage.
Janis
[...]
Actually, in the same code, I'm also using the strtok() function
Did you mean /the/ return value (of fgets)?
On 14.02.2025 20:38, James Kuyper wrote:
As with several of the string processing functions, I think fgets()
would be better if it returned a pointer to the end of the data
that was read in, rather than to the beginning.
Yes, that's another option. The language designer have to decide which behavior is more useful. There's pros and cons, IMO.
On the minus side
would be that the origin of the string gets lost that way.
(Of course
you can adjust your code then to keep a copy. But any way implemented,
you need to adjust your code according to how the function is
defined.)
On 2025-02-14, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
Kaz Kylheku <643-408-1753@kylheku.com> writes:
[...]
It would be good if fgets nuked the terminating newline.[...]
Many uses of fgets, after every operation, look for the newline
and nuke it, before doing anything else.
There is a nice idiom for that, by the way, which avoids an
temporary variable and if test:
line[strcspn(line, "\n")] = 0;
Then how do you detect a partial line? That can occur either if
the last line doesn't have a terminating newline (on systems that
permit it) or a line that's too long to fit in the array.
I've seen many programs like this don't care. They have some
'char buf[4096]' and that's that.
In a program not required or designed to handle arbitrarily
long lines, you can do something very simple (prior to the
above line[strcspn(line, "\n")] = 0 expression).
- zero-initialize the buffer.
- after every call to fgets, inspect the value of the second-to-last
array element. If the value is neither zero, nor '\n', then somehow
diagnose that a too-long line has been presented to the program,
contrary to its documented limitations.
This will yield a false positive on an unterminated last line. That
issue can be added as a documented limitation, or else the buffer can
be sized one greater than what the documented line length limit
requires, so that the program allows inner lines to be one character
longer than the documented limit, but is strict with regard to an unterminated last line.
Michael S <already5chosen@yahoo.com> writes:
On Thu, 13 Feb 2025 07:14:28 -0800
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:
Michael S <already5chosen@yahoo.com> writes:
On Sun, 9 Feb 2025 17:22:43 -0800
Andrey Tarasevich <noone@noone.net> wrote:
On Sun 2/9/2025 5:06 PM, Andrey Tarasevich wrote:
On Sun 2/9/2025 3:52 PM, Lawrence D'Oliveiro wrote:
On Sat, 8 Feb 2025 23:12:44 -0800, Andrey Tarasevich wrote:
If `fgets` reads nothing (instant end-of-file), the entire
buffer remains untouched.
You mean, only a single null byte gets written.
No. The buffer is not changed at all in such case.
... which actually raises an interesting quiz/puzzle/question:
Under what circumstances `fgets` is expected to return an
empty string? (I.e. set the [0] entry of the buffer to '\0' and
return non-null)?
The only answer I can see right away is:
When one calls it as `fgets(buffer, 1, file)`, i.e. asks it to
read 0 characters.
This is under assumption that asking `fgets` to read 0 characters
is supposed to prevent it from detecting end-of-file condition or
I/O error condition. One can probably do some nitpicking at the
current wording... but I believe the above is the intent.
fgets() is one of many poorly defined standard library functions
inherited from early UNIX days. [...]
What about the fgets() function do you think is poorly defined?
Second question: by "poorly defined" do you mean "defined
wrongly" or "defined ambiguously" (or both)?
For starter, it looks like designers of fgets() did not believe in
their own motto about files being just streams of bytes.
I don't know the history, so, may be, the function was defined this
way for portability with systems where text files have special
record-based structure?
Then, everything about it feels inelegant.
A return value carries just 1 bit of information, success or
failure. So why did they encode this information in baroque way
instead of something obvious, 0 and 1?
Appending zero at the end also feels like a hack, but it is
necessary because of the main problem. And the main problem is:
how the user is supposed to figure out how many bytes were read?
In well-designed API this question should be answered in O(1) time.
With fgets(), it can be answered in O(N) time when input is trusted
to contain no zeros. When input is arbitrary, finding out the
answer is even harder and requires quirks.
If I understand you correctly your complaint is that the existing
semantics are not as useful as you would like them to be, even
though the current definition does make the behavior well defined.
Is that right?
Clearly using fgets() is problematic when the input stream might
contain null characters. To me it seems obvious that the original implementors expected that fgets() would not be used in such cases,
perhaps with the less severe restriction that the presence of
embedded nulls could be detected and simply rejected as bad input,
much the same as overly long lines or a final line without a
terminating newline character.
On 2025-02-14, Michael S <already5chosen@yahoo.com> wrote:
For starter, it looks like designers of fgets() did not believe in
their own motto about files being just streams of bytes.
They obviously did, which is exactly why they painstakingly preserved
the annoying line terminators in the returned data.
I don't know the history, so, may be, the function was defined this
way for portability with systems where text files have special
record-based structure?
You are sliding into muddled thinking here.
Then, everything about it feels inelegant.
A return value carries just 1 bit of information, success or
failure.
Why would you assert a claim for which the standard library alone
is replete with counterexamples: getchar, malloc, getenv, pow, sin.
Did you mean /the/ return value (of fgets)?
So why did they encode this information in baroque way instead of
something obvious, 0 and 1?
Because you can express this concept:
char work_area[SIZE];
char *line;
while ((line = fgets(work_area, sizeof work_area, stream)))
{
/* process line */
}
The work_area just provides storage for the operation: line is the
returned line.
The loop would work even if fgets sometimes returned pointers that
are not the to first byte of work_area. It just so happens that
they always are.
It is meaningful to capture the returned value and work with
it as if it were distinct from the buffer.
Appending zero at the end also feels like a hack, but it is
necessary because of the main problem.
Appending zero is necessary so that the result meets the definition
of a C character string, without which it cannot be passed into string-manipulating functions like strlen.
Home-grown functions that resemble fgets, but forget to add a null
byte sometimes, are the subjects of security CVEs.
And the main problem is: how the user is
supposed to figure out how many bytes were read?
Yes, how are they, if you take away the null byte?
In well-designed API this question should be answered in O(1) time.
In the context of C strings, that buys you almost nothing.
Even if you know the length, it's going to get measured numerous
more times.
It would be good if fgets nuked the terminating newline.
Many uses of fgets, after every operation, look for the newline
and nuke it, before doing anything else.
There is a nice idiom for that, by the way, which avoids an
temporary variable and if test:
line[strcspn(line, "\n")] = 0;
strcspn(line, "\n") calculates the length of the prefix of line
which consists of non-newlines. That value is precisely the
array index of the first newline, if there is one, or else
of the terminating null, if there isn't a newline. Either
way, you can clobber that with a newline.
Once you see the above, you will never do this again:
newline = strchr(line, '\n');
if (newline)
*newline = 0;
With fgets(), it can be answered in O(N) time when input is trusted
to contain no zeros.
We have decided in the C world that text does not contain zeros.
This has become so pervasive that the remaining naysayers can safely
regarded as part of a lunatic fringe.
Software that tries to support the presence of raw nulls in text is
actively harmful for security.
For instance, a piece of text with embedded nulls might have valid
overall syntax which makes it immune to an injection attack.
But when it is sent to another piece of software which interprets
the null as a terminator, the syntax is chopped in half, allowing
it to be completed by a malicious actor.
When input is arbitrary, finding out the answer is
even harder and requires quirks.
When input is arbitrary, don't use fgets? It's for text.
The function foo() is more generic than fgets(). For use instead of
fgets() it should be accompanied by standard constant EOL_CHAR.
I am not completely satisfied with proposed solution. The API is
still less obvious than it could be. But it is much better than
fgets().
If last_c is '\n', you're still writing the pesky newline that
the caller will often want to remove.
Adding a terminating null and returning a pointer to that null
would be better.
You could then call the operation again with the returned dst
pointer, and it would continue extending the string,
without obliterating the last character.
I'm sure I've seen a foo-like function in software before:
reading delimited by an arbitrary byte, with length signaling.
On Fri, 14 Feb 2025 20:51:38 +0100
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
Actually, in the same code, I'm also using the strtok() function
strtok() is one of the relatively small set of more problemetic
functions in C library that are not thread-safe.
If you only care about POSIX target, the I'd reccomend to avoid strtok
and to use strtok_r().
One obvious possibility is to return # of characters read instead of
pointer. Then 0 can mean EOF and negative values can mean I/O errors.
But that is also not sufficiently boring.
On Fri, 14 Feb 2025 21:02:19 +0100
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
On 14.02.2025 20:38, James Kuyper wrote:
As with several of the string processing functions, I think fgets()
would be better if it returned a pointer to the end of the data
that was read in, rather than to the beginning.
Yes, that's another option. The language designer have to decide which
behavior is more useful. There's pros and cons, IMO.
IMHO, there are no cons.
Returning pointer to the end of data is very obviously superior.
On the minus side
would be that the origin of the string gets lost that way.
Huh?
How could you lose something you just passed to the function?
In most typical code, it's not even a complex expression or pointer,
but name of array.
(Of course
you can adjust your code then to keep a copy. But any way implemented,
you need to adjust your code according to how the function is
defined.)
On 15.02.2025 18:29, Michael S wrote:
On Fri, 14 Feb 2025 20:51:38 +0100
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
Actually, in the same code, I'm also using the strtok() function
strtok() is one of the relatively small set of more problemetic
functions in C library that are not thread-safe.
I know that it's not thread-safe. (You can't miss that information
if you look up the man page to inspect the function interface.)
If you only care about POSIX target, the I'd reccomend to avoid strtok
and to use strtok_r().
But since I don't use threads - neither here nor did I ever needed
them generally in my "C" contexts - that's unnecessary. Isn't it?
On Fri, 14 Feb 2025 20:51:38 +0100
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
Actually, in the same code, I'm also using the strtok() function
strtok() is one of the relatively small set of more problemetic
functions in C library that are not thread-safe.
If you only care about POSIX target, the I'd reccomend to avoid strtok
and to use strtok_r().
No. What makes strtok() problematic can come up without any use of
threads. Consider for the moment a bug I had to investigate. A function
that was looping through strtok() calls to parse a string called a
utility function during each pass through the loop. The utility function
also called strtok() in a loop to parse an entirely different string for
a different purpose. Exercise for the student: figure out what the consequences were.
On 15.02.2025 18:29, Michael S wrote:
On Fri, 14 Feb 2025 20:51:38 +0100
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
Actually, in the same code, I'm also using the strtok() function
strtok() is one of the relatively small set of more problemetic
functions in C library that are not thread-safe.
I know that it's not thread-safe. (You can't miss that information
if you look up the man page to inspect the function interface.)
If you only care about POSIX target, the I'd reccomend to avoid
strtok and to use strtok_r().
But since I don't use threads - neither here nor did I ever needed
them generally in my "C" contexts - that's unnecessary. Isn't it?
Moreover, I prefer functions with a simpler interface to functions
with a more clumsy one (I mean the 'char **saveptr' part); so why
use the complex one in the first place if it just complicates its
use and reduces the code clarity unnecessarily.
Re "more problematic functions in C library"...
I had to chuckle on that; if you're coming from other languages
most "C" functions - especially the low-level "C" functions that
operate on memory with pointers - don't look "unproblematic". :-)
Janis
On 2025-02-15, Michael S <already5chosen@yahoo.com> wrote:
On Fri, 14 Feb 2025 20:51:38 +0100
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
Actually, in the same code, I'm also using the strtok() function
strtok() is one of the relatively small set of more problemetic
functions in C library that are not thread-safe.
The design of the strtok() API is not inherently unsafe against
threads; but it requires thread-local storage to be safe.
Since ISO C has threads now, it now takes the opportunity to
explicitly removes any requirements for thread safety in strtok.
However, it is possible for an implementation to step forward and
make it thread safe. For instance, in a POSIX system, a
thread-specific key can be allocated for strtok on library
initialization, or the first use of strtok (via pthread_once).
static pthread_key_t strtok_key;
// ...
if (pthread_key_create(&strtok_key, NULL))
...
Then strtok does
char *strtok (char * restrict str, const char * restrit delim)
{
if (str == NULL)
str = pthread_getspecific(strtok_key);
...
// all return paths do this, if str has changed:
pthread_setspecific(strtok_key, str);
return ...;
}
Only problem is that this will not perform anywhere near as well as
strtok_r, which specifies an inexpensive location for the context
pointer.
If you only care about POSIX target, the I'd reccomend to avoid
strtok and to use strtok_r().
I would recommend learning about strspn and strcspn, and writing
your own tokenizing loop:
/* strtok-like loop: input variabls are str and delim */
for (;;) {
/* skip delim chars to find start of tok */
char *tok = str + strspn(str, delim);
/* tokens must be nonempty;
if (*tok == 0)
break;
/* OK; tok points to non-delim char.
Find end of token: skip span of non-delim chars. */
char *end = tok + strcspn(str, delim);
/* Record whether the end of the token is the end
of the string. */
char more = *end;
/* null-terminate token */
*end = 0;
{ /* process tok here */ }
if (!more)
break;
/* If there is more material after the tok, point
str there and continue */
str = end + 1;
}
The strok function is ill-suited to many situations. For instance,
there are situations in which you do want empty tokens, like CSV, such
that ",abc,def," shows four tokens, two of them empty.
With the strspn and strcspn building blocks, you can easily whip up a
custom tokenizing loop that has the right semantics for the situation.
We can also write our loop such that it restores the original
character that was overwritten in order to null-terminate the token,
simply by adding *end = more. Thus when the loop ends, the string
is restored to its original state.
I can understand code like that above without having to look up
anything, but if I see strtok or strtok_r code after many years of not working with strtok, I will need a refresher on how exactly they
define a token.
On 2/15/25 22:29, Janis Papanagnou wrote:
But since I don't use threads - neither here nor did I ever needed
them generally in my "C" contexts - that's unnecessary. Isn't it?
No. What makes strtok() problematic can come up without any use of
threads. Consider for the moment a bug I had to investigate. A function
that was looping through strtok() calls to parse a string called a
utility function during each pass through the loop. The utility function
also called strtok() in a loop to parse an entirely different string for
a different purpose. [...]
On Sun, 16 Feb 2025 04:29:20 +0100
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
Moreover, I prefer functions with a simpler interface to functions
with a more clumsy one (I mean the 'char **saveptr' part); so why
use the complex one in the first place if it just complicates its
use and reduces the code clarity unnecessarily.
I don't see how explicit context variable can be considered less clear
than context hidden within library in non-obvious way [...]
Re "more problematic functions in C library"...
I had to chuckle on that; if you're coming from other languages
most "C" functions - especially the low-level "C" functions that
operate on memory with pointers - don't look "unproblematic". :-)
I tend to have no problems with low-level C RTL functions, in
particular those with names start with 'mem'.
More problems with some
of those that try to be "higher level", for example, strcat(). Even more
with those that their designers probably considered 'object-oriented',
like strtok().
I would recommend learning about strspn and strcspn, and writing
your own tokenizing loop:
The strok function is ill-suited to many situations. For instance,
there are situations in which you do want empty tokens, like CSV, such
that ",abc,def," shows four tokens, two of them empty.
[...]
On Sun, 16 Feb 2025 07:32:23 -0000 (UTC)
Kaz Kylheku <643-408-1753@kylheku.com> wrote:
[...]
The strok function is ill-suited to many situations. For instance,
there are situations in which you do want empty tokens, like CSV, such
that ",abc,def," shows four tokens, two of them empty.
With the strspn and strcspn building blocks, you can easily whip up a
custom tokenizing loop that has the right semantics for the situation.
We can also write our loop such that it restores the original
character that was overwritten in order to null-terminate the token,
simply by adding *end = more. Thus when the loop ends, the string
is restored to its original state.
I can understand code like that above without having to look up
anything, but if I see strtok or strtok_r code after many years of not
working with strtok, I will need a refresher on how exactly they
define a token.
For parsing of something important and relatively well-defined, like
CSV, I'd very seriously consider option of not using standard str*
utilities at all, with exception of those, where coding your own
requires special expertise, i.e. primarily strtod(). BTW, even strtod()
can't be blindly relied on for .csv, because it accepts hex floats,
while standard CSV parser has to reject them.
Most likely, avoiding fgets() is also a good idea in this case.
On 16.02.2025 08:32, Kaz Kylheku wrote:
I would recommend learning about strspn and strcspn, and writing
your own tokenizing loop:
Incidentally, in a recent toy project, I used it for parsing simple
syntax.
For the code of this thread the strtok() was simpler to use, though.
if (condition)
statement1;
statement2;
On 16.02.2025 09:48, Michael S wrote:
On Sun, 16 Feb 2025 04:29:20 +0100
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
*shrug*
I recall (in early C++ days when there wasn't yet a string type) to
have based a set of string functions on the mem...() type functions
(as opposed to the str...() type functions); it wasn't more difficult.
Rather the effects had been (a) that we could operate binary strings,
(b) that it was (slightly) faster code, and (c) that some code could
get even simpler.
Janis
James Kuyper <jameskuyper@alumni.caltech.edu> writes:
On 2/11/25 16:59, Keith Thompson wrote:
James Kuyper <jameskuyper@alumni.caltech.edu> writes:
...
I just tried it, using gcc and found that fgets() does set the
first byte of the buffer to a null character. Therefore, it
doesn't conform to the requirements of the standard. That's not
particularly surprising - calling fgets with useless arguments
isn't something that I'd expect to be a high priority on their
pre-delivery tests.
As you know, gcc doesn't implement fgets(). Were you using GNU lib
.
Yes. To be specific, Ubuntu GLIBC 2.35-0ubuntu3.9.
Here's my test code:
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[])
{
char fill = 1;
char buffer = fill;
char *retval = NULL;
FILE *infile;
if(argc < 2)
infile = stdin;
else{
infile = fopen(argv[1], "r");
if(!infile)
{
perror(argv[1]);
return EXIT_FAILURE;
}
}
while((retval = fgets(&buffer, 1, infile)) == &buffer)
{
printf("%ld:'%u'\n", ftell(infile), (unsigned)buffer);
buffer = fill++;
}
if(ferror(infile))
perror("fgets");
printf("%p!=%p ferror:%d feof:%d '%c'\n",
(void*)&buffer, (void*)retval,
ferror(infile), feof(infile), buffer);
}
Note that if fgets() works as it should, that's an infinite loop,
since no data is read in, and therefore there's no movement through
the input file. I wrote code that executes after the infinite loop
just to cover the possibility that it doesn't work that way.
I get an infinite loop with both glibc and musl on Ubuntu, and under
Termux on Android (Bionic library implementation):
$ ./jk < /dev/null | head -n 3
0:'0'
0:'0'
0:'0'
$ echo hello | ./jk | head -n 3
-1:'0'
-1:'0'
-1:'0'
$
With newlib on Cygwin, there is no infinite loop:
$ ./jk.exe < /dev/null
0x7ffffcc17!=0x0 ferror:0 feof:0 ''
$ echo hello | ./jk.exe
0x7ffffcc17!=0x0 ferror:0 feof:0 ''
$
On Sat, 15 Feb 2025 08:37:20 -0800[...]
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:
Clearly using fgets() is problematic when the input stream might
contain null characters. To me it seems obvious that the original
implementors expected that fgets() would not be used in such cases,
perhaps with the less severe restriction that the presence of
embedded nulls could be detected and simply rejected as bad input,
much the same as overly long lines or a final line without a
terminating newline character.
My impression is that they didn't spend much time thinking.
Sysop: | DaiTengu |
---|---|
Location: | Appleton, WI |
Users: | 1,010 |
Nodes: | 10 (0 / 10) |
Uptime: | 31:12:56 |
Calls: | 13,187 |
Calls today: | 1 |
Files: | 186,574 |
D/L today: |
200 files (49,110K bytes) |
Messages: | 3,321,572 |