Forum: War Ensemble BBS

Re: Help with Streaming and Chunk Processing for Large JSON Data (60GB) from Kenna API

From Asif Ali Hirekumbi@asifali.ha@gmail.com to comp.lang.python on Mon Sep 30 12:11:30 2024

From Newsgroup: comp.lang.python

Thanks Abdur Rahmaan.
I will give it a try !
Thanks
Asif
On Mon, Sep 30, 2024 at 11:19 AM Abdur-Rahmaan Janhangeer < arj.python@gmail.com> wrote:

Idk if you tried Polars, but it seems to work well with JSON data

import polars as pl
pl.read_json("file.json")

Kind Regards,

Abdur-Rahmaan Janhangeer
about <https://compileralchemy.github.io/> | blog <https://www.pythonkitchen.com>
github <https://github.com/Abdur-RahmaanJ>
Mauritius

On Mon, Sep 30, 2024 at 8:00 AM Asif Ali Hirekumbi via Python-list < python-list@python.org> wrote:

Dear Python Experts,

I am working with the Kenna Application's API to retrieve vulnerability
data. The API endpoint provides a single, massive JSON file in gzip
format,
approximately 60 GB in size. Handling such a large dataset in one go is
proving to be quite challenging, especially in terms of memory management. >>
I am looking for guidance on how to efficiently stream this data and
process it in chunks using Python. Specifically, I am wondering if there’s >> a way to use the requests library or any other libraries that would allow
us to pull data from the API endpoint in a memory-efficient manner.

Here are the relevant API endpoints from Kenna:

- Kenna API Documentation
<https://apidocs.kennasecurity.com/reference/welcome>
- Kenna Vulnerabilities Export
<https://apidocs.kennasecurity.com/reference/retrieve-data-export>

If anyone has experience with similar use cases or can offer any advice,
it
would be greatly appreciated.

Thank you in advance for your help!

Best regards
Asif Ali
--
https://mail.python.org/mailman/listinfo/python-list

--- Synchronet 3.20a-Linux NewsLink 1.114

From Left Right@olegsivokon@gmail.com to comp.lang.python on Mon Sep 30 10:41:44 2024

From Newsgroup: comp.lang.python

Whether and to what degree you can stream JSON depends on JSON
structure. In general, however, JSON cannot be streamed (but commonly
it can be).
Imagine a pathological case of this shape: 1... <60GB of digits>. This
is still a valid JSON (it doesn't have any limits on how many digits a
number can have). And you cannot parse this number in a streaming way
because in order to do that, you need to start with the least
significant digit.
Typically, however, JSON can be parsed incrementally. The format is conceptually very simple to write a parser for. There are plenty of
parsers that do that, for example, this one: https://pypi.org/project/json-stream/ . But, I'd encourage you to do
it yourself. It's fun, and the resulting parser should end up less
than some 50 LoC. Also, it allows you to closer incorporate your
desired output into your parser.
On Mon, Sep 30, 2024 at 8:44 AM Asif Ali Hirekumbi via Python-list <python-list@python.org> wrote:

Thanks Abdur Rahmaan.
I will give it a try !

Thanks
Asif

On Mon, Sep 30, 2024 at 11:19 AM Abdur-Rahmaan Janhangeer < arj.python@gmail.com> wrote:

Idk if you tried Polars, but it seems to work well with JSON data

import polars as pl
pl.read_json("file.json")

Kind Regards,

Abdur-Rahmaan Janhangeer
about <https://compileralchemy.github.io/> | blog <https://www.pythonkitchen.com>
github <https://github.com/Abdur-RahmaanJ>
Mauritius

On Mon, Sep 30, 2024 at 8:00 AM Asif Ali Hirekumbi via Python-list < python-list@python.org> wrote:

Dear Python Experts,

I am working with the Kenna Application's API to retrieve vulnerability
data. The API endpoint provides a single, massive JSON file in gzip
format,
approximately 60 GB in size. Handling such a large dataset in one go is
proving to be quite challenging, especially in terms of memory management. >>
I am looking for guidance on how to efficiently stream this data and
process it in chunks using Python. Specifically, I am wondering if there’s
a way to use the requests library or any other libraries that would allow >> us to pull data from the API endpoint in a memory-efficient manner.

Here are the relevant API endpoints from Kenna:

- Kenna API Documentation
<https://apidocs.kennasecurity.com/reference/welcome>
- Kenna Vulnerabilities Export
<https://apidocs.kennasecurity.com/reference/retrieve-data-export>

If anyone has experience with similar use cases or can offer any advice, >> it
would be greatly appreciated.

Thank you in advance for your help!

Best regards
Asif Ali
--
https://mail.python.org/mailman/listinfo/python-list

--
https://mail.python.org/mailman/listinfo/python-list

--- Synchronet 3.20a-Linux NewsLink 1.114

From Barry@barry@barrys-emacs.org to comp.lang.python on Mon Sep 30 16:30:19 2024

From Newsgroup: comp.lang.python

On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list <python-list@python.org> wrote:

import polars as pl
pl.read_json("file.json")

This is not going to work unless the computer has a lot more the 60GiB of RAM. As later suggested a streaming parser is required.
Barry
--- Synchronet 3.20a-Linux NewsLink 1.114

From Thomas Passin@list1@tompassin.net to comp.lang.python on Mon Sep 30 12:11:46 2024

From Newsgroup: comp.lang.python

On 9/30/2024 11:30 AM, Barry via Python-list wrote:

On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list <python-list@python.org> wrote:

import polars as pl
pl.read_json("file.json")

This is not going to work unless the computer has a lot more the 60GiB of RAM.

As later suggested a streaming parser is required.

Streaming won't work because the file is gzipped. You have to receive
the whole thing before you can unzip it. Once unzipped it will be even
larger, and all in memory.
--- Synchronet 3.20a-Linux NewsLink 1.114

From 2QdxY4RzWzUUiLuE@2QdxY4RzWzUUiLuE@potatochowder.com to comp.lang.python on Mon Sep 30 14:28:33 2024

From Newsgroup: comp.lang.python

On 2024-09-30 at 11:44:50 -0400,
Grant Edwards via Python-list <python-list@python.org> wrote:

On 2024-09-30, Left Right via Python-list <python-list@python.org> wrote:

Whether and to what degree you can stream JSON depends on JSON
structure. In general, however, JSON cannot be streamed (but commonly
it can be).

Imagine a pathological case of this shape: 1... <60GB of digits>. This
is still a valid JSON (it doesn't have any limits on how many digits a number can have). And you cannot parse this number in a streaming way because in order to do that, you need to start with the least
significant digit.

Which is how arabic numbers were originally parsed, but when
westerners adopted them from a R->L written language, thet didn't flip
them around to match the L->R written language into which they were
being adopted.

Interesting.

So now long numbers can't be parsed as a stream in software. They
should have anticipated this problem back in the 13th century and
flipped the numbers around.

What am I missing? Handwavingly, start with the first digit, and as
long as the next character is a digit, multipliy the accumulated result
by 10 (or the appropriate base) and add the next value. Oh, and handle scientific notation as a special case, and perhaps fail spectacularly
instead of recovering gracefully in certain edge cases. And in the pathological case of a single number with 60 billion digits, run out of
memory (and complain loudly to the person who claimed that the file
contained a "dataset"). But why do I need to start with the least
significant digit?
--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris Angelico@rosuav@gmail.com to comp.lang.python on Tue Oct 1 04:46:35 2024

From Newsgroup: comp.lang.python

On Tue, 1 Oct 2024 at 04:30, Dan Sommers via Python-list <python-list@python.org> wrote:

But why do I need to start with the least
significant digit?

If you start from the most significant, you don't know anything about
the number until you finish parsing it. There's almost nothing you can
say about a number given that it starts with a particular sequence
(since you don't know how MANY digits there are). However, if you know
the LAST digits, you can make certain statements about it (trivial
examples being whether it's odd or even).

It's not very, well, significant. But there's something to it. And it
extends nicely to p-adic numbers, which can have an infinite number of
nonzero digits to the left of the decimal:

https://en.wikipedia.org/wiki/P-adic_number

ChrisA
--- Synchronet 3.20a-Linux NewsLink 1.114

From Thomas Passin@list1@tompassin.net to comp.lang.python on Mon Sep 30 13:57:05 2024

From Newsgroup: comp.lang.python

On 9/30/2024 1:00 PM, Chris Angelico via Python-list wrote:

On Tue, 1 Oct 2024 at 02:20, Thomas Passin via Python-list <python-list@python.org> wrote:

On 9/30/2024 11:30 AM, Barry via Python-list wrote:

On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list <python-list@python.org> wrote:

import polars as pl
pl.read_json("file.json")

This is not going to work unless the computer has a lot more the 60GiB of RAM.

As later suggested a streaming parser is required.

Streaming won't work because the file is gzipped. You have to receive
the whole thing before you can unzip it. Once unzipped it will be even
larger, and all in memory.

Streaming gzip is perfectly possible. You may be thinking of PKZip
which has its EOCD at the end of the file (although it may still be
possible to stream-decompress if you work at it).

ChrisA

You're right, that's what I was thinking of.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Left Right@olegsivokon@gmail.com to comp.lang.python on Mon Sep 30 21:30:06 2024

From Newsgroup: comp.lang.python

Streaming won't work because the file is gzipped. You have to receive
the whole thing before you can unzip it. Once unzipped it will be even larger, and all in memory.

GZip is specifically designed to be streamed. So, that's not a
problem (in principle), but you would need to have a streaming GZip
parser, quick search in PyPI revealed this package: https://pypi.org/project/gzip-stream/ .
On Mon, Sep 30, 2024 at 6:20 PM Thomas Passin via Python-list <python-list@python.org> wrote:

On 9/30/2024 11:30 AM, Barry via Python-list wrote:

On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list <python-list@python.org> wrote:

import polars as pl
pl.read_json("file.json")

This is not going to work unless the computer has a lot more the 60GiB of RAM.

As later suggested a streaming parser is required.

Streaming won't work because the file is gzipped. You have to receive
the whole thing before you can unzip it. Once unzipped it will be even larger, and all in memory.
--
https://mail.python.org/mailman/listinfo/python-list

--- Synchronet 3.20a-Linux NewsLink 1.114

From Thomas Passin@list1@tompassin.net to comp.lang.python on Mon Sep 30 14:05:36 2024

From Newsgroup: comp.lang.python

On 9/30/2024 11:30 AM, Barry via Python-list wrote:

On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list <python-list@python.org> wrote:

import polars as pl
pl.read_json("file.json")

This is not going to work unless the computer has a lot more the 60GiB of RAM.

As later suggested a streaming parser is required.

There is also the json-stream library, on PyPi at

https://pypi.org/project/json-stream/

--- Synchronet 3.20a-Linux NewsLink 1.114

From 2QdxY4RzWzUUiLuE@2QdxY4RzWzUUiLuE@potatochowder.com to comp.lang.python on Mon Sep 30 18:16:03 2024

From Newsgroup: comp.lang.python

On 2024-10-01 at 04:46:35 +1000,
Chris Angelico via Python-list <python-list@python.org> wrote:

On Tue, 1 Oct 2024 at 04:30, Dan Sommers via Python-list <python-list@python.org> wrote:

But why do I need to start with the least
significant digit?

If you start from the most significant, you don't know anything about
the number until you finish parsing it. There's almost nothing you can
say about a number given that it starts with a particular sequence
(since you don't know how MANY digits there are). However, if you know
the LAST digits, you can make certain statements about it (trivial
examples being whether it's odd or even).

But that wasn't the question. Sure, under certain circumstances and for specific use cases and/or requirements, there might be arguments to read potential numbers as strings and possibly not have to parse them
completely before accepting or rejecting them.

And if I start with the least significant digit and the number happens
to be written in scientific notation and/or has a decimal point, then I
can't tell whether it's odd or even until I further process the whole
thing anyway.

It's not very, well, significant. But there's something to it. And it
extends nicely to p-adic numbers, which can have an infinite number of nonzero digits to the left of the decimal:

https://en.wikipedia.org/wiki/P-adic_number

In Common Lisp, integers can be written in any integer base from two to
thirty six, inclusive. So knowing the last digit doesn't tell you
whether an integer is even or odd until you know the base anyway.

Curiously, we agree: if you move the goal posts arbitrarily, then
some algorithms that parse JSON numbers will fail.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris Angelico@rosuav@gmail.com to comp.lang.python on Tue Oct 1 09:09:07 2024

From Newsgroup: comp.lang.python

On Tue, 1 Oct 2024 at 08:56, Grant Edwards via Python-list <python-list@python.org> wrote:

On 2024-09-30, Dan Sommers via Python-list <python-list@python.org> wrote:

In Common Lisp, integers can be written in any integer base from two
to thirty six, inclusive. So knowing the last digit doesn't tell
you whether an integer is even or odd until you know the base
anyway.

I had to think about that for an embarassingly long time before it
clicked.

The only part I'm not clear on is what identifies the base. If you're
going to write numbers little-endian, it's not that hard to also write
them with a base indicator before the digits. But, whatever. This is a
typical tangent and people are argumentative for no reason. I was just
trying to add some explanatory notes to why little-endian does make
more sense than big-endian.

ChrisA
--- Synchronet 3.20a-Linux NewsLink 1.114

From 2QdxY4RzWzUUiLuE@2QdxY4RzWzUUiLuE@potatochowder.com to comp.lang.python on Mon Sep 30 20:06:57 2024

From Newsgroup: comp.lang.python

On 2024-10-01 at 09:09:07 +1000,
Chris Angelico via Python-list <python-list@python.org> wrote:

On Tue, 1 Oct 2024 at 08:56, Grant Edwards via Python-list <python-list@python.org> wrote:

On 2024-09-30, Dan Sommers via Python-list <python-list@python.org> wrote:

In Common Lisp, integers can be written in any integer base from two
to thirty six, inclusive. So knowing the last digit doesn't tell
you whether an integer is even or odd until you know the base
anyway.

I had to think about that for an embarassingly long time before it
clicked.

The only part I'm not clear on is what identifies the base. If you're
going to write numbers little-endian, it's not that hard to also write
them with a base indicator before the digits [...]

In Common Lisp, you can write integers as #nnR[digits], where nn is the
decimal representation of the base (possibly without a leading zero),
the # and the R are literal characters, and the digits are written in
the intended base. So the input #16fFFFF is read as the integer 65535.

You can also set or bind the global variable *read-base* (yes, the
asterisks are part of the name) to an integer between 2 and 36, and then anything that looks like an integer in that base is interpreted as such (including literals in programs). The literals I described above are
still handled correctly no matter the current value of *read-base*. So
if the value of *read-base* is 16, then the input FFFF is read as the
integer 65535 (as is the input #16rFFFF).

(Pedants may point our details I omitted. I admit to omitting them.)

IIRC, certain [old 8080 and Z-80?] assemblers used to put the base
indicator at the end. So 10 meant, well, 10, but 10H meant 16 and 10b
meant 2 (IDK; the capital H and the lower case b both look right to me).

I don't recall numbers written from least significant digit to most
significant digit (big and little endian *storage*, yes, but not the
digits when presented to or read from a human).
--- Synchronet 3.20a-Linux NewsLink 1.114

From Left Right@olegsivokon@gmail.com to comp.lang.python on Mon Sep 30 21:34:07 2024

From Newsgroup: comp.lang.python

What am I missing? Handwavingly, start with the first digit, and as
long as the next character is a digit, multipliy the accumulated result
by 10 (or the appropriate base) and add the next value. Oh, and handle scientific notation as a special case, and perhaps fail spectacularly
instead of recovering gracefully in certain edge cases. And in the pathological case of a single number with 60 billion digits, run out of memory (and complain loudly to the person who claimed that the file
contained a "dataset"). But why do I need to start with the least significant digit?

You probably forgot that it has to be _streaming_. Suppose you parse
the first digit: can you hand this information over to an external
function to process the parsed data? -- No! because you don't know the magnitude yet. What about two digits? -- Same thing. You cannot
leave the parser code until you know the magnitude (otherwise the
information is useless to the external code).
So, even if you have enough memory and don't care about special cases
like scientific notation: yes, you will be able to parse it, but it
won't be a streaming parser.
On Mon, Sep 30, 2024 at 9:30 PM Left Right <olegsivokon@gmail.com> wrote:

Streaming won't work because the file is gzipped. You have to receive
the whole thing before you can unzip it. Once unzipped it will be even larger, and all in memory.

GZip is specifically designed to be streamed. So, that's not a
problem (in principle), but you would need to have a streaming GZip
parser, quick search in PyPI revealed this package: https://pypi.org/project/gzip-stream/ .

On Mon, Sep 30, 2024 at 6:20 PM Thomas Passin via Python-list <python-list@python.org> wrote:

On 9/30/2024 11:30 AM, Barry via Python-list wrote:

On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list <python-list@python.org> wrote:

import polars as pl
pl.read_json("file.json")

This is not going to work unless the computer has a lot more the 60GiB of RAM.

As later suggested a streaming parser is required.

Streaming won't work because the file is gzipped. You have to receive
the whole thing before you can unzip it. Once unzipped it will be even larger, and all in memory.
--
https://mail.python.org/mailman/listinfo/python-list

--- Synchronet 3.20a-Linux NewsLink 1.114

From 2QdxY4RzWzUUiLuE@2QdxY4RzWzUUiLuE@potatochowder.com to comp.lang.python on Tue Oct 1 11:34:45 2024

From Newsgroup: comp.lang.python

On 2024-09-30 at 18:48:02 -0700,
Keith Thompson via Python-list <python-list@python.org> wrote:

2QdxY4RzWzUUiLuE@potatochowder.com writes:
[...]

In Common Lisp, you can write integers as #nnR[digits], where nn is the decimal representation of the base (possibly without a leading zero),
the # and the R are literal characters, and the digits are written in
the intended base. So the input #16fFFFF is read as the integer 65535.

Typo: You meant #16RFFFF, not #16fFFFF.

Yep. Sorry.
--- Synchronet 3.20a-Linux NewsLink 1.114

From 2QdxY4RzWzUUiLuE@2QdxY4RzWzUUiLuE@potatochowder.com to comp.lang.python on Tue Oct 1 11:47:24 2024

From Newsgroup: comp.lang.python

On 2024-09-30 at 21:34:07 +0200,
Regarding "Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API,"
Left Right via Python-list <python-list@python.org> wrote:

What am I missing? Handwavingly, start with the first digit, and as
long as the next character is a digit, multipliy the accumulated result
by 10 (or the appropriate base) and add the next value. Oh, and handle scientific notation as a special case, and perhaps fail spectacularly instead of recovering gracefully in certain edge cases. And in the pathological case of a single number with 60 billion digits, run out of memory (and complain loudly to the person who claimed that the file contained a "dataset"). But why do I need to start with the least significant digit?

You probably forgot that it has to be _streaming_. Suppose you parse
the first digit: can you hand this information over to an external
function to process the parsed data? -- No! because you don't know the magnitude yet. What about two digits? -- Same thing. You cannot
leave the parser code until you know the magnitude (otherwise the
information is useless to the external code).

If I recognize the first digit, then I *can* hand that over to an
external function to accumulate the digits that follow.

So, even if you have enough memory and don't care about special cases
like scientific notation: yes, you will be able to parse it, but it
won't be a streaming parser.

Under that constraint, I'm not sure I can parse anything. How can I
parse a string (and hand it over to an external function) until I've
found the closing quote?

How much state can a parser maintain (before it invokes an external
function) and still be considered streaming? I fear that we may be
getting hung up on terminology rather than solving the problem at hand.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Left Right@olegsivokon@gmail.com to comp.lang.python on Tue Oct 1 23:03:01 2024

From Newsgroup: comp.lang.python

If I recognize the first digit, then I *can* hand that over to an
external function to accumulate the digits that follow.

And what is that external function going to do with this information?
The point is you didn't parse anything if you just sent the digit.
You just delegated the parsing further. Parsing is only meaningful if
you extracted some information, but your idea is, essentially "what if
I do nothing?".

Under that constraint, I'm not sure I can parse anything. How can I

parse a string (and hand it over to an external function) until I've
found the closing quote?
Nobody says that parsing a number is the only pathological case. You,
however, exaggerate by saying you cannot parse _anything_. You can
parse booleans or null, for example. There's no problem there.
Again, I think you misunderstand what streaming is for. Let me remind:
it's for processing information as it comes, potentially,
indefinitely. This has far more important implications than what you
find in computer science. For example, some mathematicians use the
same argument to show that real numbers are either fiction or useless:
consider adding two real numbers (where real numbers are potentially
infinite strings of decimal digits after the period) -- there's no way
to prove that such an addition is possible because you would need an
infinite proof for that (because you need to start adding from the
least significant digit).
In principle, any language that has infinite words will have the same
problem with streaming. If you ever pondered h/w or low-level
protocols s.a. SCSI or IP, you'd see that they are specifically
designed in such a way as to never have infinite words (because they
must be amenable to streaming). Consider also an interesting
consequence of SCSI not being able to have infinite words: this means,
besides other things that fsync() is nonsense! :) If you aren't
familiar with the concept: UNIX filesystem API suggests that it's
possible to destage arbitrary large file (or a chunk of file) to disk.
But SCSI is built of finite "words" and to describe an arbitrary large
file you'd need to list all the blocks that constitute the file! And
that's why fsync() and family are so hated by people who deal with
storage: the only way to implement fsync() in compliance with the
standard is to sync _everything_ (and it hurts!)
On Tue, Oct 1, 2024 at 5:49 PM Dan Sommers via Python-list <python-list@python.org> wrote:

On 2024-09-30 at 21:34:07 +0200,
Regarding "Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API,"
Left Right via Python-list <python-list@python.org> wrote:

What am I missing? Handwavingly, start with the first digit, and as
long as the next character is a digit, multipliy the accumulated result by 10 (or the appropriate base) and add the next value. Oh, and handle scientific notation as a special case, and perhaps fail spectacularly instead of recovering gracefully in certain edge cases. And in the pathological case of a single number with 60 billion digits, run out of memory (and complain loudly to the person who claimed that the file contained a "dataset"). But why do I need to start with the least significant digit?

You probably forgot that it has to be _streaming_. Suppose you parse
the first digit: can you hand this information over to an external
function to process the parsed data? -- No! because you don't know the magnitude yet. What about two digits? -- Same thing. You cannot
leave the parser code until you know the magnitude (otherwise the information is useless to the external code).

If I recognize the first digit, then I *can* hand that over to an
external function to accumulate the digits that follow.

So, even if you have enough memory and don't care about special cases
like scientific notation: yes, you will be able to parse it, but it
won't be a streaming parser.

Under that constraint, I'm not sure I can parse anything. How can I
parse a string (and hand it over to an external function) until I've
found the closing quote?

How much state can a parser maintain (before it invokes an external
function) and still be considered streaming? I fear that we may be
getting hung up on terminology rather than solving the problem at hand.
--
https://mail.python.org/mailman/listinfo/python-list

--- Synchronet 3.20a-Linux NewsLink 1.114

From Greg Ewing@greg.ewing@canterbury.ac.nz to comp.lang.python on Wed Oct 2 10:48:24 2024

From Newsgroup: comp.lang.python

On 1/10/24 8:34 am, Left Right wrote:

You probably forgot that it has to be _streaming_. Suppose you parse
the first digit: can you hand this information over to an external
function to process the parsed data? -- No! because you don't know the magnitude yet.

By that definition of "streaming", no parser can ever be streaming,
because there will be some constructs that must be read in their
entirety before a suitably-structured piece of output can be
emitted.

The context of this discussion about integers is the claim that
they *could* be parsed incrementally if they were written little
endian instead of big endian, but the same argument applies either
way.
--
Greg
--- Synchronet 3.20a-Linux NewsLink 1.114

From Greg Ewing@greg.ewing@canterbury.ac.nz to comp.lang.python on Wed Oct 2 11:07:41 2024

From Newsgroup: comp.lang.python

On 2/10/24 10:03 am, Left Right wrote:

Consider also an interesting
consequence of SCSI not being able to have infinite words: this means, besides other things that fsync() is nonsense! :) If you aren't
familiar with the concept: UNIX filesystem API suggests that it's
possible to destage arbitrary large file (or a chunk of file) to disk.
But SCSI is built of finite "words" and to describe an arbitrary large
file you'd need to list all the blocks that constitute the file!

I don't follow. What fsync() does is ensure that any data buffered
in the kernel relating to the file is sent to the storage device.
It can send as many blocks of data over SCSI as required to
achieve this. There's no requirement for it to be atomic at the
level of the interface between the kernel and the hardware.

Some devices do their own buffering in ways that are invisible to
the software, so fsync() can't guarantee that the data is actually
written to the storage medium. But that's a problem stemming from
the design of the hardware, not the design of the protocol for
communicating with the hardware.

the only way to implement fsync() in compliance with the
standard is to sync _everything_

Again I'm not sure what you mean here. It may be difficult for the
kernel to track down exactly what data is relevant to a particular file,
and so the kernel programmers take the easy way out and just implement
fsync() as sync(). But again that has nothing to do with the protocol.
--
Greg
--- Synchronet 3.20a-Linux NewsLink 1.114

From avi.e.gross@avi.e.gross@gmail.com to comp.lang.python on Tue Oct 1 19:26:52 2024

From Newsgroup: comp.lang.python

This discussion has become less useful.

E can all agree that in Computer Science, real infinities are avoided, and frankly, need not be taken seriously in any serious program.

You can store all kinds of infinities quite compactly as in a transcendental number you can derive to as many decimal points as you like. Want 1/7 to a thousand decimal places, no problem. You can be given a digit 1 and a digit
7 and asked to do a division to as many digits as you wish in a
deterministic manner. I can think of quite a few generators that could
easily supply the next digit, or just keep giving the next element from
142857 each time from a circular loop.

Sines, cosines, pi, e and so on, can often be calculated to arbitrary
precision by evaluating things like infinite Taylor Series as many times as needed up to the precision of the data holding the number as you move along.

Similar ideas allow generators to give you as many primes as you want, and
no more.

So, if you can store arbitrary python code as part of your JSON, you can
send quite a bit of somewhat compressed data.

The real problem is how the JSON is set up. If you take umpteen data
structures and wrap them all in something like a list, then it may be a tad hard to stream as you may not necessarily be examining the contents till the list finishes gigabytes later. But if, instead, you send lots of smaller
parts, such as perhaps sending each row of something like a data.frame individually, the other side can recombine them incrementally to a larger structure such as a data.frame and do some logic on it as it streams, such
as keeping only some columns and discarding the rest, or applying filters
that only keep rows you care about. And, of course, all rows could be
appended to one and perhaps more .CSV files as well so if you need multiple passes on the data, it can now be processed locally in various modes,
including "streamed".

I think that for some purposes, it makes some sense to not stream anything
but results. I mean consider any database that allows a remote login and SQL commands that only stream results. If I only want info on records about
company X between July 1 and September 15 of a particular year and only if
the amount paid remains zero or is less than the amount owed, ...

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Greg Ewing via Python-list
Sent: Tuesday, October 1, 2024 5:48 PM
To: python-list@python.org
Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data
(60 GB) from Kenna API

On 1/10/24 8:34 am, Left Right wrote:

You probably forgot that it has to be _streaming_. Suppose you parse
the first digit: can you hand this information over to an external
function to process the parsed data? -- No! because you don't know the magnitude yet.

By that definition of "streaming", no parser can ever be streaming,
because there will be some constructs that must be read in their
entirety before a suitably-structured piece of output can be
emitted.

The context of this discussion about integers is the claim that
they *could* be parsed incrementally if they were written little
endian instead of big endian, but the same argument applies either
way.
--
Greg
--
https://mail.python.org/mailman/listinfo/python-list

--- Synchronet 3.20a-Linux NewsLink 1.114

From 2QdxY4RzWzUUiLuE@2QdxY4RzWzUUiLuE@potatochowder.com to comp.lang.python on Tue Oct 1 20:20:59 2024

From Newsgroup: comp.lang.python

On 2024-10-01 at 23:03:01 +0200,
Left Right <olegsivokon@gmail.com> wrote:

If I recognize the first digit, then I *can* hand that over to an
external function to accumulate the digits that follow.

And what is that external function going to do with this information?
The point is you didn't parse anything if you just sent the digit.
You just delegated the parsing further. Parsing is only meaningful if
you extracted some information, but your idea is, essentially "what if
I do nothing?".

If the parser detects the first digit of a number, then the parser can
read digits one at a time (i.e., "streaming"), assimilate and accumulate
the value of the number being parsed, and successfully finish parsing
the number it reads a non-digit. Whether the function that accumulates
the value during the process is internal or external isn't relevant; the
point is that it is possible to parse integers from most significant
digit to least significant digit under a streaming model (and if you're sufficiently clever, you can even write partial results to external
storage and/or another transmission protocol, thus allowing for numbers
bigger (as measured by JSON or your internal representation) than your
RAM).

At most, the parser has to remember the non-digit character it read so
that it (the parser) can begin to parse whatever comes after the number.
Does that break your notion of "streaming"?

Why do I have to start with the least significant digit?

Under that constraint, I'm not sure I can parse anything. How can I
parse a string (and hand it over to an external function) until I've
found the closing quote?

Nobody says that parsing a number is the only pathological case. You, however, exaggerate by saying you cannot parse _anything_. You can
parse booleans or null, for example. There's no problem there.

My intent was only to repeat what you implied: that any parser that
reads its input until it has parsed a value is not streaming.

So how much information can the parser keep before you consider it not
to be "streaming"?

[...]

In principle, any language that has infinite words will have the same
problem with streaming [...]

So what magic allows anyone to stream any JSON file over SCSI or IP?
Let alone some kind of "live stream" that by definition is indefinite,
even if it only lasts a few tenths of a second?

[...] If you ever pondered h/w or low-level
protocols s.a. SCSI or IP [...]

I spent a good deal of my career designing and implementing all manner
of communicaations protocols, from transmitting and receiving single
bits over a wire all the way up to what are now known as session and presentation layers. Some imposed maximum lengths in certain places;
some allowed for indefinite amounts of data to be transferred from one
end to the other without stopping, resetting, or overflowing. And yet
somehow, the universe never collapsed.

If you believe that some implementation of fsync fails to meet a
specification, or fails to work correctly on files containign JSON, then
file a bug report.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Greg Ewing@greg.ewing@canterbury.ac.nz to comp.lang.python on Wed Oct 2 18:27:54 2024

From Newsgroup: comp.lang.python

On 2/10/24 12:26 pm, avi.e.gross@gmail.com wrote:

The real problem is how the JSON is set up. If you take umpteen data structures and wrap them all in something like a list, then it may be a tad hard to stream as you may not necessarily be examining the contents till the list finishes gigabytes later.

Yes, if you want to process the items as they come in, you might
be better off sending a series of separate JSON strings, rather than
one JSON string containing a list.

Or, use a specialised JSON parser that processes each item of the
list as soon as it's finished parsing it, instead of collecting the
whole list first.
--
Greg

--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris Angelico@rosuav@gmail.com to comp.lang.python on Wed Oct 2 23:59:41 2024

From Newsgroup: comp.lang.python

On Wed, 2 Oct 2024 at 23:53, Left Right via Python-list <python-list@python.org> wrote:

In the same email you replied to, I gave examples of languages for
which parsers can be streaming (in general): SCSI or IP.

You can't validate an IP packet without having all of it. Your notion
of "streaming" is nonsensical.

ChrisA
--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris Angelico@rosuav@gmail.com to comp.lang.python on Thu Oct 3 08:51:01 2024

From Newsgroup: comp.lang.python

On Thu, 3 Oct 2024 at 08:48, Left Right <olegsivokon@gmail.com> wrote:

You can't validate an IP packet without having all of it. Your notion
of "streaming" is nonsensical.

Whoa, whoa, hold your horses! "nonsensical" needs a little bit of justification :)

It seems you don't understand the difference between words and
languages! In my examples, IP _protocol_ is the language, sequences of
IP packets are the words in the language. A language is amenable to
streaming if the words of the language are repetition of sequences of
symbols of the alphabet of fixed length. This is, essentially, like
saying that the words themselves are regular.

One single IP packet is all you can parse. You're playing shenanigans
with words the way Humpty Dumpty does. IP packets are not sequences,
they are individuals.

ChrisA
--- Synchronet 3.20a-Linux NewsLink 1.114

From Left Right@olegsivokon@gmail.com to comp.lang.python on Thu Oct 3 00:48:10 2024

From Newsgroup: comp.lang.python

You can't validate an IP packet without having all of it. Your notion
of "streaming" is nonsensical.

Whoa, whoa, hold your horses! "nonsensical" needs a little bit of
justification :)

It seems you don't understand the difference between words and
languages! In my examples, IP _protocol_ is the language, sequences of
IP packets are the words in the language. A language is amenable to
streaming if the words of the language are repetition of sequences of
symbols of the alphabet of fixed length. This is, essentially, like
saying that the words themselves are regular.

So, the follow-up question from you to me should be: how come strictly context-free languages can still be parsed with streaming parsers? --
And the answer to that is it's possible to approximate context-free
languages with regular languages. In fact, this is a very interesting
subject, which unfortunately is usually overlooked in automata
classes. It's interesting in a sense that it's very accessible to the
students who already mastered the understanding of regular and
context-free formalisms.

So, streaming parsers (eg. SAX) are written for a regular language
that approximates XML. This is because in practice we will almost
never encounter more than N nesting levels in an XML, more than N
characters in an element name etc. (for some large enough N).
Something which allows us to create a regular language from a
context-free one.

NB. "Nonsensical" has a very precise meaning, when it comes to
discussing the truth value of a proposition, which I think you also
somehow didn't know about. You seem to use "nonsensical" as a synonym
to "wrong". But, unbeknownst to you, you said something else. You
actually implied that there's no way to tell if my notion of streaming
is correct or not.

But, for the future reference: my notion of streaming is correct, and
you would do better learning some materials about it before jumping to conclusions.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Left Right@olegsivokon@gmail.com to comp.lang.python on Thu Oct 3 00:56:36 2024

From Newsgroup: comp.lang.python

One single IP packet is all you can parse.

I worked for an undisclosed company which manufactures h/w for ISPs
(4- and 8-unit boxes you mount on a rack in a datacenter).
Essentially, big-big routers. So, I had the pleasure of writing
software that parses IP _protocol_, and let me tell you: you have no
idea what you just wrote.
But, like I wrote earlier: you don't understand the distinction
between languages and words. And in general, are just being stubborn
and rude because you are trying to prove a point to someone you don't
like, but, in reality, you just look more and more ridiculous.
On Thu, Oct 3, 2024 at 12:51 AM Chris Angelico <rosuav@gmail.com> wrote:

On Thu, 3 Oct 2024 at 08:48, Left Right <olegsivokon@gmail.com> wrote:

You can't validate an IP packet without having all of it. Your notion
of "streaming" is nonsensical.

Whoa, whoa, hold your horses! "nonsensical" needs a little bit of justification :)

It seems you don't understand the difference between words and
languages! In my examples, IP _protocol_ is the language, sequences of
IP packets are the words in the language. A language is amenable to streaming if the words of the language are repetition of sequences of symbols of the alphabet of fixed length. This is, essentially, like
saying that the words themselves are regular.

One single IP packet is all you can parse. You're playing shenanigans
with words the way Humpty Dumpty does. IP packets are not sequences,
they are individuals.

ChrisA

--- Synchronet 3.20a-Linux NewsLink 1.114

From Ethan Furman@ethan@stoneleaf.us to comp.lang.python on Wed Oct 2 18:57:51 2024

From Newsgroup: comp.lang.python

This thread is derailing.

Please consider it closed.

--
~Ethan~
Moderator
--- Synchronet 3.20a-Linux NewsLink 1.114

Who's Online
Recent Visitors
- Winston
  Sun Jan 26 08:29:13 2025
  from Kerrville, Tx via SSH
- Microbot
  Wed Jan 22 22:28:39 2025
  from Moore, Ok via Telnet
- Grey Gamer
  Wed Jan 22 07:34:56 2025
  from Show Low, Az via Telnet
- Grey Gamer
  Wed Jan 22 04:02:00 2025
  from Show Low, Az via Telnet

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,007
Nodes:	10 (0 / 10)
Uptime:	166:56:02
Calls:	13,142
Files:	186,574
D/L today:	93 files (29,147K bytes)
Messages:	3,309,461

Re: Help with Streaming and Chunk Processing for Large JSON Data (60GB) from Kenna API

Who's Online

Recent Visitors

System Info