• Re: Help with Streaming and Chunk Processing for Large JSON Data (60GB) from Kenna API

    From Asif Ali Hirekumbi@asifali.ha@gmail.com to comp.lang.python on Mon Sep 30 12:11:30 2024
    From Newsgroup: comp.lang.python

    Thanks Abdur Rahmaan.
    I will give it a try !
    Thanks
    Asif
    On Mon, Sep 30, 2024 at 11:19 AM Abdur-Rahmaan Janhangeer < arj.python@gmail.com> wrote:
    Idk if you tried Polars, but it seems to work well with JSON data

    import polars as pl
    pl.read_json("file.json")

    Kind Regards,

    Abdur-Rahmaan Janhangeer
    about <https://compileralchemy.github.io/> | blog <https://www.pythonkitchen.com>
    github <https://github.com/Abdur-RahmaanJ>
    Mauritius


    On Mon, Sep 30, 2024 at 8:00 AM Asif Ali Hirekumbi via Python-list < python-list@python.org> wrote:

    Dear Python Experts,

    I am working with the Kenna Application's API to retrieve vulnerability
    data. The API endpoint provides a single, massive JSON file in gzip
    format,
    approximately 60 GB in size. Handling such a large dataset in one go is
    proving to be quite challenging, especially in terms of memory management. >>
    I am looking for guidance on how to efficiently stream this data and
    process it in chunks using Python. Specifically, I am wondering if there’s >> a way to use the requests library or any other libraries that would allow
    us to pull data from the API endpoint in a memory-efficient manner.

    Here are the relevant API endpoints from Kenna:

    - Kenna API Documentation
    <https://apidocs.kennasecurity.com/reference/welcome>
    - Kenna Vulnerabilities Export
    <https://apidocs.kennasecurity.com/reference/retrieve-data-export>

    If anyone has experience with similar use cases or can offer any advice,
    it
    would be greatly appreciated.

    Thank you in advance for your help!

    Best regards
    Asif Ali
    --
    https://mail.python.org/mailman/listinfo/python-list


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Left Right@olegsivokon@gmail.com to comp.lang.python on Mon Sep 30 10:41:44 2024
    From Newsgroup: comp.lang.python

    Whether and to what degree you can stream JSON depends on JSON
    structure. In general, however, JSON cannot be streamed (but commonly
    it can be).
    Imagine a pathological case of this shape: 1... <60GB of digits>. This
    is still a valid JSON (it doesn't have any limits on how many digits a
    number can have). And you cannot parse this number in a streaming way
    because in order to do that, you need to start with the least
    significant digit.
    Typically, however, JSON can be parsed incrementally. The format is conceptually very simple to write a parser for. There are plenty of
    parsers that do that, for example, this one: https://pypi.org/project/json-stream/ . But, I'd encourage you to do
    it yourself. It's fun, and the resulting parser should end up less
    than some 50 LoC. Also, it allows you to closer incorporate your
    desired output into your parser.
    On Mon, Sep 30, 2024 at 8:44 AM Asif Ali Hirekumbi via Python-list <python-list@python.org> wrote:

    Thanks Abdur Rahmaan.
    I will give it a try !

    Thanks
    Asif

    On Mon, Sep 30, 2024 at 11:19 AM Abdur-Rahmaan Janhangeer < arj.python@gmail.com> wrote:

    Idk if you tried Polars, but it seems to work well with JSON data

    import polars as pl
    pl.read_json("file.json")

    Kind Regards,

    Abdur-Rahmaan Janhangeer
    about <https://compileralchemy.github.io/> | blog <https://www.pythonkitchen.com>
    github <https://github.com/Abdur-RahmaanJ>
    Mauritius


    On Mon, Sep 30, 2024 at 8:00 AM Asif Ali Hirekumbi via Python-list < python-list@python.org> wrote:

    Dear Python Experts,

    I am working with the Kenna Application's API to retrieve vulnerability
    data. The API endpoint provides a single, massive JSON file in gzip
    format,
    approximately 60 GB in size. Handling such a large dataset in one go is
    proving to be quite challenging, especially in terms of memory management. >>
    I am looking for guidance on how to efficiently stream this data and
    process it in chunks using Python. Specifically, I am wondering if there’s
    a way to use the requests library or any other libraries that would allow >> us to pull data from the API endpoint in a memory-efficient manner.

    Here are the relevant API endpoints from Kenna:

    - Kenna API Documentation
    <https://apidocs.kennasecurity.com/reference/welcome>
    - Kenna Vulnerabilities Export
    <https://apidocs.kennasecurity.com/reference/retrieve-data-export>

    If anyone has experience with similar use cases or can offer any advice, >> it
    would be greatly appreciated.

    Thank you in advance for your help!

    Best regards
    Asif Ali
    --
    https://mail.python.org/mailman/listinfo/python-list


    --
    https://mail.python.org/mailman/listinfo/python-list
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Barry@barry@barrys-emacs.org to comp.lang.python on Mon Sep 30 16:30:19 2024
    From Newsgroup: comp.lang.python


    On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list <python-list@python.org> wrote:


    import polars as pl
    pl.read_json("file.json")


    This is not going to work unless the computer has a lot more the 60GiB of RAM. As later suggested a streaming parser is required.
    Barry
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Thomas Passin@list1@tompassin.net to comp.lang.python on Mon Sep 30 12:11:46 2024
    From Newsgroup: comp.lang.python

    On 9/30/2024 11:30 AM, Barry via Python-list wrote:


    On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list <python-list@python.org> wrote:


    import polars as pl
    pl.read_json("file.json")



    This is not going to work unless the computer has a lot more the 60GiB of RAM.

    As later suggested a streaming parser is required.

    Streaming won't work because the file is gzipped. You have to receive
    the whole thing before you can unzip it. Once unzipped it will be even
    larger, and all in memory.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From 2QdxY4RzWzUUiLuE@2QdxY4RzWzUUiLuE@potatochowder.com to comp.lang.python on Mon Sep 30 14:28:33 2024
    From Newsgroup: comp.lang.python

    On 2024-09-30 at 11:44:50 -0400,
    Grant Edwards via Python-list <python-list@python.org> wrote:

    On 2024-09-30, Left Right via Python-list <python-list@python.org> wrote:
    Whether and to what degree you can stream JSON depends on JSON
    structure. In general, however, JSON cannot be streamed (but commonly
    it can be).

    Imagine a pathological case of this shape: 1... <60GB of digits>. This
    is still a valid JSON (it doesn't have any limits on how many digits a number can have). And you cannot parse this number in a streaming way because in order to do that, you need to start with the least
    significant digit.

    Which is how arabic numbers were originally parsed, but when
    westerners adopted them from a R->L written language, thet didn't flip
    them around to match the L->R written language into which they were
    being adopted.

    Interesting.

    So now long numbers can't be parsed as a stream in software. They
    should have anticipated this problem back in the 13th century and
    flipped the numbers around.

    What am I missing? Handwavingly, start with the first digit, and as
    long as the next character is a digit, multipliy the accumulated result
    by 10 (or the appropriate base) and add the next value. Oh, and handle scientific notation as a special case, and perhaps fail spectacularly
    instead of recovering gracefully in certain edge cases. And in the pathological case of a single number with 60 billion digits, run out of
    memory (and complain loudly to the person who claimed that the file
    contained a "dataset"). But why do I need to start with the least
    significant digit?
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris Angelico@rosuav@gmail.com to comp.lang.python on Tue Oct 1 04:46:35 2024
    From Newsgroup: comp.lang.python

    On Tue, 1 Oct 2024 at 04:30, Dan Sommers via Python-list <python-list@python.org> wrote:

    But why do I need to start with the least
    significant digit?

    If you start from the most significant, you don't know anything about
    the number until you finish parsing it. There's almost nothing you can
    say about a number given that it starts with a particular sequence
    (since you don't know how MANY digits there are). However, if you know
    the LAST digits, you can make certain statements about it (trivial
    examples being whether it's odd or even).

    It's not very, well, significant. But there's something to it. And it
    extends nicely to p-adic numbers, which can have an infinite number of
    nonzero digits to the left of the decimal:

    https://en.wikipedia.org/wiki/P-adic_number

    ChrisA
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Thomas Passin@list1@tompassin.net to comp.lang.python on Mon Sep 30 13:57:05 2024
    From Newsgroup: comp.lang.python

    On 9/30/2024 1:00 PM, Chris Angelico via Python-list wrote:
    On Tue, 1 Oct 2024 at 02:20, Thomas Passin via Python-list <python-list@python.org> wrote:

    On 9/30/2024 11:30 AM, Barry via Python-list wrote:


    On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list <python-list@python.org> wrote:


    import polars as pl
    pl.read_json("file.json")



    This is not going to work unless the computer has a lot more the 60GiB of RAM.

    As later suggested a streaming parser is required.

    Streaming won't work because the file is gzipped. You have to receive
    the whole thing before you can unzip it. Once unzipped it will be even
    larger, and all in memory.

    Streaming gzip is perfectly possible. You may be thinking of PKZip
    which has its EOCD at the end of the file (although it may still be
    possible to stream-decompress if you work at it).

    ChrisA

    You're right, that's what I was thinking of.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Left Right@olegsivokon@gmail.com to comp.lang.python on Mon Sep 30 21:30:06 2024
    From Newsgroup: comp.lang.python

    Streaming won't work because the file is gzipped. You have to receive
    the whole thing before you can unzip it. Once unzipped it will be even larger, and all in memory.
    GZip is specifically designed to be streamed. So, that's not a
    problem (in principle), but you would need to have a streaming GZip
    parser, quick search in PyPI revealed this package: https://pypi.org/project/gzip-stream/ .
    On Mon, Sep 30, 2024 at 6:20 PM Thomas Passin via Python-list <python-list@python.org> wrote:

    On 9/30/2024 11:30 AM, Barry via Python-list wrote:


    On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list <python-list@python.org> wrote:


    import polars as pl
    pl.read_json("file.json")



    This is not going to work unless the computer has a lot more the 60GiB of RAM.

    As later suggested a streaming parser is required.

    Streaming won't work because the file is gzipped. You have to receive
    the whole thing before you can unzip it. Once unzipped it will be even larger, and all in memory.
    --
    https://mail.python.org/mailman/listinfo/python-list
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Thomas Passin@list1@tompassin.net to comp.lang.python on Mon Sep 30 14:05:36 2024
    From Newsgroup: comp.lang.python

    On 9/30/2024 11:30 AM, Barry via Python-list wrote:


    On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list <python-list@python.org> wrote:


    import polars as pl
    pl.read_json("file.json")



    This is not going to work unless the computer has a lot more the 60GiB of RAM.

    As later suggested a streaming parser is required.

    There is also the json-stream library, on PyPi at

    https://pypi.org/project/json-stream/


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From 2QdxY4RzWzUUiLuE@2QdxY4RzWzUUiLuE@potatochowder.com to comp.lang.python on Mon Sep 30 18:16:03 2024
    From Newsgroup: comp.lang.python

    On 2024-10-01 at 04:46:35 +1000,
    Chris Angelico via Python-list <python-list@python.org> wrote:

    On Tue, 1 Oct 2024 at 04:30, Dan Sommers via Python-list <python-list@python.org> wrote:

    But why do I need to start with the least
    significant digit?

    If you start from the most significant, you don't know anything about
    the number until you finish parsing it. There's almost nothing you can
    say about a number given that it starts with a particular sequence
    (since you don't know how MANY digits there are). However, if you know
    the LAST digits, you can make certain statements about it (trivial
    examples being whether it's odd or even).

    But that wasn't the question. Sure, under certain circumstances and for specific use cases and/or requirements, there might be arguments to read potential numbers as strings and possibly not have to parse them
    completely before accepting or rejecting them.

    And if I start with the least significant digit and the number happens
    to be written in scientific notation and/or has a decimal point, then I
    can't tell whether it's odd or even until I further process the whole
    thing anyway.

    It's not very, well, significant. But there's something to it. And it
    extends nicely to p-adic numbers, which can have an infinite number of nonzero digits to the left of the decimal:

    https://en.wikipedia.org/wiki/P-adic_number

    In Common Lisp, integers can be written in any integer base from two to
    thirty six, inclusive. So knowing the last digit doesn't tell you
    whether an integer is even or odd until you know the base anyway.

    Curiously, we agree: if you move the goal posts arbitrarily, then
    some algorithms that parse JSON numbers will fail.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris Angelico@rosuav@gmail.com to comp.lang.python on Tue Oct 1 09:09:07 2024
    From Newsgroup: comp.lang.python

    On Tue, 1 Oct 2024 at 08:56, Grant Edwards via Python-list <python-list@python.org> wrote:

    On 2024-09-30, Dan Sommers via Python-list <python-list@python.org> wrote:

    In Common Lisp, integers can be written in any integer base from two
    to thirty six, inclusive. So knowing the last digit doesn't tell
    you whether an integer is even or odd until you know the base
    anyway.

    I had to think about that for an embarassingly long time before it
    clicked.

    The only part I'm not clear on is what identifies the base. If you're
    going to write numbers little-endian, it's not that hard to also write
    them with a base indicator before the digits. But, whatever. This is a
    typical tangent and people are argumentative for no reason. I was just
    trying to add some explanatory notes to why little-endian does make
    more sense than big-endian.

    ChrisA
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From 2QdxY4RzWzUUiLuE@2QdxY4RzWzUUiLuE@potatochowder.com to comp.lang.python on Mon Sep 30 20:06:57 2024
    From Newsgroup: comp.lang.python

    On 2024-10-01 at 09:09:07 +1000,
    Chris Angelico via Python-list <python-list@python.org> wrote:

    On Tue, 1 Oct 2024 at 08:56, Grant Edwards via Python-list <python-list@python.org> wrote:

    On 2024-09-30, Dan Sommers via Python-list <python-list@python.org> wrote:

    In Common Lisp, integers can be written in any integer base from two
    to thirty six, inclusive. So knowing the last digit doesn't tell
    you whether an integer is even or odd until you know the base
    anyway.

    I had to think about that for an embarassingly long time before it
    clicked.

    The only part I'm not clear on is what identifies the base. If you're
    going to write numbers little-endian, it's not that hard to also write
    them with a base indicator before the digits [...]

    In Common Lisp, you can write integers as #nnR[digits], where nn is the
    decimal representation of the base (possibly without a leading zero),
    the # and the R are literal characters, and the digits are written in
    the intended base. So the input #16fFFFF is read as the integer 65535.

    You can also set or bind the global variable *read-base* (yes, the
    asterisks are part of the name) to an integer between 2 and 36, and then anything that looks like an integer in that base is interpreted as such (including literals in programs). The literals I described above are
    still handled correctly no matter the current value of *read-base*. So
    if the value of *read-base* is 16, then the input FFFF is read as the
    integer 65535 (as is the input #16rFFFF).

    (Pedants may point our details I omitted. I admit to omitting them.)

    IIRC, certain [old 8080 and Z-80?] assemblers used to put the base
    indicator at the end. So 10 meant, well, 10, but 10H meant 16 and 10b
    meant 2 (IDK; the capital H and the lower case b both look right to me).

    I don't recall numbers written from least significant digit to most
    significant digit (big and little endian *storage*, yes, but not the
    digits when presented to or read from a human).
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Left Right@olegsivokon@gmail.com to comp.lang.python on Mon Sep 30 21:34:07 2024
    From Newsgroup: comp.lang.python

    What am I missing? Handwavingly, start with the first digit, and as
    long as the next character is a digit, multipliy the accumulated result
    by 10 (or the appropriate base) and add the next value. Oh, and handle scientific notation as a special case, and perhaps fail spectacularly
    instead of recovering gracefully in certain edge cases. And in the pathological case of a single number with 60 billion digits, run out of memory (and complain loudly to the person who claimed that the file
    contained a "dataset"). But why do I need to start with the least significant digit?
    You probably forgot that it has to be _streaming_. Suppose you parse
    the first digit: can you hand this information over to an external
    function to process the parsed data? -- No! because you don't know the magnitude yet. What about two digits? -- Same thing. You cannot
    leave the parser code until you know the magnitude (otherwise the
    information is useless to the external code).
    So, even if you have enough memory and don't care about special cases
    like scientific notation: yes, you will be able to parse it, but it
    won't be a streaming parser.
    On Mon, Sep 30, 2024 at 9:30 PM Left Right <olegsivokon@gmail.com> wrote:

    Streaming won't work because the file is gzipped. You have to receive
    the whole thing before you can unzip it. Once unzipped it will be even larger, and all in memory.

    GZip is specifically designed to be streamed. So, that's not a
    problem (in principle), but you would need to have a streaming GZip
    parser, quick search in PyPI revealed this package: https://pypi.org/project/gzip-stream/ .

    On Mon, Sep 30, 2024 at 6:20 PM Thomas Passin via Python-list <python-list@python.org> wrote:

    On 9/30/2024 11:30 AM, Barry via Python-list wrote:


    On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list <python-list@python.org> wrote:


    import polars as pl
    pl.read_json("file.json")



    This is not going to work unless the computer has a lot more the 60GiB of RAM.

    As later suggested a streaming parser is required.

    Streaming won't work because the file is gzipped. You have to receive
    the whole thing before you can unzip it. Once unzipped it will be even larger, and all in memory.
    --
    https://mail.python.org/mailman/listinfo/python-list
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From 2QdxY4RzWzUUiLuE@2QdxY4RzWzUUiLuE@potatochowder.com to comp.lang.python on Tue Oct 1 11:34:45 2024
    From Newsgroup: comp.lang.python

    On 2024-09-30 at 18:48:02 -0700,
    Keith Thompson via Python-list <python-list@python.org> wrote:

    2QdxY4RzWzUUiLuE@potatochowder.com writes:
    [...]
    In Common Lisp, you can write integers as #nnR[digits], where nn is the decimal representation of the base (possibly without a leading zero),
    the # and the R are literal characters, and the digits are written in
    the intended base. So the input #16fFFFF is read as the integer 65535.

    Typo: You meant #16RFFFF, not #16fFFFF.

    Yep. Sorry.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From 2QdxY4RzWzUUiLuE@2QdxY4RzWzUUiLuE@potatochowder.com to comp.lang.python on Tue Oct 1 11:47:24 2024
    From Newsgroup: comp.lang.python

    On 2024-09-30 at 21:34:07 +0200,
    Regarding "Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API,"
    Left Right via Python-list <python-list@python.org> wrote:

    What am I missing? Handwavingly, start with the first digit, and as
    long as the next character is a digit, multipliy the accumulated result
    by 10 (or the appropriate base) and add the next value. Oh, and handle scientific notation as a special case, and perhaps fail spectacularly instead of recovering gracefully in certain edge cases. And in the pathological case of a single number with 60 billion digits, run out of memory (and complain loudly to the person who claimed that the file contained a "dataset"). But why do I need to start with the least significant digit?

    You probably forgot that it has to be _streaming_. Suppose you parse
    the first digit: can you hand this information over to an external
    function to process the parsed data? -- No! because you don't know the magnitude yet. What about two digits? -- Same thing. You cannot
    leave the parser code until you know the magnitude (otherwise the
    information is useless to the external code).

    If I recognize the first digit, then I *can* hand that over to an
    external function to accumulate the digits that follow.

    So, even if you have enough memory and don't care about special cases
    like scientific notation: yes, you will be able to parse it, but it
    won't be a streaming parser.

    Under that constraint, I'm not sure I can parse anything. How can I
    parse a string (and hand it over to an external function) until I've
    found the closing quote?

    How much state can a parser maintain (before it invokes an external
    function) and still be considered streaming? I fear that we may be
    getting hung up on terminology rather than solving the problem at hand.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Left Right@olegsivokon@gmail.com to comp.lang.python on Tue Oct 1 23:03:01 2024
    From Newsgroup: comp.lang.python

    If I recognize the first digit, then I *can* hand that over to an
    external function to accumulate the digits that follow.
    And what is that external function going to do with this information?
    The point is you didn't parse anything if you just sent the digit.
    You just delegated the parsing further. Parsing is only meaningful if
    you extracted some information, but your idea is, essentially "what if
    I do nothing?".
    Under that constraint, I'm not sure I can parse anything. How can I
    parse a string (and hand it over to an external function) until I've
    found the closing quote?
    Nobody says that parsing a number is the only pathological case. You,
    however, exaggerate by saying you cannot parse _anything_. You can
    parse booleans or null, for example. There's no problem there.
    Again, I think you misunderstand what streaming is for. Let me remind:
    it's for processing information as it comes, potentially,
    indefinitely. This has far more important implications than what you
    find in computer science. For example, some mathematicians use the
    same argument to show that real numbers are either fiction or useless:
    consider adding two real numbers (where real numbers are potentially
    infinite strings of decimal digits after the period) -- there's no way
    to prove that such an addition is possible because you would need an
    infinite proof for that (because you need to start adding from the
    least significant digit).
    In principle, any language that has infinite words will have the same
    problem with streaming. If you ever pondered h/w or low-level
    protocols s.a. SCSI or IP, you'd see that they are specifically
    designed in such a way as to never have infinite words (because they
    must be amenable to streaming). Consider also an interesting
    consequence of SCSI not being able to have infinite words: this means,
    besides other things that fsync() is nonsense! :) If you aren't
    familiar with the concept: UNIX filesystem API suggests that it's
    possible to destage arbitrary large file (or a chunk of file) to disk.
    But SCSI is built of finite "words" and to describe an arbitrary large
    file you'd need to list all the blocks that constitute the file! And
    that's why fsync() and family are so hated by people who deal with
    storage: the only way to implement fsync() in compliance with the
    standard is to sync _everything_ (and it hurts!)
    On Tue, Oct 1, 2024 at 5:49 PM Dan Sommers via Python-list <python-list@python.org> wrote:

    On 2024-09-30 at 21:34:07 +0200,
    Regarding "Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API,"
    Left Right via Python-list <python-list@python.org> wrote:

    What am I missing? Handwavingly, start with the first digit, and as
    long as the next character is a digit, multipliy the accumulated result by 10 (or the appropriate base) and add the next value. Oh, and handle scientific notation as a special case, and perhaps fail spectacularly instead of recovering gracefully in certain edge cases. And in the pathological case of a single number with 60 billion digits, run out of memory (and complain loudly to the person who claimed that the file contained a "dataset"). But why do I need to start with the least significant digit?

    You probably forgot that it has to be _streaming_. Suppose you parse
    the first digit: can you hand this information over to an external
    function to process the parsed data? -- No! because you don't know the magnitude yet. What about two digits? -- Same thing. You cannot
    leave the parser code until you know the magnitude (otherwise the information is useless to the external code).

    If I recognize the first digit, then I *can* hand that over to an
    external function to accumulate the digits that follow.

    So, even if you have enough memory and don't care about special cases
    like scientific notation: yes, you will be able to parse it, but it
    won't be a streaming parser.

    Under that constraint, I'm not sure I can parse anything. How can I
    parse a string (and hand it over to an external function) until I've
    found the closing quote?

    How much state can a parser maintain (before it invokes an external
    function) and still be considered streaming? I fear that we may be
    getting hung up on terminology rather than solving the problem at hand.
    --
    https://mail.python.org/mailman/listinfo/python-list
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Greg Ewing@greg.ewing@canterbury.ac.nz to comp.lang.python on Wed Oct 2 10:48:24 2024
    From Newsgroup: comp.lang.python

    On 1/10/24 8:34 am, Left Right wrote:
    You probably forgot that it has to be _streaming_. Suppose you parse
    the first digit: can you hand this information over to an external
    function to process the parsed data? -- No! because you don't know the magnitude yet.

    By that definition of "streaming", no parser can ever be streaming,
    because there will be some constructs that must be read in their
    entirety before a suitably-structured piece of output can be
    emitted.

    The context of this discussion about integers is the claim that
    they *could* be parsed incrementally if they were written little
    endian instead of big endian, but the same argument applies either
    way.
    --
    Greg
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Greg Ewing@greg.ewing@canterbury.ac.nz to comp.lang.python on Wed Oct 2 11:07:41 2024
    From Newsgroup: comp.lang.python

    On 2/10/24 10:03 am, Left Right wrote:
    Consider also an interesting
    consequence of SCSI not being able to have infinite words: this means, besides other things that fsync() is nonsense! :) If you aren't
    familiar with the concept: UNIX filesystem API suggests that it's
    possible to destage arbitrary large file (or a chunk of file) to disk.
    But SCSI is built of finite "words" and to describe an arbitrary large
    file you'd need to list all the blocks that constitute the file!

    I don't follow. What fsync() does is ensure that any data buffered
    in the kernel relating to the file is sent to the storage device.
    It can send as many blocks of data over SCSI as required to
    achieve this. There's no requirement for it to be atomic at the
    level of the interface between the kernel and the hardware.

    Some devices do their own buffering in ways that are invisible to
    the software, so fsync() can't guarantee that the data is actually
    written to the storage medium. But that's a problem stemming from
    the design of the hardware, not the design of the protocol for
    communicating with the hardware.

    the only way to implement fsync() in compliance with the
    standard is to sync _everything_

    Again I'm not sure what you mean here. It may be difficult for the
    kernel to track down exactly what data is relevant to a particular file,
    and so the kernel programmers take the easy way out and just implement
    fsync() as sync(). But again that has nothing to do with the protocol.
    --
    Greg
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From avi.e.gross@avi.e.gross@gmail.com to comp.lang.python on Tue Oct 1 19:26:52 2024
    From Newsgroup: comp.lang.python

    This discussion has become less useful.

    E can all agree that in Computer Science, real infinities are avoided, and frankly, need not be taken seriously in any serious program.

    You can store all kinds of infinities quite compactly as in a transcendental number you can derive to as many decimal points as you like. Want 1/7 to a thousand decimal places, no problem. You can be given a digit 1 and a digit
    7 and asked to do a division to as many digits as you wish in a
    deterministic manner. I can think of quite a few generators that could
    easily supply the next digit, or just keep giving the next element from
    142857 each time from a circular loop.

    Sines, cosines, pi, e and so on, can often be calculated to arbitrary
    precision by evaluating things like infinite Taylor Series as many times as needed up to the precision of the data holding the number as you move along.

    Similar ideas allow generators to give you as many primes as you want, and
    no more.

    So, if you can store arbitrary python code as part of your JSON, you can
    send quite a bit of somewhat compressed data.

    The real problem is how the JSON is set up. If you take umpteen data
    structures and wrap them all in something like a list, then it may be a tad hard to stream as you may not necessarily be examining the contents till the list finishes gigabytes later. But if, instead, you send lots of smaller
    parts, such as perhaps sending each row of something like a data.frame individually, the other side can recombine them incrementally to a larger structure such as a data.frame and do some logic on it as it streams, such
    as keeping only some columns and discarding the rest, or applying filters
    that only keep rows you care about. And, of course, all rows could be
    appended to one and perhaps more .CSV files as well so if you need multiple passes on the data, it can now be processed locally in various modes,
    including "streamed".

    I think that for some purposes, it makes some sense to not stream anything
    but results. I mean consider any database that allows a remote login and SQL commands that only stream results. If I only want info on records about
    company X between July 1 and September 15 of a particular year and only if
    the amount paid remains zero or is less than the amount owed, ...


    -----Original Message-----
    From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Greg Ewing via Python-list
    Sent: Tuesday, October 1, 2024 5:48 PM
    To: python-list@python.org
    Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data
    (60 GB) from Kenna API

    On 1/10/24 8:34 am, Left Right wrote:
    You probably forgot that it has to be _streaming_. Suppose you parse
    the first digit: can you hand this information over to an external
    function to process the parsed data? -- No! because you don't know the magnitude yet.

    By that definition of "streaming", no parser can ever be streaming,
    because there will be some constructs that must be read in their
    entirety before a suitably-structured piece of output can be
    emitted.

    The context of this discussion about integers is the claim that
    they *could* be parsed incrementally if they were written little
    endian instead of big endian, but the same argument applies either
    way.
    --
    Greg
    --
    https://mail.python.org/mailman/listinfo/python-list

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From 2QdxY4RzWzUUiLuE@2QdxY4RzWzUUiLuE@potatochowder.com to comp.lang.python on Tue Oct 1 20:20:59 2024
    From Newsgroup: comp.lang.python

    On 2024-10-01 at 23:03:01 +0200,
    Left Right <olegsivokon@gmail.com> wrote:

    If I recognize the first digit, then I *can* hand that over to an
    external function to accumulate the digits that follow.

    And what is that external function going to do with this information?
    The point is you didn't parse anything if you just sent the digit.
    You just delegated the parsing further. Parsing is only meaningful if
    you extracted some information, but your idea is, essentially "what if
    I do nothing?".

    If the parser detects the first digit of a number, then the parser can
    read digits one at a time (i.e., "streaming"), assimilate and accumulate
    the value of the number being parsed, and successfully finish parsing
    the number it reads a non-digit. Whether the function that accumulates
    the value during the process is internal or external isn't relevant; the
    point is that it is possible to parse integers from most significant
    digit to least significant digit under a streaming model (and if you're sufficiently clever, you can even write partial results to external
    storage and/or another transmission protocol, thus allowing for numbers
    bigger (as measured by JSON or your internal representation) than your
    RAM).

    At most, the parser has to remember the non-digit character it read so
    that it (the parser) can begin to parse whatever comes after the number.
    Does that break your notion of "streaming"?

    Why do I have to start with the least significant digit?

    Under that constraint, I'm not sure I can parse anything. How can I
    parse a string (and hand it over to an external function) until I've
    found the closing quote?

    Nobody says that parsing a number is the only pathological case. You, however, exaggerate by saying you cannot parse _anything_. You can
    parse booleans or null, for example. There's no problem there.

    My intent was only to repeat what you implied: that any parser that
    reads its input until it has parsed a value is not streaming.

    So how much information can the parser keep before you consider it not
    to be "streaming"?

    [...]

    In principle, any language that has infinite words will have the same
    problem with streaming [...]

    So what magic allows anyone to stream any JSON file over SCSI or IP?
    Let alone some kind of "live stream" that by definition is indefinite,
    even if it only lasts a few tenths of a second?

    [...] If you ever pondered h/w or low-level
    protocols s.a. SCSI or IP [...]

    I spent a good deal of my career designing and implementing all manner
    of communicaations protocols, from transmitting and receiving single
    bits over a wire all the way up to what are now known as session and presentation layers. Some imposed maximum lengths in certain places;
    some allowed for indefinite amounts of data to be transferred from one
    end to the other without stopping, resetting, or overflowing. And yet
    somehow, the universe never collapsed.

    If you believe that some implementation of fsync fails to meet a
    specification, or fails to work correctly on files containign JSON, then
    file a bug report.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Greg Ewing@greg.ewing@canterbury.ac.nz to comp.lang.python on Wed Oct 2 18:27:54 2024
    From Newsgroup: comp.lang.python

    On 2/10/24 12:26 pm, avi.e.gross@gmail.com wrote:
    The real problem is how the JSON is set up. If you take umpteen data structures and wrap them all in something like a list, then it may be a tad hard to stream as you may not necessarily be examining the contents till the list finishes gigabytes later.

    Yes, if you want to process the items as they come in, you might
    be better off sending a series of separate JSON strings, rather than
    one JSON string containing a list.

    Or, use a specialised JSON parser that processes each item of the
    list as soon as it's finished parsing it, instead of collecting the
    whole list first.
    --
    Greg

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris Angelico@rosuav@gmail.com to comp.lang.python on Wed Oct 2 23:59:41 2024
    From Newsgroup: comp.lang.python

    On Wed, 2 Oct 2024 at 23:53, Left Right via Python-list <python-list@python.org> wrote:
    In the same email you replied to, I gave examples of languages for
    which parsers can be streaming (in general): SCSI or IP.

    You can't validate an IP packet without having all of it. Your notion
    of "streaming" is nonsensical.

    ChrisA
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris Angelico@rosuav@gmail.com to comp.lang.python on Thu Oct 3 08:51:01 2024
    From Newsgroup: comp.lang.python

    On Thu, 3 Oct 2024 at 08:48, Left Right <olegsivokon@gmail.com> wrote:

    You can't validate an IP packet without having all of it. Your notion
    of "streaming" is nonsensical.

    Whoa, whoa, hold your horses! "nonsensical" needs a little bit of justification :)

    It seems you don't understand the difference between words and
    languages! In my examples, IP _protocol_ is the language, sequences of
    IP packets are the words in the language. A language is amenable to
    streaming if the words of the language are repetition of sequences of
    symbols of the alphabet of fixed length. This is, essentially, like
    saying that the words themselves are regular.

    One single IP packet is all you can parse. You're playing shenanigans
    with words the way Humpty Dumpty does. IP packets are not sequences,
    they are individuals.

    ChrisA
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Left Right@olegsivokon@gmail.com to comp.lang.python on Thu Oct 3 00:48:10 2024
    From Newsgroup: comp.lang.python

    You can't validate an IP packet without having all of it. Your notion
    of "streaming" is nonsensical.

    Whoa, whoa, hold your horses! "nonsensical" needs a little bit of
    justification :)

    It seems you don't understand the difference between words and
    languages! In my examples, IP _protocol_ is the language, sequences of
    IP packets are the words in the language. A language is amenable to
    streaming if the words of the language are repetition of sequences of
    symbols of the alphabet of fixed length. This is, essentially, like
    saying that the words themselves are regular.

    So, the follow-up question from you to me should be: how come strictly context-free languages can still be parsed with streaming parsers? --
    And the answer to that is it's possible to approximate context-free
    languages with regular languages. In fact, this is a very interesting
    subject, which unfortunately is usually overlooked in automata
    classes. It's interesting in a sense that it's very accessible to the
    students who already mastered the understanding of regular and
    context-free formalisms.

    So, streaming parsers (eg. SAX) are written for a regular language
    that approximates XML. This is because in practice we will almost
    never encounter more than N nesting levels in an XML, more than N
    characters in an element name etc. (for some large enough N).
    Something which allows us to create a regular language from a
    context-free one.

    NB. "Nonsensical" has a very precise meaning, when it comes to
    discussing the truth value of a proposition, which I think you also
    somehow didn't know about. You seem to use "nonsensical" as a synonym
    to "wrong". But, unbeknownst to you, you said something else. You
    actually implied that there's no way to tell if my notion of streaming
    is correct or not.

    But, for the future reference: my notion of streaming is correct, and
    you would do better learning some materials about it before jumping to conclusions.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Left Right@olegsivokon@gmail.com to comp.lang.python on Thu Oct 3 00:56:36 2024
    From Newsgroup: comp.lang.python

    One single IP packet is all you can parse.
    I worked for an undisclosed company which manufactures h/w for ISPs
    (4- and 8-unit boxes you mount on a rack in a datacenter).
    Essentially, big-big routers. So, I had the pleasure of writing
    software that parses IP _protocol_, and let me tell you: you have no
    idea what you just wrote.
    But, like I wrote earlier: you don't understand the distinction
    between languages and words. And in general, are just being stubborn
    and rude because you are trying to prove a point to someone you don't
    like, but, in reality, you just look more and more ridiculous.
    On Thu, Oct 3, 2024 at 12:51 AM Chris Angelico <rosuav@gmail.com> wrote:

    On Thu, 3 Oct 2024 at 08:48, Left Right <olegsivokon@gmail.com> wrote:

    You can't validate an IP packet without having all of it. Your notion
    of "streaming" is nonsensical.

    Whoa, whoa, hold your horses! "nonsensical" needs a little bit of justification :)

    It seems you don't understand the difference between words and
    languages! In my examples, IP _protocol_ is the language, sequences of
    IP packets are the words in the language. A language is amenable to streaming if the words of the language are repetition of sequences of symbols of the alphabet of fixed length. This is, essentially, like
    saying that the words themselves are regular.

    One single IP packet is all you can parse. You're playing shenanigans
    with words the way Humpty Dumpty does. IP packets are not sequences,
    they are individuals.

    ChrisA
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Ethan Furman@ethan@stoneleaf.us to comp.lang.python on Wed Oct 2 18:57:51 2024
    From Newsgroup: comp.lang.python

    This thread is derailing.

    Please consider it closed.

    --
    ~Ethan~
    Moderator
    --- Synchronet 3.20a-Linux NewsLink 1.114