• Constant Stack Canaries

    From Robert Finch@robfi680@gmail.com to comp.arch on Sun Mar 30 08:16:52 2025
    From Newsgroup: comp.arch

    Just got to thinking about stack canaries. I was going to have a special purpose register holding the canary value for testing while the program
    was running. But I just realized today that it may not be needed. Canary values could be handled by the program loader as constants, eliminating
    the need for a register. Since the value is not changing while the
    program is running, it could easily be a constant. This may require a
    fixup record handled by the assembler / linker to indicate to the loader
    to place a canary value.

    Prolog code would just store an immediate to the stack. On return a TRAP instruction could check for the immediate value and trap if not present.
    But the process seems to require assembler / linker support.

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Mar 30 12:47:59 2025
    From Newsgroup: comp.arch

    On 3/30/2025 7:16 AM, Robert Finch wrote:
    Just got to thinking about stack canaries. I was going to have a special purpose register holding the canary value for testing while the program
    was running. But I just realized today that it may not be needed. Canary values could be handled by the program loader as constants, eliminating
    the need for a register. Since the value is not changing while the
    program is running, it could easily be a constant. This may require a
    fixup record handled by the assembler / linker to indicate to the loader
    to place a canary value.

    Prolog code would just store an immediate to the stack. On return a TRAP instruction could check for the immediate value and trap if not present.
    But the process seems to require assembler / linker support.


    They are mostly just a normal compiler feature IME:
    Prolog stores the value;
    Epilog loads it and verifies that the value is intact.
    Using a magic number generated by the compiler.

    Nothing fancy needed in the assemble or link stages.


    In my case, canary behavior is one of:
    Use them in functions with arrays or similar (default);
    Use them everywhere (optional);
    Disable them entirely (also optional).

    In my case, it is only checking 16-bit magic numbers, but mostly because
    a 16-bit constant is cheaper to load into a register in this case
    (single 32-bit instruction, vs a larger encoding needed for larger values).

    ...

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sun Mar 30 20:14:53 2025
    From Newsgroup: comp.arch

    On Sun, 30 Mar 2025 17:47:59 +0000, BGB wrote:

    On 3/30/2025 7:16 AM, Robert Finch wrote:
    Just got to thinking about stack canaries. I was going to have a special
    purpose register holding the canary value for testing while the program
    was running. But I just realized today that it may not be needed. Canary
    values could be handled by the program loader as constants, eliminating
    the need for a register. Since the value is not changing while the
    program is running, it could easily be a constant. This may require a
    fixup record handled by the assembler / linker to indicate to the loader
    to place a canary value.

    Prolog code would just store an immediate to the stack. On return a TRAP
    instruction could check for the immediate value and trap if not present.
    But the process seems to require assembler / linker support.


    They are mostly just a normal compiler feature IME:
    Prolog stores the value;
    Epilog loads it and verifies that the value is intact.

    Agreed.

    Using a magic number

    Remove excess words.

    Nothing fancy needed in the assemble or link stages.

    They remain blissfully ignorant--at most they generate the magic
    number, possibly at random, possibly per link-module.

    In my case, canary behavior is one of:
    Use them in functions with arrays or similar (default);
    Use them everywhere (optional);
    Disable them entirely (also optional).

    In my case, it is only checking 16-bit magic numbers, but mostly because
    a 16-bit constant is cheaper to load into a register in this case
    (single 32-bit instruction, vs a larger encoding needed for larger
    values).

    ....
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sun Mar 30 21:26:23 2025
    From Newsgroup: comp.arch

    On 2025-03-30 4:14 p.m., MitchAlsup1 wrote:
    On Sun, 30 Mar 2025 17:47:59 +0000, BGB wrote:

    On 3/30/2025 7:16 AM, Robert Finch wrote:
    Just got to thinking about stack canaries. I was going to have a special >>> purpose register holding the canary value for testing while the program
    was running. But I just realized today that it may not be needed. Canary >>> values could be handled by the program loader as constants, eliminating
    the need for a register. Since the value is not changing while the
    program is running, it could easily be a constant. This may require a
    fixup record handled by the assembler / linker to indicate to the loader >>> to place a canary value.

    Prolog code would just store an immediate to the stack. On return a TRAP >>> instruction could check for the immediate value and trap if not present. >>> But the process seems to require assembler / linker support.


    They are mostly just a normal compiler feature IME:
       Prolog stores the value;
       Epilog loads it and verifies that the value is intact.

    Agreed.

    Using a magic number

    Remove excess words.

    Nothing fancy needed in the assemble or link stages.

    They remain blissfully ignorant--at most they generate the magic
    number, possibly at random, possibly per link-module.

    In my case, canary behavior is one of:
       Use them in functions with arrays or similar (default);
       Use them everywhere (optional);
       Disable them entirely (also optional).

    In my case, it is only checking 16-bit magic numbers, but mostly because
    a 16-bit constant is cheaper to load into a register in this case
    (single 32-bit instruction, vs a larger encoding needed for larger
    values).

    ....
    Ah, okay. I had thought the stack canaries were defined at run-time.
    Much easier to handle with the compiler. But, what happens when multiple instances of a program are loaded? Would it not be better to have
    separate stack canaries? I had thought the stack canaries would be
    different for each run of a program, otherwise could not some bad
    software discover the canary value?




    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Mon Mar 31 01:34:14 2025
    From Newsgroup: comp.arch

    On 3/30/2025 3:14 PM, MitchAlsup1 wrote:
    On Sun, 30 Mar 2025 17:47:59 +0000, BGB wrote:

    On 3/30/2025 7:16 AM, Robert Finch wrote:
    Just got to thinking about stack canaries. I was going to have a special >>> purpose register holding the canary value for testing while the program
    was running. But I just realized today that it may not be needed. Canary >>> values could be handled by the program loader as constants, eliminating
    the need for a register. Since the value is not changing while the
    program is running, it could easily be a constant. This may require a
    fixup record handled by the assembler / linker to indicate to the loader >>> to place a canary value.

    Prolog code would just store an immediate to the stack. On return a TRAP >>> instruction could check for the immediate value and trap if not present. >>> But the process seems to require assembler / linker support.


    They are mostly just a normal compiler feature IME:
       Prolog stores the value;
       Epilog loads it and verifies that the value is intact.

    Agreed.

    Using a magic number

    Remove excess words.


    It is possible that the magic number could have been generated by the
    CPU itself, or specified on the command-line by the user, or, ...

    Rather than, say, the compiler coming up with a magic number for each
    function (say, based on a hash function or "rand()" or something).


    Nothing fancy needed in the assemble or link stages.

    They remain blissfully ignorant--at most they generate the magic
    number, possibly at random, possibly per link-module.


    Yes.

    In my case, canary behavior is one of:
       Use them in functions with arrays or similar (default);
       Use them everywhere (optional);
       Disable them entirely (also optional).

    In my case, it is only checking 16-bit magic numbers, but mostly because
    a 16-bit constant is cheaper to load into a register in this case
    (single 32-bit instruction, vs a larger encoding needed for larger
    values).

    ....



    ( Well, anyways, going off on a tangent here... )


    Meanwhile, in my own goings on... It took way to much effort to figure
    out the specific quirks in that RIFF/WAVE headers to get Audacity to
    accept IMA-ADPCM output from BGBCC's resource converter.

    It was like:
    Media Player Classic: Yeah, fine.
    VLC Media Player: Yeah, fine.
    Audacity: "I have no idea what this is...".

    Turns out Audacity is not happy unless:
    The size of the 'fmt ' is 20 bytes, cbSize is 2,
    with an additional 16 bit member specifying the samples per block.
    With a 'fact' chunk, specifying the overall length of the WAV in samples.

    Pretty much everything else accepted the 16-byte PCMWAVEFORMAT with no
    'fact' chunk (and calculating the samples per block based on nBlockAlign).

    ...


    Though, in this case, I am mostly poking at stuff for "Resource WADs", typically images/etc that are intended to be hidden inside EXE or DLL
    files (where size matters more than quality, and any sound effects are
    likely to be limited to under 1 second).

    Say, one has a sound effect that is, say:
    0.5 seconds;
    8kHz
    2 bits/sample

    This is roughly 1kB of audio data.




    I also defined a 2-bit ADPCM variant (ADLQ), and ended up using a
    customized simplified header for it (using a similar structure to the
    BMP format; where the full RIFF format adds unnecessary overhead; though
    the savings here are debatable).

    Say:
    Full RIFF in this case:
    60 bytes of header.
    Simplified format:
    32 bytes of header.
    So, saving roughly 28 bytes of overhead vs RIFF/WAVE.
    Though, drops to 12 bytes in the absence of 'fact',
    and using the 16-byte PCMWAVEFORMAT structure vs WAVEFORMATEX.


    While theoretically 2-bit IMA ADPCM already exists for WAV, seemingly
    not much supports it. I also implemented support for this, as it does at
    least "exist in the wild".


    As for the 2-bit version of IMA ADPCM:
    Media Player Classic: Opens it and shows correct length,
    but sounds broken.
    Sounds like it is trying to play it with the 4 bit decoder.
    VLC Media Player:
    Basically works, though progress bar and time display is wonky.
    Does figure out mostly the correct length at least.
    Audacity: Claims to not understand it.


    I had discovered the "adpcm-xq" library, and looked at this as a
    reference for the 2-bit IMA format. Since VLC plays it, I will assume my
    code is probably generating "mostly correct" output (at least WRT the 2b
    ADPCM part; possible wonk may remain in the WAVEFORMATEX header, and/or
    VLC is just a little buggy here).



    So, thus far:
    ADLQ:
    Slightly higher quality;
    Needs a slightly more complicated encoder for good results;
    Decoder needs to ensure values don't go out of range.
    Software support: Basically non existent.
    Could in theory allow a cheap-ish hardware decoder.
    2-bit IMA ADPCM:
    Slightly simpler encoder;
    More is needed on the decoder side;
    Requires using multiply and range clamping.
    Slightly worse audio quality ATM.
    Around 0.8% bigger for mono due to header differences.

    Block Headers:
    ADLQ:
    ( 7: 0): Initial Sample, A-Law
    (11: 8): Initial Step Index
    ( 12): Interpolation Hint
    (15:13): Block Size (Log2)
    IMA, 2b:
    (15: 0): Initial Sample, PCM16
    (23:16): Step Index
    (31:24): Zero
    ADLQ is 1016 samples in 256 bytes, IMA is 1008.

    Sample Format is common:
    00: Small Positive
    01: Large Positive
    10: Small Negative
    11: Large Negative

    Both have a scale-ratio of 1 or 3 (if normalized).
    ADLQ has a narrower range of steps, with stepping of -1/+1.
    Each step in ADLQ is 1/2 bit, so each 2 steps is a power of 2.
    So, curve of around 1.414214**n
    IMA has more steps, with a per-sample step of -1/+2.
    Doesn't map cleanly to power of 2,
    but around 8 steps per power of 2.
    Seems to be build around a curve of 1.1**n.

    But, more aggressive stepping makes sense with 2-bit samples IMO...

    I went with not doing any range clamping in the decoder, so the encoder
    would be responsible that values don't go out of range. This does
    increase encoder complexity some (it needs to evaluate possible paths
    multiple samples in advance to make sure the path doesn't go out of range).

    Potentially, 1/4-bit step with -1/+2 could have made sense. Would need a
    5-bit index though to have enough dynamic range.


    Both use a different strategy for stereo:
    ADLQ:
    Splits center and side, encoding side at 1/4 sample rate;
    So, stereo increases bitrate by 25%.
    2b IMA:
    Encodes both the left and right channel independently.
    So, stereo doubles the bitrate.


    As for why 2b:
    Where one cares more about size than audio quality...
    8kHz : 16 kbps
    11kHz: 24 kbps
    16kHz: 32 kbps
    Also IMHO, 16kHz at 2b/s sounds better than 8kHz at 4b/s.
    At least speech is mostly still intelligible at 16 kHz.
    Basic sound effects still mostly work at 8kHz though.
    Like, if one needs a ding or chime or similar.

    Not really any good/obvious way here to reach or go below 1 bit/sample
    while still preserving passable quality (2 bit/sample is the lower limit
    for ADPCM, only real way to go lower would be to match blocks of 4 or 8 samples to a pattern table).


    Had previously been making some use of A-Law, but as can be noted, A-Law requires 8 bits per sample.

    Though, ending up back at poking around with ADPCM is similar territory
    to my projects from a decade ago...


    But, OTOH: APDCM is/was an effective format for sound effects; even if
    not given much credit (and seemingly most people see it as obsolescent).




    As for image formats, I have a few options for low bpp, while also being
    cheap to decode:
    BMP+CRAM: 4x4x1, limited variant of CRAM ("MS Video 1")
    Roughly 2 bpp (and repurposed as a graphics format...).
    BMP+CQ: 8x8x1, similar design to CRAM.
    Roughly 1.25 bpp

    Where, these can work well for images with no more than 2 colors per 4x4
    or 8x8 pixel block (otherwise, YMMV). As it so happens, lots of UI
    graphics fit this pattern, and/or are essentially monochrome. CQ can
    deal well with monochrome or almost-monochrome graphics without too much
    space overhead.

    Though, in some other cases, monochrome or 4-color images could be a
    better fit. These default to black/white or black/white/cyan/magenta,
    but don't necessarily need to be limited to this (but, may need to add
    options in BGBCC for 2/4/16 color dynamic-palette).

    Say, for example, if an image is only black/white/red/blue or similar,
    4-color could make sense (vs using CRAM or CQ and picking from the 256
    color palette; but not being able to have different sets of colors in
    close proximity). Often, 16-color works, but 16-color is rather bulky if compared with CRAM or CQ.



    For the CRAM and CQ formats, I ended up adding an option by which the
    color palette can be skipped (it is replaced by a palette hash value; OS
    can use the color palette associated with the corresponding hash number).

    Mostly this was because, say, for 32x32 or 64x64 CRAM images, the 256
    color palette was bigger than the image itself.

    Note that much below 32x32, it is more compact to use hi-color BMP
    images than 256-color due to the color palette issue (making the
    optional omission for small image formats desirable).

    Though, generally, these are generated with BGBCC, which can include the palette in the generated resource WAD, though TBD the best format. For
    the kernel, it is stored as a 256x256 indexed color bitmap (which also
    encodes a set of dither-aware RGB555 lookup tables).

    For normal EXE/DLL files, could either store a dummy 16x16 256-color
    image, or more compactly, as a 16x16 hi-color image (with no dither
    table). Since, it is possible that it could make sense that EXEs/DLLs
    use a different default color palette from the OS kernel.



    Note that neither PNG, JPEG, nor even QOI, are a good fit for these use
    cases. Wonky BMP variants are a better fit.

    For SDF font images, had also used BMP, say a 256x256 8bpp image
    covering CP-1252, with a specialized color palette (X/Y distances are
    encoded in the in the pixels). Needed a full 8bpp here as CRAM doesn't
    work for this.
    PNG compresses them, but overhead is too high; and QOI is not so
    effective for this scenario. Though, as 8bpp images, they do LZ compress pretty OK.

    But, would not be reasonable to specially address every scenario.


    ...

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon Mar 31 09:04:40 2025
    From Newsgroup: comp.arch

    On 3/30/2025 1:14 PM, MitchAlsup1 wrote:
    On Sun, 30 Mar 2025 17:47:59 +0000, BGB wrote:

    On 3/30/2025 7:16 AM, Robert Finch wrote:
    Just got to thinking about stack canaries. I was going to have a special >>> purpose register holding the canary value for testing while the program
    was running. But I just realized today that it may not be needed. Canary >>> values could be handled by the program loader as constants, eliminating
    the need for a register. Since the value is not changing while the
    program is running, it could easily be a constant. This may require a
    fixup record handled by the assembler / linker to indicate to the loader >>> to place a canary value.

    Prolog code would just store an immediate to the stack. On return a TRAP >>> instruction could check for the immediate value and trap if not present. >>> But the process seems to require assembler / linker support.


    They are mostly just a normal compiler feature IME:
       Prolog stores the value;
       Epilog loads it and verifies that the value is intact.

    Agreed.

    I'm glad you, Mitch, chimed in here. When I saw this, it occurred to me
    that this could be done automatically by the hardware (optionally, based
    on a bit in a control register). The CALL instruction would store
    magic value, and the RET instruction would test it. If there was not a
    match, an exception would be generated. The value itself could be
    something like the clock value when the program was initiated, thus guaranteeing uniqueness.

    The advantage over the software approach, of course, is the elimination
    of several instructions in each prolog/epilog, reducing footprint, and
    perhaps even time as it might be possible to overlap some of the
    processing with the other things these instructions do. The downside is
    more hardware and perhaps extra overhead.

    Does this make sense? What have I missed.







    Using a magic number

    Remove excess words.

    Nothing fancy needed in the assemble or link stages.

    They remain blissfully ignorant--at most they generate the magic
    number, possibly at random, possibly per link-module.

    In my case, canary behavior is one of:
       Use them in functions with arrays or similar (default);
       Use them everywhere (optional);
       Disable them entirely (also optional).

    In my case, it is only checking 16-bit magic numbers, but mostly because
    a 16-bit constant is cheaper to load into a register in this case
    (single 32-bit instruction, vs a larger encoding needed for larger
    values).

    ....
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Mon Mar 31 12:17:38 2025
    From Newsgroup: comp.arch

    On 3/31/2025 11:04 AM, Stephen Fuld wrote:
    On 3/30/2025 1:14 PM, MitchAlsup1 wrote:
    On Sun, 30 Mar 2025 17:47:59 +0000, BGB wrote:

    On 3/30/2025 7:16 AM, Robert Finch wrote:
    Just got to thinking about stack canaries. I was going to have a
    special
    purpose register holding the canary value for testing while the program >>>> was running. But I just realized today that it may not be needed.
    Canary
    values could be handled by the program loader as constants, eliminating >>>> the need for a register. Since the value is not changing while the
    program is running, it could easily be a constant. This may require a
    fixup record handled by the assembler / linker to indicate to the
    loader
    to place a canary value.

    Prolog code would just store an immediate to the stack. On return a
    TRAP
    instruction could check for the immediate value and trap if not
    present.
    But the process seems to require assembler / linker support.


    They are mostly just a normal compiler feature IME:
       Prolog stores the value;
       Epilog loads it and verifies that the value is intact.

    Agreed.

    I'm glad you, Mitch, chimed in here.  When I saw this, it occurred to me that this could be done automatically by the hardware (optionally, based
    on a bit in a control register).   The CALL instruction would store
    magic value, and the RET instruction would test it.  If there was not a match, an exception would be generated.  The value itself could be something like the clock value when the program was initiated, thus guaranteeing uniqueness.

    The advantage over the software approach, of course, is the elimination
    of several instructions in each prolog/epilog, reducing footprint, and perhaps even time as it might be possible to overlap some of the
    processing with the other things these instructions do.  The downside is more hardware and perhaps extra overhead.

    Does this make sense?  What have I missed.


    This would seem to imply an ISA where CALL/RET push onto the stack or
    similar, rather than the (more common for RISC's) strategy of copying PC
    into a link register...


    Another option being if it could be a feature of a Load/Store Multiple.

    Say, LDM/STM:
    6b Hi (Upper bound of register to save)
    6b Lo (Lower bound of registers to save)
    1b LR (Flag to save Link Register)
    1b GP (Flag to save Global Pointer)
    1b SK (Flag to generate a canary)

    Likely (STM):
    Pushes LR first (if bit set);
    Pushes GP second (if bit set);
    Pushes registers in range (if Hi>=Lo);
    Pushes stack canary (if bit set).

    LDM would check the canary first and fault if it doesn't see the
    expected value.

    Downside, granted, is needing the relative complexity of an LDM/STM
    style instruction.

    Other ISAs use a flag bit for each register, but this is less viable
    with an ISA with a larger number of registers, well, unless one uses a
    64 or 96 bit LDM/STM encoding (possible). Merit though would be not
    needing multiple LDM's / STM's to deal with a discontinuous register range.


    Well, also excluding the possibility where the LDM/STM is essentially
    just a function call (say, if beyond certain number of registers are to
    be saved/restored, the compiler generates a call to a save/restore
    sequence, which is also generates as-needed). Granted, this is basically
    the strategy used by BGBCC. If multiple functions happen to save/restore
    the same combination of registers, they get to reuse the prior
    function's save/restore sequence (generally folded off to before the
    function in question).

    Granted, the folding strategy can still do canary values, but doing so
    in the reused portions would limit the range of unique canary values
    (well, unless the canary magic is XOR'ed with SP or something...).

    ...




    Using a magic number

    Remove excess words.

    Nothing fancy needed in the assemble or link stages.

    They remain blissfully ignorant--at most they generate the magic
    number, possibly at random, possibly per link-module.

    In my case, canary behavior is one of:
       Use them in functions with arrays or similar (default);
       Use them everywhere (optional);
       Disable them entirely (also optional).

    In my case, it is only checking 16-bit magic numbers, but mostly because >>> a 16-bit constant is cheaper to load into a register in this case
    (single 32-bit instruction, vs a larger encoding needed for larger
    values).

    ....



    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon Mar 31 10:57:35 2025
    From Newsgroup: comp.arch

    On 3/31/2025 10:17 AM, BGB wrote:
    On 3/31/2025 11:04 AM, Stephen Fuld wrote:
    On 3/30/2025 1:14 PM, MitchAlsup1 wrote:
    On Sun, 30 Mar 2025 17:47:59 +0000, BGB wrote:

    On 3/30/2025 7:16 AM, Robert Finch wrote:
    Just got to thinking about stack canaries. I was going to have a
    special
    purpose register holding the canary value for testing while the
    program
    was running. But I just realized today that it may not be needed.
    Canary
    values could be handled by the program loader as constants,
    eliminating
    the need for a register. Since the value is not changing while the
    program is running, it could easily be a constant. This may require a >>>>> fixup record handled by the assembler / linker to indicate to the
    loader
    to place a canary value.

    Prolog code would just store an immediate to the stack. On return a >>>>> TRAP
    instruction could check for the immediate value and trap if not
    present.
    But the process seems to require assembler / linker support.


    They are mostly just a normal compiler feature IME:
       Prolog stores the value;
       Epilog loads it and verifies that the value is intact.

    Agreed.

    I'm glad you, Mitch, chimed in here.  When I saw this, it occurred to
    me that this could be done automatically by the hardware (optionally,
    based on a bit in a control register).   The CALL instruction would
    store magic value, and the RET instruction would test it.  If there
    was not a match, an exception would be generated.  The value itself
    could be something like the clock value when the program was
    initiated, thus guaranteeing uniqueness.

    The advantage over the software approach, of course, is the
    elimination of several instructions in each prolog/epilog, reducing
    footprint, and perhaps even time as it might be possible to overlap
    some of the processing with the other things these instructions do.
    The downside is more hardware and perhaps extra overhead.

    Does this make sense?  What have I missed.


    This would seem to imply an ISA where CALL/RET push onto the stack or similar, rather than the (more common for RISC's) strategy of copying PC into a link register...

    Sorry, you're right. I should have said, in the context of Mitch's My
    66000, the ENTER and EXIT instructions.


    Another option being if it could be a feature of a Load/Store Multiple.

    The nice thing about the ENTER/EXIT is that they combine the store
    multiple (ENTER) and the load multiple and return control (EXIT).
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Mon Mar 31 18:07:30 2025
    From Newsgroup: comp.arch

    On Mon, 31 Mar 2025 17:17:38 +0000, BGB wrote:

    On 3/31/2025 11:04 AM, Stephen Fuld wrote:
    On 3/30/2025 1:14 PM, MitchAlsup1 wrote:
    On Sun, 30 Mar 2025 17:47:59 +0000, BGB wrote:
    -------------

    They are mostly just a normal compiler feature IME:
       Prolog stores the value;
       Epilog loads it and verifies that the value is intact.

    Agreed.

    I'm glad you, Mitch, chimed in here.  When I saw this, it occurred to me
    that this could be done automatically by the hardware (optionally, based
    on a bit in a control register).   The CALL instruction would store
    magic value, and the RET instruction would test it.  If there was not a
    match, an exception would be generated.  The value itself could be
    something like the clock value when the program was initiated, thus
    guaranteeing uniqueness.

    The advantage over the software approach, of course, is the elimination
    of several instructions in each prolog/epilog, reducing footprint, and
    perhaps even time as it might be possible to overlap some of the
    processing with the other things these instructions do.  The downside is
    more hardware and perhaps extra overhead.

    Does this make sense?  What have I missed.


    This would seem to imply an ISA where CALL/RET push onto the stack or similar, rather than the (more common for RISC's) strategy of copying PC
    into a link register...


    Another option being if it could be a feature of a Load/Store Multiple.

    Say, LDM/STM:
    6b Hi (Upper bound of register to save)
    6b Lo (Lower bound of registers to save)
    1b LR (Flag to save Link Register)
    1b GP (Flag to save Global Pointer)
    1b SK (Flag to generate a canary)

    ENTER and EXIT have 2 of those flags--but also note use of SP and CSP
    are implicit.

    Likely (STM):
    Pushes LR first (if bit set);
    Pushes GP second (if bit set);
    Pushes registers in range (if Hi>=Lo);
    Pushes stack canary (if bit set).

    EXIT uses its 3rd flag used when doing longjump() and THROW()
    so as to pop the call-stack but not actually RET from the stack
    walker.

    LDM would check the canary first and fault if it doesn't see the
    expected value.

    Downside, granted, is needing the relative complexity of an LDM/STM
    style instruction.

    Not conceptually any harder than DIV or FDIV and nobody complains
    about doing multi-cycle math.

    Other ISAs use a flag bit for each register, but this is less viable
    with an ISA with a larger number of registers, well, unless one uses a
    64 or 96 bit LDM/STM encoding (possible). Merit though would be not
    needing multiple LDM's / STM's to deal with a discontinuous register
    range.

    To quote Trevor Smith:: "Why would anyone want to do that" ??

    Well, also excluding the possibility where the LDM/STM is essentially
    just a function call (say, if beyond certain number of registers are to
    be saved/restored, the compiler generates a call to a save/restore
    sequence, which is also generates as-needed). Granted, this is basically
    the strategy used by BGBCC. If multiple functions happen to save/restore
    the same combination of registers, they get to reuse the prior
    function's save/restore sequence (generally folded off to before the
    function in question).

    Calling a subroutine to perform epilogues is adding to the number of
    branches a program executes. Having an instruction like EXIT means
    when you know you need to exit, you EXIT you don't branch to the exit
    point. Saving instructions.

    Granted, the folding strategy can still do canary values, but doing so
    in the reused portions would limit the range of unique canary values
    (well, unless the canary magic is XOR'ed with SP or something...).

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Mon Mar 31 13:56:32 2025
    From Newsgroup: comp.arch

    On 3/31/2025 1:07 PM, MitchAlsup1 wrote:
    On Mon, 31 Mar 2025 17:17:38 +0000, BGB wrote:

    On 3/31/2025 11:04 AM, Stephen Fuld wrote:
    On 3/30/2025 1:14 PM, MitchAlsup1 wrote:
    On Sun, 30 Mar 2025 17:47:59 +0000, BGB wrote:
    -------------

    They are mostly just a normal compiler feature IME:
       Prolog stores the value;
       Epilog loads it and verifies that the value is intact.

    Agreed.

    I'm glad you, Mitch, chimed in here.  When I saw this, it occurred to me >>> that this could be done automatically by the hardware (optionally, based >>> on a bit in a control register).   The CALL instruction would store
    magic value, and the RET instruction would test it.  If there was not a >>> match, an exception would be generated.  The value itself could be
    something like the clock value when the program was initiated, thus
    guaranteeing uniqueness.

    The advantage over the software approach, of course, is the elimination
    of several instructions in each prolog/epilog, reducing footprint, and
    perhaps even time as it might be possible to overlap some of the
    processing with the other things these instructions do.  The downside is >>> more hardware and perhaps extra overhead.

    Does this make sense?  What have I missed.


    This would seem to imply an ISA where CALL/RET push onto the stack or
    similar, rather than the (more common for RISC's) strategy of copying PC
    into a link register...


    Another option being if it could be a feature of a Load/Store Multiple.

    Say, LDM/STM:
       6b Hi (Upper bound of register to save)
       6b Lo (Lower bound of registers to save)
       1b LR (Flag to save Link Register)
       1b GP (Flag to save Global Pointer)
       1b SK (Flag to generate a canary)

    ENTER and EXIT have 2 of those flags--but also note use of SP and CSP
    are implicit.

    Likely (STM):
       Pushes LR first (if bit set);
       Pushes GP second (if bit set);
       Pushes registers in range (if Hi>=Lo);
       Pushes stack canary (if bit set).

    EXIT uses its 3rd flag used when doing longjump() and THROW()
    so as to pop the call-stack but not actually RET from the stack
    walker.


    OK.

    I guess one could debate whether an LDM could treat the Load-LR as "Load
    LR" or "Load address and Branch", and/or have separate flags (Load LR vs
    Load PC, with Load PC meaning to branch).


    Other ABIs may not have as much reason to save/restore the Global
    Pointer all the time. But, in my case, it is being used as the primary
    way of accessing globals, and each binary image has its own address
    range here.

    PC-Rel not being used as PC-Rel doesn't allow for multiple process
    instances of a given loaded binary within a shared address space.


    Vs, say, for PIE ELF binaries where it is needed to load a new copy for
    each process instance because of this (well, excluding an FDPIC style
    ABI, but seemingly still no one seems to have bothered adding FDPIC
    support in GCC or friends for RV64 based targets, ...).

    Well, granted, because Linux and similar tend to load every new process
    into its own address space and/or use CoW.


    LDM would check the canary first and fault if it doesn't see the
    expected value.

    Downside, granted, is needing the relative complexity of an LDM/STM
    style instruction.

    Not conceptually any harder than DIV or FDIV and nobody complains
    about doing multi-cycle math.


    But... Only reason I have DIV and FDIV was because RISC-V's 'M'
    extension needed them, and there are generally not a whole lot of useful configurations supported by GCC that lacked 'M'.

    There is FDIV, but it is painfully slow.


    Other ISAs use a flag bit for each register, but this is less viable
    with an ISA with a larger number of registers, well, unless one uses a
    64 or 96 bit LDM/STM encoding (possible). Merit though would be not
    needing multiple LDM's / STM's to deal with a discontinuous register
    range.

    To quote Trevor Smith:: "Why would anyone want to do that" ??


    Discontinuous register ranges:
    Because pretty much no ABI's put all of the callee save registers in a contiguous range.

    Granted, I guess if someone were designing an ISA and ABI clean, they
    could make all of the argument registers and callee save registers
    contiguous.

    Say:
    R0..R3: Special
    R4..R15: Scratch
    R16..R31: Argument
    R32..R63: Callee Save
    ...

    But, invariably, someone will want "compressed" instructions with a
    subset of the registers, and one can't just have these only having
    access to argument registers.


    Well, also excluding the possibility where the LDM/STM is essentially
    just a function call (say, if beyond certain number of registers are to
    be saved/restored, the compiler generates a call to a save/restore
    sequence, which is also generates as-needed). Granted, this is basically
    the strategy used by BGBCC. If multiple functions happen to save/restore
    the same combination of registers, they get to reuse the prior
    function's save/restore sequence (generally folded off to before the
    function in question).

    Calling a subroutine to perform epilogues is adding to the number of
    branches a program executes. Having an instruction like EXIT means
    when you know you need to exit, you EXIT you don't branch to the exit
    point. Saving instructions.


    Prolog needs a call, but epilog can just be a branch, since no need to
    return back into the function that is returning.

    Needs to have a lower limit though, as it is not worth it to use a
    call/branch to save/restore 3 or 4 registers...

    But, say, 20 registers, it is more worthwhile.


    Granted, the folding strategy can still do canary values, but doing so
    in the reused portions would limit the range of unique canary values
    (well, unless the canary magic is XOR'ed with SP or something...).


    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Mon Mar 31 20:52:14 2025
    From Newsgroup: comp.arch

    On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:

    On 3/31/2025 1:07 PM, MitchAlsup1 wrote:
    -------------
    Another option being if it could be a feature of a Load/Store Multiple.

    Say, LDM/STM:
       6b Hi (Upper bound of register to save)
       6b Lo (Lower bound of registers to save)
       1b LR (Flag to save Link Register)
       1b GP (Flag to save Global Pointer)
       1b SK (Flag to generate a canary)

    ENTER and EXIT have 2 of those flags--but also note use of SP and CSP
    are implicit.

    Likely (STM):
       Pushes LR first (if bit set);
       Pushes GP second (if bit set);
       Pushes registers in range (if Hi>=Lo);
       Pushes stack canary (if bit set).

    EXIT uses its 3rd flag used when doing longjump() and THROW()
    so as to pop the call-stack but not actually RET from the stack
    walker.


    OK.

    I guess one could debate whether an LDM could treat the Load-LR as "Load
    LR" or "Load address and Branch", and/or have separate flags (Load LR vs
    Load PC, with Load PC meaning to branch).


    Other ABIs may not have as much reason to save/restore the Global
    Pointer all the time. But, in my case, it is being used as the primary
    way of accessing globals, and each binary image has its own address
    range here.

    I use constants to access globals.
    These comes in 32-bit and 64-bit flavors.

    PC-Rel not being used as PC-Rel doesn't allow for multiple process
    instances of a given loaded binary within a shared address space.

    As long as the relative distance is the same, it does.

    Vs, say, for PIE ELF binaries where it is needed to load a new copy for
    each process instance because of this (well, excluding an FDPIC style
    ABI, but seemingly still no one seems to have bothered adding FDPIC
    support in GCC or friends for RV64 based targets, ...).

    Well, granted, because Linux and similar tend to load every new process
    into its own address space and/or use CoW.

    CoW and execl()

    --------------
    Other ISAs use a flag bit for each register, but this is less viable
    with an ISA with a larger number of registers, well, unless one uses a
    64 or 96 bit LDM/STM encoding (possible). Merit though would be not
    needing multiple LDM's / STM's to deal with a discontinuous register
    range.

    To quote Trevor Smith:: "Why would anyone want to do that" ??


    Discontinuous register ranges:
    Because pretty much no ABI's put all of the callee save registers in a contiguous range.

    Granted, I guess if someone were designing an ISA and ABI clean, they
    could make all of the argument registers and callee save registers contiguous.

    Say:
    R0..R3: Special
    R4..R15: Scratch
    R16..R31: Argument
    R32..R63: Callee Save
    ....

    But, invariably, someone will want "compressed" instructions with a
    subset of the registers, and one can't just have these only having
    access to argument registers.

    Brian had little trouble using My 66000 ABI which does have contiguous
    register groupings.

    Well, also excluding the possibility where the LDM/STM is essentially
    just a function call (say, if beyond certain number of registers are to
    be saved/restored, the compiler generates a call to a save/restore
    sequence, which is also generates as-needed). Granted, this is basically >>> the strategy used by BGBCC. If multiple functions happen to save/restore >>> the same combination of registers, they get to reuse the prior
    function's save/restore sequence (generally folded off to before the
    function in question).

    Calling a subroutine to perform epilogues is adding to the number of
    branches a program executes. Having an instruction like EXIT means
    when you know you need to exit, you EXIT you don't branch to the exit
    point. Saving instructions.


    Prolog needs a call, but epilog can just be a branch, since no need to
    return back into the function that is returning.

    Yes, but this means My 66000 executes 3 fewer transfers of control
    per subroutine than you do. And taken branches add latency.

    Needs to have a lower limit though, as it is not worth it to use a call/branch to save/restore 3 or 4 registers...

    But, say, 20 registers, it is more worthwhile.

    ENTER saves as few as 1 or as many as 32 and remains that 1 single
    instruction. Same for EXIT and exit also performs the RET when LDing
    R0.


    Granted, the folding strategy can still do canary values, but doing so
    in the reused portions would limit the range of unique canary values
    (well, unless the canary magic is XOR'ed with SP or something...).

    Canary values are in addition to ENTER and EXIT not part of them
    IMHO.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Tue Apr 1 00:58:58 2025
    From Newsgroup: comp.arch

    On 2025-03-31 4:52 p.m., MitchAlsup1 wrote:
    On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:

    On 3/31/2025 1:07 PM, MitchAlsup1 wrote:
    -------------
    Another option being if it could be a feature of a Load/Store Multiple. >>>>
    Say, LDM/STM:
       6b Hi (Upper bound of register to save)
       6b Lo (Lower bound of registers to save)
       1b LR (Flag to save Link Register)
       1b GP (Flag to save Global Pointer)
       1b SK (Flag to generate a canary)

    Q+3 uses a bitmap of register selection with four more bits selecting overlapping groups. It can work with up to 17 registers.


    ENTER and EXIT have 2 of those flags--but also note use of SP and CSP
    are implicit.

    Likely (STM):
       Pushes LR first (if bit set);
       Pushes GP second (if bit set);
       Pushes registers in range (if Hi>=Lo);
       Pushes stack canary (if bit set).

    EXIT uses its 3rd flag used when doing longjump() and THROW()
    so as to pop the call-stack but not actually RET from the stack
    walker.


    OK.

    I guess one could debate whether an LDM could treat the Load-LR as "Load
    LR" or "Load address and Branch", and/or have separate flags (Load LR vs
    Load PC, with Load PC meaning to branch).


    Other ABIs may not have as much reason to save/restore the Global
    Pointer all the time. But, in my case, it is being used as the primary
    way of accessing globals, and each binary image has its own address
    range here.

    I use constants to access globals.
    These comes in 32-bit and 64-bit flavors.

    PC-Rel not being used as PC-Rel doesn't allow for multiple process
    instances of a given loaded binary within a shared address space.

    As long as the relative distance is the same, it does.

    Vs, say, for PIE ELF binaries where it is needed to load a new copy for
    each process instance because of this (well, excluding an FDPIC style
    ABI, but seemingly still no one seems to have bothered adding FDPIC
    support in GCC or friends for RV64 based targets, ...).

    Well, granted, because Linux and similar tend to load every new process
    into its own address space and/or use CoW.

    CoW and execl()

    --------------
    Other ISAs use a flag bit for each register, but this is less viable
    with an ISA with a larger number of registers, well, unless one uses a >>>> 64 or 96 bit LDM/STM encoding (possible). Merit though would be not
    needing multiple LDM's / STM's to deal with a discontinuous register
    range.

    To quote Trevor Smith:: "Why would anyone want to do that" ??


    Discontinuous register ranges:
    Because pretty much no ABI's put all of the callee save registers in a
    contiguous range.

    Granted, I guess if someone were designing an ISA and ABI clean, they
    could make all of the argument registers and callee save registers
    contiguous.

    Say:
       R0..R3: Special
       R4..R15: Scratch
       R16..R31: Argument
       R32..R63: Callee Save
    ....

    But, invariably, someone will want "compressed" instructions with a
    subset of the registers, and one can't just have these only having
    access to argument registers.

    Brian had little trouble using My 66000 ABI which does have contiguous register groupings.

    Well, also excluding the possibility where the LDM/STM is essentially
    just a function call (say, if beyond certain number of registers are to >>>> be saved/restored, the compiler generates a call to a save/restore
    sequence, which is also generates as-needed). Granted, this is
    basically
    the strategy used by BGBCC. If multiple functions happen to save/
    restore
    the same combination of registers, they get to reuse the prior
    function's save/restore sequence (generally folded off to before the
    function in question).

    Calling a subroutine to perform epilogues is adding to the number of
    branches a program executes. Having an instruction like EXIT means
    when you know you need to exit, you EXIT you don't branch to the exit
    point. Saving instructions.


    Prolog needs a call, but epilog can just be a branch, since no need to
    return back into the function that is returning.

    Yes, but this means My 66000 executes 3 fewer transfers of control
    per subroutine than you do. And taken branches add latency.

    Needs to have a lower limit though, as it is not worth it to use a
    call/branch to save/restore 3 or 4 registers...

    But, say, 20 registers, it is more worthwhile.

    ENTER saves as few as 1 or as many as 32 and remains that 1 single instruction. Same for EXIT and exit also performs the RET when LDing
    R0.


    Granted, the folding strategy can still do canary values, but doing so >>>> in the reused portions would limit the range of unique canary values
    (well, unless the canary magic is XOR'ed with SP or something...).

    Canary values are in addition to ENTER and EXIT not part of them
    IMHO.

    In Q+3 there are push and pop multiple instructions. I did not want to
    add load and store multiple on top of that. They work great for ISRs,
    but not so great for task switching code. I have the instructions
    pushing or popping up to 17 registers in a group. Groups of registers
    overlap by eight. The instructions can handle all 96 registers in the
    machine. ENTER and EXIT are also present.

    It is looking like the context switch code for the OS will take about
    3000 clock cycles to run. Not wanting to disable interrupts for that
    long, I put a spinlock on the system’s task control block array. But I
    think I have run into an issue. It is the timer ISR that switches tasks.
    Since it is an ISR it pushes a subset of registers that it uses and
    restores them at exit. But when exiting and switching tasks it spinlocks
    on the task control block array. I am not sure this is a good thing. As
    the timer IRQ is fairly high priority. If something else locked the TCB
    array it would deadlock. I guess the context switching could be deferred
    until the app requests some other operating system function. But then
    the issue is what if the app gets stuck in an infinite loop, not calling
    the OS? I suppose I could make an OS heartbeat function call a
    requirement of apps. If the app does not do a heartbeat within a
    reasonable time, it could be terminated.

    Q+3 progresses rapidly. A lot of the stuff in earlier versions was
    removed. The pared down version is a 32-bit machine. Expecting some
    headaches because of the use of condition registers and branch registers.


    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Apr 1 18:51:30 2025
    From Newsgroup: comp.arch

    On Tue, 1 Apr 2025 4:58:58 +0000, Robert Finch wrote:

    On 2025-03-31 4:52 p.m., MitchAlsup1 wrote:
    On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:
    --------------------
    ENTER saves as few as 1 or as many as 32 and remains that 1 single
    instruction. Same for EXIT and exit also performs the RET when LDing
    R0.


    Granted, the folding strategy can still do canary values, but doing so >>>>> in the reused portions would limit the range of unique canary values >>>>> (well, unless the canary magic is XOR'ed with SP or something...).

    Canary values are in addition to ENTER and EXIT not part of them
    IMHO.

    In Q+3 there are push and pop multiple instructions. I did not want to
    add load and store multiple on top of that. They work great for ISRs,
    but not so great for task switching code. I have the instructions
    pushing or popping up to 17 registers in a group. Groups of registers
    overlap by eight. The instructions can handle all 96 registers in the machine. ENTER and EXIT are also present.

    It is looking like the context switch code for the OS will take about
    3000 clock cycles to run.

    How much of that is figuring out who to switch to and, now that that has
    been decided, make the context switch manifest ??

    Not wanting to disable interrupts for that
    long, I put a spinlock on the system’s task control block array. But I think I have run into an issue. It is the timer ISR that switches tasks. Since it is an ISR it pushes a subset of registers that it uses and
    restores them at exit. But when exiting and switching tasks it spinlocks
    on the task control block array. I am not sure this is a good thing. As
    the timer IRQ is fairly high priority. If something else locked the TCB
    array it would deadlock. I guess the context switching could be deferred until the app requests some other operating system function. But then
    the issue is what if the app gets stuck in an infinite loop, not calling
    the OS? I suppose I could make an OS heartbeat function call a
    requirement of apps. If the app does not do a heartbeat within a
    reasonable time, it could be terminated.

    Q+3 progresses rapidly. A lot of the stuff in earlier versions was
    removed. The pared down version is a 32-bit machine. Expecting some
    headaches because of the use of condition registers and branch
    registers.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Apr 1 14:34:10 2025
    From Newsgroup: comp.arch

    On 3/31/2025 3:52 PM, MitchAlsup1 wrote:
    On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:

    On 3/31/2025 1:07 PM, MitchAlsup1 wrote:
    -------------
    Another option being if it could be a feature of a Load/Store Multiple. >>>>
    Say, LDM/STM:
       6b Hi (Upper bound of register to save)
       6b Lo (Lower bound of registers to save)
       1b LR (Flag to save Link Register)
       1b GP (Flag to save Global Pointer)
       1b SK (Flag to generate a canary)

    ENTER and EXIT have 2 of those flags--but also note use of SP and CSP
    are implicit.

    Likely (STM):
       Pushes LR first (if bit set);
       Pushes GP second (if bit set);
       Pushes registers in range (if Hi>=Lo);
       Pushes stack canary (if bit set).

    EXIT uses its 3rd flag used when doing longjump() and THROW()
    so as to pop the call-stack but not actually RET from the stack
    walker.


    OK.

    I guess one could debate whether an LDM could treat the Load-LR as "Load
    LR" or "Load address and Branch", and/or have separate flags (Load LR vs
    Load PC, with Load PC meaning to branch).


    Other ABIs may not have as much reason to save/restore the Global
    Pointer all the time. But, in my case, it is being used as the primary
    way of accessing globals, and each binary image has its own address
    range here.

    I use constants to access globals.
    These comes in 32-bit and 64-bit flavors.


    Typically 16-bit, most are within a 16-bit range of the Global Pointer.


    PC-Rel not being used as PC-Rel doesn't allow for multiple process
    instances of a given loaded binary within a shared address space.

    As long as the relative distance is the same, it does.


    Can't happen within a shared address space.

    Say, if you load a single copy of a binary at 0x24680000.
    Process A and B can't use the same mapping in the same address space,
    with PC-rel globals, as then they would each see the other's globals.

    You can't do a duplicate mapping at another address, as this both wastes
    VAS, and also any Abs64 base-relocs or similar would differ.

    You also can't CoW the data/bss sections, as this is no longer a shared address space.


    So, alternative is to use GBR to access globals, with the data/bss
    sections allocated independently of the binary.

    This way, multiple processes can share the same mapping at the same
    address for any executable code and constant data, with only the data
    sections needing to be allocated.


    Does mean though that one needs to save/restore the global pointer, and
    there is a ritual for reloading it.

    EXE's generally assume they are index 0, so:
    MOV.Q (GBR, 0), Rt
    MOV.Q (Rt, 0), GBR
    Or, in RV terms:
    LD X6, 0(X3)
    LD X3, Disp33(X6)
    Or, RV64G:
    LD X6, 0(X3)
    LUI X5, DispHi
    ADD X5 X5, X6
    LD X3, DispLo(X5)


    For DLL's, the index is fixed up with a base-reloc (for each loaded
    DLL), so basically the same idea. Typically a Disp33 is used here to
    allow for a potentially large/unknown number of loaded DLL's. Thus far,
    a global numbering scheme is used.

    Where, (GBR+0) gives the address of a table of global pointers for every loaded binary (can be assumed read-only from userland).


    Generally, this is needed if:
    Function may be called from outside of the current binary and:
    Accesses global variables;
    And/or, calls local functions.


    Though, still generally lower average-case overhead than the strategy typically used by FDPIC, which would handle this reload process on the
    caller side...
    SD X3, Disp(SP)
    LD X3, 8(X18)
    LD X6, 0(X18)
    JALR X1, 0(X6)
    LD X3, Disp(SP)

    With generally every function pointer existing as a pair with the actual function pointer, and its associated global pointer.

    Though, caller side handling does arguably avoid the need to perform
    relocs for the table index.

    Though, seemingly no one wants to add FDPIC for RV64G, seeing it mostly
    as a 32-bit microcontroller thing.


    For normal PIE though, absent CoW, it is necessary to load a new copy of
    the binary each time a new process instance is created.


    Vs, say, for PIE ELF binaries where it is needed to load a new copy for
    each process instance because of this (well, excluding an FDPIC style
    ABI, but seemingly still no one seems to have bothered adding FDPIC
    support in GCC or friends for RV64 based targets, ...).

    Well, granted, because Linux and similar tend to load every new process
    into its own address space and/or use CoW.

    CoW and execl()


    Though, execl() effectively replaces the current process.

    IMHO, a "CreateProcess()" style abstraction makes more sense than fork+exec.

    Though, one tricky way to handle it is:
    vfork: effectively spawns a thread in the same address space as the
    caller, with a provisional PID, and semi-copied stack;
    exec: Creates a new process copying the PID and file-descriptors;
    Internally uses CreateProcess;
    Temporary thread disappears once exec is called.

    True "fork()" is more of an issue though...

    The true "fork()" semantics are not possible on single-address-space or
    NoMMU systems. Nor fully emulated in things like Cygwin IIRC.

    Though, the usual alternative is to give them "vfork()" semantics, and
    things will probably explode if they do anything other than call exec or similar.


    --------------
    Other ISAs use a flag bit for each register, but this is less viable
    with an ISA with a larger number of registers, well, unless one uses a >>>> 64 or 96 bit LDM/STM encoding (possible). Merit though would be not
    needing multiple LDM's / STM's to deal with a discontinuous register
    range.

    To quote Trevor Smith:: "Why would anyone want to do that" ??


    Discontinuous register ranges:
    Because pretty much no ABI's put all of the callee save registers in a
    contiguous range.

    Granted, I guess if someone were designing an ISA and ABI clean, they
    could make all of the argument registers and callee save registers
    contiguous.

    Say:
       R0..R3: Special
       R4..R15: Scratch
       R16..R31: Argument
       R32..R63: Callee Save
    ....

    But, invariably, someone will want "compressed" instructions with a
    subset of the registers, and one can't just have these only having
    access to argument registers.

    Brian had little trouble using My 66000 ABI which does have contiguous register groupings.


    But, My66000 also isn't like, "Hey, how about 16-bit ops with 3 or 4 bit register numbers".

    Not sure the thinking behind the RV ABI.


    In the BJX ABI, the layout directly grew out of the SH ABI mapping, effectively just mirroring the original SH layout 4 times for 64 registers.

    The SH layout was contiguous, at least for 16 registers, though a
    mirrored layout is no longer contiguous.

    The RV ABI is not contiguous, but at least still less chaotic than the
    x86-64 ABIs.


    Well, also excluding the possibility where the LDM/STM is essentially
    just a function call (say, if beyond certain number of registers are to >>>> be saved/restored, the compiler generates a call to a save/restore
    sequence, which is also generates as-needed). Granted, this is
    basically
    the strategy used by BGBCC. If multiple functions happen to save/
    restore
    the same combination of registers, they get to reuse the prior
    function's save/restore sequence (generally folded off to before the
    function in question).

    Calling a subroutine to perform epilogues is adding to the number of
    branches a program executes. Having an instruction like EXIT means
    when you know you need to exit, you EXIT you don't branch to the exit
    point. Saving instructions.


    Prolog needs a call, but epilog can just be a branch, since no need to
    return back into the function that is returning.

    Yes, but this means My 66000 executes 3 fewer transfers of control
    per subroutine than you do. And taken branches add latency.


    Granted.

    Each predicted branch adds 2 cycles.


    Needs to have a lower limit though, as it is not worth it to use a
    call/branch to save/restore 3 or 4 registers...

    But, say, 20 registers, it is more worthwhile.

    ENTER saves as few as 1 or as many as 32 and remains that 1 single instruction. Same for EXIT and exit also performs the RET when LDing
    R0.


    Granted.

    My strategy isn't perfect:
    Non-zero branching overheads, when the feature is used;
    Per-function load/store slides in prolog/epilog, when not used.

    Then, the heuristic mostly becomes one of when it is better to use the
    inline strategy (load/store slide), or to fold them off and use calls/branches.

    Does technically also work for RISC-V though (though seemingly GCC
    always uses inline save/restore, but also the RV ABI has fewer registers).



    Granted, the folding strategy can still do canary values, but doing so >>>> in the reused portions would limit the range of unique canary values
    (well, unless the canary magic is XOR'ed with SP or something...).

    Canary values are in addition to ENTER and EXIT not part of them
    IMHO.

    OK.


    It sorta made sense to treat canary values as part of the process of saving/restoring the registers, since their main purpose is to protect
    the saved registers, and particularly the saved PC.

    Granted, canary values are not a perfect strategy.
    They can provide some added resistance against buffer overflow exploits
    if the value can be made unknown to the attacker.

    This means, ideally:
    Unique to each function, and does not repeat across builds.
    But, by itself, insufficient if a single build is used.
    Is mangled in some other way to avoid repeats.
    Say, XOR'ing with SP and also ASLR'ing the SP.

    But, yeah, if the canary value is, say:
    (SP XOR Magic) with SP being ASLR'ed, it offers at least some added protection.


    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Apr 1 16:21:30 2025
    From Newsgroup: comp.arch

    On 3/31/2025 11:58 PM, Robert Finch wrote:
    On 2025-03-31 4:52 p.m., MitchAlsup1 wrote:
    On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:

    On 3/31/2025 1:07 PM, MitchAlsup1 wrote:
    -------------
    Another option being if it could be a feature of a Load/Store
    Multiple.

    Say, LDM/STM:
       6b Hi (Upper bound of register to save)
       6b Lo (Lower bound of registers to save)
       1b LR (Flag to save Link Register)
       1b GP (Flag to save Global Pointer)
       1b SK (Flag to generate a canary)

    Q+3 uses a bitmap of register selection with four more bits selecting overlapping groups. It can work with up to 17 registers.


    OK.

    If I did LDM/STM style ops, not sure which strategy I would take.

    The possibility of using a 96-bit encoding with an Imm64 holding a
    bit-mask of all the registers makes some sense...



    ENTER and EXIT have 2 of those flags--but also note use of SP and CSP
    are implicit.

    Likely (STM):
       Pushes LR first (if bit set);
       Pushes GP second (if bit set);
       Pushes registers in range (if Hi>=Lo);
       Pushes stack canary (if bit set).

    EXIT uses its 3rd flag used when doing longjump() and THROW()
    so as to pop the call-stack but not actually RET from the stack
    walker.


    OK.

    I guess one could debate whether an LDM could treat the Load-LR as "Load >>> LR" or "Load address and Branch", and/or have separate flags (Load LR vs >>> Load PC, with Load PC meaning to branch).


    Other ABIs may not have as much reason to save/restore the Global
    Pointer all the time. But, in my case, it is being used as the primary
    way of accessing globals, and each binary image has its own address
    range here.

    I use constants to access globals.
    These comes in 32-bit and 64-bit flavors.

    PC-Rel not being used as PC-Rel doesn't allow for multiple process
    instances of a given loaded binary within a shared address space.

    As long as the relative distance is the same, it does.

    Vs, say, for PIE ELF binaries where it is needed to load a new copy for
    each process instance because of this (well, excluding an FDPIC style
    ABI, but seemingly still no one seems to have bothered adding FDPIC
    support in GCC or friends for RV64 based targets, ...).

    Well, granted, because Linux and similar tend to load every new process
    into its own address space and/or use CoW.

    CoW and execl()

    --------------
    Other ISAs use a flag bit for each register, but this is less viable >>>>> with an ISA with a larger number of registers, well, unless one uses a >>>>> 64 or 96 bit LDM/STM encoding (possible). Merit though would be not
    needing multiple LDM's / STM's to deal with a discontinuous register >>>>> range.

    To quote Trevor Smith:: "Why would anyone want to do that" ??


    Discontinuous register ranges:
    Because pretty much no ABI's put all of the callee save registers in a
    contiguous range.

    Granted, I guess if someone were designing an ISA and ABI clean, they
    could make all of the argument registers and callee save registers
    contiguous.

    Say:
       R0..R3: Special
       R4..R15: Scratch
       R16..R31: Argument
       R32..R63: Callee Save
    ....

    But, invariably, someone will want "compressed" instructions with a
    subset of the registers, and one can't just have these only having
    access to argument registers.

    Brian had little trouble using My 66000 ABI which does have contiguous
    register groupings.

    Well, also excluding the possibility where the LDM/STM is essentially >>>>> just a function call (say, if beyond certain number of registers
    are to
    be saved/restored, the compiler generates a call to a save/restore
    sequence, which is also generates as-needed). Granted, this is
    basically
    the strategy used by BGBCC. If multiple functions happen to save/
    restore
    the same combination of registers, they get to reuse the prior
    function's save/restore sequence (generally folded off to before the >>>>> function in question).

    Calling a subroutine to perform epilogues is adding to the number of
    branches a program executes. Having an instruction like EXIT means
    when you know you need to exit, you EXIT you don't branch to the exit
    point. Saving instructions.


    Prolog needs a call, but epilog can just be a branch, since no need to
    return back into the function that is returning.

    Yes, but this means My 66000 executes 3 fewer transfers of control
    per subroutine than you do. And taken branches add latency.

    Needs to have a lower limit though, as it is not worth it to use a
    call/branch to save/restore 3 or 4 registers...

    But, say, 20 registers, it is more worthwhile.

    ENTER saves as few as 1 or as many as 32 and remains that 1 single
    instruction. Same for EXIT and exit also performs the RET when LDing
    R0.


    Granted, the folding strategy can still do canary values, but doing so >>>>> in the reused portions would limit the range of unique canary values >>>>> (well, unless the canary magic is XOR'ed with SP or something...).

    Canary values are in addition to ENTER and EXIT not part of them
    IMHO.

    In Q+3 there are push and pop multiple instructions. I did not want to
    add load and store multiple on top of that. They work great for ISRs,
    but not so great for task switching code. I have the instructions
    pushing or popping up to 17 registers in a group. Groups of registers overlap by eight. The instructions can handle all 96 registers in the machine. ENTER and EXIT are also present.

    It is looking like the context switch code for the OS will take about
    3000 clock cycles to run. Not wanting to disable interrupts for that
    long, I put a spinlock on the system’s task control block array. But I think I have run into an issue. It is the timer ISR that switches tasks. Since it is an ISR it pushes a subset of registers that it uses and
    restores them at exit. But when exiting and switching tasks it spinlocks
    on the task control block array. I am not sure this is a good thing. As
    the timer IRQ is fairly high priority. If something else locked the TCB array it would deadlock. I guess the context switching could be deferred until the app requests some other operating system function. But then
    the issue is what if the app gets stuck in an infinite loop, not calling
    the OS? I suppose I could make an OS heartbeat function call a
    requirement of apps. If the app does not do a heartbeat within a
    reasonable time, it could be terminated.

    Q+3 progresses rapidly. A lot of the stuff in earlier versions was
    removed. The pared down version is a 32-bit machine. Expecting some headaches because of the use of condition registers and branch registers.


    OK.

    Ironically, I seem to have comparably low task-switch cost...
    However, each system call is essentially 2 task switches, and it is
    still slow enough to negatively effect performance if they happen at all frequently.

    So, say, one needs to try to minimize the number of unnecessary system
    calls (say, don't implement "fputs()" by sending 1 byte at a time, ...).





    Unlike on a modern PC, one generally needs to care more about efficiency.

    Hence, all the fiddling with low bit-depth graphics formats, and things
    like my recent fiddling with 2-bit ADPCM audio.

    And, online, one is (if anything) more likely to find people complaining
    about how old/obsolescent ADPCM is (and/or arguing that people should
    store all their sound effects as Ogg/Vorbis or similar; ...).



    Then again, I did note that I may need to find some other "quality
    metric" for audio, as RMSE isn't really working...

    At least going by RMSE, the "best" option would be to use 8-bit PCM and
    then downsample it.

    Say, 4kHz 8-bit PCM has a lower RMSE score than 2-bit ADPCM, but
    subjectively the 2-bit ADPCM sounds significantly better.

    Say: for 16kHz, and a test file (using a song here):
    PCM8, 16kHz : 121 (128 kbps)
    A-Law, 16kHz : 284 (128 kbps)
    IMA 4bit, 16kHz : 617 (64 kbps)
    IMA 2bit, 16kHz : 1692 (32 kbps, *)
    ADLQ 2bit, 16kHz: 2000 (32 kbps)
    PCM8, 4kHz : 242 (32 kbps)

    However, 4kHz PCM8 sounds terrible vs either 2-bit IMA or ADLQ.
    Basically sounds muffled, speech is unintelligible.
    But, it would be the "best" option if going solely by RMSE.

    Also A-Law sounds better than PCM8 (at the same sample rate).
    Even with the higher RMSE score.

    Seems like it could be possible to do RMSE on A-Law samples as a metric,
    but if anything this is just kicking the can down the road slightly.

    Granted, A-Law sounds better than 4-bit IMA, and 4-bit IMA sounds better
    than the 2-bit ADPCM's at least...


    *: Previously it was worse, around 4500, but the RMSE score dropped
    after switching it to using a similar encoder strategy to ADLQ, namely
    doing a brute-force search over the next 3 samples to find the values
    that best approximate the target samples.

    Though, which is "better", or whether or not even lower RMSE "improves" quality here, is debatable (the PCM8 numbers clearly throw using RMSE as
    a quality metric into question for this case).

    Ideally I would want some metric that better reflects hearing perception
    and is computationally cheap.

    ...

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Tue Apr 1 18:06:10 2025
    From Newsgroup: comp.arch

    On 2025-04-01 2:51 p.m., MitchAlsup1 wrote:
    On Tue, 1 Apr 2025 4:58:58 +0000, Robert Finch wrote:

    On 2025-03-31 4:52 p.m., MitchAlsup1 wrote:
    On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:
    --------------------
    ENTER saves as few as 1 or as many as 32 and remains that 1 single
    instruction. Same for EXIT and exit also performs the RET when LDing
    R0.


    Granted, the folding strategy can still do canary values, but
    doing so
    in the reused portions would limit the range of unique canary values >>>>>> (well, unless the canary magic is XOR'ed with SP or something...). >>>>>>
    Canary values are in addition to ENTER and EXIT not part of them
    IMHO.

    In Q+3 there are push and pop multiple instructions. I did not want to
    add load and store multiple on top of that. They work great for ISRs,
    but not so great for task switching code. I have the instructions
    pushing or popping up to 17 registers in a group. Groups of registers
    overlap by eight. The instructions can handle all 96 registers in the
    machine. ENTER and EXIT are also present.

    It is looking like the context switch code for the OS will take about
    3000 clock cycles to run.

    How much of that is figuring out who to switch to and, now that that has
    been decided, make the context switch manifest ??

    That was just for the making the switch. I calculated based on the
    number of register loads and stores x2 and then times 13 clocks for
    memory access, plus a little bit of overhead for other instructions.

    Deciding who to switch to may be another good chunk of time. But the
    system is using a hardware ready list, so the choice is just to pop
    (load) the top task id off the ready list. The guts of the switcher is
    only about 30 LOC, but it calls a couple of helper routines.

                              Not wanting to disable interrupts for that
    long, I put a spinlock on the system’s task control block array. But I
    think I have run into an issue. It is the timer ISR that switches tasks.
    Since it is an ISR it pushes a subset of registers that it uses and
    restores them at exit. But when exiting and switching tasks it spinlocks
    on the task control block array. I am not sure this is a good thing. As
    the timer IRQ is fairly high priority. If something else locked the TCB
    array it would deadlock. I guess the context switching could be deferred
    until the app requests some other operating system function. But then
    the issue is what if the app gets stuck in an infinite loop, not calling
    the OS? I suppose I could make an OS heartbeat function call a
    requirement of apps. If the app does not do a heartbeat within a
    reasonable time, it could be terminated.

    Q+3 progresses rapidly. A lot of the stuff in earlier versions was
    removed. The pared down version is a 32-bit machine. Expecting some
    headaches because of the use of condition registers and branch
    registers.

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Tue Apr 1 18:19:03 2025
    From Newsgroup: comp.arch

    On 2025-04-01 5:21 p.m., BGB wrote:
    On 3/31/2025 11:58 PM, Robert Finch wrote:
    On 2025-03-31 4:52 p.m., MitchAlsup1 wrote:
    On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:

    On 3/31/2025 1:07 PM, MitchAlsup1 wrote:
    -------------
    Another option being if it could be a feature of a Load/Store
    Multiple.

    Say, LDM/STM:
       6b Hi (Upper bound of register to save)
       6b Lo (Lower bound of registers to save)
       1b LR (Flag to save Link Register)
       1b GP (Flag to save Global Pointer)
       1b SK (Flag to generate a canary)

    Q+3 uses a bitmap of register selection with four more bits selecting
    overlapping groups. It can work with up to 17 registers.


    OK.

    If I did LDM/STM style ops, not sure which strategy I would take.

    The possibility of using a 96-bit encoding with an Imm64 holding a bit-
    mask of all the registers makes some sense...



    ENTER and EXIT have 2 of those flags--but also note use of SP and CSP >>>>> are implicit.

    Likely (STM):
       Pushes LR first (if bit set);
       Pushes GP second (if bit set);
       Pushes registers in range (if Hi>=Lo);
       Pushes stack canary (if bit set).

    EXIT uses its 3rd flag used when doing longjump() and THROW()
    so as to pop the call-stack but not actually RET from the stack
    walker.


    OK.

    I guess one could debate whether an LDM could treat the Load-LR as
    "Load
    LR" or "Load address and Branch", and/or have separate flags (Load
    LR vs
    Load PC, with Load PC meaning to branch).


    Other ABIs may not have as much reason to save/restore the Global
    Pointer all the time. But, in my case, it is being used as the primary >>>> way of accessing globals, and each binary image has its own address
    range here.

    I use constants to access globals.
    These comes in 32-bit and 64-bit flavors.

    PC-Rel not being used as PC-Rel doesn't allow for multiple process
    instances of a given loaded binary within a shared address space.

    As long as the relative distance is the same, it does.

    Vs, say, for PIE ELF binaries where it is needed to load a new copy for >>>> each process instance because of this (well, excluding an FDPIC style
    ABI, but seemingly still no one seems to have bothered adding FDPIC
    support in GCC or friends for RV64 based targets, ...).

    Well, granted, because Linux and similar tend to load every new process >>>> into its own address space and/or use CoW.

    CoW and execl()

    --------------
    Other ISAs use a flag bit for each register, but this is less viable >>>>>> with an ISA with a larger number of registers, well, unless one
    uses a
    64 or 96 bit LDM/STM encoding (possible). Merit though would be not >>>>>> needing multiple LDM's / STM's to deal with a discontinuous register >>>>>> range.

    To quote Trevor Smith:: "Why would anyone want to do that" ??


    Discontinuous register ranges:
    Because pretty much no ABI's put all of the callee save registers in a >>>> contiguous range.

    Granted, I guess if someone were designing an ISA and ABI clean, they
    could make all of the argument registers and callee save registers
    contiguous.

    Say:
       R0..R3: Special
       R4..R15: Scratch
       R16..R31: Argument
       R32..R63: Callee Save
    ....

    But, invariably, someone will want "compressed" instructions with a
    subset of the registers, and one can't just have these only having
    access to argument registers.

    Brian had little trouble using My 66000 ABI which does have contiguous
    register groupings.

    Well, also excluding the possibility where the LDM/STM is essentially >>>>>> just a function call (say, if beyond certain number of registers
    are to
    be saved/restored, the compiler generates a call to a save/restore >>>>>> sequence, which is also generates as-needed). Granted, this is
    basically
    the strategy used by BGBCC. If multiple functions happen to save/ >>>>>> restore
    the same combination of registers, they get to reuse the prior
    function's save/restore sequence (generally folded off to before the >>>>>> function in question).

    Calling a subroutine to perform epilogues is adding to the number of >>>>> branches a program executes. Having an instruction like EXIT means
    when you know you need to exit, you EXIT you don't branch to the exit >>>>> point. Saving instructions.


    Prolog needs a call, but epilog can just be a branch, since no need to >>>> return back into the function that is returning.

    Yes, but this means My 66000 executes 3 fewer transfers of control
    per subroutine than you do. And taken branches add latency.

    Needs to have a lower limit though, as it is not worth it to use a
    call/branch to save/restore 3 or 4 registers...

    But, say, 20 registers, it is more worthwhile.

    ENTER saves as few as 1 or as many as 32 and remains that 1 single
    instruction. Same for EXIT and exit also performs the RET when LDing
    R0.


    Granted, the folding strategy can still do canary values, but
    doing so
    in the reused portions would limit the range of unique canary values >>>>>> (well, unless the canary magic is XOR'ed with SP or something...). >>>>>>
    Canary values are in addition to ENTER and EXIT not part of them
    IMHO.

    In Q+3 there are push and pop multiple instructions. I did not want to
    add load and store multiple on top of that. They work great for ISRs,
    but not so great for task switching code. I have the instructions
    pushing or popping up to 17 registers in a group. Groups of registers
    overlap by eight. The instructions can handle all 96 registers in the
    machine. ENTER and EXIT are also present.

    It is looking like the context switch code for the OS will take about
    3000 clock cycles to run. Not wanting to disable interrupts for that
    long, I put a spinlock on the system’s task control block array. But I
    think I have run into an issue. It is the timer ISR that switches
    tasks. Since it is an ISR it pushes a subset of registers that it uses
    and restores them at exit. But when exiting and switching tasks it
    spinlocks on the task control block array. I am not sure this is a
    good thing. As the timer IRQ is fairly high priority. If something
    else locked the TCB array it would deadlock. I guess the context
    switching could be deferred until the app requests some other
    operating system function. But then the issue is what if the app gets
    stuck in an infinite loop, not calling the OS? I suppose I could make
    an OS heartbeat function call a requirement of apps. If the app does
    not do a heartbeat within a reasonable time, it could be terminated.

    Q+3 progresses rapidly. A lot of the stuff in earlier versions was
    removed. The pared down version is a 32-bit machine. Expecting some
    headaches because of the use of condition registers and branch registers.


    OK.

    Ironically, I seem to have comparably low task-switch cost...
    However, each system call is essentially 2 task switches, and it is
    still slow enough to negatively effect performance if they happen at all frequently.

    System calls for Q+ are slightly faster (but not much) than task
    switches. I just have the system saving state on the stack. I don't
    bother saving the FP registers or some of the other system registers
    that the OS controls. So, it is a little bit shorter than the task
    switch code.

    The only thing that can do a task switch in the system is the time-slicer.

    So, say, one needs to try to minimize the number of unnecessary system
    calls (say, don't implement "fputs()" by sending 1 byte at a time, ...).





    Unlike on a modern PC, one generally needs to care more about efficiency.

    Hence, all the fiddling with low bit-depth graphics formats, and things
    like my recent fiddling with 2-bit ADPCM audio.

    And, online, one is (if anything) more likely to find people complaining about how old/obsolescent ADPCM is (and/or arguing that people should
    store all their sound effects as Ogg/Vorbis or similar; ...).


    Im not one much for music, although I play the tunes ocassionally. I'm
    little hard of hearing.

    Then again, I did note that I may need to find some other "quality
    metric" for audio, as RMSE isn't really working...

    At least going by RMSE, the "best" option would be to use 8-bit PCM and
    then downsample it.

    Say, 4kHz 8-bit PCM has a lower RMSE score than 2-bit ADPCM, but subjectively the 2-bit ADPCM sounds significantly better.

    Say: for 16kHz, and a test file (using a song here):
      PCM8, 16kHz     : 121 (128 kbps)
      A-Law, 16kHz    : 284 (128 kbps)
      IMA 4bit, 16kHz : 617 (64 kbps)
      IMA 2bit, 16kHz : 1692 (32 kbps, *)
      ADLQ 2bit, 16kHz: 2000 (32 kbps)
      PCM8, 4kHz      : 242  (32 kbps)

    However, 4kHz PCM8 sounds terrible vs either 2-bit IMA or ADLQ.
      Basically sounds muffled, speech is unintelligible.
      But, it would be the "best" option if going solely by RMSE.

    Also A-Law sounds better than PCM8 (at the same sample rate).
      Even with the higher RMSE score.

    Seems like it could be possible to do RMSE on A-Law samples as a metric,
    but if anything this is just kicking the can down the road slightly.

    Granted, A-Law sounds better than 4-bit IMA, and 4-bit IMA sounds better than the 2-bit ADPCM's at least...


    *: Previously it was worse, around 4500, but the RMSE score dropped
    after switching it to using a similar encoder strategy to ADLQ, namely
    doing a brute-force search over the next 3 samples to find the values
    that best approximate the target samples.

    Though, which is "better", or whether or not even lower RMSE "improves" quality here, is debatable (the PCM8 numbers clearly throw using RMSE as
    a quality metric into question for this case).

    Ideally I would want some metric that better reflects hearing perception
    and is computationally cheap.

    ...




    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Apr 1 23:21:24 2025
    From Newsgroup: comp.arch

    On Tue, 1 Apr 2025 19:34:10 +0000, BGB wrote:

    On 3/31/2025 3:52 PM, MitchAlsup1 wrote:
    On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:
    ---------------------
    PC-Rel not being used as PC-Rel doesn't allow for multiple process
    instances of a given loaded binary within a shared address space.

    As long as the relative distance is the same, it does.


    Can't happen within a shared address space.

    Say, if you load a single copy of a binary at 0x24680000.
    Process A and B can't use the same mapping in the same address space,
    with PC-rel globals, as then they would each see the other's globals.

    Say I load a copy of the binary text at 0x24680000 and its data at
    0x35900000 for a distance of 0x11280000 into the address space of
    a process.

    Then I load another copy at 0x44680000 and its data at 55900000
    into the address space of a different process.

    PC-rel addressing works in both cases--because the distance (-rel)
    remains the same,

    and the MMU can translate the code to the same physical, and map
    each area of data individually.

    Different virtual addresses, same code physical address, different
    data virtual and physical addresses.

    You can't do a duplicate mapping at another address, as this both wastes
    VAS, and also any Abs64 base-relocs or similar would differ.

    A 64-bit VAS is a wasteable address space, whereas a 48-bit VAS is not.

    You also can't CoW the data/bss sections, as this is no longer a shared address space.

    You are trying to "get at" something here, but I can't see it (yet).


    So, alternative is to use GBR to access globals, with the data/bss
    sections allocated independently of the binary.

    This way, multiple processes can share the same mapping at the same
    address for any executable code and constant data, with only the data sections needing to be allocated.


    Does mean though that one needs to save/restore the global pointer, and
    there is a ritual for reloading it.

    EXE's generally assume they are index 0, so:
    MOV.Q (GBR, 0), Rt
    MOV.Q (Rt, 0), GBR
    Or, in RV terms:
    LD X6, 0(X3)
    LD X3, Disp33(X6)
    Or, RV64G:
    LD X6, 0(X3)
    LUI X5, DispHi
    ADD X5 X5, X6
    LD X3, DispLo(X5)


    For DLL's, the index is fixed up with a base-reloc (for each loaded
    DLL), so basically the same idea. Typically a Disp33 is used here to
    allow for a potentially large/unknown number of loaded DLL's. Thus far,
    a global numbering scheme is used.

    Where, (GBR+0) gives the address of a table of global pointers for every loaded binary (can be assumed read-only from userland).


    Generally, this is needed if:
    Function may be called from outside of the current binary and:
    Accesses global variables;
    And/or, calls local functions.

    I just use 32-bit of 64-bit displacement constants. Does not matter
    how control arrived at this subroutine, it accesses its data as the
    linker resolved addresses--without wasting a register.


    Though, still generally lower average-case overhead than the strategy typically used by FDPIC, which would handle this reload process on the
    caller side...
    SD X3, Disp(SP)
    LD X3, 8(X18)
    LD X6, 0(X18)
    JALR X1, 0(X6)
    LD X3, Disp(SP)

    This is just::

    CALX [IP,,#GOT[funct_num]-.]

    In the 32-bit linking mode this is a 2 word instruction, in the 64-bit
    linking mode it is a 3 word instruction.
    ----------------

    Though, execl() effectively replaces the current process.

    IMHO, a "CreateProcess()" style abstraction makes more sense than
    fork+exec.

    You are 40 years late on that.

    ---------------

    But, invariably, someone will want "compressed" instructions with a
    subset of the registers, and one can't just have these only having
    access to argument registers.

    Brian had little trouble using My 66000 ABI which does have contiguous
    register groupings.


    But, My66000 also isn't like, "Hey, how about 16-bit ops with 3 or 4 bit register numbers".

    Not sure the thinking behind the RV ABI.

    If RISC-V removed its 16-bit instructions, there is room in its ISA
    to put my entire ISA along with all the non-compressed RISC-V inst-
    ructions.

    ---------------

    Prolog needs a call, but epilog can just be a branch, since no need to
    return back into the function that is returning.

    Yes, but this means My 66000 executes 3 fewer transfers of control
    per subroutine than you do. And taken branches add latency.


    Granted.

    Each predicted branch adds 2 cycles.

    So, you loose 6 cycles on just under ½ of all subroutine calls,
    while also executing 2-5 instructions manipulating your global
    pointer.


    Needs to have a lower limit though, as it is not worth it to use a
    call/branch to save/restore 3 or 4 registers...

    But, say, 20 registers, it is more worthwhile.

    ENTER saves as few as 1 or as many as 32 and remains that 1 single
    instruction. Same for EXIT and exit also performs the RET when LDing
    R0.


    Granted.

    My strategy isn't perfect:
    Non-zero branching overheads, when the feature is used;
    Per-function load/store slides in prolog/epilog, when not used.

    Then, the heuristic mostly becomes one of when it is better to use the
    inline strategy (load/store slide), or to fold them off and use calls/branches.

    My solution gets rid of the delimma:
    a) the call code is always smaller
    b) the call code never takes more cycles

    In addition, there is a straightforward way to elide the STs of ENTER
    when the memory unit is still executing the previous EXIT.

    Does technically also work for RISC-V though (though seemingly GCC
    always uses inline save/restore, but also the RV ABI has fewer
    registers).
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Apr 1 23:24:29 2025
    From Newsgroup: comp.arch

    On Tue, 1 Apr 2025 22:06:10 +0000, Robert Finch wrote:

    On 2025-04-01 2:51 p.m., MitchAlsup1 wrote:
    On Tue, 1 Apr 2025 4:58:58 +0000, Robert Finch wrote:
    ------------------

    It is looking like the context switch code for the OS will take about
    3000 clock cycles to run.

    How much of that is figuring out who to switch to and, now that that has
    been decided, make the context switch manifest ??

    That was just for the making the switch. I calculated based on the
    number of register loads and stores x2 and then times 13 clocks for
    memory access, plus a little bit of overhead for other instructions.

    Why is it not 13 cycles to get started and then each register is 1 one
    cycle.

    Deciding who to switch to may be another good chunk of time. But the
    system is using a hardware ready list, so the choice is just to pop
    (load) the top task id off the ready list. The guts of the switcher is
    only about 30 LOC, but it calls a couple of helper routines.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Tue Apr 1 20:07:41 2025
    From Newsgroup: comp.arch

    On 2025-04-01 7:24 p.m., MitchAlsup1 wrote:
    On Tue, 1 Apr 2025 22:06:10 +0000, Robert Finch wrote:

    On 2025-04-01 2:51 p.m., MitchAlsup1 wrote:
    On Tue, 1 Apr 2025 4:58:58 +0000, Robert Finch wrote:
    ------------------

    It is looking like the context switch code for the OS will take about
    3000 clock cycles to run.

    How much of that is figuring out who to switch to and, now that that has >>> been decided, make the context switch manifest ??

    That was just for the making the switch. I calculated based on the
    number of register loads and stores x2 and then times 13 clocks for
    memory access, plus a little bit of overhead for other instructions.

    Why is it not 13 cycles to get started and then each register is 1 one
    cycle.

    The CPU does not do pipe-lined burst loads. To load the cache line it is
    two independent loads. 256-bits at a time. Stores post to the bus, but
    I seem to remember having to space out the stores so the queue in the
    memory controller did not overflow. Needs more work.

    Stores should be faster, I think they are single cycle. But loads may be
    quite slow if things are not in the cache. I should really measure it.
    It may not be as bad I think. It is still 300 LOC, about 100 loads and
    stores each way. Lots of move instructions for regs that cannot be
    directly loaded or stored. And with CRs serializing the processor. But
    the processor should eat up all the moves fairly quickly.

    Deciding who to switch to may be another good chunk of time. But the
    system is using a hardware ready list, so the choice is just to pop
    (load) the top task id off the ready list. The guts of the switcher is
    only about 30 LOC, but it calls a couple of helper routines.

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Apr 2 01:47:26 2025
    From Newsgroup: comp.arch

    On Wed, 2 Apr 2025 0:07:41 +0000, Robert Finch wrote:

    On 2025-04-01 7:24 p.m., MitchAlsup1 wrote:
    On Tue, 1 Apr 2025 22:06:10 +0000, Robert Finch wrote:

    On 2025-04-01 2:51 p.m., MitchAlsup1 wrote:
    On Tue, 1 Apr 2025 4:58:58 +0000, Robert Finch wrote:
    ------------------

    It is looking like the context switch code for the OS will take about >>>>> 3000 clock cycles to run.

    How much of that is figuring out who to switch to and, now that that has >>>> been decided, make the context switch manifest ??

    That was just for the making the switch. I calculated based on the
    number of register loads and stores x2 and then times 13 clocks for
    memory access, plus a little bit of overhead for other instructions.

    Why is it not 13 cycles to get started and then each register is 1 one
    cycle.

    The CPU does not do pipe-lined burst loads. To load the cache line it is
    two independent loads. 256-bits at a time. Stores post to the bus, but
    I seem to remember having to space out the stores so the queue in the
    memory controller did not overflow. Needs more work.

    Stores should be faster, I think they are single cycle. But loads may be quite slow if things are not in the cache. I should really measure it.
    It may not be as bad I think. It is still 300 LOC, about 100 loads and
    stores each way. Lots of move instructions for regs that cannot be
    directly loaded or stored. And with CRs serializing the processor. But
    the processor should eat up all the moves fairly quickly.

    One of the reasons I went with treating the register file and thread-
    state as a write-back cache is that HW can read-up the inbound register
    values before starting to write out the outbound values (rather than
    the other way of having to do the STs first so the LDs have a place
    to land.)

    Deciding who to switch to may be another good chunk of time. But the
    system is using a hardware ready list, so the choice is just to pop
    (load) the top task id off the ready list. The guts of the switcher is
    only about 30 LOC, but it calls a couple of helper routines.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Tue Apr 1 22:55:56 2025
    From Newsgroup: comp.arch

    Say, if you load a single copy of a binary at 0x24680000.
    Process A and B can't use the same mapping in the same address space,
    with PC-rel globals, as then they would each see the other's globals.

    Say I load a copy of the binary text at 0x24680000 and its data at
    0x35900000 for a distance of 0x11280000 into the address space of
    a process.

    Then I load another copy at 0x44680000 and its data at 55900000
    into the address space of a different process.

    But then if thread A (whose state is stored at 0x35900000) sends to
    thread B (whose state is at 55900000) a closure whose code points
    somewhere inside 0x24680000, it will end up using the state of thread
    A instead of the state of the current thread.


    Stefan
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Apr 1 23:04:52 2025
    From Newsgroup: comp.arch

    On 4/1/2025 5:19 PM, Robert Finch wrote:
    On 2025-04-01 5:21 p.m., BGB wrote:
    On 3/31/2025 11:58 PM, Robert Finch wrote:
    On 2025-03-31 4:52 p.m., MitchAlsup1 wrote:
    On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:

    On 3/31/2025 1:07 PM, MitchAlsup1 wrote:
    -------------
    Another option being if it could be a feature of a Load/Store
    Multiple.

    Say, LDM/STM:
       6b Hi (Upper bound of register to save)
       6b Lo (Lower bound of registers to save)
       1b LR (Flag to save Link Register)
       1b GP (Flag to save Global Pointer)
       1b SK (Flag to generate a canary)

    Q+3 uses a bitmap of register selection with four more bits selecting
    overlapping groups. It can work with up to 17 registers.


    OK.

    If I did LDM/STM style ops, not sure which strategy I would take.

    The possibility of using a 96-bit encoding with an Imm64 holding a
    bit- mask of all the registers makes some sense...



    ENTER and EXIT have 2 of those flags--but also note use of SP and CSP >>>>>> are implicit.

    Likely (STM):
       Pushes LR first (if bit set);
       Pushes GP second (if bit set);
       Pushes registers in range (if Hi>=Lo);
       Pushes stack canary (if bit set).

    EXIT uses its 3rd flag used when doing longjump() and THROW()
    so as to pop the call-stack but not actually RET from the stack
    walker.


    OK.

    I guess one could debate whether an LDM could treat the Load-LR as
    "Load
    LR" or "Load address and Branch", and/or have separate flags (Load
    LR vs
    Load PC, with Load PC meaning to branch).


    Other ABIs may not have as much reason to save/restore the Global
    Pointer all the time. But, in my case, it is being used as the primary >>>>> way of accessing globals, and each binary image has its own address
    range here.

    I use constants to access globals.
    These comes in 32-bit and 64-bit flavors.

    PC-Rel not being used as PC-Rel doesn't allow for multiple process
    instances of a given loaded binary within a shared address space.

    As long as the relative distance is the same, it does.

    Vs, say, for PIE ELF binaries where it is needed to load a new copy >>>>> for
    each process instance because of this (well, excluding an FDPIC style >>>>> ABI, but seemingly still no one seems to have bothered adding FDPIC
    support in GCC or friends for RV64 based targets, ...).

    Well, granted, because Linux and similar tend to load every new
    process
    into its own address space and/or use CoW.

    CoW and execl()

    --------------
    Other ISAs use a flag bit for each register, but this is less viable >>>>>>> with an ISA with a larger number of registers, well, unless one >>>>>>> uses a
    64 or 96 bit LDM/STM encoding (possible). Merit though would be not >>>>>>> needing multiple LDM's / STM's to deal with a discontinuous register >>>>>>> range.

    To quote Trevor Smith:: "Why would anyone want to do that" ??


    Discontinuous register ranges:
    Because pretty much no ABI's put all of the callee save registers in a >>>>> contiguous range.

    Granted, I guess if someone were designing an ISA and ABI clean, they >>>>> could make all of the argument registers and callee save registers
    contiguous.

    Say:
       R0..R3: Special
       R4..R15: Scratch
       R16..R31: Argument
       R32..R63: Callee Save
    ....

    But, invariably, someone will want "compressed" instructions with a
    subset of the registers, and one can't just have these only having
    access to argument registers.

    Brian had little trouble using My 66000 ABI which does have contiguous >>>> register groupings.

    Well, also excluding the possibility where the LDM/STM is
    essentially
    just a function call (say, if beyond certain number of registers >>>>>>> are to
    be saved/restored, the compiler generates a call to a save/restore >>>>>>> sequence, which is also generates as-needed). Granted, this is
    basically
    the strategy used by BGBCC. If multiple functions happen to save/ >>>>>>> restore
    the same combination of registers, they get to reuse the prior
    function's save/restore sequence (generally folded off to before the >>>>>>> function in question).

    Calling a subroutine to perform epilogues is adding to the number of >>>>>> branches a program executes. Having an instruction like EXIT means >>>>>> when you know you need to exit, you EXIT you don't branch to the exit >>>>>> point. Saving instructions.


    Prolog needs a call, but epilog can just be a branch, since no need to >>>>> return back into the function that is returning.

    Yes, but this means My 66000 executes 3 fewer transfers of control
    per subroutine than you do. And taken branches add latency.

    Needs to have a lower limit though, as it is not worth it to use a
    call/branch to save/restore 3 or 4 registers...

    But, say, 20 registers, it is more worthwhile.

    ENTER saves as few as 1 or as many as 32 and remains that 1 single
    instruction. Same for EXIT and exit also performs the RET when LDing
    R0.


    Granted, the folding strategy can still do canary values, but
    doing so
    in the reused portions would limit the range of unique canary values >>>>>>> (well, unless the canary magic is XOR'ed with SP or something...). >>>>>>>
    Canary values are in addition to ENTER and EXIT not part of them
    IMHO.

    In Q+3 there are push and pop multiple instructions. I did not want
    to add load and store multiple on top of that. They work great for
    ISRs, but not so great for task switching code. I have the
    instructions pushing or popping up to 17 registers in a group. Groups
    of registers overlap by eight. The instructions can handle all 96
    registers in the machine. ENTER and EXIT are also present.

    It is looking like the context switch code for the OS will take about
    3000 clock cycles to run. Not wanting to disable interrupts for that
    long, I put a spinlock on the system’s task control block array. But
    I think I have run into an issue. It is the timer ISR that switches
    tasks. Since it is an ISR it pushes a subset of registers that it
    uses and restores them at exit. But when exiting and switching tasks
    it spinlocks on the task control block array. I am not sure this is a
    good thing. As the timer IRQ is fairly high priority. If something
    else locked the TCB array it would deadlock. I guess the context
    switching could be deferred until the app requests some other
    operating system function. But then the issue is what if the app gets
    stuck in an infinite loop, not calling the OS? I suppose I could make
    an OS heartbeat function call a requirement of apps. If the app does
    not do a heartbeat within a reasonable time, it could be terminated.

    Q+3 progresses rapidly. A lot of the stuff in earlier versions was
    removed. The pared down version is a 32-bit machine. Expecting some
    headaches because of the use of condition registers and branch
    registers.


    OK.

    Ironically, I seem to have comparably low task-switch cost...
    However, each system call is essentially 2 task switches, and it is
    still slow enough to negatively effect performance if they happen at
    all frequently.

    System calls for Q+ are slightly faster (but not much) than task
    switches. I just have the system saving state on the stack. I don't
    bother saving the FP registers or some of the other system registers
    that the OS controls. So, it is a little bit shorter than the task
    switch code.

    The only thing that can do a task switch in the system is the time-slicer.


    In my case, task switch happens by capturing and restoring all of the registers (of which there are 64 main registers, and a few CR's).

    No separate FPU or vector registers (the BJX GPR space, and RISC-V X+F
    spaces, being mostly equivalent).

    The interrupt handlers only have access to physical addresses, and will
    block all other interrupts when running, so there is a need to get
    quickly from the user-program task to the syscall handler task, and then
    back again once done (though, maybe not immediately, as it may instead
    send the results back to the caller task, and then transfer control to a different task).

    Timer interrupt can do scheduling, but mostly avoids doing so unless
    there is no other option (TestKern being mostly lacking in mutexes,
    which makes timer-driven preemptive multitasking a bit risky). However, usually programs will use system calls often enough that it is possible
    to schedule tasks this way, and generally a system call will not be made inside of a critical section.


    So, say, one needs to try to minimize the number of unnecessary system
    calls (say, don't implement "fputs()" by sending 1 byte at a time, ...).





    Unlike on a modern PC, one generally needs to care more about efficiency.

    Hence, all the fiddling with low bit-depth graphics formats, and
    things like my recent fiddling with 2-bit ADPCM audio.

    And, online, one is (if anything) more likely to find people
    complaining about how old/obsolescent ADPCM is (and/or arguing that
    people should store all their sound effects as Ogg/Vorbis or
    similar; ...).


    Im not one much for music, although I play the tunes ocassionally. I'm little hard of hearing.

    Not so much for music here, but more for storing sound-effects.


    I can note I seem to have a form of reverse-slope hearing impairment...

    Not a new thing, I have either always been this way, or it has happened
    very slowly.


    I can seemingly hear most stuff OK though.
    Except, IRL, I can't hear tuning forks.
    Nor car engines.
    I don't hear the engines.
    I do hear the tires rolling on the ground.
    Nor refrigerators (mostly).
    I sometimes hear the relays when they start/stop,
    or a crackling sound from the radiator coil.
    Using phones sucks hard, can't hear crap...
    Not terribly musically inclined.
    But, instruments don't sound much different from "noise" sounds.

    My ability to hear low-frequencies is a bit weird:
    Square or triangle waves, I hear these well;
    Sine waves, weakly, but I hear them in headphones if volume is high.
    If the volume isn't very high, sine waves become silent.
    Seemingly, these are harder to hear IRL.


    I seem most sensitive to frequencies between around 2 to 8 kHz. Upper
    end of hearing seems to be around 17 kHz (lower end around 1kHz for pure
    sine waves). The lower ("absolute" limit seems to be around 8Hz, but
    more because at this point, a square wave turns from a "tone" into a
    series of discrete pops, 8-20 Hz being sort of a meta-range between
    being tonal and discrete pops).

    Have noted that in YouTube videos where someone is messing with a CRT
    TV, I can still sometimes hear the squeal, particularly if the camera is
    close to the TV. Not seen a CRT IRL in a while though; no obvious sound
    from a VGA CRT monitor though (but, then again, I am using it ATM on an
    old rack server, which sounds kinda like a vacuum cleaner, so might be
    masking it if it is making a noise).


    Have noted that I still understand speech fine with a 2-8 kHz bandpass
    (with steep fall-off). I don't understand speech at all with a 2kHz
    low-pass. So, whichever parts I use for intelligibility seem to be
    between 2 and 8kHz. Had noted if I split it into 2-4 or 4-8 kHz bands,
    either works, though individually each has a notably worse quality than combined 2-8 kHz.

    The 1-2 kHz range can be heard, but doesn't seem to contain much as far
    as intelligibility goes, but its presence or absence does seem to alter
    vowel sounds slightly.

    A 1-8 kHz bandpass sounds mostly natural to me. Though, cats seem to
    respond unfavorably to band-passed audio (if cats are neutral to the
    original, but tense up and dig in their claws if I play band-passed
    audio, it seems they hear a difference).


    Although, I was using music as test-cases mostly as they can give a
    better idea of the relative audio quality than a short sound effect.

    But, for things that are going to be embedded into an EXE or DLL,
    generally these are ideally kept at a few kB or less.


    For long form audio, there is more reason to care about audio quality,
    but for something like a notification ding, not as much. Do preferably
    want it to not sound like "broken crap" though. And, if any speech is
    present, ideally it needs to be intelligible.


    In terms of being small and "not sounding like crap":
    ADPCM:
    Works well enough, but can't go below 2 bits per sample.
    Delta-Signma:
    1 bit per sample, but sounds horrid much under 64 kHz.


    MP3 and Vorbis work well at 96 to 128 kbps, but:
    Are complex and expensive formats to decode;
    Don't give acceptable results much below around 40 kbps.

    At lower bitrates, the artifacts from MP3 and Vorbis can become rather obnoxious (lots of squealing and whistling and sounds like broken glass
    being shaken in a steel can).

    I actually much prefer the sound of ADPCM for low bitrates. Muffled and
    gritty is still preferable to "rattling a steel can full of broken
    glass" (simple loss of quality rather than the addition of other more obnoxious artifacts).



    From what I gather, the telephone network used 8kHz as a standard
    sampling rate, with one of several formats:
    u-Law, in the US
    A-Law, in Europe
    4-bit ADPCM, for lower-priority long-distance links;
    When not using u-Law or A-Law.
    2-bit ADPCM, for "overflow" links (*).

    *: Apparently, if there were too many long distance calls over a given long-distance link, they would drop to a 2-bit ADPCM (running at 16 kbps).

    I was testing with 16kHz 2-bit ADPCM, as while both 16kHz 2-bit and 8kHz
    4-bit ADPCM are both 32 kbps, the 16kHz sounds better to me (and intelligibility is higher).


    Though, if spoken language is not used, it makes sense to drop to 8kHz.
    Using 8kHz as standard is weak as intelligibility is a lot worse.

    But, I guess the thinking was "minimum where you can still 'mostly' hear
    what they are saying...".


    Even if 8kHz was standard on the telephone network, I can't easily
    understand what anyone is saying over the phone (speech is often very
    muffled and there is often a loud/obnoxious hiss).

    Actually, weirdly, actual phone quality is somehow *worse* than my
    experiments with low bitrate ADPCM. Like, the low-bit depth ADPCM mostly
    just sounds "gritty" (without any obvious hiss). Like, the phone adds
    extra levels of badness beyond just any compression issues (probably
    also crappy microphones and speakers, etc, as well).

    Using headphones with a phone is "slightly" better, but there is often
    still a rather loud/annoying hiss, even when the sound is coming from an artificial source.

    Poor quality ADPCM, by itself, does not have this particular issue
    (actually, it almost seems as if the ADPCM somehow "enhances" the audio
    and compensates slightly for the low sample rate, making details easier
    to hear compared with "cleaner" PCM audio versions).


    For sound-effects, could drop to 4kHz, but there is fairly significant distortion. Like, if you have a notification ding, it doesn't really
    sound like a bell anymore.


    So, say (ADPCM modes):
    16kHz 4-bit: Mostly Good, but needs 64 kbps.
    16kHz 2-bit: Slightly muffled, gritty, 32 kbps.
    8kHz 4-bit: More obvious muffling (but not gritty);
    8kHz 2-bit: Muffled and gritty (16 kbps);
    4kHz 4-bit: Serious muffle / distortion, 16 kbps.
    4kHz 2-bit: Muffle + distortion + grit, 8 kbps.

    Possible merit of 4kHz 2-bit is that it allows putting a bell sound
    effect in around 500 bytes. Downside is that it is no longer
    particularly recognizable as a bell (and goes more from "ding" to "plong").

    At 4 kHz, speech is basically almost entirely unintelligible, but can
    still hear that speech is present (its "shape" can still be heard; but
    words are still recognizable; sort of like the muffling when people are talking in a different room, but one can still hear that they are saying "something").

    At 2 kHz; it is barely recognizable as being speech (it sounds almost
    more like wind). Percussive sounds are still recognizable though (so,
    music is turned into "howling wind with drums").


    Early 90s games (such as Doom) mostly used 11 kHz as standard.
    IMHO, 16kHz is a better quality/space tradeoff.
    Where, 22/32/44 can sound better, but may not be worth the overhead.
    Sample rates above 44 kHz are overkill though.

    I, personally, can't hear the difference between 44 and 48 kHz audio.
    I suspect anything 48kHz and beyond is likely needless overkill.



    Then again, I did note that I may need to find some other "quality
    metric" for audio, as RMSE isn't really working...

    At least going by RMSE, the "best" option would be to use 8-bit PCM
    and then downsample it.

    Say, 4kHz 8-bit PCM has a lower RMSE score than 2-bit ADPCM, but
    subjectively the 2-bit ADPCM sounds significantly better.

    Say: for 16kHz, and a test file (using a song here):
       PCM8, 16kHz     : 121 (128 kbps)
       A-Law, 16kHz    : 284 (128 kbps)
       IMA 4bit, 16kHz : 617 (64 kbps)
       IMA 2bit, 16kHz : 1692 (32 kbps, *)
       ADLQ 2bit, 16kHz: 2000 (32 kbps)
       PCM8, 4kHz      : 242  (32 kbps)

    However, 4kHz PCM8 sounds terrible vs either 2-bit IMA or ADLQ.
       Basically sounds muffled, speech is unintelligible.
       But, it would be the "best" option if going solely by RMSE.

    Also A-Law sounds better than PCM8 (at the same sample rate).
       Even with the higher RMSE score.

    Seems like it could be possible to do RMSE on A-Law samples as a
    metric, but if anything this is just kicking the can down the road
    slightly.

    Granted, A-Law sounds better than 4-bit IMA, and 4-bit IMA sounds
    better than the 2-bit ADPCM's at least...


    *: Previously it was worse, around 4500, but the RMSE score dropped
    after switching it to using a similar encoder strategy to ADLQ, namely
    doing a brute-force search over the next 3 samples to find the values
    that best approximate the target samples.

    Though, which is "better", or whether or not even lower RMSE
    "improves" quality here, is debatable (the PCM8 numbers clearly throw
    using RMSE as a quality metric into question for this case).

    Ideally I would want some metric that better reflects hearing
    perception and is computationally cheap.

    ...





    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Apr 1 23:19:11 2025
    From Newsgroup: comp.arch

    On 4/1/2025 9:55 PM, Stefan Monnier wrote:
    Say, if you load a single copy of a binary at 0x24680000.
    Process A and B can't use the same mapping in the same address space,
    with PC-rel globals, as then they would each see the other's globals.

    Say I load a copy of the binary text at 0x24680000 and its data at
    0x35900000 for a distance of 0x11280000 into the address space of
    a process.

    Then I load another copy at 0x44680000 and its data at 55900000
    into the address space of a different process.

    But then if thread A (whose state is stored at 0x35900000) sends to
    thread B (whose state is at 55900000) a closure whose code points
    somewhere inside 0x24680000, it will end up using the state of thread
    A instead of the state of the current thread.


    Generally, threads and processes are seen as different...

    But, yeah, passing lambdas between processes is theoretically possible
    in this scheme, but not advised.

    If done, any pointers captured by the lambda would likely point to the originating process, but if called with a GBR from the new process, any
    global variables would either be mapped to the corresponding DLL index
    in the new process, or NULL (if a DLL that was not loaded in the new
    process), or possibly a random address if it was from the main EXE and
    the EXE's differ...


    But, yeah, inter-process function pointers aren't really a thing, and
    should not be a thing.

    The eventual plan is to disallow them in the memory protection scheme,
    but enforcing memory access based the ACL based memory protection is
    still on the TODO list (it was only very recently that stuff is actually running in a proper usermode and so can't just stomp all over the
    kernel's memory...).

    But... Yeah, the kernel and program are still hanging out in the same
    VAS, along with every other running program...

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Apr 2 00:43:39 2025
    From Newsgroup: comp.arch

    On 4/1/2025 6:21 PM, MitchAlsup1 wrote:
    On Tue, 1 Apr 2025 19:34:10 +0000, BGB wrote:

    On 3/31/2025 3:52 PM, MitchAlsup1 wrote:
    On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:
    ---------------------
    PC-Rel not being used as PC-Rel doesn't allow for multiple process
    instances of a given loaded binary within a shared address space.

    As long as the relative distance is the same, it does.


    Can't happen within a shared address space.

    Say, if you load a single copy of a binary at 0x24680000.
    Process A and B can't use the same mapping in the same address space,
    with PC-rel globals, as then they would each see the other's globals.

    Say I load a copy of the binary text at 0x24680000 and its data at
    0x35900000 for a distance of 0x11280000 into the address space of
    a process.

    Then I load another copy at 0x44680000 and its data at 55900000
    into the address space of a different process.

    PC-rel addressing works in both cases--because the distance (-rel)
    remains the same,

    and the MMU can translate the code to the same physical, and map
    each area of data individually.

    Different virtual addresses, same code physical address, different
    data virtual and physical addresses.

    You can't do a duplicate mapping at another address, as this both wastes
    VAS, and also any Abs64 base-relocs or similar would differ.

    A 64-bit VAS is a wasteable address space, whereas a 48-bit VAS is not.


    OK.

    PE/COFF had defined Abs64 relocs, but I am using a 48-bit VAS.

    Would not have made sense to define separate Abs48 relocs, but much of
    the time, we can just assume the HOBs are zero.

    Well, except for function pointers, where the base-reloc handling
    detects pointers into ".text" and does some special secret-sauce magic regarding the HOBs to make sure they are correctly tagged.


    Binaries are not generally fully PIE though, but are instead
    base-relocated (more like EXE/DLL handling in Windows). Though, most
    things within the core proper are either PC-rel or GBR rel, and there
    are usually a relatively small number of base-relocations.

    Things like DLL calls are essentially absolute addressed though. Where, mapping instances at different virtual addresses would be messy for
    things like DLL handling (in the absence of a GOT or similar).


    You also can't CoW the data/bss sections, as this is no longer a shared
    address space.

    You are trying to "get at" something here, but I can't see it (yet).


    Shared address space assumes all processes have the same page tables and shared address mappings and TLB contents (though, ACL checking can be different, as the ACL/KRR stuff is not based on having separate contents
    in the page tables or TLB, *).

    By definition, CoW can't be used in this constraint.

    But, multiple VAS's adds new problems (both hassles and potential
    performance effects, so better here to delay this if possible).


    *: A smaller 4-entry full-assoc cache is used for ACL checks, so it is
    more of a "what access does the current task have to this particular
    ACL" check. But, admittedly, some of this part is still TODO regarding
    making use of it in the OS.



    So, alternative is to use GBR to access globals, with the data/bss
    sections allocated independently of the binary.

    This way, multiple processes can share the same mapping at the same
    address for any executable code and constant data, with only the data
    sections needing to be allocated.


    Does mean though that one needs to save/restore the global pointer, and
    there is a ritual for reloading it.

    EXE's generally assume they are index 0, so:
       MOV.Q (GBR, 0), Rt
       MOV.Q (Rt, 0), GBR
    Or, in RV terms:
       LD    X6, 0(X3)
       LD    X3, Disp33(X6)
    Or, RV64G:
       LD    X6, 0(X3)
       LUI   X5, DispHi
       ADD   X5  X5, X6
       LD    X3, DispLo(X5)


    For DLL's, the index is fixed up with a base-reloc (for each loaded
    DLL), so basically the same idea. Typically a Disp33 is used here to
    allow for a potentially large/unknown number of loaded DLL's. Thus far,
    a global numbering scheme is used.

    Where, (GBR+0) gives the address of a table of global pointers for every
    loaded binary (can be assumed read-only from userland).


    Generally, this is needed if:
       Function may be called from outside of the current binary and:
         Accesses global variables;
         And/or, calls local functions.

    I just use 32-bit of 64-bit displacement constants. Does not matter
    how control arrived at this subroutine, it accesses its data as the
    linker resolved addresses--without wasting a register.


    GBR or GP is specially designated as a global pointer though.
    Not so starved for registers that it would make sense to reclaim it as a
    GPR.

    But, yeah, do need to care how control can arrive at a given function.




    Though, still generally lower average-case overhead than the strategy
    typically used by FDPIC, which would handle this reload process on the
    caller side...
       SD    X3, Disp(SP)
       LD    X3, 8(X18)
       LD    X6, 0(X18)
       JALR  X1, 0(X6)
       LD    X3, Disp(SP)

    This is just::

        CALX    [IP,,#GOT[funct_num]-.]

    In the 32-bit linking mode this is a 2 word instruction, in the 64-bit linking mode it is a 3 word instruction.
    ----------------

    OK.

    Neither BJX nor RISC-V have special instructions to deal with FDPIC call semantics.



    Though, execl() effectively replaces the current process.

    IMHO, a "CreateProcess()" style abstraction makes more sense than
    fork+exec.

    You are 40 years late on that.


    I am just doing it the Windows (or Cygwin) way...

    Most POSIX style programs still work, but with a slightly higher risk of "stuff may catastrophically explode" (say, if one tries to use "fork()"
    to fold off copies of the parent process, and then returning from the call-frame that called "fork()").

    Fork could be made to clone the global variables, though avoiding
    tangled addresses could be an issue (could maybe be done by relying on debuginfo or similar, to walk the globals and then redirect any pointers
    from the old data/bss into the new one; kinda SOL for anything on the
    heap though).

    Better may just be to be like "yeah, fork() doesn't really work, don't
    use it...".



    ---------------

    But, invariably, someone will want "compressed" instructions with a
    subset of the registers, and one can't just have these only having
    access to argument registers.

    Brian had little trouble using My 66000 ABI which does have contiguous
    register groupings.


    But, My66000 also isn't like, "Hey, how about 16-bit ops with 3 or 4 bit
    register numbers".

    Not sure the thinking behind the RV ABI.

    If RISC-V removed its 16-bit instructions, there is room in its ISA
    to put my entire ISA along with all the non-compressed RISC-V inst-
    ructions.


    Yeah, errm, how do you think XG3 came about?...

    I just sort of dropped the C instructions and shoved nearly the entirety
    of XG2 into that space.

    There would still have been half the encoding space left, if predication
    were disallowed.

    But, say, RV64G + XG3 (sans predication) + 2/3 of the 'C' extension,
    would be a bit picky...


    Granted, did need to shuffle the bits for the ISAs to be
    encoding-compatible; and went a little further than the bare minimum to
    avoid dog chew (gluing them together with entirely mismatched encodings
    and disjoint register numbering would have been possible; but I wanted
    at least some semblance of encoding consistency between them).



    ---------------

    Prolog needs a call, but epilog can just be a branch, since no need to >>>> return back into the function that is returning.

    Yes, but this means My 66000 executes 3 fewer transfers of control
    per subroutine than you do. And taken branches add latency.


    Granted.

    Each predicted branch adds 2 cycles.

    So, you loose 6 cycles on just under ½ of all subroutine calls,
    while also executing 2-5 instructions manipulating your global
    pointer.


    Possibly, but I don't think it is quite that bad on average...

    Would need to run some stats and do some math to try to figure out the percentages and relative impact from each of these.


    But, even with all this, and using stack canaries (which add around 6 or
    so instructions when applicable), it is still outperforming GCC's RV64G
    output (along with smaller binaries).



    Needs to have a lower limit though, as it is not worth it to use a
    call/branch to save/restore 3 or 4 registers...

    But, say, 20 registers, it is more worthwhile.

    ENTER saves as few as 1 or as many as 32 and remains that 1 single
    instruction. Same for EXIT and exit also performs the RET when LDing
    R0.


    Granted.

    My strategy isn't perfect:
       Non-zero branching overheads, when the feature is used;
       Per-function load/store slides in prolog/epilog, when not used.

    Then, the heuristic mostly becomes one of when it is better to use the
    inline strategy (load/store slide), or to fold them off and use
    calls/branches.

    My solution gets rid of the delimma:
    a) the call code is always smaller
    b) the call code never takes more cycles

    In addition, there is a straightforward way to elide the STs of ENTER
    when the memory unit is still executing the previous EXIT.


    OK.
    I was trying to keep the CPU implementation from being too complicated.


    In my case though, there is an advantage over plain RV64G:
    I have a Load/Store Pair, so need fewer Load/Store operations.

    Though, my RV+Jx experiment does also have this...



    Though were also variants defined for RV32 but not for RV64 (because apparently there was indecision about encodings, and some arguments from
    the "opcode fusion" camp that 64-bit RV processors could fuse groups of
    LD or SD instructions...).

    Decided to leave out complaining about "opcode fusion" distractions (to actually addressing ISA issues) and seeming over reliance on SpecInt and CoreMark to drive ISA design choices...


    Granted, one might say the same about Doom, but at least I am treating
    Doom more as a representation of a workload, and not the end-goal
    arbiter of what is added or dropped.



    Does technically also work for RISC-V though (though seemingly GCC
    always uses inline save/restore, but also the RV ABI has fewer
    registers).

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Thu Apr 3 10:09:20 2025
    From Newsgroup: comp.arch

    BGB [2025-04-01 23:19:11] wrote:
    But, yeah, inter-process function pointers aren't really a thing, and should not be a thing.

    AFAIK, this point was brought in the context of a shared address space
    (I assumed it was some kind of SASOS situation, but the same thing
    happens with per-thread data inside a POSIX-style process).
    Function pointers are perfectly normal and common in data (even tho they
    may often be implicit, e.g. within the method table of objects), and the
    whole point of sharing an address space is to be able to exchange data.


    Stefan
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu Apr 3 12:22:57 2025
    From Newsgroup: comp.arch

    On 4/3/2025 9:09 AM, Stefan Monnier wrote:
    BGB [2025-04-01 23:19:11] wrote:
    But, yeah, inter-process function pointers aren't really a thing, and should >> not be a thing.

    AFAIK, this point was brought in the context of a shared address space
    (I assumed it was some kind of SASOS situation, but the same thing
    happens with per-thread data inside a POSIX-style process).
    Function pointers are perfectly normal and common in data (even tho they
    may often be implicit, e.g. within the method table of objects), and the whole point of sharing an address space is to be able to exchange data.


    Or, to allow for NOMMU operation, or reduce costs by not having context switches result in as large of numbers of TLB misses.

    Also makes the kernel simpler as it doesn't need to deal with each
    process having its own address space.


    Some data sharing is used for IPC, but directly sharing function
    pointers between processes, or local memory (stack, malloc, etc), is not allowed.


    Though, things may change later, there is a plan to more to separate global/local address ranges. Likely things like code will remain in the
    shared range, and program data will be in the local range.


    Stefan

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Thu Apr 3 23:49:31 2025
    From Newsgroup: comp.arch

    On 2025-04-03 1:22 p.m., BGB wrote:
    On 4/3/2025 9:09 AM, Stefan Monnier wrote:
    BGB [2025-04-01 23:19:11] wrote:
    But, yeah, inter-process function pointers aren't really a thing, and
    should
    not be a thing.

    AFAIK, this point was brought in the context of a shared address space
    (I assumed it was some kind of SASOS situation, but the same thing
    happens with per-thread data inside a POSIX-style process).
    Function pointers are perfectly normal and common in data (even tho they
    may often be implicit, e.g. within the method table of objects), and the
    whole point of sharing an address space is to be able to exchange data.


    Or, to allow for NOMMU operation, or reduce costs by not having context switches result in as large of numbers of TLB misses.

    Also makes the kernel simpler as it doesn't need to deal with each
    process having its own address space.

    Have you seen the MPRV bit in RISCV? Allows memory ops to execute using
    the previous mode / address space. The bit just has to be set, then do
    the memory op, then reset the bit. Makes it easy to access data using
    the process address space.


    Some data sharing is used for IPC, but directly sharing function
    pointers between processes, or local memory (stack, malloc, etc), is not allowed.


    Though, things may change later, there is a plan to more to separate global/local address ranges. Likely things like code will remain in the shared range, and program data will be in the local range.

    Thinking of having a CPU local address space in Q+ to store vars for
    that particular CPU. It looks like only a small RAM is required. I guess
    it would be hardware thread local storage. May place the RAM in the CPU itself.


             Stefan


    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Apr 4 12:41:39 2025
    From Newsgroup: comp.arch

    On 4/3/2025 10:49 PM, Robert Finch wrote:
    On 2025-04-03 1:22 p.m., BGB wrote:
    On 4/3/2025 9:09 AM, Stefan Monnier wrote:
    BGB [2025-04-01 23:19:11] wrote:
    But, yeah, inter-process function pointers aren't really a thing,
    and should
    not be a thing.

    AFAIK, this point was brought in the context of a shared address space
    (I assumed it was some kind of SASOS situation, but the same thing
    happens with per-thread data inside a POSIX-style process).
    Function pointers are perfectly normal and common in data (even tho they >>> may often be implicit, e.g. within the method table of objects), and the >>> whole point of sharing an address space is to be able to exchange data.


    Or, to allow for NOMMU operation, or reduce costs by not having
    context switches result in as large of numbers of TLB misses.

    Also makes the kernel simpler as it doesn't need to deal with each
    process having its own address space.

    Have you seen the MPRV bit in RISCV? Allows memory ops to execute using
    the previous mode / address space. The bit just has to be set, then do
    the memory op, then reset the bit. Makes it easy to access data using
    the process address space.

    I am not aware of this one. If it is in the privileged spec or similar,
    may have missed it.

    Thus far, my core doesn't implement that much of the RV privileged spec, mostly just the userland ISA. If I wanted to run an RV OS, it is
    debatable if it would make more sense to try to mimic a hardware
    interface it understands, or have the "firmware" manage the real HW interfaces, and then fake the rest in software.




    Some data sharing is used for IPC, but directly sharing function
    pointers between processes, or local memory (stack, malloc, etc), is
    not allowed.


    Though, things may change later, there is a plan to more to separate
    global/local address ranges. Likely things like code will remain in
    the shared range, and program data will be in the local range.

    Thinking of having a CPU local address space in Q+ to store vars for
    that particular CPU. It looks like only a small RAM is required. I guess
    it would be hardware thread local storage. May place the RAM in the CPU itself.


    I am aware of at least a few CPU's that have banked register sets that
    may be backed to memory addresses (with the CPU itself evicting
    registers on context switch).

    I have not done so.

    I had considered the possibility of 4 rings each with their own set of registers. This could make things like interrupts and system calls
    cheaper, but (ironically) using this would make context switching more expensive.

    An intermediate option could be a special RAM area for a "task cache",
    say, 8 or 16K, and then have a "Task Cache Miss" interrupt for cases
    where one tries to switch to a task's register bank that isn't in the
    cache. While a this would have a high cost (for a task cache miss), if
    the cache is bigger than the number of currently running tasks, it could
    still work out ahead.

    But, better for performance would be if the task-cache were RAM backed
    and the HW spills and reloads from RAM (then, one could have maybe 64K
    or 256K or more for task register banks; probably enough for a decent
    number of active PIDs).



    Though naive, "always save and restore all the registers to RAM" seems
    to have a fairly reasonable cost (and the among the lowest "actual
    task-switch cost", aside from the possibility of "let hardware lazily
    spill and reload register banks from main RAM", which could potentially
    be lower).


    The main "bad" cost of switching between processes being the storm of
    TLB misses that would happen if not using a shared address space
    (granted, there are "global pages"). In my design, there are not true
    global pages, rather pages that are "global" within an ASID group (if
    the low 10 bits of the page's ASID are 0, it is assumed global within
    this group, whereas non-zero values are ASID specific; and the high 6
    bits of the ASID gives the group, where pages are not global between
    groups).

    Though, my existing OS, still being single address space, doesn't make
    use of this. The idea is that ASID will be tied to PID.

    As how to best scale this past 1024 PIDs is unclear, likely the ASID
    will be modulo-1024, and needing to reassign a previously assigned ASID
    to a new PID would require a TLB flush.



             Stefan



    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Fri Apr 4 21:07:09 2025
    From Newsgroup: comp.arch

    On Wed, 2 Apr 2025 0:07:41 +0000, Robert Finch wrote:

    On 2025-04-01 7:24 p.m., MitchAlsup1 wrote:
    On Tue, 1 Apr 2025 22:06:10 +0000, Robert Finch wrote: -------------------------
    Why is it not 13 cycles to get started and then each register is 1 one
    cycle.

    The CPU does not do pipe-lined burst loads. To load the cache line it is
    two independent loads. 256-bits at a time. Stores post to the bus, but
    I seem to remember having to space out the stores so the queue in the
    memory controller did not overflow. Needs more work.

    Stores should be faster, I think they are single cycle. But loads may be quite slow if things are not in the cache. I should really measure it.
    It may not be as bad I think. It is still 300 LOC, about 100 loads and
    stores each way. Lots of move instructions for regs that cannot be
    directly loaded or stored. And with CRs serializing the processor. But
    the processor should eat up all the moves fairly quickly.

    By placing all the CRs together, and treating thread-state as a write-
    back cache, all the storing and loading happens without any
    serialization,
    in cache line quanta, where the LD can begin before the STs
    begin--giving
    the overlap that reduces the cycle count.

    For example, once a core has decided to run "this-thread" all it has to
    do is to execute a single HR instruction which writes a pointer to
    thread-
    state. Then upon SVR, that thread begins running. Between HE and SVR, HW
    can preload the inbound data, and push out the outbound data after the
    inbound data has arrived.

    But, also note: Due to the way CR's are mapped into MMI/O memory, one
    core can write that same HR available CR on another core and cause a
    remote context switch of that another core.

    The main use is more likely to be remote diagnostics of a core that
    has quit responding to the system (crashed hard) so its CRs can be
    read out and examined to see why it quit responding.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Fri Apr 4 21:13:27 2025
    From Newsgroup: comp.arch

    On Fri, 4 Apr 2025 3:49:31 +0000, Robert Finch wrote:

    On 2025-04-03 1:22 p.m., BGB wrote:
    -------------------

    Or, to allow for NOMMU operation, or reduce costs by not having context
    switches result in as large of numbers of TLB misses.

    Also makes the kernel simpler as it doesn't need to deal with each
    process having its own address space.

    Have you seen the MPRV bit in RISCV? Allows memory ops to execute using
    the previous mode / address space. The bit just has to be set, then do
    the memory op, then reset the bit. Makes it easy to access data using
    the process address space.

    Let us postulate you are running in RISC-V HyperVisor on core[j]
    and you want to write into GuestOS VAS and into application VAS
    more or less simultaneously.

    Seems to me like you need a MPRV to be more than a single bit
    so it could index which layer of the SW stack's VAS it needs
    to touch.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Fri Apr 4 23:45:51 2025
    From Newsgroup: comp.arch

    On 2025-04-04 5:13 p.m., MitchAlsup1 wrote:
    On Fri, 4 Apr 2025 3:49:31 +0000, Robert Finch wrote:

    On 2025-04-03 1:22 p.m., BGB wrote:
    -------------------

    Or, to allow for NOMMU operation, or reduce costs by not having context
    switches result in as large of numbers of TLB misses.

    Also makes the kernel simpler as it doesn't need to deal with each
    process having its own address space.

    Have you seen the MPRV bit in RISCV? Allows memory ops to execute using
    the previous mode / address space. The bit just has to be set, then do
    the memory op, then reset the bit. Makes it easy to access data using
    the process address space.

    Let us postulate you are running in RISC-V HyperVisor on core[j]
    and you want to write into GuestOS VAS and into application VAS
    more or less simultaneously.

    Would not writing to the GuestOs VAS and the application VAS be the
    result of separate system calls? Or does the hypervisor take over for
    the GuestOS?

    Seems to me like you need a MPRV to be more than a single bit
    so it could index which layer of the SW stack's VAS it needs
    to touch.

    So, there is a need to be able to go back two or three levels? I suppose
    it could also be done by manipulating the stack, although adding an
    extra bit may be easier. How often does it happen?

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sat Apr 5 16:37:19 2025
    From Newsgroup: comp.arch

    On Sat, 5 Apr 2025 3:45:51 +0000, Robert Finch wrote:

    On 2025-04-04 5:13 p.m., MitchAlsup1 wrote:
    On Fri, 4 Apr 2025 3:49:31 +0000, Robert Finch wrote:

    On 2025-04-03 1:22 p.m., BGB wrote:
    -------------------

    Or, to allow for NOMMU operation, or reduce costs by not having context >>>> switches result in as large of numbers of TLB misses.

    Also makes the kernel simpler as it doesn't need to deal with each
    process having its own address space.

    Have you seen the MPRV bit in RISCV? Allows memory ops to execute using
    the previous mode / address space. The bit just has to be set, then do
    the memory op, then reset the bit. Makes it easy to access data using
    the process address space.

    Let us postulate you are running in RISC-V HyperVisor on core[j]
    and you want to write into GuestOS VAS and into application VAS
    more or less simultaneously.

    Would not writing to the GuestOs VAS and the application VAS be the
    result of separate system calls? Or does the hypervisor take over for
    the GuestOS?

    Application has a 64-bit VAS
    GusetOS has a 64-bit VAS
    HyprVisor has a 64-bit VAS
    and so does
    Securte has a 64-bit VAS

    So, we are in HV and we need to write to guestOS and to Application
    but we have only 1-bit of distinction.

    And this has nothing to do with system calls it has to do with
    accessing (rather simultaneously) any of the 4 VASs.

    Seems to me like you need a MPRV to be more than a single bit
    so it could index which layer of the SW stack's VAS it needs
    to touch.

    So, there is a need to be able to go back two or three levels? I suppose
    it could also be done by manipulating the stack, although adding an
    extra bit may be easier. How often does it happen?

    I have no idea, and I suspect GuestOS people don't either.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sat Apr 5 18:31:44 2025
    From Newsgroup: comp.arch

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sat, 5 Apr 2025 3:45:51 +0000, Robert Finch wrote:


    Would not writing to the GuestOs VAS and the application VAS be the
    result of separate system calls? Or does the hypervisor take over for
    the GuestOS?

    Application has a 64-bit VAS
    GusetOS has a 64-bit VAS
    HyprVisor has a 64-bit VAS
    and so does
    Securte has a 64-bit VAS

    So, we are in HV and we need to write to guestOS and to Application
    but we have only 1-bit of distinction.

    On ARM64, when the HV needs to write to guest user VA or guest PA,
    the SMMU provides an interface the processor can use to translate
    the guest VA or Guest PA to the corresponding system physical address.
    Of course, there is a race if the guest OS changes the underlying
    translation tables during the upcall to the hypervisor or secure
    monitor, although that would be a bug in the guest were it so to do,
    since the guest explicitly requested the action from the higher
    privilege level (e.g. HV).

    Arm does have a set of load/store "user" instructions that translate
    addresses using the unprivileged (application) translation tables. There's also a processor state bit (UAO - User Access Override) that can
    be set to force those instructions to use the permissions associated
    with the current processor privilege level.

    Note that there is a push by all vendors to include support
    for guest 'privacy', such that the hypervisor has no direct
    access to memory owned by the guest, or where the the guest
    memory is encrypted using a key the hypervisor or secure monitor
    don't have access to.

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sat Apr 5 17:57:50 2025
    From Newsgroup: comp.arch

    On 2025-04-05 2:31 p.m., Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sat, 5 Apr 2025 3:45:51 +0000, Robert Finch wrote:


    Would not writing to the GuestOs VAS and the application VAS be the
    result of separate system calls? Or does the hypervisor take over for
    the GuestOS?

    Application has a 64-bit VAS
    GusetOS has a 64-bit VAS
    HyprVisor has a 64-bit VAS
    and so does
    Securte has a 64-bit VAS

    So, we are in HV and we need to write to guestOS and to Application
    but we have only 1-bit of distinction.

    On ARM64, when the HV needs to write to guest user VA or guest PA,
    the SMMU provides an interface the processor can use to translate
    the guest VA or Guest PA to the corresponding system physical address.
    Of course, there is a race if the guest OS changes the underlying
    translation tables during the upcall to the hypervisor or secure
    monitor, although that would be a bug in the guest were it so to do,
    since the guest explicitly requested the action from the higher
    privilege level (e.g. HV).

    Arm does have a set of load/store "user" instructions that translate addresses using the unprivileged (application) translation tables. There's also a processor state bit (UAO - User Access Override) that can
    be set to force those instructions to use the permissions associated
    with the current processor privilege level.

    Note that there is a push by all vendors to include support
    for guest 'privacy', such that the hypervisor has no direct
    access to memory owned by the guest, or where the the guest
    memory is encrypted using a key the hypervisor or secure monitor
    don't have access to.

    Okay,

    I was interpreting RISCV specs wrong. They have three bits dedicated to
    this. 1 is an on/off and the other two are the mode to use. I am left wondering how it is determined which mode to use. If the hypervisor is
    passed a pointer to a VAS variable in a register, how does it know that
    the pointer is for the supervisor or the user/app? It's why I assumed it
    found the mode from the stack. Those two select bits have to set
    somehow. It seems like extra code to access the right address space.
    I got the thought to use the three bits a bit differently.
    111 = use current mode
    110 = use mode from stack
    100 = debug? mode
    011 = secure (machine) mode
    010 = hypervisor mode
    001 = supervisor mode
    000 = user/app mode
    I was just using inline code to select the proper address space. But if
    it is necessary to dig around to figure the mode, it may turn into a subroutine call.



    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sat Apr 5 23:06:38 2025
    From Newsgroup: comp.arch

    On Sat, 5 Apr 2025 18:31:44 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sat, 5 Apr 2025 3:45:51 +0000, Robert Finch wrote:


    Would not writing to the GuestOs VAS and the application VAS be the
    result of separate system calls? Or does the hypervisor take over for
    the GuestOS?

    Application has a 64-bit VAS
    GusetOS has a 64-bit VAS
    HyprVisor has a 64-bit VAS
    and so does
    Securte has a 64-bit VAS

    So, we are in HV and we need to write to guestOS and to Application
    but we have only 1-bit of distinction.

    On ARM64, when the HV needs to write to guest user VA or guest PA,
    the SMMU provides an interface the processor can use to translate
    the guest VA or Guest PA to the corresponding system physical address.
    Of course, there is a race if the guest OS changes the underlying
    translation tables during the upcall to the hypervisor or secure
    monitor, although that would be a bug in the guest were it so to do,
    since the guest explicitly requested the action from the higher
    privilege level (e.g. HV).

    Arm does have a set of load/store "user" instructions that translate addresses using the unprivileged (application) translation tables.

    When Secure Monitor executes a "user" instructions which layer
    of the SW stack is accessed:: {HV, SV, User} ??

    Is this 1-layer down the stack, or all layers down the stack ??

    There's
    also a processor state bit (UAO - User Access Override) that can
    be set to force those instructions to use the permissions associated
    with the current processor privilege level.

    That is how My 66000 MMU is defined--higher privilege layers
    have R/W access to the next lower privilege layer--without
    doing anything other than a typical LD or ST instruction.

    I/O MMU has similar issues to solve in that a device can Read
    write-execute only memory and write read-execute only memory.

    Note that there is a push by all vendors to include support
    for guest 'privacy', such that the hypervisor has no direct
    access to memory owned by the guest, or where the the guest
    memory is encrypted using a key the hypervisor or secure monitor
    don't have access to.

    I call these "paranoid" applications--generally requiring no
    privilege, but they don't want GuestOS of HyperVisor to look
    at their data and at the same time, they want GuestOS or HV
    to perform I/O to said data--so some devices have a effective
    privilege above that of the driver commanding them.

    I understand the reasons and rational.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sat Apr 5 23:11:00 2025
    From Newsgroup: comp.arch

    On Sat, 5 Apr 2025 21:57:50 +0000, Robert Finch wrote:

    On 2025-04-05 2:31 p.m., Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sat, 5 Apr 2025 3:45:51 +0000, Robert Finch wrote:


    Would not writing to the GuestOs VAS and the application VAS be the
    result of separate system calls? Or does the hypervisor take over for
    the GuestOS?

    Application has a 64-bit VAS
    GusetOS has a 64-bit VAS
    HyprVisor has a 64-bit VAS
    and so does
    Securte has a 64-bit VAS

    So, we are in HV and we need to write to guestOS and to Application
    but we have only 1-bit of distinction.

    On ARM64, when the HV needs to write to guest user VA or guest PA,
    the SMMU provides an interface the processor can use to translate
    the guest VA or Guest PA to the corresponding system physical address.
    Of course, there is a race if the guest OS changes the underlying
    translation tables during the upcall to the hypervisor or secure
    monitor, although that would be a bug in the guest were it so to do,
    since the guest explicitly requested the action from the higher
    privilege level (e.g. HV).

    Arm does have a set of load/store "user" instructions that translate
    addresses using the unprivileged (application) translation tables.
    There's
    also a processor state bit (UAO - User Access Override) that can
    be set to force those instructions to use the permissions associated
    with the current processor privilege level.

    Note that there is a push by all vendors to include support
    for guest 'privacy', such that the hypervisor has no direct
    access to memory owned by the guest, or where the the guest
    memory is encrypted using a key the hypervisor or secure monitor
    don't have access to.

    Okay,

    I was interpreting RISCV specs wrong. They have three bits dedicated to
    this. 1 is an on/off and the other two are the mode to use. I am left wondering how it is determined which mode to use. If the hypervisor is
    passed a pointer to a VAS variable in a register, how does it know that
    the pointer is for the supervisor or the user/app?

    More interesting the the concept that there are multiple HVs that
    have been virtualized--in this case the sender of the address may
    think it has HV privilege but is currently operating as if it only
    has GuestOS privilege. ...

    It's why I assumed it found the mode from the stack. Those two select bits have to set
    somehow. It seems like extra code to access the right address space.
    I got the thought to use the three bits a bit differently.
    111 = use current mode
    110 = use mode from stack
    100 = debug? mode
    011 = secure (machine) mode
    010 = hypervisor mode
    001 = supervisor mode
    000 = user/app mode
    I was just using inline code to select the proper address space. But if
    it is necessary to dig around to figure the mode, it may turn into a subroutine call.

    All the machines I have used/designed/programmed in the past use 000
    as highest privilege and 111 as lowest.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sun Apr 6 14:21:26 2025
    From Newsgroup: comp.arch

    Robert Finch <robfi680@gmail.com> writes:
    On 2025-04-05 2:31 p.m., Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sat, 5 Apr 2025 3:45:51 +0000, Robert Finch wrote:


    Would not writing to the GuestOs VAS and the application VAS be the
    result of separate system calls? Or does the hypervisor take over for
    the GuestOS?

    Application has a 64-bit VAS
    GusetOS has a 64-bit VAS
    HyprVisor has a 64-bit VAS
    and so does
    Securte has a 64-bit VAS

    So, we are in HV and we need to write to guestOS and to Application
    but we have only 1-bit of distinction.

    On ARM64, when the HV needs to write to guest user VA or guest PA,
    the SMMU provides an interface the processor can use to translate
    the guest VA or Guest PA to the corresponding system physical address.
    Of course, there is a race if the guest OS changes the underlying
    translation tables during the upcall to the hypervisor or secure
    monitor, although that would be a bug in the guest were it so to do,
    since the guest explicitly requested the action from the higher
    privilege level (e.g. HV).

    Arm does have a set of load/store "user" instructions that translate
    addresses using the unprivileged (application) translation tables. There's >> also a processor state bit (UAO - User Access Override) that can
    be set to force those instructions to use the permissions associated
    with the current processor privilege level.

    Note that there is a push by all vendors to include support
    for guest 'privacy', such that the hypervisor has no direct
    access to memory owned by the guest, or where the the guest
    memory is encrypted using a key the hypervisor or secure monitor
    don't have access to.

    Okay,

    I was interpreting RISCV specs wrong. They have three bits dedicated to >this. 1 is an on/off and the other two are the mode to use. I am left >wondering how it is determined which mode to use. If the hypervisor is >passed a pointer to a VAS variable in a register, how does it know that
    the pointer is for the supervisor or the user/app?

    When the exception (in this case an upcall to a more privileged
    regime) occurs, the saved state register/stack word should contain the prior privilege level. The hypervisor will know from that whether
    the upcall was from the guest OS or a guest Application.

    Note that on ARM, there are restrictions on upcalls to
    more privileged regimes - generally a particular regime
    can only upcall the next higher privileged regime, so
    the user app can only upcall the GuestOS, the guest OS can only
    upcall the HV and the HV is the only regime that can
    upcall the secure monitor.

    It's why I assumed it
    found the mode from the stack. Those two select bits have to set
    somehow. It seems like extra code to access the right address space.'

    I haven't spent much time with RISC-V, but surely the processor
    has a state register that stores the current mode, and which
    must be preserved over exceptions/upcalls, which would require
    that they be recorded in an exception syndrome register for
    restoration when the upcall returns.

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sun Apr 6 14:32:43 2025
    From Newsgroup: comp.arch

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sat, 5 Apr 2025 18:31:44 +0000, Scott Lurndal wrote:


    Arm does have a set of load/store "user" instructions that translate
    addresses using the unprivileged (application) translation tables.

    When Secure Monitor executes a "user" instructions which layer
    of the SW stack is accessed:: {HV, SV, User} ?

    The Secure Monitor will never execute a user instruction. If
    it does, it will act as any other load/store executed by the
    secure monitor.

    The "user" instructions are only used by a bare-metal OS
    or a guest OS to access user application address spaces.


    Is this 1-layer down the stack, or all layers down the stack ??

    One layer down, and only the least privileged non-user level.


    There's
    also a processor state bit (UAO - User Access Override) that can
    be set to force those instructions to use the permissions associated
    with the current processor privilege level.

    That is how My 66000 MMU is defined--higher privilege layers
    have R/W access to the next lower privilege layer--without
    doing anything other than a typical LD or ST instruction.

    That presumes a shared address space between the privilege
    levels - which is common for the OS and user-modes. It's
    not common (or particularly useful[*]) at any other privilege level.

    [*] A primary goal must be to avoid privilege level
    upcalls as much as possible.



    I/O MMU has similar issues to solve in that a device can Read
    write-execute only memory and write read-execute only memory.

    By the time the IOMMU translates the inbound address, it is
    a physical machine address, so I don't see any issue here.
    And in the ARM case, the IOMMU translation tables are identical
    to the processor translation tables in format and can actually
    share some or all of the tables between the core(s) and the IOMMU.

    Note that for various reasons, the IOMMU translation tables
    may cover only a portion of the target address space at any particular privilege level.


    Note that there is a push by all vendors to include support
    for guest 'privacy', such that the hypervisor has no direct
    access to memory owned by the guest, or where the the guest
    memory is encrypted using a key the hypervisor or secure monitor
    don't have access to.

    I call these "paranoid" applications--generally requiring no
    privilege, but they don't want GuestOS of HyperVisor to look
    at their data and at the same time, they want GuestOS or HV
    to perform I/O to said data--so some devices have a effective
    privilege above that of the driver commanding them.

    I understand the reasons and rational.

    The primary reason is for encrypted video decoding where
    the decoded video is fed directly to the graphics processor
    and the end-user cannot intercept the decrypted video stream. Closing
    the barn door after the horse has left, but c'est la vie.

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sun Apr 6 15:01:31 2025
    From Newsgroup: comp.arch

    On 2025-04-06 10:21 a.m., Scott Lurndal wrote:
    Robert Finch <robfi680@gmail.com> writes:
    On 2025-04-05 2:31 p.m., Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sat, 5 Apr 2025 3:45:51 +0000, Robert Finch wrote:


    Would not writing to the GuestOs VAS and the application VAS be the
    result of separate system calls? Or does the hypervisor take over for >>>>> the GuestOS?

    Application has a 64-bit VAS
    GusetOS has a 64-bit VAS
    HyprVisor has a 64-bit VAS
    and so does
    Securte has a 64-bit VAS

    So, we are in HV and we need to write to guestOS and to Application
    but we have only 1-bit of distinction.

    On ARM64, when the HV needs to write to guest user VA or guest PA,
    the SMMU provides an interface the processor can use to translate
    the guest VA or Guest PA to the corresponding system physical address.
    Of course, there is a race if the guest OS changes the underlying
    translation tables during the upcall to the hypervisor or secure
    monitor, although that would be a bug in the guest were it so to do,
    since the guest explicitly requested the action from the higher
    privilege level (e.g. HV).

    Arm does have a set of load/store "user" instructions that translate
    addresses using the unprivileged (application) translation tables. There's >>> also a processor state bit (UAO - User Access Override) that can
    be set to force those instructions to use the permissions associated
    with the current processor privilege level.

    Note that there is a push by all vendors to include support
    for guest 'privacy', such that the hypervisor has no direct
    access to memory owned by the guest, or where the the guest
    memory is encrypted using a key the hypervisor or secure monitor
    don't have access to.

    Okay,

    I was interpreting RISCV specs wrong. They have three bits dedicated to
    this. 1 is an on/off and the other two are the mode to use. I am left
    wondering how it is determined which mode to use. If the hypervisor is
    passed a pointer to a VAS variable in a register, how does it know that
    the pointer is for the supervisor or the user/app?

    When the exception (in this case an upcall to a more privileged
    regime) occurs, the saved state register/stack word should contain the prior privilege level. The hypervisor will know from that whether
    the upcall was from the guest OS or a guest Application.

    Yes, Q+ works that way, I think RISCV does as well. Q+ stacks the PC and
    SR on an internal stack which is basically a shift register. The TOS is visible as a CR. The mode is state saved in the SR. Interrupts and
    exceptions do not have to store the state in memory. The far end of the
    stack is hard coded to do a reset if the stack underflows.

    Note that on ARM, there are restrictions on upcalls to
    more privileged regimes - generally a particular regime
    can only upcall the next higher privileged regime, so
    the user app can only upcall the GuestOS, the guest OS can only
    upcall the HV and the HV is the only regime that can
    upcall the secure monitor.

    Allows two directional virtualization I think. Q+ has all exceptions and interrupts going to the secure monitor, which can then delegate it back
    to a lower level.

    It's why I assumed it
    found the mode from the stack. Those two select bits have to set
    somehow. It seems like extra code to access the right address space.'

    I haven't spent much time with RISC-V, but surely the processor
    has a state register that stores the current mode, and which
    must be preserved over exceptions/upcalls, which would require
    that they be recorded in an exception syndrome register for
    restoration when the upcall returns.


    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Mon Apr 7 00:51:08 2025
    From Newsgroup: comp.arch

    On Sun, 6 Apr 2025 14:21:26 +0000, Scott Lurndal wrote:
    ----------------
    When the exception (in this case an upcall to a more privileged
    regime) occurs, the saved state register/stack word should contain the
    prior privilege level. The hypervisor will know from that whether
    the upcall was from the guest OS or a guest Application.

    Note that on ARM, there are restrictions on upcalls to
    more privileged regimes - generally a particular regime
    can only upcall the next higher privileged regime, so
    the user app can only upcall the GuestOS, the guest OS can only
    upcall the HV and the HV is the only regime that can
    upcall the secure monitor.

    On Sun, 6 Apr 2025 14:32:43 +0000, Scott Lurndal wrote:

    That presumes a shared address space between the privilege
    levels - which is common for the OS and user-modes. It's
    not common (or particularly useful[*]) at any other privilege
    level.

    So, is this dichotomy because::

    a) HVs are good enough at virtualizing raw HW that GuestOS
    does not need a lot of paravirtualization to be efficient ??

    b) GuestOS does not need "that much paravirtualization" to be
    efficient anyway.

    c) the kinds of things GuestOS ask HVs to perform is just not
    enough like the kind of things user asks of GuestOS.

    d) User and GuestOS evolved in a time before virtualization
    and simply prefer to exist as it used to be ??
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Apr 7 14:04:37 2025
    From Newsgroup: comp.arch

    Robert Finch <robfi680@gmail.com> writes:
    On 2025-04-06 10:21 a.m., Scott Lurndal wrote:

    Note that on ARM, there are restrictions on upcalls to
    more privileged regimes - generally a particular regime
    can only upcall the next higher privileged regime, so
    the user app can only upcall the GuestOS, the guest OS can only
    upcall the HV and the HV is the only regime that can
    upcall the secure monitor.

    Allows two directional virtualization I think. Q+ has all exceptions and >interrupts going to the secure monitor, which can then delegate it back
    to a lower level.

    If that adds latency to the interrupt handler, that will not
    be a positive benefit.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Apr 7 14:09:50 2025
    From Newsgroup: comp.arch

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sun, 6 Apr 2025 14:21:26 +0000, Scott Lurndal wrote:
    ----------------
    When the exception (in this case an upcall to a more privileged
    regime) occurs, the saved state register/stack word should contain the
    prior privilege level. The hypervisor will know from that whether
    the upcall was from the guest OS or a guest Application.

    Note that on ARM, there are restrictions on upcalls to
    more privileged regimes - generally a particular regime
    can only upcall the next higher privileged regime, so
    the user app can only upcall the GuestOS, the guest OS can only
    upcall the HV and the HV is the only regime that can
    upcall the secure monitor.

    On Sun, 6 Apr 2025 14:32:43 +0000, Scott Lurndal wrote:

    That presumes a shared address space between the privilege
    levels - which is common for the OS and user-modes. It's
    not common (or particularly useful[*]) at any other privilege
    level.

    So, is this dichotomy because::

    a) HVs are good enough at virtualizing raw HW that GuestOS
    does not need a lot of paravirtualization to be efficient ??

    Yes. Once AMD added Nested Page Tables to SVM and the PCI-SIG
    proposed the SR-IOV capability, paravirtualization became anathema.


    b) GuestOS does not need "that much paravirtualization" to be
    efficient anyway.

    With modern hardware support, yes.


    c) the kinds of things GuestOS ask HVs to perform is just not
    enough like the kind of things user asks of GuestOS.

    Yes, that's also a truism.


    d) User and GuestOS evolved in a time before virtualization
    and simply prefer to exist as it used to be ??

    Typically an OS doesn't know if it is a guest or bare metal.
    That characteristic means that a given distribution can
    operate as either.

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Apr 9 00:23:09 2025
    From Newsgroup: comp.arch

    On Mon, 7 Apr 2025 14:09:50 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sun, 6 Apr 2025 14:21:26 +0000, Scott Lurndal wrote:
    ----------------
    When the exception (in this case an upcall to a more privileged
    regime) occurs, the saved state register/stack word should contain the
    prior privilege level. The hypervisor will know from that whether
    the upcall was from the guest OS or a guest Application.

    Note that on ARM, there are restrictions on upcalls to
    more privileged regimes - generally a particular regime
    can only upcall the next higher privileged regime, so
    the user app can only upcall the GuestOS, the guest OS can only
    upcall the HV and the HV is the only regime that can
    upcall the secure monitor.

    On Sun, 6 Apr 2025 14:32:43 +0000, Scott Lurndal wrote:

    That presumes a shared address space between the privilege
    levels - which is common for the OS and user-modes. It's
    not common (or particularly useful[*]) at any other privilege
    level.

    So, is this dichotomy because::

    a) HVs are good enough at virtualizing raw HW that GuestOS
    does not need a lot of paravirtualization to be efficient ??

    Yes. Once AMD added Nested Page Tables to SVM and the PCI-SIG
    proposed the SR-IOV capability, paravirtualization became anathema.


    b) GuestOS does not need "that much paravirtualization" to be
    efficient anyway.

    With modern hardware support, yes.


    c) the kinds of things GuestOS ask HVs to perform is just not
    enough like the kind of things user asks of GuestOS.

    Yes, that's also a truism.


    d) User and GuestOS evolved in a time before virtualization
    and simply prefer to exist as it used to be ??

    Typically an OS doesn't know if it is a guest or bare metal.
    That characteristic means that a given distribution can
    operate as either.

    Thank you for updating a piece of history apparently I did not
    live through !!
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Apr 15 00:43:43 2025
    From Newsgroup: comp.arch

    On Mon, 7 Apr 2025 14:09:50 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sun, 6 Apr 2025 14:21:26 +0000, Scott Lurndal wrote:
    ----------------
    When the exception (in this case an upcall to a more privileged
    regime) occurs, the saved state register/stack word should contain the
    prior privilege level. The hypervisor will know from that whether
    the upcall was from the guest OS or a guest Application.

    Note that on ARM, there are restrictions on upcalls to
    more privileged regimes - generally a particular regime
    can only upcall the next higher privileged regime, so
    the user app can only upcall the GuestOS, the guest OS can only
    upcall the HV and the HV is the only regime that can
    upcall the secure monitor.

    On Sun, 6 Apr 2025 14:32:43 +0000, Scott Lurndal wrote:

    That presumes a shared address space between the privilege
    levels - which is common for the OS and user-modes. It's
    not common (or particularly useful[*]) at any other privilege
    level.

    So, is this dichotomy because::

    a) HVs are good enough at virtualizing raw HW that GuestOS
    does not need a lot of paravirtualization to be efficient ??

    Yes. Once AMD added Nested Page Tables to SVM and the PCI-SIG
    proposed the SR-IOV capability, paravirtualization became anathema.

    Ok, back to Dan Cross:: (with help from Scott)

    If GuestOS wants to grab and hold onto a lock/mutex for a while
    to do some critical section stuff--does GuestOS "care" that HV
    can still take an interrupt while GuestOS is doing its CS thing ??
    since HV is not going to touch any memory associated with GuestOS.

    In effect, I am asking is Disable Interrupt is SW-stack-wide or only
    applicable to the current layer of the SW stack ?? One can equally
    use SW-stack-wide to mean core-wide.

    For example:: GuestOS DIs, and HV takes a page fault from GuestOS;
    makes the page resident and accessible, and allows GuestOS to run
    from the point of fault. GuestOS "sees" no interrupt and nothing
    in GuestOS VAS is touched by HV in servicing the page fault.

    Now, sure that lock is held while the page fault is being serviced,
    and the ugly head of priority inversion takes hold. But ... I am ni
    need of some edumacation here.


    b) GuestOS does not need "that much paravirtualization" to be
    efficient anyway.

    With modern hardware support, yes.


    c) the kinds of things GuestOS ask HVs to perform is just not
    enough like the kind of things user asks of GuestOS.

    Yes, that's also a truism.


    d) User and GuestOS evolved in a time before virtualization
    and simply prefer to exist as it used to be ??

    Typically an OS doesn't know if it is a guest or bare metal.
    That characteristic means that a given distribution can
    operate as either.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Apr 15 14:02:37 2025
    From Newsgroup: comp.arch

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 7 Apr 2025 14:09:50 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sun, 6 Apr 2025 14:21:26 +0000, Scott Lurndal wrote:
    ----------------
    When the exception (in this case an upcall to a more privileged
    regime) occurs, the saved state register/stack word should contain the >>>> prior privilege level. The hypervisor will know from that whether
    the upcall was from the guest OS or a guest Application.

    Note that on ARM, there are restrictions on upcalls to
    more privileged regimes - generally a particular regime
    can only upcall the next higher privileged regime, so
    the user app can only upcall the GuestOS, the guest OS can only
    upcall the HV and the HV is the only regime that can
    upcall the secure monitor.

    On Sun, 6 Apr 2025 14:32:43 +0000, Scott Lurndal wrote:

    That presumes a shared address space between the privilege
    levels - which is common for the OS and user-modes. It's
    not common (or particularly useful[*]) at any other privilege
    level.

    So, is this dichotomy because::

    a) HVs are good enough at virtualizing raw HW that GuestOS
    does not need a lot of paravirtualization to be efficient ??

    Yes. Once AMD added Nested Page Tables to SVM and the PCI-SIG
    proposed the SR-IOV capability, paravirtualization became anathema.

    Ok, back to Dan Cross:: (with help from Scott)

    If GuestOS wants to grab and hold onto a lock/mutex for a while
    to do some critical section stuff--does GuestOS "care" that HV
    can still take an interrupt while GuestOS is doing its CS thing ??
    since HV is not going to touch any memory associated with GuestOS.

    Generally, the Guest should execute "as if" it were running on
    Bare Metal. Consider an intel/amd processor running a bare-metal
    operating system that takes an interrupt into SMM mode; from the
    POV of a guest, an HV interrupt is similar to an SMM interrupt.

    If the SMM, Secure Monitor or HV modify guest memory in any way,
    all bets are off.


    In effect, I am asking is Disable Interrupt is SW-stack-wide or only >applicable to the current layer of the SW stack ?? One can equally
    use SW-stack-wide to mean core-wide.

    Current layer of the privilege stack. If there is a secure monitor
    at a more privileged level than the HV, it can take interrupts in a
    manner similar to the legacy SMM interupts. Typically there will
    be independent periodic timer interrupts in the Guest OS, the HV, and the secure monitor.


    For example:: GuestOS DIs, and HV takes a page fault from GuestOS;

    Note that these will be rare and only if the HV overcommits physical
    memory.

    makes the page resident and accessible, and allows GuestOS to run
    from the point of fault. GuestOS "sees" no interrupt and nothing
    in GuestOS VAS is touched by HV in servicing the page fault.

    The only way that the guest OS or guest OS application can detect
    such an event is if it measures an affected load/store - a covert
    channel. So there may be security considerations.


    Now, sure that lock is held while the page fault is being serviced,
    and the ugly head of priority inversion takes hold. But ... I am ni
    need of some edumacation here.

    Priority inversion is only applicable within a privilege level/ring.
    Interrupts to a higher privilege level cannot be masked by an active interrupt at a lower priority level.

    The higher privilege level must not unilaterally modify guest OS or
    application state.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Apr 15 20:46:28 2025
    From Newsgroup: comp.arch

    On Tue, 15 Apr 2025 14:02:37 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 7 Apr 2025 14:09:50 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sun, 6 Apr 2025 14:21:26 +0000, Scott Lurndal wrote: >>>>----------------
    When the exception (in this case an upcall to a more privileged
    regime) occurs, the saved state register/stack word should contain the >>>>> prior privilege level. The hypervisor will know from that whether
    the upcall was from the guest OS or a guest Application.

    Note that on ARM, there are restrictions on upcalls to
    more privileged regimes - generally a particular regime
    can only upcall the next higher privileged regime, so
    the user app can only upcall the GuestOS, the guest OS can only
    upcall the HV and the HV is the only regime that can
    upcall the secure monitor.

    On Sun, 6 Apr 2025 14:32:43 +0000, Scott Lurndal wrote:

    That presumes a shared address space between the privilege
    levels - which is common for the OS and user-modes. It's
    not common (or particularly useful[*]) at any other privilege
    level.

    So, is this dichotomy because::

    a) HVs are good enough at virtualizing raw HW that GuestOS
    does not need a lot of paravirtualization to be efficient ??

    Yes. Once AMD added Nested Page Tables to SVM and the PCI-SIG
    proposed the SR-IOV capability, paravirtualization became anathema.

    Ok, back to Dan Cross:: (with help from Scott)

    If GuestOS wants to grab and hold onto a lock/mutex for a while
    to do some critical section stuff--does GuestOS "care" that HV
    can still take an interrupt while GuestOS is doing its CS thing ??
    since HV is not going to touch any memory associated with GuestOS.

    Generally, the Guest should execute "as if" it were running on
    Bare Metal. Consider an intel/amd processor running a bare-metal
    operating system that takes an interrupt into SMM mode; from the
    POV of a guest, an HV interrupt is similar to an SMM interrupt.

    If the SMM, Secure Monitor or HV modify guest memory in any way,
    all bets are off.

    Yes, but we have previously established HV does its virtualization
    without touching GuestOS memory. {Which is why I used page fault as
    the example.}


    In effect, I am asking is Disable Interrupt is SW-stack-wide or only >>applicable to the current layer of the SW stack ?? One can equally
    use SW-stack-wide to mean core-wide.

    Current layer of the privilege stack. If there is a secure monitor
    at a more privileged level than the HV, it can take interrupts in a
    manner similar to the legacy SMM interupts. Typically there will
    be independent periodic timer interrupts in the Guest OS, the HV, and
    the secure monitor.

    This agrees with the RISC-V approach where each layer in the stack
    has its own Interrupt Enable configuration. {Which is what lead to
    my questions}.

    However, many architectures have only a single control bit for the
    whole core--which is shy I am trying to get a complete understanding
    of what is required and what is choice. That there is some control
    is (IS) required--how many seems to be choice at this stage.

    Would it be unwise of me to speculate that a control at each layer
    is more optimal, or that the critical section that is delayed due
    to "other stuff needing to be handled" should have taken precedent.

    Anyone know of any literature where this was simulate or measured ??


    For example:: GuestOS DIs, and HV takes a page fault from GuestOS;

    Note that these will be rare and only if the HV overcommits physical
    memory.

    makes the page resident and accessible, and allows GuestOS to run
    from the point of fault. GuestOS "sees" no interrupt and nothing
    in GuestOS VAS is touched by HV in servicing the page fault.

    The only way that the guest OS or guest OS application can detect
    such an event is if it measures an affected load/store - a covert
    channel. So there may be security considerations.

    Damn that high precision clock .....

    Which also leads to the question of should a Virtual Machine have
    its own virtual time ?? {Or VM and VMM share the concept of virtual
    time} ??


    Now, sure that lock is held while the page fault is being serviced,
    and the ugly head of priority inversion takes hold. But ... I am ni
    need of some edumacation here.

    Priority inversion is only applicable within a privilege level/ring. Interrupts to a higher privilege level cannot be masked by an active interrupt at a lower priority level.

    So, if core is running HyperVisor at priority 15 and a user interrupt
    arrives at a higher priority but directed at GuestOS (instead of HV)
    does::
    a) HV continue leaving higher priority interrupt waiting.
    b) switch back to GuestOS for higher priority interrupt--in such
    . a way that when GuestOS returns from interrupt HV takes over
    . from whence it left.

    This is really a question of what priority means across the entire
    SW stack--and real-time versus Linux may have different answers on
    this matter.

    The higher privilege level must not unilaterally modify guest OS or application state.

    Given the almost complete lack of shared address spaces in a manner
    where pointers can be passed between, there is almost nothing an HV
    can do to a GuestOS VAS unless GuestOS has ask for a HV service via paravirtualization entry point.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Apr 16 14:07:36 2025
    From Newsgroup: comp.arch

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 15 Apr 2025 14:02:37 +0000, Scott Lurndal wrote:




    Current layer of the privilege stack. If there is a secure monitor
    at a more privileged level than the HV, it can take interrupts in a
    manner similar to the legacy SMM interupts. Typically there will
    be independent periodic timer interrupts in the Guest OS, the HV, and
    the secure monitor.

    This agrees with the RISC-V approach where each layer in the stack
    has its own Interrupt Enable configuration. {Which is what lead to
    my questions}.

    AArch64 also had interrupt enables at each privilege level.


    However, many architectures have only a single control bit for the
    whole core--which is shy I am trying to get a complete understanding
    of what is required and what is choice. That there is some control
    is (IS) required--how many seems to be choice at this stage.

    I'm not aware of any architecture that supports virtualization that
    doesn't have enables for each privilege level; either there are
    distinct levels in hardware, or the hypervisor needs to handle
    all interrupts and inject them into the guest in some fashion. Best
    to have hardware support for all of this rather than the overhead
    of the HV handing all interrupts and the consequent context switches.

    Would it be unwise of me to speculate that a control at each layer
    is more optimal, or that the critical section that is delayed due
    to "other stuff needing to be handled" should have taken precedent.

    The former is optimal. Assumning the guest is independent of the
    HV, any delay in the critical section (e.g. due to an HV interrupt
    being handled) are inconsequential. The critical section is only
    critical to the privilege layer it occurs on.

    <snip>



    The only way that the guest OS or guest OS application can detect
    such an event is if it measures an affected load/store - a covert
    channel. So there may be security considerations.

    Damn that high precision clock .....

    Which also leads to the question of should a Virtual Machine have
    its own virtual time ?? {Or VM and VMM share the concept of virtual
    time} ??

    Generally, yes. Usually modeled with an offset register in
    the HV that gets applied to the guest view of current time.



    Now, sure that lock is held while the page fault is being serviced,
    and the ugly head of priority inversion takes hold. But ... I am ni
    need of some edumacation here.

    Priority inversion is only applicable within a privilege level/ring.
    Interrupts to a higher privilege level cannot be masked by an active
    interrupt at a lower priority level.

    So, if core is running HyperVisor at priority 15 and a user interrupt
    arrives at a higher priority but directed at GuestOS (instead of HV)
    does::
    a) HV continue leaving higher priority interrupt waiting.
    b) switch back to GuestOS for higher priority interrupt--in such
    . a way that when GuestOS returns from interrupt HV takes over
    . from whence it left.

    ARM, for example, splits the per-core interrupt priority range into halves
    - one half is assigned to the secure monitor and the other is assigned to the non-secure software running on the core. Early hypervisors would field all non-secure interrupts and either handle them itself or inject them into
    the guest. The first ARM64 cores would field all interrupts in the HV
    and the int controller had special registers the HV could use to inject interrupts
    into the guest. The overhead was not insignifcant, so they added
    a mechanism to allow some interrupts to be directly fielded by the
    guest itself - avoiding the round trip through the HV on every interrupt (called virtual LPIs).


    This is really a question of what priority means across the entire
    SW stack--and real-time versus Linux may have different answers on
    this matter.

    The higher privilege level must not unilaterally modify guest OS or
    application state.

    Given the almost complete lack of shared address spaces in a manner
    where pointers can be passed between, there is almost nothing an HV
    can do to a GuestOS VAS unless GuestOS has ask for a HV service via >paravirtualization entry point.

    The HV owns the translation tables for guest to physical address,
    it can pretty much do anything it wants with that access[*], including modifying guest processor and memory state at any time - absent
    potential future features such as hardware guest memory encryption
    or memory access controls at a level higher than the HV (e.g. the
    secure monitor - see AArch64 Realms, for example).

    https://developer.arm.com/documentation/den0126/0101/Overview

    [*] the hypervisor can easily double map a page in both the guest PAS
    and the HV VAS - a technique common in paravirtualized environments.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Apr 16 21:13:43 2025
    From Newsgroup: comp.arch

    On Wed, 16 Apr 2025 14:07:36 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    ---------snip-----------

    So, if core is running HyperVisor at priority 15 and a user interrupt >>arrives at a higher priority but directed at GuestOS (instead of HV)
    does::
    a) HV continue leaving higher priority interrupt waiting.
    b) switch back to GuestOS for higher priority interrupt--in such
    . a way that when GuestOS returns from interrupt HV takes over
    . from whence it left.

    ARM, for example, splits the per-core interrupt priority range into
    halves
    - one half is assigned to the secure monitor and the other is assigned
    to the
    non-secure software running on the core.

    Thus, my predilection for 64-priority levels (rather than ~8 as
    suggested
    by another participant) allows for this distribution of priorities
    across
    layers in the SW stack at the discretion of trustable-SW.

    Early hypervisors would field
    all
    non-secure interrupts and either handle them itself or inject them into
    the guest. The first ARM64 cores would field all interrupts in the HV
    and the int controller had special registers the HV could use to inject interrupts
    into the guest. The overhead was not insignifcant, so they added
    a mechanism to allow some interrupts to be directly fielded by the
    guest itself - avoiding the round trip through the HV on every interrupt (called virtual LPIs).

    Given 4 layers in the stack {Secure, Hyper, Super, User} and we have
    interrupts targeting {Secure, Hyper, Super}, do we pick up any liability
    or do we gain flexibility by being able to target interrupts directly to
    {user} ?? (the 4th element).

    Roughly: HW maintains 4 copies of state and generally indexes state
    with a 2-bit value, and the "structure" of thread-header is identical
    between layers; thus, indexing down to {user} falls out for free.

    {{But I could be off my rocker...again}}
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Wed Apr 16 17:48:49 2025
    From Newsgroup: comp.arch

    Given 4 layers in the stack {Secure, Hyper, Super, User} and we have interrupts targeting {Secure, Hyper, Super}, do we pick up any liability
    or do we gain flexibility by being able to target interrupts directly to {user} ?? (the 4th element).

    All these discussions seem to presume a very fixed structure that (I
    presume) corresponds to a typical situation in servers nowadays.

    But shouldn't the hardware aim for something more flexible to account
    for other use cases?

    E.g. What if I want to run my own VM as a user? Or my own HV?
    That's likely to be a common desire for people working on the
    development and testing of OSes and HVs?


    Stefan
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Apr 16 22:12:22 2025
    From Newsgroup: comp.arch

    On Wed, 16 Apr 2025 21:48:49 +0000, Stefan Monnier wrote:

    Given 4 layers in the stack {Secure, Hyper, Super, User} and we have
    interrupts targeting {Secure, Hyper, Super}, do we pick up any liability
    or do we gain flexibility by being able to target interrupts directly to
    {user} ?? (the 4th element).

    All these discussions seem to presume a very fixed structure that (I
    presume) corresponds to a typical situation in servers nowadays.

    But shouldn't the hardware aim for something more flexible to account
    for other use cases?

    The goal is that::
    The two layers in the middle can be managed as an accordion; supporting
    any number of HVs and GuestOSs between Secure and User.

    E.g. What if I want to run my own VM as a user? Or my own HV?
    That's likely to be a common desire for people working on the
    development and testing of OSes and HVs?

    Use the accordion


    Stefan
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Apr 16 15:26:12 2025
    From Newsgroup: comp.arch

    On 4/16/2025 2:13 PM, MitchAlsup1 wrote:
    On Wed, 16 Apr 2025 14:07:36 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    ---------snip-----------

    So, if core is running HyperVisor at priority 15 and a user interrupt
    arrives at a higher priority but directed at GuestOS (instead of HV)
    does::
    a) HV continue leaving higher priority interrupt waiting.
    b) switch back to GuestOS for higher priority interrupt--in such
    . a way that when GuestOS returns from interrupt HV takes over
    . from whence it left.

    ARM, for example, splits the per-core interrupt  priority range into
    halves
    - one half is assigned to the secure monitor and the other is assigned
    to the
    non-secure software running on the core.

    Thus, my predilection for 64-priority levels (rather than ~8 as
    suggested
    by another participant) allows for this distribution of priorities
    across
    layers in the SW stack at the discretion of trustable-SW.

                                              Early hypervisors would field
    all
    non-secure interrupts and either handle them itself or inject them into
    the guest.    The first ARM64 cores would field all interrupts in the HV >> and the int controller had special registers the HV could use to inject
    interrupts
    into the guest.    The overhead was not insignifcant, so they added
    a mechanism to allow some interrupts to be directly fielded by the
    guest itself - avoiding the round trip through the HV on every interrupt
    (called virtual LPIs).

    Given 4 layers in the stack {Secure, Hyper, Super, User} and we have interrupts targeting {Secure, Hyper, Super}, do we pick up any liability
    or do we gain flexibility by being able to target interrupts directly to {user} ?? (the 4th element).

    I think you could gain a tiny amount of efficiency if the OS (super)
    allowed the user to set up handle certain classes of exceptions, e.g.
    divide faults) itself rather than having to go through the super.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Apr 17 00:47:38 2025
    From Newsgroup: comp.arch

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 16 Apr 2025 14:07:36 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    ---------snip-----------

    So, if core is running HyperVisor at priority 15 and a user interrupt >>>arrives at a higher priority but directed at GuestOS (instead of HV) >>>does::
    a) HV continue leaving higher priority interrupt waiting.
    b) switch back to GuestOS for higher priority interrupt--in such
    . a way that when GuestOS returns from interrupt HV takes over
    . from whence it left.

    ARM, for example, splits the per-core interrupt priority range into
    halves
    - one half is assigned to the secure monitor and the other is assigned
    to the
    non-secure software running on the core.

    Thus, my predilection for 64-priority levels (rather than ~8 as
    suggested
    by another participant) allows for this distribution of priorities
    across
    layers in the SW stack at the discretion of trustable-SW.

    Architecturally, the ARM64 interrupt priority can vary from 3 to 8
    bits. Most implementations implement 5 bits, allowing 16 secure
    and 16 non-secure priority levels. They can be grouped using
    a binary point register, if required.


    Early hypervisors would field
    all
    non-secure interrupts and either handle them itself or inject them into
    the guest. The first ARM64 cores would field all interrupts in the HV
    and the int controller had special registers the HV could use to inject
    interrupts
    into the guest. The overhead was not insignifcant, so they added
    a mechanism to allow some interrupts to be directly fielded by the
    guest itself - avoiding the round trip through the HV on every interrupt
    (called virtual LPIs).

    Given 4 layers in the stack {Secure, Hyper, Super, User} and we have >interrupts targeting {Secure, Hyper, Super}, do we pick up any liability
    or do we gain flexibility by being able to target interrupts directly to >{user} ?? (the 4th element).

    On ARM there are only two interrupt signals from the interrupt controller
    to each core: FIQ and IRQ.

    Each of the signals can be 'claimed' by one, and only one privilege
    level on that core; if the secure monitor claims FIQ, then it can only be delivered
    to EL3.

    If running bare-metal, the OS (EL1) will claim the IRQ signal (by default if none of the more privileged levels claim it).

    If a hypervisor (EL2) is running, it will claim the IRQ signal and field
    all physical interrupts, except for virtual LPI and IPI interrupts which the hardware can inject directly into the guest (which may result in an
    interrupt to the hypervisor if the guest isn't resident on the target
    CPU).

    In a virtualized environment, one needs to be vary careful when
    exposing hardware interrupt signals directly to the guest operating system,
    as that often requires exposing some of the interrupt controller.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Apr 17 00:49:37 2025
    From Newsgroup: comp.arch

    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    Given 4 layers in the stack {Secure, Hyper, Super, User} and we have
    interrupts targeting {Secure, Hyper, Super}, do we pick up any liability
    or do we gain flexibility by being able to target interrupts directly to
    {user} ?? (the 4th element).

    All these discussions seem to presume a very fixed structure that (I
    presume) corresponds to a typical situation in servers nowadays.

    But shouldn't the hardware aim for something more flexible to account
    for other use cases?

    E.g. What if I want to run my own VM as a user? Or my own HV?
    That's likely to be a common desire for people working on the
    development and testing of OSes and HVs?

    ARM has hardware support for nested hypervisors. It can be tricky.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Apr 17 00:57:12 2025
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 4/16/2025 2:13 PM, MitchAlsup1 wrote:
    On Wed, 16 Apr 2025 14:07:36 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    ---------snip-----------

    So, if core is running HyperVisor at priority 15 and a user interrupt
    arrives at a higher priority but directed at GuestOS (instead of HV)
    does::
    a) HV continue leaving higher priority interrupt waiting.
    b) switch back to GuestOS for higher priority interrupt--in such
    . a way that when GuestOS returns from interrupt HV takes over
    . from whence it left.

    ARM, for example, splits the per-core interrupt  priority range into
    halves
    - one half is assigned to the secure monitor and the other is assigned
    to the
    non-secure software running on the core.

    Thus, my predilection for 64-priority levels (rather than ~8 as
    suggested
    by another participant) allows for this distribution of priorities
    across
    layers in the SW stack at the discretion of trustable-SW.

                                              Early hypervisors would field
    all
    non-secure interrupts and either handle them itself or inject them into
    the guest.    The first ARM64 cores would field all interrupts in the HV >>> and the int controller had special registers the HV could use to inject
    interrupts
    into the guest.    The overhead was not insignifcant, so they added
    a mechanism to allow some interrupts to be directly fielded by the
    guest itself - avoiding the round trip through the HV on every interrupt >>> (called virtual LPIs).

    Given 4 layers in the stack {Secure, Hyper, Super, User} and we have
    interrupts targeting {Secure, Hyper, Super}, do we pick up any liability
    or do we gain flexibility by being able to target interrupts directly to
    {user} ?? (the 4th element).

    I think you could gain a tiny amount of efficiency if the OS (super)
    allowed the user to set up handle certain classes of exceptions, e.g.
    divide faults) itself rather than having to go through the super.

    Think carefully about the security implications of user-mode interrupt delivery. Particuarly with respect to potential impacts on other
    processes running on the system, and to overall system functionality.

    Handling interrupts requires direct access to the hardware from
    user-mode.

    Hardware access is normally done in the context of a 'sandboxed'
    PCI Express SRIOV function which the application can access directly;
    the hardware guarantees that the user process cannot adversley
    affect the hardware or other guests using other virtual functions.

    However, the interrupt controller itself (e.g. the mechanism used
    to acknowledge the interrupt to the interrupt controller after it
    has been serviced - e.g. the LAPIC) isn't virtualized, and direct
    access to that shouldn't be available to user-mode for fairly obvious
    reasons.

    That's why DPDK/ODP require the OS to handle interrupts and notify
    the application via standard OS notification mechanisms even
    when using SR-IOV capable hardware for the actual packet handling.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Thu Apr 17 01:04:10 2025
    From Newsgroup: comp.arch

    According to Scott Lurndal <slp53@pacbell.net>:
    I think you could gain a tiny amount of efficiency if the OS (super) >>allowed the user to set up handle certain classes of exceptions, e.g. >>divide faults) itself rather than having to go through the super.

    Think carefully about the security implications of user-mode interrupt >delivery. Particuarly with respect to potential impacts on other
    processes running on the system, and to overall system functionality.

    Handling interrupts requires direct access to the hardware from
    user-mode.

    I think he was talking about exceptions, not interrupts. I don't see much danger in reflecting divide faults and supervisor calls directly back
    to the virtual machine. I gather that IBM's virtualization microcode has
    done that for decades.

    External interrupts are indeed a lot harder unless you know a whole lot
    about the thing that's interrupting.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Wed Apr 16 21:07:13 2025
    From Newsgroup: comp.arch

    External interrupts are indeed a lot harder unless you know a whole lot
    about the thing that's interrupting.

    Not only the thing that's interrupting but also the thing
    it's interrupting. Maybe it's easier for My 66000 where I understand
    that the hardware has a notion of threads/processes so it may be able to
    know how to deliver the interrupt to the appropriate thread/process.


    Stefan
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Apr 16 23:30:28 2025
    From Newsgroup: comp.arch

    On 4/16/2025 6:04 PM, John Levine wrote:
    According to Scott Lurndal <slp53@pacbell.net>:
    I think you could gain a tiny amount of efficiency if the OS (super)
    allowed the user to set up handle certain classes of exceptions, e.g.
    divide faults) itself rather than having to go through the super.

    Think carefully about the security implications of user-mode interrupt
    delivery. Particuarly with respect to potential impacts on other
    processes running on the system, and to overall system functionality.

    Handling interrupts requires direct access to the hardware from
    user-mode.

    I think he was talking about exceptions, not interrupts.

    Right. Thanks John. I was careful to say exceptions, not interrupts.


    I don't see much
    danger in reflecting divide faults and supervisor calls directly back
    to the virtual machine. I gather that IBM's virtualization microcode has done that for decades.

    I was suggesting that for things like divide fault, it could go directly
    back to the user, assuming the user had set up a place to handle them.


    External interrupts are indeed a lot harder unless you know a whole lot
    about the thing that's interrupting.

    Yup.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Apr 17 13:32:54 2025
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes:
    According to Scott Lurndal <slp53@pacbell.net>:
    I think you could gain a tiny amount of efficiency if the OS (super) >>>allowed the user to set up handle certain classes of exceptions, e.g. >>>divide faults) itself rather than having to go through the super.

    Think carefully about the security implications of user-mode interrupt >>delivery. Particuarly with respect to potential impacts on other
    processes running on the system, and to overall system functionality.

    Handling interrupts requires direct access to the hardware from
    user-mode.

    I think he was talking about exceptions, not interrupts. I don't see much >danger in reflecting divide faults and supervisor calls directly back
    to the virtual machine. I gather that IBM's virtualization microcode has >done that for decades.

    All the current processors (intel, AMD, ARM, MIPS) that have hardware virtualization support handle faults in the context in which they
    arise. e.g. a divide fault will be handled directly by the guest
    OS without any hypervisor intervention. The single standard exception
    is user mode, where the faults are handled by the Guest/Bare-metal
    OS.


    External interrupts are indeed a lot harder unless you know a whole lot
    about the thing that's interrupting.

    Indeed, although it's not so much about the 'thing that's interrupting'
    as it is about the interrupt infrastructure (i.e. interrupt controller)
    itself.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Thu Apr 17 18:22:35 2025
    From Newsgroup: comp.arch

    On Thu, 17 Apr 2025 1:04:10 +0000, John Levine wrote:

    According to Scott Lurndal <slp53@pacbell.net>:
    I think you could gain a tiny amount of efficiency if the OS (super) >>>allowed the user to set up handle certain classes of exceptions, e.g. >>>divide faults) itself rather than having to go through the super.

    Think carefully about the security implications of user-mode interrupt >>delivery. Particuarly with respect to potential impacts on other
    processes running on the system, and to overall system functionality.

    Handling interrupts requires direct access to the hardware from
    user-mode.

    I think he was talking about exceptions, not interrupts. I don't see
    much danger in reflecting divide faults and supervisor calls directly
    back
    to the virtual machine. I gather that IBM's virtualization microcode
    has done that for decades.

    I used (I think) the word interrupted as in "the thread currently in
    control
    has its instruction stream interrupted" which could stand in for
    interrupts
    or exceptions or faults; to see how the conversation develops.

    It seems to me that to "take" and interrupt at user layer in SW-stack,
    that the 3-upper layers have to be in the same state as when that User
    thread is in control of a core. But, It also seems to me that to "take"
    an interrupt into Super, the 2 higher layers of SW-stack also have to
    be as they were when that Super thread has control. You don't want HV[j].GuestOS[k] to take an interrupt when Hyper != HV[j] && Super != GuestOS[k] -- because the various translation tables are not properly
    available to perform the nested MMU VAS->UAS translation.

    In effect, the SW-stack becomes some kind of "closure" where control
    can be transferred asynchronously. Enough information is passed (as
    arguments) across this boundary that efficient dispatch to the proper
    ISR is but a few instructions (3 typically in My 66000).

    External interrupts are indeed a lot harder unless you know a whole lot
    about the thing that's interrupting.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Apr 17 20:10:11 2025
    From Newsgroup: comp.arch

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Thu, 17 Apr 2025 1:04:10 +0000, John Levine wrote:

    According to Scott Lurndal <slp53@pacbell.net>:
    I think you could gain a tiny amount of efficiency if the OS (super) >>>>allowed the user to set up handle certain classes of exceptions, e.g. >>>>divide faults) itself rather than having to go through the super.

    Think carefully about the security implications of user-mode interrupt >>>delivery. Particuarly with respect to potential impacts on other >>>processes running on the system, and to overall system functionality.

    Handling interrupts requires direct access to the hardware from >>>user-mode.

    I think he was talking about exceptions, not interrupts. I don't see
    much danger in reflecting divide faults and supervisor calls directly
    back
    to the virtual machine. I gather that IBM's virtualization microcode
    has done that for decades.

    I used (I think) the word interrupted as in "the thread currently in
    control
    has its instruction stream interrupted" which could stand in for
    interrupts
    or exceptions or faults; to see how the conversation develops.

    In ARM64, an interrupt is just a maskable asynchronous exception.


    It seems to me that to "take" and interrupt at user layer in SW-stack,
    that the 3-upper layers have to be in the same state as when that User
    thread is in control of a core. But, It also seems to me that to "take"
    an interrupt into Super, the 2 higher layers of SW-stack also have to
    be as they were when that Super thread has control. You don't want >HV[j].GuestOS[k] to take an interrupt when Hyper != HV[j] && Super != >GuestOS[k] -- because the various translation tables are not properly >available to perform the nested MMU VAS->UAS translation.

    Note that while any one layer is executing _on a core/hardware thread_,
    the other layers aren't running on that core, by definition. However, there is no synchronization with other cores, so other cores in the same system
    may be executing in any one or all of the privilege levels/security layers while a given core is taking an exception (synchronous or asynchronous).

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Thu Apr 17 21:45:42 2025
    From Newsgroup: comp.arch

    On Thu, 17 Apr 2025 20:10:11 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Thu, 17 Apr 2025 1:04:10 +0000, John Levine wrote:

    According to Scott Lurndal <slp53@pacbell.net>:
    I think you could gain a tiny amount of efficiency if the OS (super) >>>>>allowed the user to set up handle certain classes of exceptions, e.g. >>>>>divide faults) itself rather than having to go through the super.

    Think carefully about the security implications of user-mode interrupt >>>>delivery. Particuarly with respect to potential impacts on other >>>>processes running on the system, and to overall system functionality.

    Handling interrupts requires direct access to the hardware from >>>>user-mode.

    I think he was talking about exceptions, not interrupts. I don't see
    much danger in reflecting divide faults and supervisor calls directly
    back
    to the virtual machine. I gather that IBM's virtualization microcode
    has done that for decades.

    I used (I think) the word interrupted as in "the thread currently in >>control
    has its instruction stream interrupted" which could stand in for
    interrupts
    or exceptions or faults; to see how the conversation develops.

    In ARM64, an interrupt is just a maskable asynchronous exception.

    My 66000 defines:
    a) exception: something wrong in the attempt to execute an instruction
    b) interrupt: asynchronous events not related to instruction execution
    c) trap ... : request for service to next higher privilege layer
    d) check .. : something that should (almost) never happen

    Unlike many RISC architectures, My 66000 has arithmetic exceptions
    {Operand Domain, Result Range, privilege, 5-IEEE exceptions} along
    with typical {GuestOS page fault, Hypervisor page fault} everybody
    has. Much more IBM 360-like than MIPS-like.

    Exceptions are then categorized as repairable-faults or terminations.
    SW determines if arithmetic faults are recognized and what to do
    with the exception if one is raised and recognized {terminate, repair, complete}. Page faults operate under "repair" state is repaired such
    that re-execution of the instruction should now succeed. Complete is
    for situations where a HW cannot deliver "an acceptable" result, but
    SW can. Here, SW "completes" the work and returns following the causing instruction.

    Checks are things like
    1) unrepairable ECC failure
    2) special privilege violations
    3) hardware failures
    4) power or reset events

    Which either log the event, attempt repair, or panic the VMM.

    Checks are simply exceptions that deliver control to {secure}
    instead of {next higher privilege} and checks are not maskable.
    {I may come to regret this non-maskable part ...}


    It seems to me that to "take" and interrupt at user layer in SW-stack,
    that the 3-upper layers have to be in the same state as when that User >>thread is in control of a core. But, It also seems to me that to "take"
    an interrupt into Super, the 2 higher layers of SW-stack also have to
    be as they were when that Super thread has control. You don't want >>HV[j].GuestOS[k] to take an interrupt when Hyper != HV[j] && Super != >>GuestOS[k] -- because the various translation tables are not properly >>available to perform the nested MMU VAS->UAS translation.

    Note that while any one layer is executing _on a core/hardware thread_,
    the other layers aren't running on that core,

    Not "running" but those layer's CRs are still supporting lower privilege
    layers that ARE running on that core. Mostly in the nested Root pointer
    and ASID categories, sometimes in the interrupt-table category.

    by definition. However,
    there is
    no synchronization with other cores, so other cores in the same system
    may be executing in any one or all of the privilege levels/security
    layers
    while a given core is taking an exception (synchronous or asynchronous).

    Yes, obviously. Any core can be operating at any priority any privilege
    any layer unbeknownst to any other core; until and unless SW tries to synchronize with said other core to find out.
    --- Synchronet 3.20c-Linux NewsLink 1.2