Forum: War Ensemble BBS

Constant Stack Canaries

From Robert Finch@robfi680@gmail.com to comp.arch on Sun Mar 30 08:16:52 2025

From Newsgroup: comp.arch

Just got to thinking about stack canaries. I was going to have a special purpose register holding the canary value for testing while the program
was running. But I just realized today that it may not be needed. Canary values could be handled by the program loader as constants, eliminating
the need for a register. Since the value is not changing while the
program is running, it could easily be a constant. This may require a
fixup record handled by the assembler / linker to indicate to the loader
to place a canary value.

Prolog code would just store an immediate to the stack. On return a TRAP instruction could check for the immediate value and trap if not present.
But the process seems to require assembler / linker support.

--- Synchronet 3.20c-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sun Mar 30 12:47:59 2025

From Newsgroup: comp.arch

On 3/30/2025 7:16 AM, Robert Finch wrote:

Just got to thinking about stack canaries. I was going to have a special purpose register holding the canary value for testing while the program
was running. But I just realized today that it may not be needed. Canary values could be handled by the program loader as constants, eliminating
the need for a register. Since the value is not changing while the
program is running, it could easily be a constant. This may require a
fixup record handled by the assembler / linker to indicate to the loader
to place a canary value.

Prolog code would just store an immediate to the stack. On return a TRAP instruction could check for the immediate value and trap if not present.
But the process seems to require assembler / linker support.

They are mostly just a normal compiler feature IME:
Prolog stores the value;
Epilog loads it and verifies that the value is intact.
Using a magic number generated by the compiler.

Nothing fancy needed in the assemble or link stages.

In my case, canary behavior is one of:
Use them in functions with arrays or similar (default);
Use them everywhere (optional);
Disable them entirely (also optional).

In my case, it is only checking 16-bit magic numbers, but mostly because
a 16-bit constant is cheaper to load into a register in this case
(single 32-bit instruction, vs a larger encoding needed for larger values).

...

--- Synchronet 3.20c-Linux NewsLink 1.2

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sun Mar 30 20:14:53 2025

From Newsgroup: comp.arch

On Sun, 30 Mar 2025 17:47:59 +0000, BGB wrote:

On 3/30/2025 7:16 AM, Robert Finch wrote:

Just got to thinking about stack canaries. I was going to have a special
purpose register holding the canary value for testing while the program
was running. But I just realized today that it may not be needed. Canary
values could be handled by the program loader as constants, eliminating
the need for a register. Since the value is not changing while the
program is running, it could easily be a constant. This may require a
fixup record handled by the assembler / linker to indicate to the loader
to place a canary value.

Prolog code would just store an immediate to the stack. On return a TRAP
instruction could check for the immediate value and trap if not present.
But the process seems to require assembler / linker support.

They are mostly just a normal compiler feature IME:
Prolog stores the value;
Epilog loads it and verifies that the value is intact.

Agreed.

Using a magic number

Remove excess words.

Nothing fancy needed in the assemble or link stages.

They remain blissfully ignorant--at most they generate the magic
number, possibly at random, possibly per link-module.

In my case, canary behavior is one of:
Use them in functions with arrays or similar (default);
Use them everywhere (optional);
Disable them entirely (also optional).

In my case, it is only checking 16-bit magic numbers, but mostly because
a 16-bit constant is cheaper to load into a register in this case
(single 32-bit instruction, vs a larger encoding needed for larger
values).

....

--- Synchronet 3.20c-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sun Mar 30 21:26:23 2025

From Newsgroup: comp.arch

On 2025-03-30 4:14 p.m., MitchAlsup1 wrote:

On Sun, 30 Mar 2025 17:47:59 +0000, BGB wrote:

On 3/30/2025 7:16 AM, Robert Finch wrote:

Just got to thinking about stack canaries. I was going to have a special >>> purpose register holding the canary value for testing while the program
was running. But I just realized today that it may not be needed. Canary >>> values could be handled by the program loader as constants, eliminating
the need for a register. Since the value is not changing while the
program is running, it could easily be a constant. This may require a
fixup record handled by the assembler / linker to indicate to the loader >>> to place a canary value.

Prolog code would just store an immediate to the stack. On return a TRAP >>> instruction could check for the immediate value and trap if not present. >>> But the process seems to require assembler / linker support.

They are mostly just a normal compiler feature IME:
   Prolog stores the value;
   Epilog loads it and verifies that the value is intact.

Agreed.

Using a magic number

Remove excess words.

Nothing fancy needed in the assemble or link stages.

They remain blissfully ignorant--at most they generate the magic
number, possibly at random, possibly per link-module.

In my case, canary behavior is one of:
   Use them in functions with arrays or similar (default);
   Use them everywhere (optional);
   Disable them entirely (also optional).

In my case, it is only checking 16-bit magic numbers, but mostly because
a 16-bit constant is cheaper to load into a register in this case
(single 32-bit instruction, vs a larger encoding needed for larger
values).

....

Ah, okay. I had thought the stack canaries were defined at run-time.
Much easier to handle with the compiler. But, what happens when multiple instances of a program are loaded? Would it not be better to have
separate stack canaries? I had thought the stack canaries would be
different for each run of a program, otherwise could not some bad
software discover the canary value?

--- Synchronet 3.20c-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Mon Mar 31 01:34:14 2025

From Newsgroup: comp.arch

On 3/30/2025 3:14 PM, MitchAlsup1 wrote:

On Sun, 30 Mar 2025 17:47:59 +0000, BGB wrote:

On 3/30/2025 7:16 AM, Robert Finch wrote:

Just got to thinking about stack canaries. I was going to have a special >>> purpose register holding the canary value for testing while the program
was running. But I just realized today that it may not be needed. Canary >>> values could be handled by the program loader as constants, eliminating
the need for a register. Since the value is not changing while the
program is running, it could easily be a constant. This may require a
fixup record handled by the assembler / linker to indicate to the loader >>> to place a canary value.

Prolog code would just store an immediate to the stack. On return a TRAP >>> instruction could check for the immediate value and trap if not present. >>> But the process seems to require assembler / linker support.

They are mostly just a normal compiler feature IME:
   Prolog stores the value;
   Epilog loads it and verifies that the value is intact.

Agreed.

Using a magic number

Remove excess words.

It is possible that the magic number could have been generated by the
CPU itself, or specified on the command-line by the user, or, ...

Rather than, say, the compiler coming up with a magic number for each
function (say, based on a hash function or "rand()" or something).

Nothing fancy needed in the assemble or link stages.

They remain blissfully ignorant--at most they generate the magic
number, possibly at random, possibly per link-module.

Yes.

In my case, canary behavior is one of:
   Use them in functions with arrays or similar (default);
   Use them everywhere (optional);
   Disable them entirely (also optional).

In my case, it is only checking 16-bit magic numbers, but mostly because
a 16-bit constant is cheaper to load into a register in this case
(single 32-bit instruction, vs a larger encoding needed for larger
values).

....

( Well, anyways, going off on a tangent here... )

Meanwhile, in my own goings on... It took way to much effort to figure
out the specific quirks in that RIFF/WAVE headers to get Audacity to
accept IMA-ADPCM output from BGBCC's resource converter.

It was like:
Media Player Classic: Yeah, fine.
VLC Media Player: Yeah, fine.
Audacity: "I have no idea what this is...".

Turns out Audacity is not happy unless:
The size of the 'fmt ' is 20 bytes, cbSize is 2,
with an additional 16 bit member specifying the samples per block.
With a 'fact' chunk, specifying the overall length of the WAV in samples.

Pretty much everything else accepted the 16-byte PCMWAVEFORMAT with no
'fact' chunk (and calculating the samples per block based on nBlockAlign).

...

Though, in this case, I am mostly poking at stuff for "Resource WADs", typically images/etc that are intended to be hidden inside EXE or DLL
files (where size matters more than quality, and any sound effects are
likely to be limited to under 1 second).

Say, one has a sound effect that is, say:
0.5 seconds;
8kHz
2 bits/sample

This is roughly 1kB of audio data.

I also defined a 2-bit ADPCM variant (ADLQ), and ended up using a
customized simplified header for it (using a similar structure to the
BMP format; where the full RIFF format adds unnecessary overhead; though
the savings here are debatable).

Say:
Full RIFF in this case:
60 bytes of header.
Simplified format:
32 bytes of header.
So, saving roughly 28 bytes of overhead vs RIFF/WAVE.
Though, drops to 12 bytes in the absence of 'fact',
and using the 16-byte PCMWAVEFORMAT structure vs WAVEFORMATEX.

While theoretically 2-bit IMA ADPCM already exists for WAV, seemingly
not much supports it. I also implemented support for this, as it does at
least "exist in the wild".

As for the 2-bit version of IMA ADPCM:
Media Player Classic: Opens it and shows correct length,
but sounds broken.
Sounds like it is trying to play it with the 4 bit decoder.
VLC Media Player:
Basically works, though progress bar and time display is wonky.
Does figure out mostly the correct length at least.
Audacity: Claims to not understand it.

I had discovered the "adpcm-xq" library, and looked at this as a
reference for the 2-bit IMA format. Since VLC plays it, I will assume my
code is probably generating "mostly correct" output (at least WRT the 2b
ADPCM part; possible wonk may remain in the WAVEFORMATEX header, and/or
VLC is just a little buggy here).

So, thus far:
ADLQ:
Slightly higher quality;
Needs a slightly more complicated encoder for good results;
Decoder needs to ensure values don't go out of range.
Software support: Basically non existent.
Could in theory allow a cheap-ish hardware decoder.
2-bit IMA ADPCM:
Slightly simpler encoder;
More is needed on the decoder side;
Requires using multiply and range clamping.
Slightly worse audio quality ATM.
Around 0.8% bigger for mono due to header differences.

Block Headers:
ADLQ:
( 7: 0): Initial Sample, A-Law
(11: 8): Initial Step Index
( 12): Interpolation Hint
(15:13): Block Size (Log2)
IMA, 2b:
(15: 0): Initial Sample, PCM16
(23:16): Step Index
(31:24): Zero
ADLQ is 1016 samples in 256 bytes, IMA is 1008.

Sample Format is common:
00: Small Positive
01: Large Positive
10: Small Negative
11: Large Negative

Both have a scale-ratio of 1 or 3 (if normalized).
ADLQ has a narrower range of steps, with stepping of -1/+1.
Each step in ADLQ is 1/2 bit, so each 2 steps is a power of 2.
So, curve of around 1.414214**n
IMA has more steps, with a per-sample step of -1/+2.
Doesn't map cleanly to power of 2,
but around 8 steps per power of 2.
Seems to be build around a curve of 1.1**n.

But, more aggressive stepping makes sense with 2-bit samples IMO...

I went with not doing any range clamping in the decoder, so the encoder
would be responsible that values don't go out of range. This does
increase encoder complexity some (it needs to evaluate possible paths
multiple samples in advance to make sure the path doesn't go out of range).

Potentially, 1/4-bit step with -1/+2 could have made sense. Would need a
5-bit index though to have enough dynamic range.

Both use a different strategy for stereo:
ADLQ:
Splits center and side, encoding side at 1/4 sample rate;
So, stereo increases bitrate by 25%.
2b IMA:
Encodes both the left and right channel independently.
So, stereo doubles the bitrate.

As for why 2b:
Where one cares more about size than audio quality...
8kHz : 16 kbps
11kHz: 24 kbps
16kHz: 32 kbps
Also IMHO, 16kHz at 2b/s sounds better than 8kHz at 4b/s.
At least speech is mostly still intelligible at 16 kHz.
Basic sound effects still mostly work at 8kHz though.
Like, if one needs a ding or chime or similar.

Not really any good/obvious way here to reach or go below 1 bit/sample
while still preserving passable quality (2 bit/sample is the lower limit
for ADPCM, only real way to go lower would be to match blocks of 4 or 8 samples to a pattern table).

Had previously been making some use of A-Law, but as can be noted, A-Law requires 8 bits per sample.

Though, ending up back at poking around with ADPCM is similar territory
to my projects from a decade ago...

But, OTOH: APDCM is/was an effective format for sound effects; even if
not given much credit (and seemingly most people see it as obsolescent).

As for image formats, I have a few options for low bpp, while also being
cheap to decode:
BMP+CRAM: 4x4x1, limited variant of CRAM ("MS Video 1")
Roughly 2 bpp (and repurposed as a graphics format...).
BMP+CQ: 8x8x1, similar design to CRAM.
Roughly 1.25 bpp

Where, these can work well for images with no more than 2 colors per 4x4
or 8x8 pixel block (otherwise, YMMV). As it so happens, lots of UI
graphics fit this pattern, and/or are essentially monochrome. CQ can
deal well with monochrome or almost-monochrome graphics without too much
space overhead.

Though, in some other cases, monochrome or 4-color images could be a
better fit. These default to black/white or black/white/cyan/magenta,
but don't necessarily need to be limited to this (but, may need to add
options in BGBCC for 2/4/16 color dynamic-palette).

Say, for example, if an image is only black/white/red/blue or similar,
4-color could make sense (vs using CRAM or CQ and picking from the 256
color palette; but not being able to have different sets of colors in
close proximity). Often, 16-color works, but 16-color is rather bulky if compared with CRAM or CQ.

For the CRAM and CQ formats, I ended up adding an option by which the
color palette can be skipped (it is replaced by a palette hash value; OS
can use the color palette associated with the corresponding hash number).

Mostly this was because, say, for 32x32 or 64x64 CRAM images, the 256
color palette was bigger than the image itself.

Note that much below 32x32, it is more compact to use hi-color BMP
images than 256-color due to the color palette issue (making the
optional omission for small image formats desirable).

Though, generally, these are generated with BGBCC, which can include the palette in the generated resource WAD, though TBD the best format. For
the kernel, it is stored as a 256x256 indexed color bitmap (which also
encodes a set of dither-aware RGB555 lookup tables).

For normal EXE/DLL files, could either store a dummy 16x16 256-color
image, or more compactly, as a 16x16 hi-color image (with no dither
table). Since, it is possible that it could make sense that EXEs/DLLs
use a different default color palette from the OS kernel.

Note that neither PNG, JPEG, nor even QOI, are a good fit for these use
cases. Wonky BMP variants are a better fit.

For SDF font images, had also used BMP, say a 256x256 8bpp image
covering CP-1252, with a specialized color palette (X/Y distances are
encoded in the in the pixels). Needed a full 8bpp here as CRAM doesn't
work for this.
PNG compresses them, but overhead is too high; and QOI is not so
effective for this scenario. Though, as 8bpp images, they do LZ compress pretty OK.

But, would not be reasonable to specially address every scenario.

...

--- Synchronet 3.20c-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon Mar 31 09:04:40 2025

From Newsgroup: comp.arch

On 3/30/2025 1:14 PM, MitchAlsup1 wrote:

On Sun, 30 Mar 2025 17:47:59 +0000, BGB wrote:

On 3/30/2025 7:16 AM, Robert Finch wrote:

Just got to thinking about stack canaries. I was going to have a special >>> purpose register holding the canary value for testing while the program
was running. But I just realized today that it may not be needed. Canary >>> values could be handled by the program loader as constants, eliminating
the need for a register. Since the value is not changing while the
program is running, it could easily be a constant. This may require a
fixup record handled by the assembler / linker to indicate to the loader >>> to place a canary value.

Prolog code would just store an immediate to the stack. On return a TRAP >>> instruction could check for the immediate value and trap if not present. >>> But the process seems to require assembler / linker support.

They are mostly just a normal compiler feature IME:
   Prolog stores the value;
   Epilog loads it and verifies that the value is intact.

Agreed.

I'm glad you, Mitch, chimed in here. When I saw this, it occurred to me
that this could be done automatically by the hardware (optionally, based
on a bit in a control register). The CALL instruction would store
magic value, and the RET instruction would test it. If there was not a
match, an exception would be generated. The value itself could be
something like the clock value when the program was initiated, thus guaranteeing uniqueness.

The advantage over the software approach, of course, is the elimination
of several instructions in each prolog/epilog, reducing footprint, and
perhaps even time as it might be possible to overlap some of the
processing with the other things these instructions do. The downside is
more hardware and perhaps extra overhead.

Does this make sense? What have I missed.

Using a magic number

Remove excess words.

Nothing fancy needed in the assemble or link stages.

They remain blissfully ignorant--at most they generate the magic
number, possibly at random, possibly per link-module.

In my case, canary behavior is one of:
   Use them in functions with arrays or similar (default);
   Use them everywhere (optional);
   Disable them entirely (also optional).

In my case, it is only checking 16-bit magic numbers, but mostly because
a 16-bit constant is cheaper to load into a register in this case
(single 32-bit instruction, vs a larger encoding needed for larger
values).

....

--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.20c-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Mon Mar 31 12:17:38 2025

From Newsgroup: comp.arch

On 3/31/2025 11:04 AM, Stephen Fuld wrote:

On 3/30/2025 1:14 PM, MitchAlsup1 wrote:

On Sun, 30 Mar 2025 17:47:59 +0000, BGB wrote:

On 3/30/2025 7:16 AM, Robert Finch wrote:

Just got to thinking about stack canaries. I was going to have a
special
purpose register holding the canary value for testing while the program >>>> was running. But I just realized today that it may not be needed.
Canary
values could be handled by the program loader as constants, eliminating >>>> the need for a register. Since the value is not changing while the
program is running, it could easily be a constant. This may require a
fixup record handled by the assembler / linker to indicate to the
loader
to place a canary value.

Prolog code would just store an immediate to the stack. On return a
TRAP
instruction could check for the immediate value and trap if not
present.
But the process seems to require assembler / linker support.

They are mostly just a normal compiler feature IME:
   Prolog stores the value;
   Epilog loads it and verifies that the value is intact.

Agreed.

I'm glad you, Mitch, chimed in here. When I saw this, it occurred to me that this could be done automatically by the hardware (optionally, based
on a bit in a control register).   The CALL instruction would store
magic value, and the RET instruction would test it. If there was not a match, an exception would be generated. The value itself could be something like the clock value when the program was initiated, thus guaranteeing uniqueness.

The advantage over the software approach, of course, is the elimination
of several instructions in each prolog/epilog, reducing footprint, and perhaps even time as it might be possible to overlap some of the
processing with the other things these instructions do. The downside is more hardware and perhaps extra overhead.

Does this make sense? What have I missed.

This would seem to imply an ISA where CALL/RET push onto the stack or
similar, rather than the (more common for RISC's) strategy of copying PC
into a link register...

Another option being if it could be a feature of a Load/Store Multiple.

Say, LDM/STM:
6b Hi (Upper bound of register to save)
6b Lo (Lower bound of registers to save)
1b LR (Flag to save Link Register)
1b GP (Flag to save Global Pointer)
1b SK (Flag to generate a canary)

Likely (STM):
Pushes LR first (if bit set);
Pushes GP second (if bit set);
Pushes registers in range (if Hi>=Lo);
Pushes stack canary (if bit set).

LDM would check the canary first and fault if it doesn't see the
expected value.

Downside, granted, is needing the relative complexity of an LDM/STM
style instruction.

Other ISAs use a flag bit for each register, but this is less viable
with an ISA with a larger number of registers, well, unless one uses a
64 or 96 bit LDM/STM encoding (possible). Merit though would be not
needing multiple LDM's / STM's to deal with a discontinuous register range.

Well, also excluding the possibility where the LDM/STM is essentially
just a function call (say, if beyond certain number of registers are to
be saved/restored, the compiler generates a call to a save/restore
sequence, which is also generates as-needed). Granted, this is basically
the strategy used by BGBCC. If multiple functions happen to save/restore
the same combination of registers, they get to reuse the prior
function's save/restore sequence (generally folded off to before the
function in question).

Granted, the folding strategy can still do canary values, but doing so
in the reused portions would limit the range of unique canary values
(well, unless the canary magic is XOR'ed with SP or something...).

...

Using a magic number

Remove excess words.

Nothing fancy needed in the assemble or link stages.

They remain blissfully ignorant--at most they generate the magic
number, possibly at random, possibly per link-module.

In my case, canary behavior is one of:
   Use them in functions with arrays or similar (default);
   Use them everywhere (optional);
   Disable them entirely (also optional).

In my case, it is only checking 16-bit magic numbers, but mostly because >>> a 16-bit constant is cheaper to load into a register in this case
(single 32-bit instruction, vs a larger encoding needed for larger
values).

....

--- Synchronet 3.20c-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon Mar 31 10:57:35 2025

From Newsgroup: comp.arch

On 3/31/2025 10:17 AM, BGB wrote:

On 3/31/2025 11:04 AM, Stephen Fuld wrote:

On 3/30/2025 1:14 PM, MitchAlsup1 wrote:

On Sun, 30 Mar 2025 17:47:59 +0000, BGB wrote:

On 3/30/2025 7:16 AM, Robert Finch wrote:

Just got to thinking about stack canaries. I was going to have a
special
purpose register holding the canary value for testing while the
program
was running. But I just realized today that it may not be needed.
Canary
values could be handled by the program loader as constants,
eliminating
the need for a register. Since the value is not changing while the
program is running, it could easily be a constant. This may require a >>>>> fixup record handled by the assembler / linker to indicate to the
loader
to place a canary value.

Prolog code would just store an immediate to the stack. On return a >>>>> TRAP
instruction could check for the immediate value and trap if not
present.
But the process seems to require assembler / linker support.

They are mostly just a normal compiler feature IME:
   Prolog stores the value;
   Epilog loads it and verifies that the value is intact.

Agreed.

I'm glad you, Mitch, chimed in here. When I saw this, it occurred to
me that this could be done automatically by the hardware (optionally,
based on a bit in a control register).   The CALL instruction would
store magic value, and the RET instruction would test it. If there
was not a match, an exception would be generated. The value itself
could be something like the clock value when the program was
initiated, thus guaranteeing uniqueness.

The advantage over the software approach, of course, is the
elimination of several instructions in each prolog/epilog, reducing
footprint, and perhaps even time as it might be possible to overlap
some of the processing with the other things these instructions do.
The downside is more hardware and perhaps extra overhead.

Does this make sense? What have I missed.

This would seem to imply an ISA where CALL/RET push onto the stack or similar, rather than the (more common for RISC's) strategy of copying PC into a link register...

Sorry, you're right. I should have said, in the context of Mitch's My
66000, the ENTER and EXIT instructions.

Another option being if it could be a feature of a Load/Store Multiple.

The nice thing about the ENTER/EXIT is that they combine the store
multiple (ENTER) and the load multiple and return control (EXIT).
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.20c-Linux NewsLink 1.2

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Mon Mar 31 18:07:30 2025

From Newsgroup: comp.arch

On Mon, 31 Mar 2025 17:17:38 +0000, BGB wrote:

On 3/31/2025 11:04 AM, Stephen Fuld wrote:

On 3/30/2025 1:14 PM, MitchAlsup1 wrote:

On Sun, 30 Mar 2025 17:47:59 +0000, BGB wrote:
-------------

They are mostly just a normal compiler feature IME:
   Prolog stores the value;
   Epilog loads it and verifies that the value is intact.

Agreed.

I'm glad you, Mitch, chimed in here. When I saw this, it occurred to me
that this could be done automatically by the hardware (optionally, based
on a bit in a control register).   The CALL instruction would store
magic value, and the RET instruction would test it. If there was not a
match, an exception would be generated. The value itself could be
something like the clock value when the program was initiated, thus
guaranteeing uniqueness.

The advantage over the software approach, of course, is the elimination
of several instructions in each prolog/epilog, reducing footprint, and
perhaps even time as it might be possible to overlap some of the
processing with the other things these instructions do. The downside is
more hardware and perhaps extra overhead.

Does this make sense? What have I missed.

This would seem to imply an ISA where CALL/RET push onto the stack or similar, rather than the (more common for RISC's) strategy of copying PC
into a link register...

Another option being if it could be a feature of a Load/Store Multiple.

Say, LDM/STM:
6b Hi (Upper bound of register to save)
6b Lo (Lower bound of registers to save)
1b LR (Flag to save Link Register)
1b GP (Flag to save Global Pointer)
1b SK (Flag to generate a canary)

ENTER and EXIT have 2 of those flags--but also note use of SP and CSP
are implicit.

Likely (STM):
Pushes LR first (if bit set);
Pushes GP second (if bit set);
Pushes registers in range (if Hi>=Lo);
Pushes stack canary (if bit set).

EXIT uses its 3rd flag used when doing longjump() and THROW()
so as to pop the call-stack but not actually RET from the stack
walker.

LDM would check the canary first and fault if it doesn't see the
expected value.

Downside, granted, is needing the relative complexity of an LDM/STM
style instruction.

Not conceptually any harder than DIV or FDIV and nobody complains
about doing multi-cycle math.

Other ISAs use a flag bit for each register, but this is less viable
with an ISA with a larger number of registers, well, unless one uses a
64 or 96 bit LDM/STM encoding (possible). Merit though would be not
needing multiple LDM's / STM's to deal with a discontinuous register
range.

To quote Trevor Smith:: "Why would anyone want to do that" ??

Well, also excluding the possibility where the LDM/STM is essentially
just a function call (say, if beyond certain number of registers are to
be saved/restored, the compiler generates a call to a save/restore
sequence, which is also generates as-needed). Granted, this is basically
the strategy used by BGBCC. If multiple functions happen to save/restore
the same combination of registers, they get to reuse the prior
function's save/restore sequence (generally folded off to before the
function in question).

Calling a subroutine to perform epilogues is adding to the number of
branches a program executes. Having an instruction like EXIT means
when you know you need to exit, you EXIT you don't branch to the exit
point. Saving instructions.

Granted, the folding strategy can still do canary values, but doing so
in the reused portions would limit the range of unique canary values
(well, unless the canary magic is XOR'ed with SP or something...).

--- Synchronet 3.20c-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Mon Mar 31 13:56:32 2025

From Newsgroup: comp.arch

On 3/31/2025 1:07 PM, MitchAlsup1 wrote:

On Mon, 31 Mar 2025 17:17:38 +0000, BGB wrote:

On 3/31/2025 11:04 AM, Stephen Fuld wrote:

On 3/30/2025 1:14 PM, MitchAlsup1 wrote:

On Sun, 30 Mar 2025 17:47:59 +0000, BGB wrote:
-------------

They are mostly just a normal compiler feature IME:
   Prolog stores the value;
   Epilog loads it and verifies that the value is intact.

Agreed.

I'm glad you, Mitch, chimed in here. When I saw this, it occurred to me >>> that this could be done automatically by the hardware (optionally, based >>> on a bit in a control register).   The CALL instruction would store
magic value, and the RET instruction would test it. If there was not a >>> match, an exception would be generated. The value itself could be
something like the clock value when the program was initiated, thus
guaranteeing uniqueness.

The advantage over the software approach, of course, is the elimination
of several instructions in each prolog/epilog, reducing footprint, and
perhaps even time as it might be possible to overlap some of the
processing with the other things these instructions do. The downside is >>> more hardware and perhaps extra overhead.

Does this make sense? What have I missed.

This would seem to imply an ISA where CALL/RET push onto the stack or
similar, rather than the (more common for RISC's) strategy of copying PC
into a link register...

Another option being if it could be a feature of a Load/Store Multiple.

Say, LDM/STM:
   6b Hi (Upper bound of register to save)
   6b Lo (Lower bound of registers to save)
   1b LR (Flag to save Link Register)
   1b GP (Flag to save Global Pointer)
   1b SK (Flag to generate a canary)

ENTER and EXIT have 2 of those flags--but also note use of SP and CSP
are implicit.

Likely (STM):
   Pushes LR first (if bit set);
   Pushes GP second (if bit set);
   Pushes registers in range (if Hi>=Lo);
   Pushes stack canary (if bit set).

EXIT uses its 3rd flag used when doing longjump() and THROW()
so as to pop the call-stack but not actually RET from the stack
walker.

OK.

I guess one could debate whether an LDM could treat the Load-LR as "Load
LR" or "Load address and Branch", and/or have separate flags (Load LR vs
Load PC, with Load PC meaning to branch).

Other ABIs may not have as much reason to save/restore the Global
Pointer all the time. But, in my case, it is being used as the primary
way of accessing globals, and each binary image has its own address
range here.

PC-Rel not being used as PC-Rel doesn't allow for multiple process
instances of a given loaded binary within a shared address space.

Vs, say, for PIE ELF binaries where it is needed to load a new copy for
each process instance because of this (well, excluding an FDPIC style
ABI, but seemingly still no one seems to have bothered adding FDPIC
support in GCC or friends for RV64 based targets, ...).

Well, granted, because Linux and similar tend to load every new process
into its own address space and/or use CoW.

LDM would check the canary first and fault if it doesn't see the
expected value.

Downside, granted, is needing the relative complexity of an LDM/STM
style instruction.

Not conceptually any harder than DIV or FDIV and nobody complains
about doing multi-cycle math.

But... Only reason I have DIV and FDIV was because RISC-V's 'M'
extension needed them, and there are generally not a whole lot of useful configurations supported by GCC that lacked 'M'.

There is FDIV, but it is painfully slow.

Other ISAs use a flag bit for each register, but this is less viable
with an ISA with a larger number of registers, well, unless one uses a
64 or 96 bit LDM/STM encoding (possible). Merit though would be not
needing multiple LDM's / STM's to deal with a discontinuous register
range.

To quote Trevor Smith:: "Why would anyone want to do that" ??

Discontinuous register ranges:
Because pretty much no ABI's put all of the callee save registers in a contiguous range.

Granted, I guess if someone were designing an ISA and ABI clean, they
could make all of the argument registers and callee save registers
contiguous.

Say:
R0..R3: Special
R4..R15: Scratch
R16..R31: Argument
R32..R63: Callee Save
...

But, invariably, someone will want "compressed" instructions with a
subset of the registers, and one can't just have these only having
access to argument registers.

Well, also excluding the possibility where the LDM/STM is essentially
just a function call (say, if beyond certain number of registers are to
be saved/restored, the compiler generates a call to a save/restore
sequence, which is also generates as-needed). Granted, this is basically
the strategy used by BGBCC. If multiple functions happen to save/restore
the same combination of registers, they get to reuse the prior
function's save/restore sequence (generally folded off to before the
function in question).

Calling a subroutine to perform epilogues is adding to the number of
branches a program executes. Having an instruction like EXIT means
when you know you need to exit, you EXIT you don't branch to the exit
point. Saving instructions.

Prolog needs a call, but epilog can just be a branch, since no need to
return back into the function that is returning.

Needs to have a lower limit though, as it is not worth it to use a
call/branch to save/restore 3 or 4 registers...

But, say, 20 registers, it is more worthwhile.

Granted, the folding strategy can still do canary values, but doing so
in the reused portions would limit the range of unique canary values
(well, unless the canary magic is XOR'ed with SP or something...).

--- Synchronet 3.20c-Linux NewsLink 1.2

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Mon Mar 31 20:52:14 2025

From Newsgroup: comp.arch

On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:

On 3/31/2025 1:07 PM, MitchAlsup1 wrote:

-------------

Another option being if it could be a feature of a Load/Store Multiple.

Say, LDM/STM:
   6b Hi (Upper bound of register to save)
   6b Lo (Lower bound of registers to save)
   1b LR (Flag to save Link Register)
   1b GP (Flag to save Global Pointer)
   1b SK (Flag to generate a canary)

ENTER and EXIT have 2 of those flags--but also note use of SP and CSP
are implicit.

Likely (STM):
   Pushes LR first (if bit set);
   Pushes GP second (if bit set);
   Pushes registers in range (if Hi>=Lo);
   Pushes stack canary (if bit set).

EXIT uses its 3rd flag used when doing longjump() and THROW()
so as to pop the call-stack but not actually RET from the stack
walker.

OK.

I guess one could debate whether an LDM could treat the Load-LR as "Load
LR" or "Load address and Branch", and/or have separate flags (Load LR vs
Load PC, with Load PC meaning to branch).

Other ABIs may not have as much reason to save/restore the Global
Pointer all the time. But, in my case, it is being used as the primary
way of accessing globals, and each binary image has its own address
range here.

I use constants to access globals.
These comes in 32-bit and 64-bit flavors.

PC-Rel not being used as PC-Rel doesn't allow for multiple process
instances of a given loaded binary within a shared address space.

As long as the relative distance is the same, it does.

Vs, say, for PIE ELF binaries where it is needed to load a new copy for
each process instance because of this (well, excluding an FDPIC style
ABI, but seemingly still no one seems to have bothered adding FDPIC
support in GCC or friends for RV64 based targets, ...).

Well, granted, because Linux and similar tend to load every new process
into its own address space and/or use CoW.

CoW and execl()

--------------

Other ISAs use a flag bit for each register, but this is less viable
with an ISA with a larger number of registers, well, unless one uses a
64 or 96 bit LDM/STM encoding (possible). Merit though would be not
needing multiple LDM's / STM's to deal with a discontinuous register
range.

To quote Trevor Smith:: "Why would anyone want to do that" ??

Discontinuous register ranges:
Because pretty much no ABI's put all of the callee save registers in a contiguous range.

Granted, I guess if someone were designing an ISA and ABI clean, they
could make all of the argument registers and callee save registers contiguous.

Say:
R0..R3: Special
R4..R15: Scratch
R16..R31: Argument
R32..R63: Callee Save
....

But, invariably, someone will want "compressed" instructions with a
subset of the registers, and one can't just have these only having
access to argument registers.

Brian had little trouble using My 66000 ABI which does have contiguous
register groupings.

Well, also excluding the possibility where the LDM/STM is essentially
just a function call (say, if beyond certain number of registers are to
be saved/restored, the compiler generates a call to a save/restore
sequence, which is also generates as-needed). Granted, this is basically >>> the strategy used by BGBCC. If multiple functions happen to save/restore >>> the same combination of registers, they get to reuse the prior
function's save/restore sequence (generally folded off to before the
function in question).

Calling a subroutine to perform epilogues is adding to the number of
branches a program executes. Having an instruction like EXIT means
when you know you need to exit, you EXIT you don't branch to the exit
point. Saving instructions.

Prolog needs a call, but epilog can just be a branch, since no need to
return back into the function that is returning.

Yes, but this means My 66000 executes 3 fewer transfers of control
per subroutine than you do. And taken branches add latency.

Needs to have a lower limit though, as it is not worth it to use a call/branch to save/restore 3 or 4 registers...

But, say, 20 registers, it is more worthwhile.

ENTER saves as few as 1 or as many as 32 and remains that 1 single
instruction. Same for EXIT and exit also performs the RET when LDing
R0.

Granted, the folding strategy can still do canary values, but doing so
in the reused portions would limit the range of unique canary values
(well, unless the canary magic is XOR'ed with SP or something...).

Canary values are in addition to ENTER and EXIT not part of them
IMHO.
--- Synchronet 3.20c-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Tue Apr 1 00:58:58 2025

From Newsgroup: comp.arch

On 2025-03-31 4:52 p.m., MitchAlsup1 wrote:

On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:

On 3/31/2025 1:07 PM, MitchAlsup1 wrote:

-------------

Another option being if it could be a feature of a Load/Store Multiple. >>>>
Say, LDM/STM:
   6b Hi (Upper bound of register to save)
   6b Lo (Lower bound of registers to save)
   1b LR (Flag to save Link Register)
   1b GP (Flag to save Global Pointer)
   1b SK (Flag to generate a canary)

Q+3 uses a bitmap of register selection with four more bits selecting overlapping groups. It can work with up to 17 registers.

ENTER and EXIT have 2 of those flags--but also note use of SP and CSP
are implicit.

Likely (STM):
   Pushes LR first (if bit set);
   Pushes GP second (if bit set);
   Pushes registers in range (if Hi>=Lo);
   Pushes stack canary (if bit set).

EXIT uses its 3rd flag used when doing longjump() and THROW()
so as to pop the call-stack but not actually RET from the stack
walker.

OK.

I guess one could debate whether an LDM could treat the Load-LR as "Load
LR" or "Load address and Branch", and/or have separate flags (Load LR vs
Load PC, with Load PC meaning to branch).

Other ABIs may not have as much reason to save/restore the Global
Pointer all the time. But, in my case, it is being used as the primary
way of accessing globals, and each binary image has its own address
range here.

I use constants to access globals.
These comes in 32-bit and 64-bit flavors.

PC-Rel not being used as PC-Rel doesn't allow for multiple process
instances of a given loaded binary within a shared address space.

As long as the relative distance is the same, it does.

Vs, say, for PIE ELF binaries where it is needed to load a new copy for
each process instance because of this (well, excluding an FDPIC style
ABI, but seemingly still no one seems to have bothered adding FDPIC
support in GCC or friends for RV64 based targets, ...).

Well, granted, because Linux and similar tend to load every new process
into its own address space and/or use CoW.

CoW and execl()

--------------

Other ISAs use a flag bit for each register, but this is less viable
with an ISA with a larger number of registers, well, unless one uses a >>>> 64 or 96 bit LDM/STM encoding (possible). Merit though would be not
needing multiple LDM's / STM's to deal with a discontinuous register
range.

To quote Trevor Smith:: "Why would anyone want to do that" ??

Discontinuous register ranges:
Because pretty much no ABI's put all of the callee save registers in a
contiguous range.

Granted, I guess if someone were designing an ISA and ABI clean, they
could make all of the argument registers and callee save registers
contiguous.

Say:
   R0..R3: Special
   R4..R15: Scratch
   R16..R31: Argument
   R32..R63: Callee Save
....

But, invariably, someone will want "compressed" instructions with a
subset of the registers, and one can't just have these only having
access to argument registers.

Brian had little trouble using My 66000 ABI which does have contiguous register groupings.

Well, also excluding the possibility where the LDM/STM is essentially
just a function call (say, if beyond certain number of registers are to >>>> be saved/restored, the compiler generates a call to a save/restore
sequence, which is also generates as-needed). Granted, this is
basically
the strategy used by BGBCC. If multiple functions happen to save/
restore
the same combination of registers, they get to reuse the prior
function's save/restore sequence (generally folded off to before the
function in question).

Calling a subroutine to perform epilogues is adding to the number of
branches a program executes. Having an instruction like EXIT means
when you know you need to exit, you EXIT you don't branch to the exit
point. Saving instructions.

Prolog needs a call, but epilog can just be a branch, since no need to
return back into the function that is returning.

Yes, but this means My 66000 executes 3 fewer transfers of control
per subroutine than you do. And taken branches add latency.

Needs to have a lower limit though, as it is not worth it to use a
call/branch to save/restore 3 or 4 registers...

But, say, 20 registers, it is more worthwhile.

ENTER saves as few as 1 or as many as 32 and remains that 1 single instruction. Same for EXIT and exit also performs the RET when LDing
R0.

Granted, the folding strategy can still do canary values, but doing so >>>> in the reused portions would limit the range of unique canary values
(well, unless the canary magic is XOR'ed with SP or something...).

Canary values are in addition to ENTER and EXIT not part of them
IMHO.

In Q+3 there are push and pop multiple instructions. I did not want to
add load and store multiple on top of that. They work great for ISRs,
but not so great for task switching code. I have the instructions
pushing or popping up to 17 registers in a group. Groups of registers
overlap by eight. The instructions can handle all 96 registers in the
machine. ENTER and EXIT are also present.

It is looking like the context switch code for the OS will take about
3000 clock cycles to run. Not wanting to disable interrupts for that
long, I put a spinlock on the system’s task control block array. But I
think I have run into an issue. It is the timer ISR that switches tasks.
Since it is an ISR it pushes a subset of registers that it uses and
restores them at exit. But when exiting and switching tasks it spinlocks
on the task control block array. I am not sure this is a good thing. As
the timer IRQ is fairly high priority. If something else locked the TCB
array it would deadlock. I guess the context switching could be deferred
until the app requests some other operating system function. But then
the issue is what if the app gets stuck in an infinite loop, not calling
the OS? I suppose I could make an OS heartbeat function call a
requirement of apps. If the app does not do a heartbeat within a
reasonable time, it could be terminated.

Q+3 progresses rapidly. A lot of the stuff in earlier versions was
removed. The pared down version is a 32-bit machine. Expecting some
headaches because of the use of condition registers and branch registers.

--- Synchronet 3.20c-Linux NewsLink 1.2

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Apr 1 18:51:30 2025

From Newsgroup: comp.arch

On Tue, 1 Apr 2025 4:58:58 +0000, Robert Finch wrote:

On 2025-03-31 4:52 p.m., MitchAlsup1 wrote:

On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:

--------------------

ENTER saves as few as 1 or as many as 32 and remains that 1 single
instruction. Same for EXIT and exit also performs the RET when LDing
R0.

Granted, the folding strategy can still do canary values, but doing so >>>>> in the reused portions would limit the range of unique canary values >>>>> (well, unless the canary magic is XOR'ed with SP or something...).

Canary values are in addition to ENTER and EXIT not part of them
IMHO.

In Q+3 there are push and pop multiple instructions. I did not want to
add load and store multiple on top of that. They work great for ISRs,
but not so great for task switching code. I have the instructions
pushing or popping up to 17 registers in a group. Groups of registers
overlap by eight. The instructions can handle all 96 registers in the machine. ENTER and EXIT are also present.

It is looking like the context switch code for the OS will take about
3000 clock cycles to run.

How much of that is figuring out who to switch to and, now that that has
been decided, make the context switch manifest ??

Not wanting to disable interrupts for that
long, I put a spinlock on the system’s task control block array. But I think I have run into an issue. It is the timer ISR that switches tasks. Since it is an ISR it pushes a subset of registers that it uses and
restores them at exit. But when exiting and switching tasks it spinlocks
on the task control block array. I am not sure this is a good thing. As
the timer IRQ is fairly high priority. If something else locked the TCB
array it would deadlock. I guess the context switching could be deferred until the app requests some other operating system function. But then
the issue is what if the app gets stuck in an infinite loop, not calling
the OS? I suppose I could make an OS heartbeat function call a
requirement of apps. If the app does not do a heartbeat within a
reasonable time, it could be terminated.

Q+3 progresses rapidly. A lot of the stuff in earlier versions was
removed. The pared down version is a 32-bit machine. Expecting some
headaches because of the use of condition registers and branch
registers.

--- Synchronet 3.20c-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Tue Apr 1 14:34:10 2025

From Newsgroup: comp.arch

On 3/31/2025 3:52 PM, MitchAlsup1 wrote:

On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:

On 3/31/2025 1:07 PM, MitchAlsup1 wrote:

-------------

Another option being if it could be a feature of a Load/Store Multiple. >>>>
Say, LDM/STM:
   6b Hi (Upper bound of register to save)
   6b Lo (Lower bound of registers to save)
   1b LR (Flag to save Link Register)
   1b GP (Flag to save Global Pointer)
   1b SK (Flag to generate a canary)

ENTER and EXIT have 2 of those flags--but also note use of SP and CSP
are implicit.

Likely (STM):
   Pushes LR first (if bit set);
   Pushes GP second (if bit set);
   Pushes registers in range (if Hi>=Lo);
   Pushes stack canary (if bit set).

EXIT uses its 3rd flag used when doing longjump() and THROW()
so as to pop the call-stack but not actually RET from the stack
walker.

OK.

I guess one could debate whether an LDM could treat the Load-LR as "Load
LR" or "Load address and Branch", and/or have separate flags (Load LR vs
Load PC, with Load PC meaning to branch).

Other ABIs may not have as much reason to save/restore the Global
Pointer all the time. But, in my case, it is being used as the primary
way of accessing globals, and each binary image has its own address
range here.

I use constants to access globals.
These comes in 32-bit and 64-bit flavors.

Typically 16-bit, most are within a 16-bit range of the Global Pointer.

PC-Rel not being used as PC-Rel doesn't allow for multiple process
instances of a given loaded binary within a shared address space.

As long as the relative distance is the same, it does.

Can't happen within a shared address space.

Say, if you load a single copy of a binary at 0x24680000.
Process A and B can't use the same mapping in the same address space,
with PC-rel globals, as then they would each see the other's globals.

You can't do a duplicate mapping at another address, as this both wastes
VAS, and also any Abs64 base-relocs or similar would differ.

You also can't CoW the data/bss sections, as this is no longer a shared address space.

So, alternative is to use GBR to access globals, with the data/bss
sections allocated independently of the binary.

This way, multiple processes can share the same mapping at the same
address for any executable code and constant data, with only the data
sections needing to be allocated.

Does mean though that one needs to save/restore the global pointer, and
there is a ritual for reloading it.

EXE's generally assume they are index 0, so:
MOV.Q (GBR, 0), Rt
MOV.Q (Rt, 0), GBR
Or, in RV terms:
LD X6, 0(X3)
LD X3, Disp33(X6)
Or, RV64G:
LD X6, 0(X3)
LUI X5, DispHi
ADD X5 X5, X6
LD X3, DispLo(X5)

For DLL's, the index is fixed up with a base-reloc (for each loaded
DLL), so basically the same idea. Typically a Disp33 is used here to
allow for a potentially large/unknown number of loaded DLL's. Thus far,
a global numbering scheme is used.

Where, (GBR+0) gives the address of a table of global pointers for every loaded binary (can be assumed read-only from userland).

Generally, this is needed if:
Function may be called from outside of the current binary and:
Accesses global variables;
And/or, calls local functions.

Though, still generally lower average-case overhead than the strategy typically used by FDPIC, which would handle this reload process on the
caller side...
SD X3, Disp(SP)
LD X3, 8(X18)
LD X6, 0(X18)
JALR X1, 0(X6)
LD X3, Disp(SP)

With generally every function pointer existing as a pair with the actual function pointer, and its associated global pointer.

Though, caller side handling does arguably avoid the need to perform
relocs for the table index.

Though, seemingly no one wants to add FDPIC for RV64G, seeing it mostly
as a 32-bit microcontroller thing.

For normal PIE though, absent CoW, it is necessary to load a new copy of
the binary each time a new process instance is created.

Vs, say, for PIE ELF binaries where it is needed to load a new copy for
each process instance because of this (well, excluding an FDPIC style
ABI, but seemingly still no one seems to have bothered adding FDPIC
support in GCC or friends for RV64 based targets, ...).

Well, granted, because Linux and similar tend to load every new process
into its own address space and/or use CoW.

CoW and execl()

Though, execl() effectively replaces the current process.

IMHO, a "CreateProcess()" style abstraction makes more sense than fork+exec.

Though, one tricky way to handle it is:
vfork: effectively spawns a thread in the same address space as the
caller, with a provisional PID, and semi-copied stack;
exec: Creates a new process copying the PID and file-descriptors;
Internally uses CreateProcess;
Temporary thread disappears once exec is called.

True "fork()" is more of an issue though...

The true "fork()" semantics are not possible on single-address-space or
NoMMU systems. Nor fully emulated in things like Cygwin IIRC.

Though, the usual alternative is to give them "vfork()" semantics, and
things will probably explode if they do anything other than call exec or similar.

--------------

Other ISAs use a flag bit for each register, but this is less viable
with an ISA with a larger number of registers, well, unless one uses a >>>> 64 or 96 bit LDM/STM encoding (possible). Merit though would be not
needing multiple LDM's / STM's to deal with a discontinuous register
range.

To quote Trevor Smith:: "Why would anyone want to do that" ??

Discontinuous register ranges:
Because pretty much no ABI's put all of the callee save registers in a
contiguous range.

Granted, I guess if someone were designing an ISA and ABI clean, they
could make all of the argument registers and callee save registers
contiguous.

Say:
   R0..R3: Special
   R4..R15: Scratch
   R16..R31: Argument
   R32..R63: Callee Save
....

But, invariably, someone will want "compressed" instructions with a
subset of the registers, and one can't just have these only having
access to argument registers.

Brian had little trouble using My 66000 ABI which does have contiguous register groupings.

But, My66000 also isn't like, "Hey, how about 16-bit ops with 3 or 4 bit register numbers".

Not sure the thinking behind the RV ABI.

In the BJX ABI, the layout directly grew out of the SH ABI mapping, effectively just mirroring the original SH layout 4 times for 64 registers.

The SH layout was contiguous, at least for 16 registers, though a
mirrored layout is no longer contiguous.

The RV ABI is not contiguous, but at least still less chaotic than the
x86-64 ABIs.

Well, also excluding the possibility where the LDM/STM is essentially
just a function call (say, if beyond certain number of registers are to >>>> be saved/restored, the compiler generates a call to a save/restore
sequence, which is also generates as-needed). Granted, this is
basically
the strategy used by BGBCC. If multiple functions happen to save/
restore
the same combination of registers, they get to reuse the prior
function's save/restore sequence (generally folded off to before the
function in question).

Calling a subroutine to perform epilogues is adding to the number of
branches a program executes. Having an instruction like EXIT means
when you know you need to exit, you EXIT you don't branch to the exit
point. Saving instructions.

Prolog needs a call, but epilog can just be a branch, since no need to
return back into the function that is returning.

Yes, but this means My 66000 executes 3 fewer transfers of control
per subroutine than you do. And taken branches add latency.

Granted.

Each predicted branch adds 2 cycles.

Needs to have a lower limit though, as it is not worth it to use a
call/branch to save/restore 3 or 4 registers...

But, say, 20 registers, it is more worthwhile.

ENTER saves as few as 1 or as many as 32 and remains that 1 single instruction. Same for EXIT and exit also performs the RET when LDing
R0.

Granted.

My strategy isn't perfect:
Non-zero branching overheads, when the feature is used;
Per-function load/store slides in prolog/epilog, when not used.

Then, the heuristic mostly becomes one of when it is better to use the
inline strategy (load/store slide), or to fold them off and use calls/branches.

Does technically also work for RISC-V though (though seemingly GCC
always uses inline save/restore, but also the RV ABI has fewer registers).

Granted, the folding strategy can still do canary values, but doing so >>>> in the reused portions would limit the range of unique canary values
(well, unless the canary magic is XOR'ed with SP or something...).

Canary values are in addition to ENTER and EXIT not part of them
IMHO.

OK.

It sorta made sense to treat canary values as part of the process of saving/restoring the registers, since their main purpose is to protect
the saved registers, and particularly the saved PC.

Granted, canary values are not a perfect strategy.
They can provide some added resistance against buffer overflow exploits
if the value can be made unknown to the attacker.

This means, ideally:
Unique to each function, and does not repeat across builds.
But, by itself, insufficient if a single build is used.
Is mangled in some other way to avoid repeats.
Say, XOR'ing with SP and also ASLR'ing the SP.

But, yeah, if the canary value is, say:
(SP XOR Magic) with SP being ASLR'ed, it offers at least some added protection.

--- Synchronet 3.20c-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Tue Apr 1 16:21:30 2025

From Newsgroup: comp.arch

On 3/31/2025 11:58 PM, Robert Finch wrote:

On 2025-03-31 4:52 p.m., MitchAlsup1 wrote:

On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:

On 3/31/2025 1:07 PM, MitchAlsup1 wrote:

-------------

Another option being if it could be a feature of a Load/Store
Multiple.

Say, LDM/STM:
   6b Hi (Upper bound of register to save)
   6b Lo (Lower bound of registers to save)
   1b LR (Flag to save Link Register)
   1b GP (Flag to save Global Pointer)
   1b SK (Flag to generate a canary)

Q+3 uses a bitmap of register selection with four more bits selecting overlapping groups. It can work with up to 17 registers.

OK.

If I did LDM/STM style ops, not sure which strategy I would take.

The possibility of using a 96-bit encoding with an Imm64 holding a
bit-mask of all the registers makes some sense...

ENTER and EXIT have 2 of those flags--but also note use of SP and CSP
are implicit.

Likely (STM):
   Pushes LR first (if bit set);
   Pushes GP second (if bit set);
   Pushes registers in range (if Hi>=Lo);
   Pushes stack canary (if bit set).

EXIT uses its 3rd flag used when doing longjump() and THROW()
so as to pop the call-stack but not actually RET from the stack
walker.

OK.

I guess one could debate whether an LDM could treat the Load-LR as "Load >>> LR" or "Load address and Branch", and/or have separate flags (Load LR vs >>> Load PC, with Load PC meaning to branch).

Other ABIs may not have as much reason to save/restore the Global
Pointer all the time. But, in my case, it is being used as the primary
way of accessing globals, and each binary image has its own address
range here.

I use constants to access globals.
These comes in 32-bit and 64-bit flavors.

PC-Rel not being used as PC-Rel doesn't allow for multiple process
instances of a given loaded binary within a shared address space.

As long as the relative distance is the same, it does.

Vs, say, for PIE ELF binaries where it is needed to load a new copy for
each process instance because of this (well, excluding an FDPIC style
ABI, but seemingly still no one seems to have bothered adding FDPIC
support in GCC or friends for RV64 based targets, ...).

Well, granted, because Linux and similar tend to load every new process
into its own address space and/or use CoW.

CoW and execl()

--------------

Other ISAs use a flag bit for each register, but this is less viable >>>>> with an ISA with a larger number of registers, well, unless one uses a >>>>> 64 or 96 bit LDM/STM encoding (possible). Merit though would be not
needing multiple LDM's / STM's to deal with a discontinuous register >>>>> range.

To quote Trevor Smith:: "Why would anyone want to do that" ??

Discontinuous register ranges:
Because pretty much no ABI's put all of the callee save registers in a
contiguous range.

Granted, I guess if someone were designing an ISA and ABI clean, they
could make all of the argument registers and callee save registers
contiguous.

Say:
   R0..R3: Special
   R4..R15: Scratch
   R16..R31: Argument
   R32..R63: Callee Save
....

But, invariably, someone will want "compressed" instructions with a
subset of the registers, and one can't just have these only having
access to argument registers.

Brian had little trouble using My 66000 ABI which does have contiguous
register groupings.

Well, also excluding the possibility where the LDM/STM is essentially >>>>> just a function call (say, if beyond certain number of registers
are to
be saved/restored, the compiler generates a call to a save/restore
sequence, which is also generates as-needed). Granted, this is
basically
the strategy used by BGBCC. If multiple functions happen to save/
restore
the same combination of registers, they get to reuse the prior
function's save/restore sequence (generally folded off to before the >>>>> function in question).

Calling a subroutine to perform epilogues is adding to the number of
branches a program executes. Having an instruction like EXIT means
when you know you need to exit, you EXIT you don't branch to the exit
point. Saving instructions.

Prolog needs a call, but epilog can just be a branch, since no need to
return back into the function that is returning.

Yes, but this means My 66000 executes 3 fewer transfers of control
per subroutine than you do. And taken branches add latency.

Needs to have a lower limit though, as it is not worth it to use a
call/branch to save/restore 3 or 4 registers...

But, say, 20 registers, it is more worthwhile.

ENTER saves as few as 1 or as many as 32 and remains that 1 single
instruction. Same for EXIT and exit also performs the RET when LDing
R0.

Granted, the folding strategy can still do canary values, but doing so >>>>> in the reused portions would limit the range of unique canary values >>>>> (well, unless the canary magic is XOR'ed with SP or something...).

Canary values are in addition to ENTER and EXIT not part of them
IMHO.

In Q+3 there are push and pop multiple instructions. I did not want to
add load and store multiple on top of that. They work great for ISRs,
but not so great for task switching code. I have the instructions
pushing or popping up to 17 registers in a group. Groups of registers overlap by eight. The instructions can handle all 96 registers in the machine. ENTER and EXIT are also present.

It is looking like the context switch code for the OS will take about
3000 clock cycles to run. Not wanting to disable interrupts for that
long, I put a spinlock on the system’s task control block array. But I think I have run into an issue. It is the timer ISR that switches tasks. Since it is an ISR it pushes a subset of registers that it uses and
restores them at exit. But when exiting and switching tasks it spinlocks
on the task control block array. I am not sure this is a good thing. As
the timer IRQ is fairly high priority. If something else locked the TCB array it would deadlock. I guess the context switching could be deferred until the app requests some other operating system function. But then
the issue is what if the app gets stuck in an infinite loop, not calling
the OS? I suppose I could make an OS heartbeat function call a
requirement of apps. If the app does not do a heartbeat within a
reasonable time, it could be terminated.

Q+3 progresses rapidly. A lot of the stuff in earlier versions was
removed. The pared down version is a 32-bit machine. Expecting some headaches because of the use of condition registers and branch registers.

OK.

Ironically, I seem to have comparably low task-switch cost...
However, each system call is essentially 2 task switches, and it is
still slow enough to negatively effect performance if they happen at all frequently.

So, say, one needs to try to minimize the number of unnecessary system
calls (say, don't implement "fputs()" by sending 1 byte at a time, ...).

Unlike on a modern PC, one generally needs to care more about efficiency.

Hence, all the fiddling with low bit-depth graphics formats, and things
like my recent fiddling with 2-bit ADPCM audio.

And, online, one is (if anything) more likely to find people complaining
about how old/obsolescent ADPCM is (and/or arguing that people should
store all their sound effects as Ogg/Vorbis or similar; ...).

Then again, I did note that I may need to find some other "quality
metric" for audio, as RMSE isn't really working...

At least going by RMSE, the "best" option would be to use 8-bit PCM and
then downsample it.

Say, 4kHz 8-bit PCM has a lower RMSE score than 2-bit ADPCM, but
subjectively the 2-bit ADPCM sounds significantly better.

Say: for 16kHz, and a test file (using a song here):
PCM8, 16kHz : 121 (128 kbps)
A-Law, 16kHz : 284 (128 kbps)
IMA 4bit, 16kHz : 617 (64 kbps)
IMA 2bit, 16kHz : 1692 (32 kbps, *)
ADLQ 2bit, 16kHz: 2000 (32 kbps)
PCM8, 4kHz : 242 (32 kbps)

However, 4kHz PCM8 sounds terrible vs either 2-bit IMA or ADLQ.
Basically sounds muffled, speech is unintelligible.
But, it would be the "best" option if going solely by RMSE.

Also A-Law sounds better than PCM8 (at the same sample rate).
Even with the higher RMSE score.

Seems like it could be possible to do RMSE on A-Law samples as a metric,
but if anything this is just kicking the can down the road slightly.

Granted, A-Law sounds better than 4-bit IMA, and 4-bit IMA sounds better
than the 2-bit ADPCM's at least...

*: Previously it was worse, around 4500, but the RMSE score dropped
after switching it to using a similar encoder strategy to ADLQ, namely
doing a brute-force search over the next 3 samples to find the values
that best approximate the target samples.

Though, which is "better", or whether or not even lower RMSE "improves" quality here, is debatable (the PCM8 numbers clearly throw using RMSE as
a quality metric into question for this case).

Ideally I would want some metric that better reflects hearing perception
and is computationally cheap.

...

--- Synchronet 3.20c-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Tue Apr 1 18:06:10 2025

From Newsgroup: comp.arch

On 2025-04-01 2:51 p.m., MitchAlsup1 wrote:

On Tue, 1 Apr 2025 4:58:58 +0000, Robert Finch wrote:

On 2025-03-31 4:52 p.m., MitchAlsup1 wrote:

On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:

--------------------

ENTER saves as few as 1 or as many as 32 and remains that 1 single
instruction. Same for EXIT and exit also performs the RET when LDing
R0.

Granted, the folding strategy can still do canary values, but
doing so
in the reused portions would limit the range of unique canary values >>>>>> (well, unless the canary magic is XOR'ed with SP or something...). >>>>>>

Canary values are in addition to ENTER and EXIT not part of them
IMHO.

In Q+3 there are push and pop multiple instructions. I did not want to
add load and store multiple on top of that. They work great for ISRs,
but not so great for task switching code. I have the instructions
pushing or popping up to 17 registers in a group. Groups of registers
overlap by eight. The instructions can handle all 96 registers in the
machine. ENTER and EXIT are also present.

It is looking like the context switch code for the OS will take about
3000 clock cycles to run.

How much of that is figuring out who to switch to and, now that that has
been decided, make the context switch manifest ??

That was just for the making the switch. I calculated based on the
number of register loads and stores x2 and then times 13 clocks for
memory access, plus a little bit of overhead for other instructions.

Deciding who to switch to may be another good chunk of time. But the
system is using a hardware ready list, so the choice is just to pop
(load) the top task id off the ready list. The guts of the switcher is
only about 30 LOC, but it calls a couple of helper routines.

Not wanting to disable interrupts for that
long, I put a spinlock on the system’s task control block array. But I
think I have run into an issue. It is the timer ISR that switches tasks.
Since it is an ISR it pushes a subset of registers that it uses and
restores them at exit. But when exiting and switching tasks it spinlocks
on the task control block array. I am not sure this is a good thing. As
the timer IRQ is fairly high priority. If something else locked the TCB
array it would deadlock. I guess the context switching could be deferred
until the app requests some other operating system function. But then
the issue is what if the app gets stuck in an infinite loop, not calling
the OS? I suppose I could make an OS heartbeat function call a
requirement of apps. If the app does not do a heartbeat within a
reasonable time, it could be terminated.

Q+3 progresses rapidly. A lot of the stuff in earlier versions was
removed. The pared down version is a 32-bit machine. Expecting some
headaches because of the use of condition registers and branch
registers.

--- Synchronet 3.20c-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Tue Apr 1 18:19:03 2025

From Newsgroup: comp.arch

On 2025-04-01 5:21 p.m., BGB wrote:

On 3/31/2025 11:58 PM, Robert Finch wrote:

On 2025-03-31 4:52 p.m., MitchAlsup1 wrote:

On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:

On 3/31/2025 1:07 PM, MitchAlsup1 wrote:

-------------

Another option being if it could be a feature of a Load/Store
Multiple.

Say, LDM/STM:
   6b Hi (Upper bound of register to save)
   6b Lo (Lower bound of registers to save)
   1b LR (Flag to save Link Register)
   1b GP (Flag to save Global Pointer)
   1b SK (Flag to generate a canary)

Q+3 uses a bitmap of register selection with four more bits selecting
overlapping groups. It can work with up to 17 registers.

OK.

If I did LDM/STM style ops, not sure which strategy I would take.

The possibility of using a 96-bit encoding with an Imm64 holding a bit-
mask of all the registers makes some sense...

ENTER and EXIT have 2 of those flags--but also note use of SP and CSP >>>>> are implicit.

Likely (STM):
   Pushes LR first (if bit set);
   Pushes GP second (if bit set);
   Pushes registers in range (if Hi>=Lo);
   Pushes stack canary (if bit set).

EXIT uses its 3rd flag used when doing longjump() and THROW()
so as to pop the call-stack but not actually RET from the stack
walker.

OK.

I guess one could debate whether an LDM could treat the Load-LR as
"Load
LR" or "Load address and Branch", and/or have separate flags (Load
LR vs
Load PC, with Load PC meaning to branch).

Other ABIs may not have as much reason to save/restore the Global
Pointer all the time. But, in my case, it is being used as the primary >>>> way of accessing globals, and each binary image has its own address
range here.

I use constants to access globals.
These comes in 32-bit and 64-bit flavors.

PC-Rel not being used as PC-Rel doesn't allow for multiple process
instances of a given loaded binary within a shared address space.

As long as the relative distance is the same, it does.

Vs, say, for PIE ELF binaries where it is needed to load a new copy for >>>> each process instance because of this (well, excluding an FDPIC style
ABI, but seemingly still no one seems to have bothered adding FDPIC
support in GCC or friends for RV64 based targets, ...).

Well, granted, because Linux and similar tend to load every new process >>>> into its own address space and/or use CoW.

CoW and execl()

--------------

Other ISAs use a flag bit for each register, but this is less viable >>>>>> with an ISA with a larger number of registers, well, unless one
uses a
64 or 96 bit LDM/STM encoding (possible). Merit though would be not >>>>>> needing multiple LDM's / STM's to deal with a discontinuous register >>>>>> range.

To quote Trevor Smith:: "Why would anyone want to do that" ??

Discontinuous register ranges:
Because pretty much no ABI's put all of the callee save registers in a >>>> contiguous range.

Granted, I guess if someone were designing an ISA and ABI clean, they
could make all of the argument registers and callee save registers
contiguous.

Say:
   R0..R3: Special
   R4..R15: Scratch
   R16..R31: Argument
   R32..R63: Callee Save
....

But, invariably, someone will want "compressed" instructions with a
subset of the registers, and one can't just have these only having
access to argument registers.

Brian had little trouble using My 66000 ABI which does have contiguous
register groupings.

Well, also excluding the possibility where the LDM/STM is essentially >>>>>> just a function call (say, if beyond certain number of registers
are to
be saved/restored, the compiler generates a call to a save/restore >>>>>> sequence, which is also generates as-needed). Granted, this is
basically
the strategy used by BGBCC. If multiple functions happen to save/ >>>>>> restore
the same combination of registers, they get to reuse the prior
function's save/restore sequence (generally folded off to before the >>>>>> function in question).

Calling a subroutine to perform epilogues is adding to the number of >>>>> branches a program executes. Having an instruction like EXIT means
when you know you need to exit, you EXIT you don't branch to the exit >>>>> point. Saving instructions.

Prolog needs a call, but epilog can just be a branch, since no need to >>>> return back into the function that is returning.

Yes, but this means My 66000 executes 3 fewer transfers of control
per subroutine than you do. And taken branches add latency.

Needs to have a lower limit though, as it is not worth it to use a
call/branch to save/restore 3 or 4 registers...

But, say, 20 registers, it is more worthwhile.

ENTER saves as few as 1 or as many as 32 and remains that 1 single
instruction. Same for EXIT and exit also performs the RET when LDing
R0.

Granted, the folding strategy can still do canary values, but
doing so
in the reused portions would limit the range of unique canary values >>>>>> (well, unless the canary magic is XOR'ed with SP or something...). >>>>>>

Canary values are in addition to ENTER and EXIT not part of them
IMHO.

In Q+3 there are push and pop multiple instructions. I did not want to
add load and store multiple on top of that. They work great for ISRs,
but not so great for task switching code. I have the instructions
pushing or popping up to 17 registers in a group. Groups of registers
overlap by eight. The instructions can handle all 96 registers in the
machine. ENTER and EXIT are also present.

It is looking like the context switch code for the OS will take about
3000 clock cycles to run. Not wanting to disable interrupts for that
long, I put a spinlock on the system’s task control block array. But I
think I have run into an issue. It is the timer ISR that switches
tasks. Since it is an ISR it pushes a subset of registers that it uses
and restores them at exit. But when exiting and switching tasks it
spinlocks on the task control block array. I am not sure this is a
good thing. As the timer IRQ is fairly high priority. If something
else locked the TCB array it would deadlock. I guess the context
switching could be deferred until the app requests some other
operating system function. But then the issue is what if the app gets
stuck in an infinite loop, not calling the OS? I suppose I could make
an OS heartbeat function call a requirement of apps. If the app does
not do a heartbeat within a reasonable time, it could be terminated.

Q+3 progresses rapidly. A lot of the stuff in earlier versions was
removed. The pared down version is a 32-bit machine. Expecting some
headaches because of the use of condition registers and branch registers.

OK.

Ironically, I seem to have comparably low task-switch cost...
However, each system call is essentially 2 task switches, and it is
still slow enough to negatively effect performance if they happen at all frequently.

System calls for Q+ are slightly faster (but not much) than task
switches. I just have the system saving state on the stack. I don't
bother saving the FP registers or some of the other system registers
that the OS controls. So, it is a little bit shorter than the task
switch code.

The only thing that can do a task switch in the system is the time-slicer.

So, say, one needs to try to minimize the number of unnecessary system
calls (say, don't implement "fputs()" by sending 1 byte at a time, ...).

Unlike on a modern PC, one generally needs to care more about efficiency.

Hence, all the fiddling with low bit-depth graphics formats, and things
like my recent fiddling with 2-bit ADPCM audio.

And, online, one is (if anything) more likely to find people complaining about how old/obsolescent ADPCM is (and/or arguing that people should
store all their sound effects as Ogg/Vorbis or similar; ...).

Im not one much for music, although I play the tunes ocassionally. I'm
little hard of hearing.

Then again, I did note that I may need to find some other "quality
metric" for audio, as RMSE isn't really working...

At least going by RMSE, the "best" option would be to use 8-bit PCM and
then downsample it.

Say, 4kHz 8-bit PCM has a lower RMSE score than 2-bit ADPCM, but subjectively the 2-bit ADPCM sounds significantly better.

Say: for 16kHz, and a test file (using a song here):
PCM8, 16kHz     : 121 (128 kbps)
A-Law, 16kHz    : 284 (128 kbps)
IMA 4bit, 16kHz : 617 (64 kbps)
IMA 2bit, 16kHz : 1692 (32 kbps, *)
ADLQ 2bit, 16kHz: 2000 (32 kbps)
PCM8, 4kHz      : 242 (32 kbps)

However, 4kHz PCM8 sounds terrible vs either 2-bit IMA or ADLQ.
Basically sounds muffled, speech is unintelligible.
But, it would be the "best" option if going solely by RMSE.

Also A-Law sounds better than PCM8 (at the same sample rate).
Even with the higher RMSE score.

Seems like it could be possible to do RMSE on A-Law samples as a metric,
but if anything this is just kicking the can down the road slightly.

Granted, A-Law sounds better than 4-bit IMA, and 4-bit IMA sounds better than the 2-bit ADPCM's at least...

*: Previously it was worse, around 4500, but the RMSE score dropped
after switching it to using a similar encoder strategy to ADLQ, namely
doing a brute-force search over the next 3 samples to find the values
that best approximate the target samples.

Though, which is "better", or whether or not even lower RMSE "improves" quality here, is debatable (the PCM8 numbers clearly throw using RMSE as
a quality metric into question for this case).

Ideally I would want some metric that better reflects hearing perception
and is computationally cheap.

...

--- Synchronet 3.20c-Linux NewsLink 1.2

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Apr 1 23:21:24 2025

From Newsgroup: comp.arch

On Tue, 1 Apr 2025 19:34:10 +0000, BGB wrote:

On 3/31/2025 3:52 PM, MitchAlsup1 wrote:

On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:

---------------------

PC-Rel not being used as PC-Rel doesn't allow for multiple process
instances of a given loaded binary within a shared address space.

As long as the relative distance is the same, it does.

Can't happen within a shared address space.

Say, if you load a single copy of a binary at 0x24680000.
Process A and B can't use the same mapping in the same address space,
with PC-rel globals, as then they would each see the other's globals.

Say I load a copy of the binary text at 0x24680000 and its data at
0x35900000 for a distance of 0x11280000 into the address space of
a process.

Then I load another copy at 0x44680000 and its data at 55900000
into the address space of a different process.

PC-rel addressing works in both cases--because the distance (-rel)
remains the same,

and the MMU can translate the code to the same physical, and map
each area of data individually.

Different virtual addresses, same code physical address, different
data virtual and physical addresses.

You can't do a duplicate mapping at another address, as this both wastes
VAS, and also any Abs64 base-relocs or similar would differ.

A 64-bit VAS is a wasteable address space, whereas a 48-bit VAS is not.

You also can't CoW the data/bss sections, as this is no longer a shared address space.

You are trying to "get at" something here, but I can't see it (yet).

So, alternative is to use GBR to access globals, with the data/bss
sections allocated independently of the binary.

This way, multiple processes can share the same mapping at the same
address for any executable code and constant data, with only the data sections needing to be allocated.

Does mean though that one needs to save/restore the global pointer, and
there is a ritual for reloading it.

EXE's generally assume they are index 0, so:
MOV.Q (GBR, 0), Rt
MOV.Q (Rt, 0), GBR
Or, in RV terms:
LD X6, 0(X3)
LD X3, Disp33(X6)
Or, RV64G:
LD X6, 0(X3)
LUI X5, DispHi
ADD X5 X5, X6
LD X3, DispLo(X5)

For DLL's, the index is fixed up with a base-reloc (for each loaded
DLL), so basically the same idea. Typically a Disp33 is used here to
allow for a potentially large/unknown number of loaded DLL's. Thus far,
a global numbering scheme is used.

Where, (GBR+0) gives the address of a table of global pointers for every loaded binary (can be assumed read-only from userland).

Generally, this is needed if:
Function may be called from outside of the current binary and:
Accesses global variables;
And/or, calls local functions.

I just use 32-bit of 64-bit displacement constants. Does not matter
how control arrived at this subroutine, it accesses its data as the
linker resolved addresses--without wasting a register.

Though, still generally lower average-case overhead than the strategy typically used by FDPIC, which would handle this reload process on the
caller side...
SD X3, Disp(SP)
LD X3, 8(X18)
LD X6, 0(X18)
JALR X1, 0(X6)
LD X3, Disp(SP)

This is just::

CALX [IP,,#GOT[funct_num]-.]

In the 32-bit linking mode this is a 2 word instruction, in the 64-bit
linking mode it is a 3 word instruction.
----------------

Though, execl() effectively replaces the current process.

IMHO, a "CreateProcess()" style abstraction makes more sense than
fork+exec.

You are 40 years late on that.

---------------

But, invariably, someone will want "compressed" instructions with a
subset of the registers, and one can't just have these only having
access to argument registers.

Brian had little trouble using My 66000 ABI which does have contiguous
register groupings.

But, My66000 also isn't like, "Hey, how about 16-bit ops with 3 or 4 bit register numbers".

Not sure the thinking behind the RV ABI.

If RISC-V removed its 16-bit instructions, there is room in its ISA
to put my entire ISA along with all the non-compressed RISC-V inst-
ructions.

---------------

Prolog needs a call, but epilog can just be a branch, since no need to
return back into the function that is returning.

Yes, but this means My 66000 executes 3 fewer transfers of control
per subroutine than you do. And taken branches add latency.

Granted.

Each predicted branch adds 2 cycles.

So, you loose 6 cycles on just under ½ of all subroutine calls,
while also executing 2-5 instructions manipulating your global
pointer.

Needs to have a lower limit though, as it is not worth it to use a
call/branch to save/restore 3 or 4 registers...

But, say, 20 registers, it is more worthwhile.

ENTER saves as few as 1 or as many as 32 and remains that 1 single
instruction. Same for EXIT and exit also performs the RET when LDing
R0.

Granted.

My strategy isn't perfect:
Non-zero branching overheads, when the feature is used;
Per-function load/store slides in prolog/epilog, when not used.

Then, the heuristic mostly becomes one of when it is better to use the
inline strategy (load/store slide), or to fold them off and use calls/branches.

My solution gets rid of the delimma:
a) the call code is always smaller
b) the call code never takes more cycles

In addition, there is a straightforward way to elide the STs of ENTER
when the memory unit is still executing the previous EXIT.

Does technically also work for RISC-V though (though seemingly GCC
always uses inline save/restore, but also the RV ABI has fewer
registers).

--- Synchronet 3.20c-Linux NewsLink 1.2

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Apr 1 23:24:29 2025

From Newsgroup: comp.arch

On Tue, 1 Apr 2025 22:06:10 +0000, Robert Finch wrote:

On 2025-04-01 2:51 p.m., MitchAlsup1 wrote:

On Tue, 1 Apr 2025 4:58:58 +0000, Robert Finch wrote:

------------------

It is looking like the context switch code for the OS will take about
3000 clock cycles to run.

How much of that is figuring out who to switch to and, now that that has
been decided, make the context switch manifest ??

That was just for the making the switch. I calculated based on the
number of register loads and stores x2 and then times 13 clocks for
memory access, plus a little bit of overhead for other instructions.

Why is it not 13 cycles to get started and then each register is 1 one
cycle.

Deciding who to switch to may be another good chunk of time. But the
system is using a hardware ready list, so the choice is just to pop
(load) the top task id off the ready list. The guts of the switcher is
only about 30 LOC, but it calls a couple of helper routines.

--- Synchronet 3.20c-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Tue Apr 1 20:07:41 2025

From Newsgroup: comp.arch

On 2025-04-01 7:24 p.m., MitchAlsup1 wrote:

On Tue, 1 Apr 2025 22:06:10 +0000, Robert Finch wrote:

On 2025-04-01 2:51 p.m., MitchAlsup1 wrote:

On Tue, 1 Apr 2025 4:58:58 +0000, Robert Finch wrote:

------------------

It is looking like the context switch code for the OS will take about
3000 clock cycles to run.

How much of that is figuring out who to switch to and, now that that has >>> been decided, make the context switch manifest ??

That was just for the making the switch. I calculated based on the
number of register loads and stores x2 and then times 13 clocks for
memory access, plus a little bit of overhead for other instructions.

Why is it not 13 cycles to get started and then each register is 1 one
cycle.

The CPU does not do pipe-lined burst loads. To load the cache line it is
two independent loads. 256-bits at a time. Stores post to the bus, but
I seem to remember having to space out the stores so the queue in the
memory controller did not overflow. Needs more work.

Stores should be faster, I think they are single cycle. But loads may be
quite slow if things are not in the cache. I should really measure it.
It may not be as bad I think. It is still 300 LOC, about 100 loads and
stores each way. Lots of move instructions for regs that cannot be
directly loaded or stored. And with CRs serializing the processor. But
the processor should eat up all the moves fairly quickly.

Deciding who to switch to may be another good chunk of time. But the
system is using a hardware ready list, so the choice is just to pop
(load) the top task id off the ready list. The guts of the switcher is
only about 30 LOC, but it calls a couple of helper routines.

--- Synchronet 3.20c-Linux NewsLink 1.2

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Apr 2 01:47:26 2025

From Newsgroup: comp.arch

On Wed, 2 Apr 2025 0:07:41 +0000, Robert Finch wrote:

On 2025-04-01 7:24 p.m., MitchAlsup1 wrote:

On Tue, 1 Apr 2025 22:06:10 +0000, Robert Finch wrote:

On 2025-04-01 2:51 p.m., MitchAlsup1 wrote:

On Tue, 1 Apr 2025 4:58:58 +0000, Robert Finch wrote:

------------------

It is looking like the context switch code for the OS will take about >>>>> 3000 clock cycles to run.

How much of that is figuring out who to switch to and, now that that has >>>> been decided, make the context switch manifest ??

That was just for the making the switch. I calculated based on the
number of register loads and stores x2 and then times 13 clocks for
memory access, plus a little bit of overhead for other instructions.

Why is it not 13 cycles to get started and then each register is 1 one
cycle.

The CPU does not do pipe-lined burst loads. To load the cache line it is
two independent loads. 256-bits at a time. Stores post to the bus, but
I seem to remember having to space out the stores so the queue in the
memory controller did not overflow. Needs more work.

Stores should be faster, I think they are single cycle. But loads may be quite slow if things are not in the cache. I should really measure it.
It may not be as bad I think. It is still 300 LOC, about 100 loads and
stores each way. Lots of move instructions for regs that cannot be
directly loaded or stored. And with CRs serializing the processor. But
the processor should eat up all the moves fairly quickly.

One of the reasons I went with treating the register file and thread-
state as a write-back cache is that HW can read-up the inbound register
values before starting to write out the outbound values (rather than
the other way of having to do the STs first so the LDs have a place
to land.)

Deciding who to switch to may be another good chunk of time. But the
system is using a hardware ready list, so the choice is just to pop
(load) the top task id off the ready list. The guts of the switcher is
only about 30 LOC, but it calls a couple of helper routines.

--- Synchronet 3.20c-Linux NewsLink 1.2

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Tue Apr 1 22:55:56 2025

From Newsgroup: comp.arch

Say, if you load a single copy of a binary at 0x24680000.
Process A and B can't use the same mapping in the same address space,
with PC-rel globals, as then they would each see the other's globals.

Say I load a copy of the binary text at 0x24680000 and its data at
0x35900000 for a distance of 0x11280000 into the address space of
a process.

Then I load another copy at 0x44680000 and its data at 55900000
into the address space of a different process.

But then if thread A (whose state is stored at 0x35900000) sends to
thread B (whose state is at 55900000) a closure whose code points
somewhere inside 0x24680000, it will end up using the state of thread
A instead of the state of the current thread.

Stefan
--- Synchronet 3.20c-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Tue Apr 1 23:04:52 2025

From Newsgroup: comp.arch

On 4/1/2025 5:19 PM, Robert Finch wrote:

On 2025-04-01 5:21 p.m., BGB wrote:

On 3/31/2025 11:58 PM, Robert Finch wrote:

On 2025-03-31 4:52 p.m., MitchAlsup1 wrote:

On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:

On 3/31/2025 1:07 PM, MitchAlsup1 wrote:

-------------

Another option being if it could be a feature of a Load/Store
Multiple.

Say, LDM/STM:
   6b Hi (Upper bound of register to save)
   6b Lo (Lower bound of registers to save)
   1b LR (Flag to save Link Register)
   1b GP (Flag to save Global Pointer)
   1b SK (Flag to generate a canary)

Q+3 uses a bitmap of register selection with four more bits selecting
overlapping groups. It can work with up to 17 registers.

OK.

If I did LDM/STM style ops, not sure which strategy I would take.

The possibility of using a 96-bit encoding with an Imm64 holding a
bit- mask of all the registers makes some sense...

ENTER and EXIT have 2 of those flags--but also note use of SP and CSP >>>>>> are implicit.

Likely (STM):
   Pushes LR first (if bit set);
   Pushes GP second (if bit set);
   Pushes registers in range (if Hi>=Lo);
   Pushes stack canary (if bit set).

EXIT uses its 3rd flag used when doing longjump() and THROW()
so as to pop the call-stack but not actually RET from the stack
walker.

OK.

I guess one could debate whether an LDM could treat the Load-LR as
"Load
LR" or "Load address and Branch", and/or have separate flags (Load
LR vs
Load PC, with Load PC meaning to branch).

Other ABIs may not have as much reason to save/restore the Global
Pointer all the time. But, in my case, it is being used as the primary >>>>> way of accessing globals, and each binary image has its own address
range here.

I use constants to access globals.
These comes in 32-bit and 64-bit flavors.

PC-Rel not being used as PC-Rel doesn't allow for multiple process
instances of a given loaded binary within a shared address space.

As long as the relative distance is the same, it does.

Vs, say, for PIE ELF binaries where it is needed to load a new copy >>>>> for
each process instance because of this (well, excluding an FDPIC style >>>>> ABI, but seemingly still no one seems to have bothered adding FDPIC
support in GCC or friends for RV64 based targets, ...).

Well, granted, because Linux and similar tend to load every new
process
into its own address space and/or use CoW.

CoW and execl()

--------------

Other ISAs use a flag bit for each register, but this is less viable >>>>>>> with an ISA with a larger number of registers, well, unless one >>>>>>> uses a
64 or 96 bit LDM/STM encoding (possible). Merit though would be not >>>>>>> needing multiple LDM's / STM's to deal with a discontinuous register >>>>>>> range.

To quote Trevor Smith:: "Why would anyone want to do that" ??

Discontinuous register ranges:
Because pretty much no ABI's put all of the callee save registers in a >>>>> contiguous range.

Granted, I guess if someone were designing an ISA and ABI clean, they >>>>> could make all of the argument registers and callee save registers
contiguous.

Say:
   R0..R3: Special
   R4..R15: Scratch
   R16..R31: Argument
   R32..R63: Callee Save
....

But, invariably, someone will want "compressed" instructions with a
subset of the registers, and one can't just have these only having
access to argument registers.

Brian had little trouble using My 66000 ABI which does have contiguous >>>> register groupings.

Well, also excluding the possibility where the LDM/STM is
essentially
just a function call (say, if beyond certain number of registers >>>>>>> are to
be saved/restored, the compiler generates a call to a save/restore >>>>>>> sequence, which is also generates as-needed). Granted, this is
basically
the strategy used by BGBCC. If multiple functions happen to save/ >>>>>>> restore
the same combination of registers, they get to reuse the prior
function's save/restore sequence (generally folded off to before the >>>>>>> function in question).

Calling a subroutine to perform epilogues is adding to the number of >>>>>> branches a program executes. Having an instruction like EXIT means >>>>>> when you know you need to exit, you EXIT you don't branch to the exit >>>>>> point. Saving instructions.

Prolog needs a call, but epilog can just be a branch, since no need to >>>>> return back into the function that is returning.

Yes, but this means My 66000 executes 3 fewer transfers of control
per subroutine than you do. And taken branches add latency.

Needs to have a lower limit though, as it is not worth it to use a
call/branch to save/restore 3 or 4 registers...

But, say, 20 registers, it is more worthwhile.

ENTER saves as few as 1 or as many as 32 and remains that 1 single
instruction. Same for EXIT and exit also performs the RET when LDing
R0.

Granted, the folding strategy can still do canary values, but
doing so
in the reused portions would limit the range of unique canary values >>>>>>> (well, unless the canary magic is XOR'ed with SP or something...). >>>>>>>

Canary values are in addition to ENTER and EXIT not part of them
IMHO.

In Q+3 there are push and pop multiple instructions. I did not want
to add load and store multiple on top of that. They work great for
ISRs, but not so great for task switching code. I have the
instructions pushing or popping up to 17 registers in a group. Groups
of registers overlap by eight. The instructions can handle all 96
registers in the machine. ENTER and EXIT are also present.

It is looking like the context switch code for the OS will take about
3000 clock cycles to run. Not wanting to disable interrupts for that
long, I put a spinlock on the system’s task control block array. But
I think I have run into an issue. It is the timer ISR that switches
tasks. Since it is an ISR it pushes a subset of registers that it
uses and restores them at exit. But when exiting and switching tasks
it spinlocks on the task control block array. I am not sure this is a
good thing. As the timer IRQ is fairly high priority. If something
else locked the TCB array it would deadlock. I guess the context
switching could be deferred until the app requests some other
operating system function. But then the issue is what if the app gets
stuck in an infinite loop, not calling the OS? I suppose I could make
an OS heartbeat function call a requirement of apps. If the app does
not do a heartbeat within a reasonable time, it could be terminated.

Q+3 progresses rapidly. A lot of the stuff in earlier versions was
removed. The pared down version is a 32-bit machine. Expecting some
headaches because of the use of condition registers and branch
registers.

OK.

Ironically, I seem to have comparably low task-switch cost...
However, each system call is essentially 2 task switches, and it is
still slow enough to negatively effect performance if they happen at
all frequently.

System calls for Q+ are slightly faster (but not much) than task
switches. I just have the system saving state on the stack. I don't
bother saving the FP registers or some of the other system registers
that the OS controls. So, it is a little bit shorter than the task
switch code.

The only thing that can do a task switch in the system is the time-slicer.

In my case, task switch happens by capturing and restoring all of the registers (of which there are 64 main registers, and a few CR's).

No separate FPU or vector registers (the BJX GPR space, and RISC-V X+F
spaces, being mostly equivalent).

The interrupt handlers only have access to physical addresses, and will
block all other interrupts when running, so there is a need to get
quickly from the user-program task to the syscall handler task, and then
back again once done (though, maybe not immediately, as it may instead
send the results back to the caller task, and then transfer control to a different task).

Timer interrupt can do scheduling, but mostly avoids doing so unless
there is no other option (TestKern being mostly lacking in mutexes,
which makes timer-driven preemptive multitasking a bit risky). However, usually programs will use system calls often enough that it is possible
to schedule tasks this way, and generally a system call will not be made inside of a critical section.

So, say, one needs to try to minimize the number of unnecessary system
calls (say, don't implement "fputs()" by sending 1 byte at a time, ...).

Unlike on a modern PC, one generally needs to care more about efficiency.

Hence, all the fiddling with low bit-depth graphics formats, and
things like my recent fiddling with 2-bit ADPCM audio.

And, online, one is (if anything) more likely to find people
complaining about how old/obsolescent ADPCM is (and/or arguing that
people should store all their sound effects as Ogg/Vorbis or
similar; ...).

Im not one much for music, although I play the tunes ocassionally. I'm little hard of hearing.

Not so much for music here, but more for storing sound-effects.

I can note I seem to have a form of reverse-slope hearing impairment...

Not a new thing, I have either always been this way, or it has happened
very slowly.

I can seemingly hear most stuff OK though.
Except, IRL, I can't hear tuning forks.
Nor car engines.
I don't hear the engines.
I do hear the tires rolling on the ground.
Nor refrigerators (mostly).
I sometimes hear the relays when they start/stop,
or a crackling sound from the radiator coil.
Using phones sucks hard, can't hear crap...
Not terribly musically inclined.
But, instruments don't sound much different from "noise" sounds.

My ability to hear low-frequencies is a bit weird:
Square or triangle waves, I hear these well;
Sine waves, weakly, but I hear them in headphones if volume is high.
If the volume isn't very high, sine waves become silent.
Seemingly, these are harder to hear IRL.

I seem most sensitive to frequencies between around 2 to 8 kHz. Upper
end of hearing seems to be around 17 kHz (lower end around 1kHz for pure
sine waves). The lower ("absolute" limit seems to be around 8Hz, but
more because at this point, a square wave turns from a "tone" into a
series of discrete pops, 8-20 Hz being sort of a meta-range between
being tonal and discrete pops).

Have noted that in YouTube videos where someone is messing with a CRT
TV, I can still sometimes hear the squeal, particularly if the camera is
close to the TV. Not seen a CRT IRL in a while though; no obvious sound
from a VGA CRT monitor though (but, then again, I am using it ATM on an
old rack server, which sounds kinda like a vacuum cleaner, so might be
masking it if it is making a noise).

Have noted that I still understand speech fine with a 2-8 kHz bandpass
(with steep fall-off). I don't understand speech at all with a 2kHz
low-pass. So, whichever parts I use for intelligibility seem to be
between 2 and 8kHz. Had noted if I split it into 2-4 or 4-8 kHz bands,
either works, though individually each has a notably worse quality than combined 2-8 kHz.

The 1-2 kHz range can be heard, but doesn't seem to contain much as far
as intelligibility goes, but its presence or absence does seem to alter
vowel sounds slightly.

A 1-8 kHz bandpass sounds mostly natural to me. Though, cats seem to
respond unfavorably to band-passed audio (if cats are neutral to the
original, but tense up and dig in their claws if I play band-passed
audio, it seems they hear a difference).

Although, I was using music as test-cases mostly as they can give a
better idea of the relative audio quality than a short sound effect.

But, for things that are going to be embedded into an EXE or DLL,
generally these are ideally kept at a few kB or less.

For long form audio, there is more reason to care about audio quality,
but for something like a notification ding, not as much. Do preferably
want it to not sound like "broken crap" though. And, if any speech is
present, ideally it needs to be intelligible.

In terms of being small and "not sounding like crap":
ADPCM:
Works well enough, but can't go below 2 bits per sample.
Delta-Signma:
1 bit per sample, but sounds horrid much under 64 kHz.

MP3 and Vorbis work well at 96 to 128 kbps, but:
Are complex and expensive formats to decode;
Don't give acceptable results much below around 40 kbps.

At lower bitrates, the artifacts from MP3 and Vorbis can become rather obnoxious (lots of squealing and whistling and sounds like broken glass
being shaken in a steel can).

I actually much prefer the sound of ADPCM for low bitrates. Muffled and
gritty is still preferable to "rattling a steel can full of broken
glass" (simple loss of quality rather than the addition of other more obnoxious artifacts).

From what I gather, the telephone network used 8kHz as a standard
sampling rate, with one of several formats:
u-Law, in the US
A-Law, in Europe
4-bit ADPCM, for lower-priority long-distance links;
When not using u-Law or A-Law.
2-bit ADPCM, for "overflow" links (*).

*: Apparently, if there were too many long distance calls over a given long-distance link, they would drop to a 2-bit ADPCM (running at 16 kbps).

I was testing with 16kHz 2-bit ADPCM, as while both 16kHz 2-bit and 8kHz
4-bit ADPCM are both 32 kbps, the 16kHz sounds better to me (and intelligibility is higher).

Though, if spoken language is not used, it makes sense to drop to 8kHz.
Using 8kHz as standard is weak as intelligibility is a lot worse.

But, I guess the thinking was "minimum where you can still 'mostly' hear
what they are saying...".

Even if 8kHz was standard on the telephone network, I can't easily
understand what anyone is saying over the phone (speech is often very
muffled and there is often a loud/obnoxious hiss).

Actually, weirdly, actual phone quality is somehow *worse* than my
experiments with low bitrate ADPCM. Like, the low-bit depth ADPCM mostly
just sounds "gritty" (without any obvious hiss). Like, the phone adds
extra levels of badness beyond just any compression issues (probably
also crappy microphones and speakers, etc, as well).

Using headphones with a phone is "slightly" better, but there is often
still a rather loud/annoying hiss, even when the sound is coming from an artificial source.

Poor quality ADPCM, by itself, does not have this particular issue
(actually, it almost seems as if the ADPCM somehow "enhances" the audio
and compensates slightly for the low sample rate, making details easier
to hear compared with "cleaner" PCM audio versions).

For sound-effects, could drop to 4kHz, but there is fairly significant distortion. Like, if you have a notification ding, it doesn't really
sound like a bell anymore.

So, say (ADPCM modes):
16kHz 4-bit: Mostly Good, but needs 64 kbps.
16kHz 2-bit: Slightly muffled, gritty, 32 kbps.
8kHz 4-bit: More obvious muffling (but not gritty);
8kHz 2-bit: Muffled and gritty (16 kbps);
4kHz 4-bit: Serious muffle / distortion, 16 kbps.
4kHz 2-bit: Muffle + distortion + grit, 8 kbps.

Possible merit of 4kHz 2-bit is that it allows putting a bell sound
effect in around 500 bytes. Downside is that it is no longer
particularly recognizable as a bell (and goes more from "ding" to "plong").

At 4 kHz, speech is basically almost entirely unintelligible, but can
still hear that speech is present (its "shape" can still be heard; but
words are still recognizable; sort of like the muffling when people are talking in a different room, but one can still hear that they are saying "something").

At 2 kHz; it is barely recognizable as being speech (it sounds almost
more like wind). Percussive sounds are still recognizable though (so,
music is turned into "howling wind with drums").

Early 90s games (such as Doom) mostly used 11 kHz as standard.
IMHO, 16kHz is a better quality/space tradeoff.
Where, 22/32/44 can sound better, but may not be worth the overhead.
Sample rates above 44 kHz are overkill though.

I, personally, can't hear the difference between 44 and 48 kHz audio.
I suspect anything 48kHz and beyond is likely needless overkill.

Then again, I did note that I may need to find some other "quality
metric" for audio, as RMSE isn't really working...

At least going by RMSE, the "best" option would be to use 8-bit PCM
and then downsample it.

Say, 4kHz 8-bit PCM has a lower RMSE score than 2-bit ADPCM, but
subjectively the 2-bit ADPCM sounds significantly better.

Say: for 16kHz, and a test file (using a song here):
   PCM8, 16kHz     : 121 (128 kbps)
   A-Law, 16kHz    : 284 (128 kbps)
   IMA 4bit, 16kHz : 617 (64 kbps)
   IMA 2bit, 16kHz : 1692 (32 kbps, *)
   ADLQ 2bit, 16kHz: 2000 (32 kbps)
   PCM8, 4kHz      : 242 (32 kbps)

However, 4kHz PCM8 sounds terrible vs either 2-bit IMA or ADLQ.
   Basically sounds muffled, speech is unintelligible.
   But, it would be the "best" option if going solely by RMSE.

Also A-Law sounds better than PCM8 (at the same sample rate).
   Even with the higher RMSE score.

Seems like it could be possible to do RMSE on A-Law samples as a
metric, but if anything this is just kicking the can down the road
slightly.

Granted, A-Law sounds better than 4-bit IMA, and 4-bit IMA sounds
better than the 2-bit ADPCM's at least...

*: Previously it was worse, around 4500, but the RMSE score dropped
after switching it to using a similar encoder strategy to ADLQ, namely
doing a brute-force search over the next 3 samples to find the values
that best approximate the target samples.

Though, which is "better", or whether or not even lower RMSE
"improves" quality here, is debatable (the PCM8 numbers clearly throw
using RMSE as a quality metric into question for this case).

Ideally I would want some metric that better reflects hearing
perception and is computationally cheap.

...

--- Synchronet 3.20c-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Tue Apr 1 23:19:11 2025

From Newsgroup: comp.arch

On 4/1/2025 9:55 PM, Stefan Monnier wrote:

Say, if you load a single copy of a binary at 0x24680000.
Process A and B can't use the same mapping in the same address space,
with PC-rel globals, as then they would each see the other's globals.

Say I load a copy of the binary text at 0x24680000 and its data at
0x35900000 for a distance of 0x11280000 into the address space of
a process.

Then I load another copy at 0x44680000 and its data at 55900000
into the address space of a different process.

But then if thread A (whose state is stored at 0x35900000) sends to
thread B (whose state is at 55900000) a closure whose code points
somewhere inside 0x24680000, it will end up using the state of thread
A instead of the state of the current thread.

Generally, threads and processes are seen as different...

But, yeah, passing lambdas between processes is theoretically possible
in this scheme, but not advised.

If done, any pointers captured by the lambda would likely point to the originating process, but if called with a GBR from the new process, any
global variables would either be mapped to the corresponding DLL index
in the new process, or NULL (if a DLL that was not loaded in the new
process), or possibly a random address if it was from the main EXE and
the EXE's differ...

But, yeah, inter-process function pointers aren't really a thing, and
should not be a thing.

The eventual plan is to disallow them in the memory protection scheme,
but enforcing memory access based the ACL based memory protection is
still on the TODO list (it was only very recently that stuff is actually running in a proper usermode and so can't just stomp all over the
kernel's memory...).

But... Yeah, the kernel and program are still hanging out in the same
VAS, along with every other running program...

--- Synchronet 3.20c-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Apr 2 00:43:39 2025

From Newsgroup: comp.arch

On 4/1/2025 6:21 PM, MitchAlsup1 wrote:

On Tue, 1 Apr 2025 19:34:10 +0000, BGB wrote:

On 3/31/2025 3:52 PM, MitchAlsup1 wrote:

On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:

---------------------

PC-Rel not being used as PC-Rel doesn't allow for multiple process
instances of a given loaded binary within a shared address space.

As long as the relative distance is the same, it does.

Can't happen within a shared address space.

Say, if you load a single copy of a binary at 0x24680000.
Process A and B can't use the same mapping in the same address space,
with PC-rel globals, as then they would each see the other's globals.

Say I load a copy of the binary text at 0x24680000 and its data at
0x35900000 for a distance of 0x11280000 into the address space of
a process.

Then I load another copy at 0x44680000 and its data at 55900000
into the address space of a different process.

PC-rel addressing works in both cases--because the distance (-rel)
remains the same,

and the MMU can translate the code to the same physical, and map
each area of data individually.

Different virtual addresses, same code physical address, different
data virtual and physical addresses.

You can't do a duplicate mapping at another address, as this both wastes
VAS, and also any Abs64 base-relocs or similar would differ.

A 64-bit VAS is a wasteable address space, whereas a 48-bit VAS is not.

OK.

PE/COFF had defined Abs64 relocs, but I am using a 48-bit VAS.

Would not have made sense to define separate Abs48 relocs, but much of
the time, we can just assume the HOBs are zero.

Well, except for function pointers, where the base-reloc handling
detects pointers into ".text" and does some special secret-sauce magic regarding the HOBs to make sure they are correctly tagged.

Binaries are not generally fully PIE though, but are instead
base-relocated (more like EXE/DLL handling in Windows). Though, most
things within the core proper are either PC-rel or GBR rel, and there
are usually a relatively small number of base-relocations.

Things like DLL calls are essentially absolute addressed though. Where, mapping instances at different virtual addresses would be messy for
things like DLL handling (in the absence of a GOT or similar).

You also can't CoW the data/bss sections, as this is no longer a shared
address space.

You are trying to "get at" something here, but I can't see it (yet).

Shared address space assumes all processes have the same page tables and shared address mappings and TLB contents (though, ACL checking can be different, as the ACL/KRR stuff is not based on having separate contents
in the page tables or TLB, *).

By definition, CoW can't be used in this constraint.

But, multiple VAS's adds new problems (both hassles and potential
performance effects, so better here to delay this if possible).

*: A smaller 4-entry full-assoc cache is used for ACL checks, so it is
more of a "what access does the current task have to this particular
ACL" check. But, admittedly, some of this part is still TODO regarding
making use of it in the OS.

So, alternative is to use GBR to access globals, with the data/bss
sections allocated independently of the binary.

This way, multiple processes can share the same mapping at the same
address for any executable code and constant data, with only the data
sections needing to be allocated.

Does mean though that one needs to save/restore the global pointer, and
there is a ritual for reloading it.

EXE's generally assume they are index 0, so:
   MOV.Q (GBR, 0), Rt
   MOV.Q (Rt, 0), GBR
Or, in RV terms:
   LD    X6, 0(X3)
   LD    X3, Disp33(X6)
Or, RV64G:
   LD    X6, 0(X3)
   LUI   X5, DispHi
   ADD   X5 X5, X6
   LD    X3, DispLo(X5)

For DLL's, the index is fixed up with a base-reloc (for each loaded
DLL), so basically the same idea. Typically a Disp33 is used here to
allow for a potentially large/unknown number of loaded DLL's. Thus far,
a global numbering scheme is used.

Where, (GBR+0) gives the address of a table of global pointers for every
loaded binary (can be assumed read-only from userland).

Generally, this is needed if:
   Function may be called from outside of the current binary and:
     Accesses global variables;
     And/or, calls local functions.

I just use 32-bit of 64-bit displacement constants. Does not matter
how control arrived at this subroutine, it accesses its data as the
linker resolved addresses--without wasting a register.

GBR or GP is specially designated as a global pointer though.
Not so starved for registers that it would make sense to reclaim it as a
GPR.

But, yeah, do need to care how control can arrive at a given function.

Though, still generally lower average-case overhead than the strategy
typically used by FDPIC, which would handle this reload process on the
caller side...
   SD    X3, Disp(SP)
   LD    X3, 8(X18)
   LD    X6, 0(X18)
   JALR X1, 0(X6)
   LD    X3, Disp(SP)

This is just::

    CALX    [IP,,#GOT[funct_num]-.]

In the 32-bit linking mode this is a 2 word instruction, in the 64-bit linking mode it is a 3 word instruction.
----------------

OK.

Neither BJX nor RISC-V have special instructions to deal with FDPIC call semantics.

Though, execl() effectively replaces the current process.

IMHO, a "CreateProcess()" style abstraction makes more sense than
fork+exec.

You are 40 years late on that.

I am just doing it the Windows (or Cygwin) way...

Most POSIX style programs still work, but with a slightly higher risk of "stuff may catastrophically explode" (say, if one tries to use "fork()"
to fold off copies of the parent process, and then returning from the call-frame that called "fork()").

Fork could be made to clone the global variables, though avoiding
tangled addresses could be an issue (could maybe be done by relying on debuginfo or similar, to walk the globals and then redirect any pointers
from the old data/bss into the new one; kinda SOL for anything on the
heap though).

Better may just be to be like "yeah, fork() doesn't really work, don't
use it...".

---------------

But, invariably, someone will want "compressed" instructions with a
subset of the registers, and one can't just have these only having
access to argument registers.

Brian had little trouble using My 66000 ABI which does have contiguous
register groupings.

But, My66000 also isn't like, "Hey, how about 16-bit ops with 3 or 4 bit
register numbers".

Not sure the thinking behind the RV ABI.

If RISC-V removed its 16-bit instructions, there is room in its ISA
to put my entire ISA along with all the non-compressed RISC-V inst-
ructions.

Yeah, errm, how do you think XG3 came about?...

I just sort of dropped the C instructions and shoved nearly the entirety
of XG2 into that space.

There would still have been half the encoding space left, if predication
were disallowed.

But, say, RV64G + XG3 (sans predication) + 2/3 of the 'C' extension,
would be a bit picky...

Granted, did need to shuffle the bits for the ISAs to be
encoding-compatible; and went a little further than the bare minimum to
avoid dog chew (gluing them together with entirely mismatched encodings
and disjoint register numbering would have been possible; but I wanted
at least some semblance of encoding consistency between them).

---------------

Prolog needs a call, but epilog can just be a branch, since no need to >>>> return back into the function that is returning.

Yes, but this means My 66000 executes 3 fewer transfers of control
per subroutine than you do. And taken branches add latency.

Granted.

Each predicted branch adds 2 cycles.

So, you loose 6 cycles on just under ½ of all subroutine calls,
while also executing 2-5 instructions manipulating your global
pointer.

Possibly, but I don't think it is quite that bad on average...

Would need to run some stats and do some math to try to figure out the percentages and relative impact from each of these.

But, even with all this, and using stack canaries (which add around 6 or
so instructions when applicable), it is still outperforming GCC's RV64G
output (along with smaller binaries).

Needs to have a lower limit though, as it is not worth it to use a
call/branch to save/restore 3 or 4 registers...

But, say, 20 registers, it is more worthwhile.

ENTER saves as few as 1 or as many as 32 and remains that 1 single
instruction. Same for EXIT and exit also performs the RET when LDing
R0.

Granted.

My strategy isn't perfect:
   Non-zero branching overheads, when the feature is used;
   Per-function load/store slides in prolog/epilog, when not used.

Then, the heuristic mostly becomes one of when it is better to use the
inline strategy (load/store slide), or to fold them off and use
calls/branches.

My solution gets rid of the delimma:
a) the call code is always smaller
b) the call code never takes more cycles

In addition, there is a straightforward way to elide the STs of ENTER
when the memory unit is still executing the previous EXIT.

OK.
I was trying to keep the CPU implementation from being too complicated.

In my case though, there is an advantage over plain RV64G:
I have a Load/Store Pair, so need fewer Load/Store operations.

Though, my RV+Jx experiment does also have this...

Though were also variants defined for RV32 but not for RV64 (because apparently there was indecision about encodings, and some arguments from
the "opcode fusion" camp that 64-bit RV processors could fuse groups of
LD or SD instructions...).

Decided to leave out complaining about "opcode fusion" distractions (to actually addressing ISA issues) and seeming over reliance on SpecInt and CoreMark to drive ISA design choices...

Granted, one might say the same about Doom, but at least I am treating
Doom more as a representation of a workload, and not the end-goal
arbiter of what is added or dropped.

Does technically also work for RISC-V though (though seemingly GCC
always uses inline save/restore, but also the RV ABI has fewer
registers).

--- Synchronet 3.20c-Linux NewsLink 1.2

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Thu Apr 3 10:09:20 2025

From Newsgroup: comp.arch

BGB [2025-04-01 23:19:11] wrote:

But, yeah, inter-process function pointers aren't really a thing, and should not be a thing.

AFAIK, this point was brought in the context of a shared address space
(I assumed it was some kind of SASOS situation, but the same thing
happens with per-thread data inside a POSIX-style process).
Function pointers are perfectly normal and common in data (even tho they
may often be implicit, e.g. within the method table of objects), and the
whole point of sharing an address space is to be able to exchange data.

Stefan
--- Synchronet 3.20c-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Thu Apr 3 12:22:57 2025

From Newsgroup: comp.arch

On 4/3/2025 9:09 AM, Stefan Monnier wrote:

BGB [2025-04-01 23:19:11] wrote:

But, yeah, inter-process function pointers aren't really a thing, and should >> not be a thing.

AFAIK, this point was brought in the context of a shared address space
(I assumed it was some kind of SASOS situation, but the same thing
happens with per-thread data inside a POSIX-style process).
Function pointers are perfectly normal and common in data (even tho they
may often be implicit, e.g. within the method table of objects), and the whole point of sharing an address space is to be able to exchange data.

Or, to allow for NOMMU operation, or reduce costs by not having context switches result in as large of numbers of TLB misses.

Also makes the kernel simpler as it doesn't need to deal with each
process having its own address space.

Some data sharing is used for IPC, but directly sharing function
pointers between processes, or local memory (stack, malloc, etc), is not allowed.

Though, things may change later, there is a plan to more to separate global/local address ranges. Likely things like code will remain in the
shared range, and program data will be in the local range.

Stefan

--- Synchronet 3.20c-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Thu Apr 3 23:49:31 2025

From Newsgroup: comp.arch

On 2025-04-03 1:22 p.m., BGB wrote:

On 4/3/2025 9:09 AM, Stefan Monnier wrote:

BGB [2025-04-01 23:19:11] wrote:

But, yeah, inter-process function pointers aren't really a thing, and
should
not be a thing.

AFAIK, this point was brought in the context of a shared address space
(I assumed it was some kind of SASOS situation, but the same thing
happens with per-thread data inside a POSIX-style process).
Function pointers are perfectly normal and common in data (even tho they
may often be implicit, e.g. within the method table of objects), and the
whole point of sharing an address space is to be able to exchange data.

Or, to allow for NOMMU operation, or reduce costs by not having context switches result in as large of numbers of TLB misses.

Also makes the kernel simpler as it doesn't need to deal with each
process having its own address space.

Have you seen the MPRV bit in RISCV? Allows memory ops to execute using
the previous mode / address space. The bit just has to be set, then do
the memory op, then reset the bit. Makes it easy to access data using
the process address space.

Some data sharing is used for IPC, but directly sharing function
pointers between processes, or local memory (stack, malloc, etc), is not allowed.

Though, things may change later, there is a plan to more to separate global/local address ranges. Likely things like code will remain in the shared range, and program data will be in the local range.

Thinking of having a CPU local address space in Q+ to store vars for
that particular CPU. It looks like only a small RAM is required. I guess
it would be hardware thread local storage. May place the RAM in the CPU itself.

Stefan

--- Synchronet 3.20c-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Fri Apr 4 12:41:39 2025

From Newsgroup: comp.arch

On 4/3/2025 10:49 PM, Robert Finch wrote:

On 2025-04-03 1:22 p.m., BGB wrote:

On 4/3/2025 9:09 AM, Stefan Monnier wrote:

BGB [2025-04-01 23:19:11] wrote:

But, yeah, inter-process function pointers aren't really a thing,
and should
not be a thing.

AFAIK, this point was brought in the context of a shared address space
(I assumed it was some kind of SASOS situation, but the same thing
happens with per-thread data inside a POSIX-style process).
Function pointers are perfectly normal and common in data (even tho they >>> may often be implicit, e.g. within the method table of objects), and the >>> whole point of sharing an address space is to be able to exchange data.

Or, to allow for NOMMU operation, or reduce costs by not having
context switches result in as large of numbers of TLB misses.

Also makes the kernel simpler as it doesn't need to deal with each
process having its own address space.

Have you seen the MPRV bit in RISCV? Allows memory ops to execute using
the previous mode / address space. The bit just has to be set, then do
the memory op, then reset the bit. Makes it easy to access data using
the process address space.

I am not aware of this one. If it is in the privileged spec or similar,
may have missed it.

Thus far, my core doesn't implement that much of the RV privileged spec, mostly just the userland ISA. If I wanted to run an RV OS, it is
debatable if it would make more sense to try to mimic a hardware
interface it understands, or have the "firmware" manage the real HW interfaces, and then fake the rest in software.

Some data sharing is used for IPC, but directly sharing function
pointers between processes, or local memory (stack, malloc, etc), is
not allowed.

Though, things may change later, there is a plan to more to separate
global/local address ranges. Likely things like code will remain in
the shared range, and program data will be in the local range.

Thinking of having a CPU local address space in Q+ to store vars for
that particular CPU. It looks like only a small RAM is required. I guess
it would be hardware thread local storage. May place the RAM in the CPU itself.

I am aware of at least a few CPU's that have banked register sets that
may be backed to memory addresses (with the CPU itself evicting
registers on context switch).

I have not done so.

I had considered the possibility of 4 rings each with their own set of registers. This could make things like interrupts and system calls
cheaper, but (ironically) using this would make context switching more expensive.

An intermediate option could be a special RAM area for a "task cache",
say, 8 or 16K, and then have a "Task Cache Miss" interrupt for cases
where one tries to switch to a task's register bank that isn't in the
cache. While a this would have a high cost (for a task cache miss), if
the cache is bigger than the number of currently running tasks, it could
still work out ahead.

But, better for performance would be if the task-cache were RAM backed
and the HW spills and reloads from RAM (then, one could have maybe 64K
or 256K or more for task register banks; probably enough for a decent
number of active PIDs).

Though naive, "always save and restore all the registers to RAM" seems
to have a fairly reasonable cost (and the among the lowest "actual
task-switch cost", aside from the possibility of "let hardware lazily
spill and reload register banks from main RAM", which could potentially
be lower).

The main "bad" cost of switching between processes being the storm of
TLB misses that would happen if not using a shared address space
(granted, there are "global pages"). In my design, there are not true
global pages, rather pages that are "global" within an ASID group (if
the low 10 bits of the page's ASID are 0, it is assumed global within
this group, whereas non-zero values are ASID specific; and the high 6
bits of the ASID gives the group, where pages are not global between
groups).

Though, my existing OS, still being single address space, doesn't make
use of this. The idea is that ASID will be tied to PID.

As how to best scale this past 1024 PIDs is unclear, likely the ASID
will be modulo-1024, and needing to reassign a previously assigned ASID
to a new PID would require a TLB flush.

Stefan

--- Synchronet 3.20c-Linux NewsLink 1.2

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Fri Apr 4 21:07:09 2025

From Newsgroup: comp.arch

On Wed, 2 Apr 2025 0:07:41 +0000, Robert Finch wrote:

On 2025-04-01 7:24 p.m., MitchAlsup1 wrote:

On Tue, 1 Apr 2025 22:06:10 +0000, Robert Finch wrote: -------------------------
Why is it not 13 cycles to get started and then each register is 1 one
cycle.

The CPU does not do pipe-lined burst loads. To load the cache line it is
two independent loads. 256-bits at a time. Stores post to the bus, but
I seem to remember having to space out the stores so the queue in the
memory controller did not overflow. Needs more work.

Stores should be faster, I think they are single cycle. But loads may be quite slow if things are not in the cache. I should really measure it.
It may not be as bad I think. It is still 300 LOC, about 100 loads and
stores each way. Lots of move instructions for regs that cannot be
directly loaded or stored. And with CRs serializing the processor. But
the processor should eat up all the moves fairly quickly.

By placing all the CRs together, and treating thread-state as a write-
back cache, all the storing and loading happens without any
serialization,
in cache line quanta, where the LD can begin before the STs
begin--giving
the overlap that reduces the cycle count.

For example, once a core has decided to run "this-thread" all it has to
do is to execute a single HR instruction which writes a pointer to
thread-
state. Then upon SVR, that thread begins running. Between HE and SVR, HW
can preload the inbound data, and push out the outbound data after the
inbound data has arrived.

But, also note: Due to the way CR's are mapped into MMI/O memory, one
core can write that same HR available CR on another core and cause a
remote context switch of that another core.

The main use is more likely to be remote diagnostics of a core that
has quit responding to the system (crashed hard) so its CRs can be
read out and examined to see why it quit responding.
--- Synchronet 3.20c-Linux NewsLink 1.2

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Fri Apr 4 21:13:27 2025

From Newsgroup: comp.arch

On Fri, 4 Apr 2025 3:49:31 +0000, Robert Finch wrote:

On 2025-04-03 1:22 p.m., BGB wrote:

-------------------

Or, to allow for NOMMU operation, or reduce costs by not having context
switches result in as large of numbers of TLB misses.

Also makes the kernel simpler as it doesn't need to deal with each
process having its own address space.

Have you seen the MPRV bit in RISCV? Allows memory ops to execute using
the previous mode / address space. The bit just has to be set, then do
the memory op, then reset the bit. Makes it easy to access data using
the process address space.

Let us postulate you are running in RISC-V HyperVisor on core[j]
and you want to write into GuestOS VAS and into application VAS
more or less simultaneously.

Seems to me like you need a MPRV to be more than a single bit
so it could index which layer of the SW stack's VAS it needs
to touch.
--- Synchronet 3.20c-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Fri Apr 4 23:45:51 2025

From Newsgroup: comp.arch

On 2025-04-04 5:13 p.m., MitchAlsup1 wrote:

On Fri, 4 Apr 2025 3:49:31 +0000, Robert Finch wrote:

On 2025-04-03 1:22 p.m., BGB wrote:

-------------------

Or, to allow for NOMMU operation, or reduce costs by not having context
switches result in as large of numbers of TLB misses.

Also makes the kernel simpler as it doesn't need to deal with each
process having its own address space.

Have you seen the MPRV bit in RISCV? Allows memory ops to execute using
the previous mode / address space. The bit just has to be set, then do
the memory op, then reset the bit. Makes it easy to access data using
the process address space.

Let us postulate you are running in RISC-V HyperVisor on core[j]
and you want to write into GuestOS VAS and into application VAS
more or less simultaneously.

Would not writing to the GuestOs VAS and the application VAS be the
result of separate system calls? Or does the hypervisor take over for
the GuestOS?

Seems to me like you need a MPRV to be more than a single bit
so it could index which layer of the SW stack's VAS it needs
to touch.

So, there is a need to be able to go back two or three levels? I suppose
it could also be done by manipulating the stack, although adding an
extra bit may be easier. How often does it happen?

--- Synchronet 3.20c-Linux NewsLink 1.2

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sat Apr 5 16:37:19 2025

From Newsgroup: comp.arch

On Sat, 5 Apr 2025 3:45:51 +0000, Robert Finch wrote:

On 2025-04-04 5:13 p.m., MitchAlsup1 wrote:

On Fri, 4 Apr 2025 3:49:31 +0000, Robert Finch wrote:

On 2025-04-03 1:22 p.m., BGB wrote:

-------------------

Or, to allow for NOMMU operation, or reduce costs by not having context >>>> switches result in as large of numbers of TLB misses.

Also makes the kernel simpler as it doesn't need to deal with each
process having its own address space.

Have you seen the MPRV bit in RISCV? Allows memory ops to execute using
the previous mode / address space. The bit just has to be set, then do
the memory op, then reset the bit. Makes it easy to access data using
the process address space.

Let us postulate you are running in RISC-V HyperVisor on core[j]
and you want to write into GuestOS VAS and into application VAS
more or less simultaneously.

Would not writing to the GuestOs VAS and the application VAS be the
result of separate system calls? Or does the hypervisor take over for
the GuestOS?

Application has a 64-bit VAS
GusetOS has a 64-bit VAS
HyprVisor has a 64-bit VAS
and so does
Securte has a 64-bit VAS

So, we are in HV and we need to write to guestOS and to Application
but we have only 1-bit of distinction.

And this has nothing to do with system calls it has to do with
accessing (rather simultaneously) any of the 4 VASs.

Seems to me like you need a MPRV to be more than a single bit
so it could index which layer of the SW stack's VAS it needs
to touch.

So, there is a need to be able to go back two or three levels? I suppose
it could also be done by manipulating the stack, although adding an
extra bit may be easier. How often does it happen?

I have no idea, and I suspect GuestOS people don't either.
--- Synchronet 3.20c-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sat Apr 5 18:31:44 2025

From Newsgroup: comp.arch

mitchalsup@aol.com (MitchAlsup1) writes:

On Sat, 5 Apr 2025 3:45:51 +0000, Robert Finch wrote:

Would not writing to the GuestOs VAS and the application VAS be the
result of separate system calls? Or does the hypervisor take over for
the GuestOS?

Application has a 64-bit VAS
GusetOS has a 64-bit VAS
HyprVisor has a 64-bit VAS
and so does
Securte has a 64-bit VAS

So, we are in HV and we need to write to guestOS and to Application
but we have only 1-bit of distinction.

On ARM64, when the HV needs to write to guest user VA or guest PA,
the SMMU provides an interface the processor can use to translate
the guest VA or Guest PA to the corresponding system physical address.
Of course, there is a race if the guest OS changes the underlying
translation tables during the upcall to the hypervisor or secure
monitor, although that would be a bug in the guest were it so to do,
since the guest explicitly requested the action from the higher
privilege level (e.g. HV).

Arm does have a set of load/store "user" instructions that translate
addresses using the unprivileged (application) translation tables. There's also a processor state bit (UAO - User Access Override) that can
be set to force those instructions to use the permissions associated
with the current processor privilege level.

Note that there is a push by all vendors to include support
for guest 'privacy', such that the hypervisor has no direct
access to memory owned by the guest, or where the the guest
memory is encrypted using a key the hypervisor or secure monitor
don't have access to.

--- Synchronet 3.20c-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sat Apr 5 17:57:50 2025

From Newsgroup: comp.arch

On 2025-04-05 2:31 p.m., Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Sat, 5 Apr 2025 3:45:51 +0000, Robert Finch wrote:

Would not writing to the GuestOs VAS and the application VAS be the
result of separate system calls? Or does the hypervisor take over for
the GuestOS?

Application has a 64-bit VAS
GusetOS has a 64-bit VAS
HyprVisor has a 64-bit VAS
and so does
Securte has a 64-bit VAS

So, we are in HV and we need to write to guestOS and to Application
but we have only 1-bit of distinction.

On ARM64, when the HV needs to write to guest user VA or guest PA,
the SMMU provides an interface the processor can use to translate
the guest VA or Guest PA to the corresponding system physical address.
Of course, there is a race if the guest OS changes the underlying
translation tables during the upcall to the hypervisor or secure
monitor, although that would be a bug in the guest were it so to do,
since the guest explicitly requested the action from the higher
privilege level (e.g. HV).

Arm does have a set of load/store "user" instructions that translate addresses using the unprivileged (application) translation tables. There's also a processor state bit (UAO - User Access Override) that can
be set to force those instructions to use the permissions associated
with the current processor privilege level.

Note that there is a push by all vendors to include support
for guest 'privacy', such that the hypervisor has no direct
access to memory owned by the guest, or where the the guest
memory is encrypted using a key the hypervisor or secure monitor
don't have access to.

Okay,

I was interpreting RISCV specs wrong. They have three bits dedicated to
this. 1 is an on/off and the other two are the mode to use. I am left wondering how it is determined which mode to use. If the hypervisor is
passed a pointer to a VAS variable in a register, how does it know that
the pointer is for the supervisor or the user/app? It's why I assumed it
found the mode from the stack. Those two select bits have to set
somehow. It seems like extra code to access the right address space.
I got the thought to use the three bits a bit differently.
111 = use current mode
110 = use mode from stack
100 = debug? mode
011 = secure (machine) mode
010 = hypervisor mode
001 = supervisor mode
000 = user/app mode
I was just using inline code to select the proper address space. But if
it is necessary to dig around to figure the mode, it may turn into a subroutine call.

--- Synchronet 3.20c-Linux NewsLink 1.2

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sat Apr 5 23:06:38 2025

From Newsgroup: comp.arch

On Sat, 5 Apr 2025 18:31:44 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Sat, 5 Apr 2025 3:45:51 +0000, Robert Finch wrote:

Would not writing to the GuestOs VAS and the application VAS be the
result of separate system calls? Or does the hypervisor take over for
the GuestOS?

Application has a 64-bit VAS
GusetOS has a 64-bit VAS
HyprVisor has a 64-bit VAS
and so does
Securte has a 64-bit VAS

So, we are in HV and we need to write to guestOS and to Application
but we have only 1-bit of distinction.

On ARM64, when the HV needs to write to guest user VA or guest PA,
the SMMU provides an interface the processor can use to translate
the guest VA or Guest PA to the corresponding system physical address.
Of course, there is a race if the guest OS changes the underlying
translation tables during the upcall to the hypervisor or secure
monitor, although that would be a bug in the guest were it so to do,
since the guest explicitly requested the action from the higher
privilege level (e.g. HV).

Arm does have a set of load/store "user" instructions that translate addresses using the unprivileged (application) translation tables.

When Secure Monitor executes a "user" instructions which layer
of the SW stack is accessed:: {HV, SV, User} ??

Is this 1-layer down the stack, or all layers down the stack ??

There's
also a processor state bit (UAO - User Access Override) that can
be set to force those instructions to use the permissions associated
with the current processor privilege level.

That is how My 66000 MMU is defined--higher privilege layers
have R/W access to the next lower privilege layer--without
doing anything other than a typical LD or ST instruction.

I/O MMU has similar issues to solve in that a device can Read
write-execute only memory and write read-execute only memory.

Note that there is a push by all vendors to include support
for guest 'privacy', such that the hypervisor has no direct
access to memory owned by the guest, or where the the guest
memory is encrypted using a key the hypervisor or secure monitor
don't have access to.

I call these "paranoid" applications--generally requiring no
privilege, but they don't want GuestOS of HyperVisor to look
at their data and at the same time, they want GuestOS or HV
to perform I/O to said data--so some devices have a effective
privilege above that of the driver commanding them.

I understand the reasons and rational.
--- Synchronet 3.20c-Linux NewsLink 1.2

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sat Apr 5 23:11:00 2025

From Newsgroup: comp.arch

On Sat, 5 Apr 2025 21:57:50 +0000, Robert Finch wrote:

On 2025-04-05 2:31 p.m., Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Sat, 5 Apr 2025 3:45:51 +0000, Robert Finch wrote:

Would not writing to the GuestOs VAS and the application VAS be the
result of separate system calls? Or does the hypervisor take over for
the GuestOS?

Application has a 64-bit VAS
GusetOS has a 64-bit VAS
HyprVisor has a 64-bit VAS
and so does
Securte has a 64-bit VAS

So, we are in HV and we need to write to guestOS and to Application
but we have only 1-bit of distinction.

On ARM64, when the HV needs to write to guest user VA or guest PA,
the SMMU provides an interface the processor can use to translate
the guest VA or Guest PA to the corresponding system physical address.
Of course, there is a race if the guest OS changes the underlying
translation tables during the upcall to the hypervisor or secure
monitor, although that would be a bug in the guest were it so to do,
since the guest explicitly requested the action from the higher
privilege level (e.g. HV).

Arm does have a set of load/store "user" instructions that translate
addresses using the unprivileged (application) translation tables.
There's
also a processor state bit (UAO - User Access Override) that can
be set to force those instructions to use the permissions associated
with the current processor privilege level.

Note that there is a push by all vendors to include support
for guest 'privacy', such that the hypervisor has no direct
access to memory owned by the guest, or where the the guest
memory is encrypted using a key the hypervisor or secure monitor
don't have access to.

Okay,

I was interpreting RISCV specs wrong. They have three bits dedicated to
this. 1 is an on/off and the other two are the mode to use. I am left wondering how it is determined which mode to use. If the hypervisor is
passed a pointer to a VAS variable in a register, how does it know that
the pointer is for the supervisor or the user/app?

More interesting the the concept that there are multiple HVs that
have been virtualized--in this case the sender of the address may
think it has HV privilege but is currently operating as if it only
has GuestOS privilege. ...

It's why I assumed it found the mode from the stack. Those two select bits have to set
somehow. It seems like extra code to access the right address space.
I got the thought to use the three bits a bit differently.
111 = use current mode
110 = use mode from stack
100 = debug? mode
011 = secure (machine) mode
010 = hypervisor mode
001 = supervisor mode
000 = user/app mode
I was just using inline code to select the proper address space. But if
it is necessary to dig around to figure the mode, it may turn into a subroutine call.

All the machines I have used/designed/programmed in the past use 000
as highest privilege and 111 as lowest.
--- Synchronet 3.20c-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sun Apr 6 14:21:26 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> writes:

On 2025-04-05 2:31 p.m., Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Sat, 5 Apr 2025 3:45:51 +0000, Robert Finch wrote:

Would not writing to the GuestOs VAS and the application VAS be the
result of separate system calls? Or does the hypervisor take over for
the GuestOS?

Application has a 64-bit VAS
GusetOS has a 64-bit VAS
HyprVisor has a 64-bit VAS
and so does
Securte has a 64-bit VAS

So, we are in HV and we need to write to guestOS and to Application
but we have only 1-bit of distinction.

On ARM64, when the HV needs to write to guest user VA or guest PA,
the SMMU provides an interface the processor can use to translate
the guest VA or Guest PA to the corresponding system physical address.
Of course, there is a race if the guest OS changes the underlying
translation tables during the upcall to the hypervisor or secure
monitor, although that would be a bug in the guest were it so to do,
since the guest explicitly requested the action from the higher
privilege level (e.g. HV).

Arm does have a set of load/store "user" instructions that translate
addresses using the unprivileged (application) translation tables. There's >> also a processor state bit (UAO - User Access Override) that can
be set to force those instructions to use the permissions associated
with the current processor privilege level.

Note that there is a push by all vendors to include support
for guest 'privacy', such that the hypervisor has no direct
access to memory owned by the guest, or where the the guest
memory is encrypted using a key the hypervisor or secure monitor
don't have access to.

Okay,

I was interpreting RISCV specs wrong. They have three bits dedicated to >this. 1 is an on/off and the other two are the mode to use. I am left >wondering how it is determined which mode to use. If the hypervisor is >passed a pointer to a VAS variable in a register, how does it know that
the pointer is for the supervisor or the user/app?

When the exception (in this case an upcall to a more privileged
regime) occurs, the saved state register/stack word should contain the prior privilege level. The hypervisor will know from that whether
the upcall was from the guest OS or a guest Application.

Note that on ARM, there are restrictions on upcalls to
more privileged regimes - generally a particular regime
can only upcall the next higher privileged regime, so
the user app can only upcall the GuestOS, the guest OS can only
upcall the HV and the HV is the only regime that can
upcall the secure monitor.

It's why I assumed it
found the mode from the stack. Those two select bits have to set
somehow. It seems like extra code to access the right address space.'

I haven't spent much time with RISC-V, but surely the processor
has a state register that stores the current mode, and which
must be preserved over exceptions/upcalls, which would require
that they be recorded in an exception syndrome register for
restoration when the upcall returns.

--- Synchronet 3.20c-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sun Apr 6 14:32:43 2025

From Newsgroup: comp.arch

mitchalsup@aol.com (MitchAlsup1) writes:

On Sat, 5 Apr 2025 18:31:44 +0000, Scott Lurndal wrote:

Arm does have a set of load/store "user" instructions that translate
addresses using the unprivileged (application) translation tables.

When Secure Monitor executes a "user" instructions which layer
of the SW stack is accessed:: {HV, SV, User} ?

The Secure Monitor will never execute a user instruction. If
it does, it will act as any other load/store executed by the
secure monitor.

The "user" instructions are only used by a bare-metal OS
or a guest OS to access user application address spaces.

Is this 1-layer down the stack, or all layers down the stack ??

One layer down, and only the least privileged non-user level.

There's
also a processor state bit (UAO - User Access Override) that can
be set to force those instructions to use the permissions associated
with the current processor privilege level.

That is how My 66000 MMU is defined--higher privilege layers
have R/W access to the next lower privilege layer--without
doing anything other than a typical LD or ST instruction.

That presumes a shared address space between the privilege
levels - which is common for the OS and user-modes. It's
not common (or particularly useful[*]) at any other privilege level.

[*] A primary goal must be to avoid privilege level
upcalls as much as possible.

I/O MMU has similar issues to solve in that a device can Read
write-execute only memory and write read-execute only memory.

By the time the IOMMU translates the inbound address, it is
a physical machine address, so I don't see any issue here.
And in the ARM case, the IOMMU translation tables are identical
to the processor translation tables in format and can actually
share some or all of the tables between the core(s) and the IOMMU.

Note that for various reasons, the IOMMU translation tables
may cover only a portion of the target address space at any particular privilege level.

Note that there is a push by all vendors to include support
for guest 'privacy', such that the hypervisor has no direct
access to memory owned by the guest, or where the the guest
memory is encrypted using a key the hypervisor or secure monitor
don't have access to.

I call these "paranoid" applications--generally requiring no
privilege, but they don't want GuestOS of HyperVisor to look
at their data and at the same time, they want GuestOS or HV
to perform I/O to said data--so some devices have a effective
privilege above that of the driver commanding them.

I understand the reasons and rational.

The primary reason is for encrypted video decoding where
the decoded video is fed directly to the graphics processor
and the end-user cannot intercept the decrypted video stream. Closing
the barn door after the horse has left, but c'est la vie.

--- Synchronet 3.20c-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sun Apr 6 15:01:31 2025

From Newsgroup: comp.arch

On 2025-04-06 10:21 a.m., Scott Lurndal wrote:

Robert Finch <robfi680@gmail.com> writes:

On 2025-04-05 2:31 p.m., Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Sat, 5 Apr 2025 3:45:51 +0000, Robert Finch wrote:

Would not writing to the GuestOs VAS and the application VAS be the
result of separate system calls? Or does the hypervisor take over for >>>>> the GuestOS?

Application has a 64-bit VAS
GusetOS has a 64-bit VAS
HyprVisor has a 64-bit VAS
and so does
Securte has a 64-bit VAS

So, we are in HV and we need to write to guestOS and to Application
but we have only 1-bit of distinction.

On ARM64, when the HV needs to write to guest user VA or guest PA,
the SMMU provides an interface the processor can use to translate
the guest VA or Guest PA to the corresponding system physical address.
Of course, there is a race if the guest OS changes the underlying
translation tables during the upcall to the hypervisor or secure
monitor, although that would be a bug in the guest were it so to do,
since the guest explicitly requested the action from the higher
privilege level (e.g. HV).

Arm does have a set of load/store "user" instructions that translate
addresses using the unprivileged (application) translation tables. There's >>> also a processor state bit (UAO - User Access Override) that can
be set to force those instructions to use the permissions associated
with the current processor privilege level.

Note that there is a push by all vendors to include support
for guest 'privacy', such that the hypervisor has no direct
access to memory owned by the guest, or where the the guest
memory is encrypted using a key the hypervisor or secure monitor
don't have access to.

Okay,

I was interpreting RISCV specs wrong. They have three bits dedicated to
this. 1 is an on/off and the other two are the mode to use. I am left
wondering how it is determined which mode to use. If the hypervisor is
passed a pointer to a VAS variable in a register, how does it know that
the pointer is for the supervisor or the user/app?

When the exception (in this case an upcall to a more privileged
regime) occurs, the saved state register/stack word should contain the prior privilege level. The hypervisor will know from that whether
the upcall was from the guest OS or a guest Application.

Yes, Q+ works that way, I think RISCV does as well. Q+ stacks the PC and
SR on an internal stack which is basically a shift register. The TOS is visible as a CR. The mode is state saved in the SR. Interrupts and
exceptions do not have to store the state in memory. The far end of the
stack is hard coded to do a reset if the stack underflows.

Note that on ARM, there are restrictions on upcalls to
more privileged regimes - generally a particular regime
can only upcall the next higher privileged regime, so
the user app can only upcall the GuestOS, the guest OS can only
upcall the HV and the HV is the only regime that can
upcall the secure monitor.

Allows two directional virtualization I think. Q+ has all exceptions and interrupts going to the secure monitor, which can then delegate it back
to a lower level.

It's why I assumed it
found the mode from the stack. Those two select bits have to set
somehow. It seems like extra code to access the right address space.'

I haven't spent much time with RISC-V, but surely the processor
has a state register that stores the current mode, and which
must be preserved over exceptions/upcalls, which would require
that they be recorded in an exception syndrome register for
restoration when the upcall returns.

--- Synchronet 3.20c-Linux NewsLink 1.2

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Mon Apr 7 00:51:08 2025

From Newsgroup: comp.arch

On Sun, 6 Apr 2025 14:21:26 +0000, Scott Lurndal wrote:
----------------

When the exception (in this case an upcall to a more privileged
regime) occurs, the saved state register/stack word should contain the
prior privilege level. The hypervisor will know from that whether
the upcall was from the guest OS or a guest Application.

Note that on ARM, there are restrictions on upcalls to
more privileged regimes - generally a particular regime
can only upcall the next higher privileged regime, so
the user app can only upcall the GuestOS, the guest OS can only
upcall the HV and the HV is the only regime that can
upcall the secure monitor.

On Sun, 6 Apr 2025 14:32:43 +0000, Scott Lurndal wrote:

That presumes a shared address space between the privilege
levels - which is common for the OS and user-modes. It's
not common (or particularly useful[*]) at any other privilege
level.

So, is this dichotomy because::

a) HVs are good enough at virtualizing raw HW that GuestOS
does not need a lot of paravirtualization to be efficient ??

b) GuestOS does not need "that much paravirtualization" to be
efficient anyway.

c) the kinds of things GuestOS ask HVs to perform is just not
enough like the kind of things user asks of GuestOS.

d) User and GuestOS evolved in a time before virtualization
and simply prefer to exist as it used to be ??
--- Synchronet 3.20c-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Apr 7 14:04:37 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> writes:

On 2025-04-06 10:21 a.m., Scott Lurndal wrote:

Note that on ARM, there are restrictions on upcalls to
more privileged regimes - generally a particular regime
can only upcall the next higher privileged regime, so
the user app can only upcall the GuestOS, the guest OS can only
upcall the HV and the HV is the only regime that can
upcall the secure monitor.

Allows two directional virtualization I think. Q+ has all exceptions and >interrupts going to the secure monitor, which can then delegate it back
to a lower level.

If that adds latency to the interrupt handler, that will not
be a positive benefit.
--- Synchronet 3.20c-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Apr 7 14:09:50 2025

From Newsgroup: comp.arch

mitchalsup@aol.com (MitchAlsup1) writes:

On Sun, 6 Apr 2025 14:21:26 +0000, Scott Lurndal wrote:
----------------

When the exception (in this case an upcall to a more privileged
regime) occurs, the saved state register/stack word should contain the
prior privilege level. The hypervisor will know from that whether
the upcall was from the guest OS or a guest Application.

Note that on ARM, there are restrictions on upcalls to
more privileged regimes - generally a particular regime
can only upcall the next higher privileged regime, so
the user app can only upcall the GuestOS, the guest OS can only
upcall the HV and the HV is the only regime that can
upcall the secure monitor.

On Sun, 6 Apr 2025 14:32:43 +0000, Scott Lurndal wrote:

That presumes a shared address space between the privilege
levels - which is common for the OS and user-modes. It's
not common (or particularly useful[*]) at any other privilege
level.

So, is this dichotomy because::

a) HVs are good enough at virtualizing raw HW that GuestOS
does not need a lot of paravirtualization to be efficient ??

Yes. Once AMD added Nested Page Tables to SVM and the PCI-SIG
proposed the SR-IOV capability, paravirtualization became anathema.

b) GuestOS does not need "that much paravirtualization" to be
efficient anyway.

With modern hardware support, yes.

c) the kinds of things GuestOS ask HVs to perform is just not
enough like the kind of things user asks of GuestOS.

Yes, that's also a truism.

d) User and GuestOS evolved in a time before virtualization
and simply prefer to exist as it used to be ??

Typically an OS doesn't know if it is a guest or bare metal.
That characteristic means that a given distribution can
operate as either.

--- Synchronet 3.20c-Linux NewsLink 1.2

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Apr 9 00:23:09 2025

From Newsgroup: comp.arch

On Mon, 7 Apr 2025 14:09:50 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Sun, 6 Apr 2025 14:21:26 +0000, Scott Lurndal wrote:
----------------

When the exception (in this case an upcall to a more privileged
regime) occurs, the saved state register/stack word should contain the
prior privilege level. The hypervisor will know from that whether
the upcall was from the guest OS or a guest Application.

Note that on ARM, there are restrictions on upcalls to
more privileged regimes - generally a particular regime
can only upcall the next higher privileged regime, so
the user app can only upcall the GuestOS, the guest OS can only
upcall the HV and the HV is the only regime that can
upcall the secure monitor.

On Sun, 6 Apr 2025 14:32:43 +0000, Scott Lurndal wrote:

That presumes a shared address space between the privilege
levels - which is common for the OS and user-modes. It's
not common (or particularly useful[*]) at any other privilege
level.

So, is this dichotomy because::

a) HVs are good enough at virtualizing raw HW that GuestOS
does not need a lot of paravirtualization to be efficient ??

Yes. Once AMD added Nested Page Tables to SVM and the PCI-SIG
proposed the SR-IOV capability, paravirtualization became anathema.

b) GuestOS does not need "that much paravirtualization" to be
efficient anyway.

With modern hardware support, yes.

c) the kinds of things GuestOS ask HVs to perform is just not
enough like the kind of things user asks of GuestOS.

Yes, that's also a truism.

d) User and GuestOS evolved in a time before virtualization
and simply prefer to exist as it used to be ??

Typically an OS doesn't know if it is a guest or bare metal.
That characteristic means that a given distribution can
operate as either.

Thank you for updating a piece of history apparently I did not
live through !!
--- Synchronet 3.20c-Linux NewsLink 1.2

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Apr 15 00:43:43 2025

From Newsgroup: comp.arch

On Mon, 7 Apr 2025 14:09:50 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Sun, 6 Apr 2025 14:21:26 +0000, Scott Lurndal wrote:
----------------

When the exception (in this case an upcall to a more privileged
regime) occurs, the saved state register/stack word should contain the
prior privilege level. The hypervisor will know from that whether
the upcall was from the guest OS or a guest Application.

Note that on ARM, there are restrictions on upcalls to
more privileged regimes - generally a particular regime
can only upcall the next higher privileged regime, so
the user app can only upcall the GuestOS, the guest OS can only
upcall the HV and the HV is the only regime that can
upcall the secure monitor.

On Sun, 6 Apr 2025 14:32:43 +0000, Scott Lurndal wrote:

That presumes a shared address space between the privilege
levels - which is common for the OS and user-modes. It's
not common (or particularly useful[*]) at any other privilege
level.

So, is this dichotomy because::

a) HVs are good enough at virtualizing raw HW that GuestOS
does not need a lot of paravirtualization to be efficient ??

Yes. Once AMD added Nested Page Tables to SVM and the PCI-SIG
proposed the SR-IOV capability, paravirtualization became anathema.

Ok, back to Dan Cross:: (with help from Scott)

If GuestOS wants to grab and hold onto a lock/mutex for a while
to do some critical section stuff--does GuestOS "care" that HV
can still take an interrupt while GuestOS is doing its CS thing ??
since HV is not going to touch any memory associated with GuestOS.

In effect, I am asking is Disable Interrupt is SW-stack-wide or only
applicable to the current layer of the SW stack ?? One can equally
use SW-stack-wide to mean core-wide.

For example:: GuestOS DIs, and HV takes a page fault from GuestOS;
makes the page resident and accessible, and allows GuestOS to run
from the point of fault. GuestOS "sees" no interrupt and nothing
in GuestOS VAS is touched by HV in servicing the page fault.

Now, sure that lock is held while the page fault is being serviced,
and the ugly head of priority inversion takes hold. But ... I am ni
need of some edumacation here.

b) GuestOS does not need "that much paravirtualization" to be
efficient anyway.

With modern hardware support, yes.

c) the kinds of things GuestOS ask HVs to perform is just not
enough like the kind of things user asks of GuestOS.

Yes, that's also a truism.

d) User and GuestOS evolved in a time before virtualization
and simply prefer to exist as it used to be ??

Typically an OS doesn't know if it is a guest or bare metal.
That characteristic means that a given distribution can
operate as either.

--- Synchronet 3.20c-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Apr 15 14:02:37 2025

From Newsgroup: comp.arch

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 7 Apr 2025 14:09:50 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Sun, 6 Apr 2025 14:21:26 +0000, Scott Lurndal wrote:
----------------

When the exception (in this case an upcall to a more privileged
regime) occurs, the saved state register/stack word should contain the >>>> prior privilege level. The hypervisor will know from that whether
the upcall was from the guest OS or a guest Application.

Note that on ARM, there are restrictions on upcalls to
more privileged regimes - generally a particular regime
can only upcall the next higher privileged regime, so
the user app can only upcall the GuestOS, the guest OS can only
upcall the HV and the HV is the only regime that can
upcall the secure monitor.

On Sun, 6 Apr 2025 14:32:43 +0000, Scott Lurndal wrote:

That presumes a shared address space between the privilege
levels - which is common for the OS and user-modes. It's
not common (or particularly useful[*]) at any other privilege
level.

So, is this dichotomy because::

a) HVs are good enough at virtualizing raw HW that GuestOS
does not need a lot of paravirtualization to be efficient ??

Yes. Once AMD added Nested Page Tables to SVM and the PCI-SIG
proposed the SR-IOV capability, paravirtualization became anathema.

Ok, back to Dan Cross:: (with help from Scott)

If GuestOS wants to grab and hold onto a lock/mutex for a while
to do some critical section stuff--does GuestOS "care" that HV
can still take an interrupt while GuestOS is doing its CS thing ??
since HV is not going to touch any memory associated with GuestOS.

Generally, the Guest should execute "as if" it were running on
Bare Metal. Consider an intel/amd processor running a bare-metal
operating system that takes an interrupt into SMM mode; from the
POV of a guest, an HV interrupt is similar to an SMM interrupt.

If the SMM, Secure Monitor or HV modify guest memory in any way,
all bets are off.

In effect, I am asking is Disable Interrupt is SW-stack-wide or only >applicable to the current layer of the SW stack ?? One can equally
use SW-stack-wide to mean core-wide.

Current layer of the privilege stack. If there is a secure monitor
at a more privileged level than the HV, it can take interrupts in a
manner similar to the legacy SMM interupts. Typically there will
be independent periodic timer interrupts in the Guest OS, the HV, and the secure monitor.

For example:: GuestOS DIs, and HV takes a page fault from GuestOS;

Note that these will be rare and only if the HV overcommits physical
memory.

makes the page resident and accessible, and allows GuestOS to run
from the point of fault. GuestOS "sees" no interrupt and nothing
in GuestOS VAS is touched by HV in servicing the page fault.

The only way that the guest OS or guest OS application can detect
such an event is if it measures an affected load/store - a covert
channel. So there may be security considerations.

Now, sure that lock is held while the page fault is being serviced,
and the ugly head of priority inversion takes hold. But ... I am ni
need of some edumacation here.

Priority inversion is only applicable within a privilege level/ring.
Interrupts to a higher privilege level cannot be masked by an active interrupt at a lower priority level.

The higher privilege level must not unilaterally modify guest OS or
application state.
--- Synchronet 3.20c-Linux NewsLink 1.2

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Apr 15 20:46:28 2025

From Newsgroup: comp.arch

On Tue, 15 Apr 2025 14:02:37 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 7 Apr 2025 14:09:50 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Sun, 6 Apr 2025 14:21:26 +0000, Scott Lurndal wrote: >>>>----------------

When the exception (in this case an upcall to a more privileged
regime) occurs, the saved state register/stack word should contain the >>>>> prior privilege level. The hypervisor will know from that whether
the upcall was from the guest OS or a guest Application.

Note that on ARM, there are restrictions on upcalls to
more privileged regimes - generally a particular regime
can only upcall the next higher privileged regime, so
the user app can only upcall the GuestOS, the guest OS can only
upcall the HV and the HV is the only regime that can
upcall the secure monitor.

On Sun, 6 Apr 2025 14:32:43 +0000, Scott Lurndal wrote:

That presumes a shared address space between the privilege
levels - which is common for the OS and user-modes. It's
not common (or particularly useful[*]) at any other privilege
level.

So, is this dichotomy because::

a) HVs are good enough at virtualizing raw HW that GuestOS
does not need a lot of paravirtualization to be efficient ??

Yes. Once AMD added Nested Page Tables to SVM and the PCI-SIG
proposed the SR-IOV capability, paravirtualization became anathema.

Ok, back to Dan Cross:: (with help from Scott)

If GuestOS wants to grab and hold onto a lock/mutex for a while
to do some critical section stuff--does GuestOS "care" that HV
can still take an interrupt while GuestOS is doing its CS thing ??
since HV is not going to touch any memory associated with GuestOS.

Generally, the Guest should execute "as if" it were running on
Bare Metal. Consider an intel/amd processor running a bare-metal
operating system that takes an interrupt into SMM mode; from the
POV of a guest, an HV interrupt is similar to an SMM interrupt.

If the SMM, Secure Monitor or HV modify guest memory in any way,
all bets are off.

Yes, but we have previously established HV does its virtualization
without touching GuestOS memory. {Which is why I used page fault as
the example.}

In effect, I am asking is Disable Interrupt is SW-stack-wide or only >>applicable to the current layer of the SW stack ?? One can equally
use SW-stack-wide to mean core-wide.

Current layer of the privilege stack. If there is a secure monitor
at a more privileged level than the HV, it can take interrupts in a
manner similar to the legacy SMM interupts. Typically there will
be independent periodic timer interrupts in the Guest OS, the HV, and
the secure monitor.

This agrees with the RISC-V approach where each layer in the stack
has its own Interrupt Enable configuration. {Which is what lead to
my questions}.

However, many architectures have only a single control bit for the
whole core--which is shy I am trying to get a complete understanding
of what is required and what is choice. That there is some control
is (IS) required--how many seems to be choice at this stage.

Would it be unwise of me to speculate that a control at each layer
is more optimal, or that the critical section that is delayed due
to "other stuff needing to be handled" should have taken precedent.

Anyone know of any literature where this was simulate or measured ??

For example:: GuestOS DIs, and HV takes a page fault from GuestOS;

Note that these will be rare and only if the HV overcommits physical
memory.

makes the page resident and accessible, and allows GuestOS to run
from the point of fault. GuestOS "sees" no interrupt and nothing
in GuestOS VAS is touched by HV in servicing the page fault.

The only way that the guest OS or guest OS application can detect
such an event is if it measures an affected load/store - a covert
channel. So there may be security considerations.

Damn that high precision clock .....

Which also leads to the question of should a Virtual Machine have
its own virtual time ?? {Or VM and VMM share the concept of virtual
time} ??

Now, sure that lock is held while the page fault is being serviced,
and the ugly head of priority inversion takes hold. But ... I am ni
need of some edumacation here.

Priority inversion is only applicable within a privilege level/ring. Interrupts to a higher privilege level cannot be masked by an active interrupt at a lower priority level.

So, if core is running HyperVisor at priority 15 and a user interrupt
arrives at a higher priority but directed at GuestOS (instead of HV)
does::
a) HV continue leaving higher priority interrupt waiting.
b) switch back to GuestOS for higher priority interrupt--in such
. a way that when GuestOS returns from interrupt HV takes over
. from whence it left.

This is really a question of what priority means across the entire
SW stack--and real-time versus Linux may have different answers on
this matter.

The higher privilege level must not unilaterally modify guest OS or application state.

Given the almost complete lack of shared address spaces in a manner
where pointers can be passed between, there is almost nothing an HV
can do to a GuestOS VAS unless GuestOS has ask for a HV service via paravirtualization entry point.
--- Synchronet 3.20c-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Apr 16 14:07:36 2025

From Newsgroup: comp.arch

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 15 Apr 2025 14:02:37 +0000, Scott Lurndal wrote:

Current layer of the privilege stack. If there is a secure monitor
at a more privileged level than the HV, it can take interrupts in a
manner similar to the legacy SMM interupts. Typically there will
be independent periodic timer interrupts in the Guest OS, the HV, and
the secure monitor.

This agrees with the RISC-V approach where each layer in the stack
has its own Interrupt Enable configuration. {Which is what lead to
my questions}.

AArch64 also had interrupt enables at each privilege level.

However, many architectures have only a single control bit for the
whole core--which is shy I am trying to get a complete understanding
of what is required and what is choice. That there is some control
is (IS) required--how many seems to be choice at this stage.

I'm not aware of any architecture that supports virtualization that
doesn't have enables for each privilege level; either there are
distinct levels in hardware, or the hypervisor needs to handle
all interrupts and inject them into the guest in some fashion. Best
to have hardware support for all of this rather than the overhead
of the HV handing all interrupts and the consequent context switches.

Would it be unwise of me to speculate that a control at each layer
is more optimal, or that the critical section that is delayed due
to "other stuff needing to be handled" should have taken precedent.

The former is optimal. Assumning the guest is independent of the
HV, any delay in the critical section (e.g. due to an HV interrupt
being handled) are inconsequential. The critical section is only
critical to the privilege layer it occurs on.

<snip>

The only way that the guest OS or guest OS application can detect
such an event is if it measures an affected load/store - a covert
channel. So there may be security considerations.

Damn that high precision clock .....

Which also leads to the question of should a Virtual Machine have
its own virtual time ?? {Or VM and VMM share the concept of virtual
time} ??

Generally, yes. Usually modeled with an offset register in
the HV that gets applied to the guest view of current time.

Now, sure that lock is held while the page fault is being serviced,
and the ugly head of priority inversion takes hold. But ... I am ni
need of some edumacation here.

Priority inversion is only applicable within a privilege level/ring.
Interrupts to a higher privilege level cannot be masked by an active
interrupt at a lower priority level.

So, if core is running HyperVisor at priority 15 and a user interrupt
arrives at a higher priority but directed at GuestOS (instead of HV)
does::
a) HV continue leaving higher priority interrupt waiting.
b) switch back to GuestOS for higher priority interrupt--in such
. a way that when GuestOS returns from interrupt HV takes over
. from whence it left.

ARM, for example, splits the per-core interrupt priority range into halves
- one half is assigned to the secure monitor and the other is assigned to the non-secure software running on the core. Early hypervisors would field all non-secure interrupts and either handle them itself or inject them into
the guest. The first ARM64 cores would field all interrupts in the HV
and the int controller had special registers the HV could use to inject interrupts
into the guest. The overhead was not insignifcant, so they added
a mechanism to allow some interrupts to be directly fielded by the
guest itself - avoiding the round trip through the HV on every interrupt (called virtual LPIs).

This is really a question of what priority means across the entire
SW stack--and real-time versus Linux may have different answers on
this matter.

The higher privilege level must not unilaterally modify guest OS or
application state.

Given the almost complete lack of shared address spaces in a manner
where pointers can be passed between, there is almost nothing an HV
can do to a GuestOS VAS unless GuestOS has ask for a HV service via >paravirtualization entry point.

The HV owns the translation tables for guest to physical address,
it can pretty much do anything it wants with that access[*], including modifying guest processor and memory state at any time - absent
potential future features such as hardware guest memory encryption
or memory access controls at a level higher than the HV (e.g. the
secure monitor - see AArch64 Realms, for example).

https://developer.arm.com/documentation/den0126/0101/Overview

[*] the hypervisor can easily double map a page in both the guest PAS
and the HV VAS - a technique common in paravirtualized environments.
--- Synchronet 3.20c-Linux NewsLink 1.2

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Apr 16 21:13:43 2025

From Newsgroup: comp.arch

On Wed, 16 Apr 2025 14:07:36 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

---------snip-----------

So, if core is running HyperVisor at priority 15 and a user interrupt >>arrives at a higher priority but directed at GuestOS (instead of HV)
does::
a) HV continue leaving higher priority interrupt waiting.
b) switch back to GuestOS for higher priority interrupt--in such
. a way that when GuestOS returns from interrupt HV takes over
. from whence it left.

ARM, for example, splits the per-core interrupt priority range into
halves
- one half is assigned to the secure monitor and the other is assigned
to the
non-secure software running on the core.

Thus, my predilection for 64-priority levels (rather than ~8 as
suggested
by another participant) allows for this distribution of priorities
across
layers in the SW stack at the discretion of trustable-SW.

Early hypervisors would field
all
non-secure interrupts and either handle them itself or inject them into
the guest. The first ARM64 cores would field all interrupts in the HV
and the int controller had special registers the HV could use to inject interrupts
into the guest. The overhead was not insignifcant, so they added
a mechanism to allow some interrupts to be directly fielded by the
guest itself - avoiding the round trip through the HV on every interrupt (called virtual LPIs).

Given 4 layers in the stack {Secure, Hyper, Super, User} and we have
interrupts targeting {Secure, Hyper, Super}, do we pick up any liability
or do we gain flexibility by being able to target interrupts directly to
{user} ?? (the 4th element).

Roughly: HW maintains 4 copies of state and generally indexes state
with a 2-bit value, and the "structure" of thread-header is identical
between layers; thus, indexing down to {user} falls out for free.

{{But I could be off my rocker...again}}
--- Synchronet 3.20c-Linux NewsLink 1.2

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Wed Apr 16 17:48:49 2025

From Newsgroup: comp.arch

Given 4 layers in the stack {Secure, Hyper, Super, User} and we have interrupts targeting {Secure, Hyper, Super}, do we pick up any liability
or do we gain flexibility by being able to target interrupts directly to {user} ?? (the 4th element).

All these discussions seem to presume a very fixed structure that (I
presume) corresponds to a typical situation in servers nowadays.

But shouldn't the hardware aim for something more flexible to account
for other use cases?

E.g. What if I want to run my own VM as a user? Or my own HV?
That's likely to be a common desire for people working on the
development and testing of OSes and HVs?

Stefan
--- Synchronet 3.20c-Linux NewsLink 1.2

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Apr 16 22:12:22 2025

From Newsgroup: comp.arch

On Wed, 16 Apr 2025 21:48:49 +0000, Stefan Monnier wrote:

Given 4 layers in the stack {Secure, Hyper, Super, User} and we have
interrupts targeting {Secure, Hyper, Super}, do we pick up any liability
or do we gain flexibility by being able to target interrupts directly to
{user} ?? (the 4th element).

All these discussions seem to presume a very fixed structure that (I
presume) corresponds to a typical situation in servers nowadays.

But shouldn't the hardware aim for something more flexible to account
for other use cases?

The goal is that::
The two layers in the middle can be managed as an accordion; supporting
any number of HVs and GuestOSs between Secure and User.

E.g. What if I want to run my own VM as a user? Or my own HV?
That's likely to be a common desire for people working on the
development and testing of OSes and HVs?

Use the accordion

Stefan

--- Synchronet 3.20c-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Apr 16 15:26:12 2025

From Newsgroup: comp.arch

On 4/16/2025 2:13 PM, MitchAlsup1 wrote:

On Wed, 16 Apr 2025 14:07:36 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

---------snip-----------

So, if core is running HyperVisor at priority 15 and a user interrupt
arrives at a higher priority but directed at GuestOS (instead of HV)
does::
a) HV continue leaving higher priority interrupt waiting.
b) switch back to GuestOS for higher priority interrupt--in such
. a way that when GuestOS returns from interrupt HV takes over
. from whence it left.

ARM, for example, splits the per-core interrupt priority range into
halves
- one half is assigned to the secure monitor and the other is assigned
to the
non-secure software running on the core.

Thus, my predilection for 64-priority levels (rather than ~8 as
suggested
by another participant) allows for this distribution of priorities
across
layers in the SW stack at the discretion of trustable-SW.

                                          Early hypervisors would field
all
non-secure interrupts and either handle them itself or inject them into
the guest.    The first ARM64 cores would field all interrupts in the HV >> and the int controller had special registers the HV could use to inject
interrupts
into the guest.    The overhead was not insignifcant, so they added
a mechanism to allow some interrupts to be directly fielded by the
guest itself - avoiding the round trip through the HV on every interrupt
(called virtual LPIs).

Given 4 layers in the stack {Secure, Hyper, Super, User} and we have interrupts targeting {Secure, Hyper, Super}, do we pick up any liability
or do we gain flexibility by being able to target interrupts directly to {user} ?? (the 4th element).

I think you could gain a tiny amount of efficiency if the OS (super)
allowed the user to set up handle certain classes of exceptions, e.g.
divide faults) itself rather than having to go through the super.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.20c-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Apr 17 00:47:38 2025

From Newsgroup: comp.arch

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 16 Apr 2025 14:07:36 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

---------snip-----------

So, if core is running HyperVisor at priority 15 and a user interrupt >>>arrives at a higher priority but directed at GuestOS (instead of HV) >>>does::
a) HV continue leaving higher priority interrupt waiting.
b) switch back to GuestOS for higher priority interrupt--in such
. a way that when GuestOS returns from interrupt HV takes over
. from whence it left.

ARM, for example, splits the per-core interrupt priority range into
halves
- one half is assigned to the secure monitor and the other is assigned
to the
non-secure software running on the core.

Thus, my predilection for 64-priority levels (rather than ~8 as
suggested
by another participant) allows for this distribution of priorities
across
layers in the SW stack at the discretion of trustable-SW.

Architecturally, the ARM64 interrupt priority can vary from 3 to 8
bits. Most implementations implement 5 bits, allowing 16 secure
and 16 non-secure priority levels. They can be grouped using
a binary point register, if required.

Early hypervisors would field
all
non-secure interrupts and either handle them itself or inject them into
the guest. The first ARM64 cores would field all interrupts in the HV
and the int controller had special registers the HV could use to inject
interrupts
into the guest. The overhead was not insignifcant, so they added
a mechanism to allow some interrupts to be directly fielded by the
guest itself - avoiding the round trip through the HV on every interrupt
(called virtual LPIs).

Given 4 layers in the stack {Secure, Hyper, Super, User} and we have >interrupts targeting {Secure, Hyper, Super}, do we pick up any liability
or do we gain flexibility by being able to target interrupts directly to >{user} ?? (the 4th element).

On ARM there are only two interrupt signals from the interrupt controller
to each core: FIQ and IRQ.

Each of the signals can be 'claimed' by one, and only one privilege
level on that core; if the secure monitor claims FIQ, then it can only be delivered
to EL3.

If running bare-metal, the OS (EL1) will claim the IRQ signal (by default if none of the more privileged levels claim it).

If a hypervisor (EL2) is running, it will claim the IRQ signal and field
all physical interrupts, except for virtual LPI and IPI interrupts which the hardware can inject directly into the guest (which may result in an
interrupt to the hypervisor if the guest isn't resident on the target
CPU).

In a virtualized environment, one needs to be vary careful when
exposing hardware interrupt signals directly to the guest operating system,
as that often requires exposing some of the interrupt controller.
--- Synchronet 3.20c-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Apr 17 00:49:37 2025

From Newsgroup: comp.arch

Stefan Monnier <monnier@iro.umontreal.ca> writes:

Given 4 layers in the stack {Secure, Hyper, Super, User} and we have
interrupts targeting {Secure, Hyper, Super}, do we pick up any liability
or do we gain flexibility by being able to target interrupts directly to
{user} ?? (the 4th element).

All these discussions seem to presume a very fixed structure that (I
presume) corresponds to a typical situation in servers nowadays.

But shouldn't the hardware aim for something more flexible to account
for other use cases?

E.g. What if I want to run my own VM as a user? Or my own HV?
That's likely to be a common desire for people working on the
development and testing of OSes and HVs?

ARM has hardware support for nested hypervisors. It can be tricky.
--- Synchronet 3.20c-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Apr 17 00:57:12 2025

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 4/16/2025 2:13 PM, MitchAlsup1 wrote:

On Wed, 16 Apr 2025 14:07:36 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

---------snip-----------

So, if core is running HyperVisor at priority 15 and a user interrupt
arrives at a higher priority but directed at GuestOS (instead of HV)
does::
a) HV continue leaving higher priority interrupt waiting.
b) switch back to GuestOS for higher priority interrupt--in such
. a way that when GuestOS returns from interrupt HV takes over
. from whence it left.

ARM, for example, splits the per-core interrupt priority range into
halves
- one half is assigned to the secure monitor and the other is assigned
to the
non-secure software running on the core.

Thus, my predilection for 64-priority levels (rather than ~8 as
suggested
by another participant) allows for this distribution of priorities
across
layers in the SW stack at the discretion of trustable-SW.

                                          Early hypervisors would field
all
non-secure interrupts and either handle them itself or inject them into
the guest.    The first ARM64 cores would field all interrupts in the HV >>> and the int controller had special registers the HV could use to inject
interrupts
into the guest.    The overhead was not insignifcant, so they added
a mechanism to allow some interrupts to be directly fielded by the
guest itself - avoiding the round trip through the HV on every interrupt >>> (called virtual LPIs).

Given 4 layers in the stack {Secure, Hyper, Super, User} and we have
interrupts targeting {Secure, Hyper, Super}, do we pick up any liability
or do we gain flexibility by being able to target interrupts directly to
{user} ?? (the 4th element).

I think you could gain a tiny amount of efficiency if the OS (super)
allowed the user to set up handle certain classes of exceptions, e.g.
divide faults) itself rather than having to go through the super.

Think carefully about the security implications of user-mode interrupt delivery. Particuarly with respect to potential impacts on other
processes running on the system, and to overall system functionality.

Handling interrupts requires direct access to the hardware from
user-mode.

Hardware access is normally done in the context of a 'sandboxed'
PCI Express SRIOV function which the application can access directly;
the hardware guarantees that the user process cannot adversley
affect the hardware or other guests using other virtual functions.

However, the interrupt controller itself (e.g. the mechanism used
to acknowledge the interrupt to the interrupt controller after it
has been serviced - e.g. the LAPIC) isn't virtualized, and direct
access to that shouldn't be available to user-mode for fairly obvious
reasons.

That's why DPDK/ODP require the OS to handle interrupts and notify
the application via standard OS notification mechanisms even
when using SR-IOV capable hardware for the actual packet handling.
--- Synchronet 3.20c-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Thu Apr 17 01:04:10 2025

From Newsgroup: comp.arch

According to Scott Lurndal <slp53@pacbell.net>:

I think you could gain a tiny amount of efficiency if the OS (super) >>allowed the user to set up handle certain classes of exceptions, e.g. >>divide faults) itself rather than having to go through the super.

Think carefully about the security implications of user-mode interrupt >delivery. Particuarly with respect to potential impacts on other
processes running on the system, and to overall system functionality.

Handling interrupts requires direct access to the hardware from
user-mode.

I think he was talking about exceptions, not interrupts. I don't see much danger in reflecting divide faults and supervisor calls directly back
to the virtual machine. I gather that IBM's virtualization microcode has
done that for decades.

External interrupts are indeed a lot harder unless you know a whole lot
about the thing that's interrupting.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.20c-Linux NewsLink 1.2

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Wed Apr 16 21:07:13 2025

From Newsgroup: comp.arch

External interrupts are indeed a lot harder unless you know a whole lot
about the thing that's interrupting.

Not only the thing that's interrupting but also the thing
it's interrupting. Maybe it's easier for My 66000 where I understand
that the hardware has a notion of threads/processes so it may be able to
know how to deliver the interrupt to the appropriate thread/process.

Stefan
--- Synchronet 3.20c-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Apr 16 23:30:28 2025

From Newsgroup: comp.arch

On 4/16/2025 6:04 PM, John Levine wrote:

According to Scott Lurndal <slp53@pacbell.net>:

I think you could gain a tiny amount of efficiency if the OS (super)
allowed the user to set up handle certain classes of exceptions, e.g.
divide faults) itself rather than having to go through the super.

Think carefully about the security implications of user-mode interrupt
delivery. Particuarly with respect to potential impacts on other
processes running on the system, and to overall system functionality.

Handling interrupts requires direct access to the hardware from
user-mode.

I think he was talking about exceptions, not interrupts.

Right. Thanks John. I was careful to say exceptions, not interrupts.

I don't see much
danger in reflecting divide faults and supervisor calls directly back
to the virtual machine. I gather that IBM's virtualization microcode has done that for decades.

I was suggesting that for things like divide fault, it could go directly
back to the user, assuming the user had set up a place to handle them.

External interrupts are indeed a lot harder unless you know a whole lot
about the thing that's interrupting.

Yup.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.20c-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Apr 17 13:32:54 2025

From Newsgroup: comp.arch

John Levine <johnl@taugh.com> writes:

According to Scott Lurndal <slp53@pacbell.net>:

I think you could gain a tiny amount of efficiency if the OS (super) >>>allowed the user to set up handle certain classes of exceptions, e.g. >>>divide faults) itself rather than having to go through the super.

Think carefully about the security implications of user-mode interrupt >>delivery. Particuarly with respect to potential impacts on other
processes running on the system, and to overall system functionality.

Handling interrupts requires direct access to the hardware from
user-mode.

I think he was talking about exceptions, not interrupts. I don't see much >danger in reflecting divide faults and supervisor calls directly back
to the virtual machine. I gather that IBM's virtualization microcode has >done that for decades.

All the current processors (intel, AMD, ARM, MIPS) that have hardware virtualization support handle faults in the context in which they
arise. e.g. a divide fault will be handled directly by the guest
OS without any hypervisor intervention. The single standard exception
is user mode, where the faults are handled by the Guest/Bare-metal
OS.

External interrupts are indeed a lot harder unless you know a whole lot
about the thing that's interrupting.

Indeed, although it's not so much about the 'thing that's interrupting'
as it is about the interrupt infrastructure (i.e. interrupt controller)
itself.
--- Synchronet 3.20c-Linux NewsLink 1.2

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Thu Apr 17 18:22:35 2025

From Newsgroup: comp.arch

On Thu, 17 Apr 2025 1:04:10 +0000, John Levine wrote:

According to Scott Lurndal <slp53@pacbell.net>:

I think you could gain a tiny amount of efficiency if the OS (super) >>>allowed the user to set up handle certain classes of exceptions, e.g. >>>divide faults) itself rather than having to go through the super.

Think carefully about the security implications of user-mode interrupt >>delivery. Particuarly with respect to potential impacts on other
processes running on the system, and to overall system functionality.

Handling interrupts requires direct access to the hardware from
user-mode.

I think he was talking about exceptions, not interrupts. I don't see
much danger in reflecting divide faults and supervisor calls directly
back
to the virtual machine. I gather that IBM's virtualization microcode
has done that for decades.

I used (I think) the word interrupted as in "the thread currently in
control
has its instruction stream interrupted" which could stand in for
interrupts
or exceptions or faults; to see how the conversation develops.

It seems to me that to "take" and interrupt at user layer in SW-stack,
that the 3-upper layers have to be in the same state as when that User
thread is in control of a core. But, It also seems to me that to "take"
an interrupt into Super, the 2 higher layers of SW-stack also have to
be as they were when that Super thread has control. You don't want HV[j].GuestOS[k] to take an interrupt when Hyper != HV[j] && Super != GuestOS[k] -- because the various translation tables are not properly
available to perform the nested MMU VAS->UAS translation.

In effect, the SW-stack becomes some kind of "closure" where control
can be transferred asynchronously. Enough information is passed (as
arguments) across this boundary that efficient dispatch to the proper
ISR is but a few instructions (3 typically in My 66000).

External interrupts are indeed a lot harder unless you know a whole lot
about the thing that's interrupting.

--- Synchronet 3.20c-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Apr 17 20:10:11 2025

From Newsgroup: comp.arch

mitchalsup@aol.com (MitchAlsup1) writes:

On Thu, 17 Apr 2025 1:04:10 +0000, John Levine wrote:

According to Scott Lurndal <slp53@pacbell.net>:

I think you could gain a tiny amount of efficiency if the OS (super) >>>>allowed the user to set up handle certain classes of exceptions, e.g. >>>>divide faults) itself rather than having to go through the super.

Think carefully about the security implications of user-mode interrupt >>>delivery. Particuarly with respect to potential impacts on other >>>processes running on the system, and to overall system functionality.

Handling interrupts requires direct access to the hardware from >>>user-mode.

I think he was talking about exceptions, not interrupts. I don't see
much danger in reflecting divide faults and supervisor calls directly
back
to the virtual machine. I gather that IBM's virtualization microcode
has done that for decades.

I used (I think) the word interrupted as in "the thread currently in
control
has its instruction stream interrupted" which could stand in for
interrupts
or exceptions or faults; to see how the conversation develops.

In ARM64, an interrupt is just a maskable asynchronous exception.

It seems to me that to "take" and interrupt at user layer in SW-stack,
that the 3-upper layers have to be in the same state as when that User
thread is in control of a core. But, It also seems to me that to "take"
an interrupt into Super, the 2 higher layers of SW-stack also have to
be as they were when that Super thread has control. You don't want >HV[j].GuestOS[k] to take an interrupt when Hyper != HV[j] && Super != >GuestOS[k] -- because the various translation tables are not properly >available to perform the nested MMU VAS->UAS translation.

Note that while any one layer is executing _on a core/hardware thread_,
the other layers aren't running on that core, by definition. However, there is no synchronization with other cores, so other cores in the same system
may be executing in any one or all of the privilege levels/security layers while a given core is taking an exception (synchronous or asynchronous).

--- Synchronet 3.20c-Linux NewsLink 1.2

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Thu Apr 17 21:45:42 2025

From Newsgroup: comp.arch

On Thu, 17 Apr 2025 20:10:11 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Thu, 17 Apr 2025 1:04:10 +0000, John Levine wrote:

According to Scott Lurndal <slp53@pacbell.net>:

I think you could gain a tiny amount of efficiency if the OS (super) >>>>>allowed the user to set up handle certain classes of exceptions, e.g. >>>>>divide faults) itself rather than having to go through the super.

Think carefully about the security implications of user-mode interrupt >>>>delivery. Particuarly with respect to potential impacts on other >>>>processes running on the system, and to overall system functionality.

Handling interrupts requires direct access to the hardware from >>>>user-mode.

I think he was talking about exceptions, not interrupts. I don't see
much danger in reflecting divide faults and supervisor calls directly
back
to the virtual machine. I gather that IBM's virtualization microcode
has done that for decades.

I used (I think) the word interrupted as in "the thread currently in >>control
has its instruction stream interrupted" which could stand in for
interrupts
or exceptions or faults; to see how the conversation develops.

In ARM64, an interrupt is just a maskable asynchronous exception.

My 66000 defines:
a) exception: something wrong in the attempt to execute an instruction
b) interrupt: asynchronous events not related to instruction execution
c) trap ... : request for service to next higher privilege layer
d) check .. : something that should (almost) never happen

Unlike many RISC architectures, My 66000 has arithmetic exceptions
{Operand Domain, Result Range, privilege, 5-IEEE exceptions} along
with typical {GuestOS page fault, Hypervisor page fault} everybody
has. Much more IBM 360-like than MIPS-like.

Exceptions are then categorized as repairable-faults or terminations.
SW determines if arithmetic faults are recognized and what to do
with the exception if one is raised and recognized {terminate, repair, complete}. Page faults operate under "repair" state is repaired such
that re-execution of the instruction should now succeed. Complete is
for situations where a HW cannot deliver "an acceptable" result, but
SW can. Here, SW "completes" the work and returns following the causing instruction.

Checks are things like
1) unrepairable ECC failure
2) special privilege violations
3) hardware failures
4) power or reset events

Which either log the event, attempt repair, or panic the VMM.

Checks are simply exceptions that deliver control to {secure}
instead of {next higher privilege} and checks are not maskable.
{I may come to regret this non-maskable part ...}

It seems to me that to "take" and interrupt at user layer in SW-stack,
that the 3-upper layers have to be in the same state as when that User >>thread is in control of a core. But, It also seems to me that to "take"
an interrupt into Super, the 2 higher layers of SW-stack also have to
be as they were when that Super thread has control. You don't want >>HV[j].GuestOS[k] to take an interrupt when Hyper != HV[j] && Super != >>GuestOS[k] -- because the various translation tables are not properly >>available to perform the nested MMU VAS->UAS translation.

Note that while any one layer is executing _on a core/hardware thread_,
the other layers aren't running on that core,

Not "running" but those layer's CRs are still supporting lower privilege
layers that ARE running on that core. Mostly in the nested Root pointer
and ASID categories, sometimes in the interrupt-table category.

by definition. However,
there is
no synchronization with other cores, so other cores in the same system
may be executing in any one or all of the privilege levels/security
layers
while a given core is taking an exception (synchronous or asynchronous).

Yes, obviously. Any core can be operating at any priority any privilege
any layer unbeknownst to any other core; until and unless SW tries to synchronize with said other core to find out.
--- Synchronet 3.20c-Linux NewsLink 1.2

Who's Online
Recent Visitors
- Noozle
  Mon Apr 21 07:53:02 2025
  from Noozle City via Telnet
- Microbot
  Mon Apr 21 01:36:56 2025
  from Moore, Ok via Telnet
- Noozle
  Sun Apr 20 15:14:28 2025
  from Noozle City via Telnet
- Microbot
  Sun Apr 20 03:00:36 2025
  from Moore, Ok via Telnet

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,030
Nodes:	10 (1 / 9)
Uptime:	63:14:31
Calls:	13,350
Calls today:	2
Files:	186,574
D/L today:	1,871 files (514M bytes)
Messages:	3,358,618

Constant Stack Canaries

Who's Online

Recent Visitors

System Info