Just got to thinking about stack canaries. I was going to have a special purpose register holding the canary value for testing while the program
was running. But I just realized today that it may not be needed. Canary values could be handled by the program loader as constants, eliminating
the need for a register. Since the value is not changing while the
program is running, it could easily be a constant. This may require a
fixup record handled by the assembler / linker to indicate to the loader
to place a canary value.
Prolog code would just store an immediate to the stack. On return a TRAP instruction could check for the immediate value and trap if not present.
But the process seems to require assembler / linker support.
On 3/30/2025 7:16 AM, Robert Finch wrote:
Just got to thinking about stack canaries. I was going to have a special
purpose register holding the canary value for testing while the program
was running. But I just realized today that it may not be needed. Canary
values could be handled by the program loader as constants, eliminating
the need for a register. Since the value is not changing while the
program is running, it could easily be a constant. This may require a
fixup record handled by the assembler / linker to indicate to the loader
to place a canary value.
Prolog code would just store an immediate to the stack. On return a TRAP
instruction could check for the immediate value and trap if not present.
But the process seems to require assembler / linker support.
They are mostly just a normal compiler feature IME:
Prolog stores the value;
Epilog loads it and verifies that the value is intact.
Using a magic number
Nothing fancy needed in the assemble or link stages.
In my case, canary behavior is one of:--- Synchronet 3.20c-Linux NewsLink 1.2
Use them in functions with arrays or similar (default);
Use them everywhere (optional);
Disable them entirely (also optional).
In my case, it is only checking 16-bit magic numbers, but mostly because
a 16-bit constant is cheaper to load into a register in this case
(single 32-bit instruction, vs a larger encoding needed for larger
values).
....
On Sun, 30 Mar 2025 17:47:59 +0000, BGB wrote:Ah, okay. I had thought the stack canaries were defined at run-time.
On 3/30/2025 7:16 AM, Robert Finch wrote:
Just got to thinking about stack canaries. I was going to have a special >>> purpose register holding the canary value for testing while the program
was running. But I just realized today that it may not be needed. Canary >>> values could be handled by the program loader as constants, eliminating
the need for a register. Since the value is not changing while the
program is running, it could easily be a constant. This may require a
fixup record handled by the assembler / linker to indicate to the loader >>> to place a canary value.
Prolog code would just store an immediate to the stack. On return a TRAP >>> instruction could check for the immediate value and trap if not present. >>> But the process seems to require assembler / linker support.
They are mostly just a normal compiler feature IME:
Prolog stores the value;
Epilog loads it and verifies that the value is intact.
Agreed.
Using a magic number
Remove excess words.
Nothing fancy needed in the assemble or link stages.
They remain blissfully ignorant--at most they generate the magic
number, possibly at random, possibly per link-module.
In my case, canary behavior is one of:
Use them in functions with arrays or similar (default);
Use them everywhere (optional);
Disable them entirely (also optional).
In my case, it is only checking 16-bit magic numbers, but mostly because
a 16-bit constant is cheaper to load into a register in this case
(single 32-bit instruction, vs a larger encoding needed for larger
values).
....
On Sun, 30 Mar 2025 17:47:59 +0000, BGB wrote:
On 3/30/2025 7:16 AM, Robert Finch wrote:
Just got to thinking about stack canaries. I was going to have a special >>> purpose register holding the canary value for testing while the program
was running. But I just realized today that it may not be needed. Canary >>> values could be handled by the program loader as constants, eliminating
the need for a register. Since the value is not changing while the
program is running, it could easily be a constant. This may require a
fixup record handled by the assembler / linker to indicate to the loader >>> to place a canary value.
Prolog code would just store an immediate to the stack. On return a TRAP >>> instruction could check for the immediate value and trap if not present. >>> But the process seems to require assembler / linker support.
They are mostly just a normal compiler feature IME:
Prolog stores the value;
Epilog loads it and verifies that the value is intact.
Agreed.
Using a magic number
Remove excess words.
Nothing fancy needed in the assemble or link stages.
They remain blissfully ignorant--at most they generate the magic
number, possibly at random, possibly per link-module.
In my case, canary behavior is one of:
Use them in functions with arrays or similar (default);
Use them everywhere (optional);
Disable them entirely (also optional).
In my case, it is only checking 16-bit magic numbers, but mostly because
a 16-bit constant is cheaper to load into a register in this case
(single 32-bit instruction, vs a larger encoding needed for larger
values).
....
On Sun, 30 Mar 2025 17:47:59 +0000, BGB wrote:
On 3/30/2025 7:16 AM, Robert Finch wrote:
Just got to thinking about stack canaries. I was going to have a special >>> purpose register holding the canary value for testing while the program
was running. But I just realized today that it may not be needed. Canary >>> values could be handled by the program loader as constants, eliminating
the need for a register. Since the value is not changing while the
program is running, it could easily be a constant. This may require a
fixup record handled by the assembler / linker to indicate to the loader >>> to place a canary value.
Prolog code would just store an immediate to the stack. On return a TRAP >>> instruction could check for the immediate value and trap if not present. >>> But the process seems to require assembler / linker support.
They are mostly just a normal compiler feature IME:
Prolog stores the value;
Epilog loads it and verifies that the value is intact.
Agreed.
--Using a magic number
Remove excess words.
Nothing fancy needed in the assemble or link stages.
They remain blissfully ignorant--at most they generate the magic
number, possibly at random, possibly per link-module.
In my case, canary behavior is one of:
Use them in functions with arrays or similar (default);
Use them everywhere (optional);
Disable them entirely (also optional).
In my case, it is only checking 16-bit magic numbers, but mostly because
a 16-bit constant is cheaper to load into a register in this case
(single 32-bit instruction, vs a larger encoding needed for larger
values).
....
On 3/30/2025 1:14 PM, MitchAlsup1 wrote:
On Sun, 30 Mar 2025 17:47:59 +0000, BGB wrote:
On 3/30/2025 7:16 AM, Robert Finch wrote:
Just got to thinking about stack canaries. I was going to have a
special
purpose register holding the canary value for testing while the program >>>> was running. But I just realized today that it may not be needed.
Canary
values could be handled by the program loader as constants, eliminating >>>> the need for a register. Since the value is not changing while the
program is running, it could easily be a constant. This may require a
fixup record handled by the assembler / linker to indicate to the
loader
to place a canary value.
Prolog code would just store an immediate to the stack. On return a
TRAP
instruction could check for the immediate value and trap if not
present.
But the process seems to require assembler / linker support.
They are mostly just a normal compiler feature IME:
Prolog stores the value;
Epilog loads it and verifies that the value is intact.
Agreed.
I'm glad you, Mitch, chimed in here. When I saw this, it occurred to me that this could be done automatically by the hardware (optionally, based
on a bit in a control register). The CALL instruction would store
magic value, and the RET instruction would test it. If there was not a match, an exception would be generated. The value itself could be something like the clock value when the program was initiated, thus guaranteeing uniqueness.
The advantage over the software approach, of course, is the elimination
of several instructions in each prolog/epilog, reducing footprint, and perhaps even time as it might be possible to overlap some of the
processing with the other things these instructions do. The downside is more hardware and perhaps extra overhead.
Does this make sense? What have I missed.
Using a magic number
Remove excess words.
Nothing fancy needed in the assemble or link stages.
They remain blissfully ignorant--at most they generate the magic
number, possibly at random, possibly per link-module.
In my case, canary behavior is one of:
Use them in functions with arrays or similar (default);
Use them everywhere (optional);
Disable them entirely (also optional).
In my case, it is only checking 16-bit magic numbers, but mostly because >>> a 16-bit constant is cheaper to load into a register in this case
(single 32-bit instruction, vs a larger encoding needed for larger
values).
....
On 3/31/2025 11:04 AM, Stephen Fuld wrote:
On 3/30/2025 1:14 PM, MitchAlsup1 wrote:
On Sun, 30 Mar 2025 17:47:59 +0000, BGB wrote:
On 3/30/2025 7:16 AM, Robert Finch wrote:
Just got to thinking about stack canaries. I was going to have a
special
purpose register holding the canary value for testing while the
program
was running. But I just realized today that it may not be needed.
Canary
values could be handled by the program loader as constants,
eliminating
the need for a register. Since the value is not changing while the
program is running, it could easily be a constant. This may require a >>>>> fixup record handled by the assembler / linker to indicate to the
loader
to place a canary value.
Prolog code would just store an immediate to the stack. On return a >>>>> TRAP
instruction could check for the immediate value and trap if not
present.
But the process seems to require assembler / linker support.
They are mostly just a normal compiler feature IME:
Prolog stores the value;
Epilog loads it and verifies that the value is intact.
Agreed.
I'm glad you, Mitch, chimed in here. When I saw this, it occurred to
me that this could be done automatically by the hardware (optionally,
based on a bit in a control register). The CALL instruction would
store magic value, and the RET instruction would test it. If there
was not a match, an exception would be generated. The value itself
could be something like the clock value when the program was
initiated, thus guaranteeing uniqueness.
The advantage over the software approach, of course, is the
elimination of several instructions in each prolog/epilog, reducing
footprint, and perhaps even time as it might be possible to overlap
some of the processing with the other things these instructions do.
The downside is more hardware and perhaps extra overhead.
Does this make sense? What have I missed.
This would seem to imply an ISA where CALL/RET push onto the stack or similar, rather than the (more common for RISC's) strategy of copying PC into a link register...
Another option being if it could be a feature of a Load/Store Multiple.
On 3/31/2025 11:04 AM, Stephen Fuld wrote:
On 3/30/2025 1:14 PM, MitchAlsup1 wrote:
On Sun, 30 Mar 2025 17:47:59 +0000, BGB wrote:
-------------
They are mostly just a normal compiler feature IME:
Prolog stores the value;
Epilog loads it and verifies that the value is intact.
Agreed.
I'm glad you, Mitch, chimed in here. When I saw this, it occurred to me
that this could be done automatically by the hardware (optionally, based
on a bit in a control register). The CALL instruction would store
magic value, and the RET instruction would test it. If there was not a
match, an exception would be generated. The value itself could be
something like the clock value when the program was initiated, thus
guaranteeing uniqueness.
The advantage over the software approach, of course, is the elimination
of several instructions in each prolog/epilog, reducing footprint, and
perhaps even time as it might be possible to overlap some of the
processing with the other things these instructions do. The downside is
more hardware and perhaps extra overhead.
Does this make sense? What have I missed.
This would seem to imply an ISA where CALL/RET push onto the stack or similar, rather than the (more common for RISC's) strategy of copying PC
into a link register...
Another option being if it could be a feature of a Load/Store Multiple.
Say, LDM/STM:
6b Hi (Upper bound of register to save)
6b Lo (Lower bound of registers to save)
1b LR (Flag to save Link Register)
1b GP (Flag to save Global Pointer)
1b SK (Flag to generate a canary)
Likely (STM):
Pushes LR first (if bit set);
Pushes GP second (if bit set);
Pushes registers in range (if Hi>=Lo);
Pushes stack canary (if bit set).
LDM would check the canary first and fault if it doesn't see the
expected value.
Downside, granted, is needing the relative complexity of an LDM/STM
style instruction.
Other ISAs use a flag bit for each register, but this is less viable
with an ISA with a larger number of registers, well, unless one uses a
64 or 96 bit LDM/STM encoding (possible). Merit though would be not
needing multiple LDM's / STM's to deal with a discontinuous register
range.
Well, also excluding the possibility where the LDM/STM is essentially
just a function call (say, if beyond certain number of registers are to
be saved/restored, the compiler generates a call to a save/restore
sequence, which is also generates as-needed). Granted, this is basically
the strategy used by BGBCC. If multiple functions happen to save/restore
the same combination of registers, they get to reuse the prior
function's save/restore sequence (generally folded off to before the
function in question).
Granted, the folding strategy can still do canary values, but doing so
in the reused portions would limit the range of unique canary values
(well, unless the canary magic is XOR'ed with SP or something...).
On Mon, 31 Mar 2025 17:17:38 +0000, BGB wrote:
On 3/31/2025 11:04 AM, Stephen Fuld wrote:
On 3/30/2025 1:14 PM, MitchAlsup1 wrote:
On Sun, 30 Mar 2025 17:47:59 +0000, BGB wrote:
-------------
They are mostly just a normal compiler feature IME:
Prolog stores the value;
Epilog loads it and verifies that the value is intact.
Agreed.
I'm glad you, Mitch, chimed in here. When I saw this, it occurred to me >>> that this could be done automatically by the hardware (optionally, based >>> on a bit in a control register). The CALL instruction would store
magic value, and the RET instruction would test it. If there was not a >>> match, an exception would be generated. The value itself could be
something like the clock value when the program was initiated, thus
guaranteeing uniqueness.
The advantage over the software approach, of course, is the elimination
of several instructions in each prolog/epilog, reducing footprint, and
perhaps even time as it might be possible to overlap some of the
processing with the other things these instructions do. The downside is >>> more hardware and perhaps extra overhead.
Does this make sense? What have I missed.
This would seem to imply an ISA where CALL/RET push onto the stack or
similar, rather than the (more common for RISC's) strategy of copying PC
into a link register...
Another option being if it could be a feature of a Load/Store Multiple.
Say, LDM/STM:
6b Hi (Upper bound of register to save)
6b Lo (Lower bound of registers to save)
1b LR (Flag to save Link Register)
1b GP (Flag to save Global Pointer)
1b SK (Flag to generate a canary)
ENTER and EXIT have 2 of those flags--but also note use of SP and CSP
are implicit.
Likely (STM):
Pushes LR first (if bit set);
Pushes GP second (if bit set);
Pushes registers in range (if Hi>=Lo);
Pushes stack canary (if bit set).
EXIT uses its 3rd flag used when doing longjump() and THROW()
so as to pop the call-stack but not actually RET from the stack
walker.
LDM would check the canary first and fault if it doesn't see the
expected value.
Downside, granted, is needing the relative complexity of an LDM/STM
style instruction.
Not conceptually any harder than DIV or FDIV and nobody complains
about doing multi-cycle math.
Other ISAs use a flag bit for each register, but this is less viable
with an ISA with a larger number of registers, well, unless one uses a
64 or 96 bit LDM/STM encoding (possible). Merit though would be not
needing multiple LDM's / STM's to deal with a discontinuous register
range.
To quote Trevor Smith:: "Why would anyone want to do that" ??
Well, also excluding the possibility where the LDM/STM is essentially
just a function call (say, if beyond certain number of registers are to
be saved/restored, the compiler generates a call to a save/restore
sequence, which is also generates as-needed). Granted, this is basically
the strategy used by BGBCC. If multiple functions happen to save/restore
the same combination of registers, they get to reuse the prior
function's save/restore sequence (generally folded off to before the
function in question).
Calling a subroutine to perform epilogues is adding to the number of
branches a program executes. Having an instruction like EXIT means
when you know you need to exit, you EXIT you don't branch to the exit
point. Saving instructions.
Granted, the folding strategy can still do canary values, but doing so
in the reused portions would limit the range of unique canary values
(well, unless the canary magic is XOR'ed with SP or something...).
On 3/31/2025 1:07 PM, MitchAlsup1 wrote:-------------
Another option being if it could be a feature of a Load/Store Multiple.
Say, LDM/STM:
6b Hi (Upper bound of register to save)
6b Lo (Lower bound of registers to save)
1b LR (Flag to save Link Register)
1b GP (Flag to save Global Pointer)
1b SK (Flag to generate a canary)
ENTER and EXIT have 2 of those flags--but also note use of SP and CSP
are implicit.
Likely (STM):
Pushes LR first (if bit set);
Pushes GP second (if bit set);
Pushes registers in range (if Hi>=Lo);
Pushes stack canary (if bit set).
EXIT uses its 3rd flag used when doing longjump() and THROW()
so as to pop the call-stack but not actually RET from the stack
walker.
OK.
I guess one could debate whether an LDM could treat the Load-LR as "Load
LR" or "Load address and Branch", and/or have separate flags (Load LR vs
Load PC, with Load PC meaning to branch).
Other ABIs may not have as much reason to save/restore the Global
Pointer all the time. But, in my case, it is being used as the primary
way of accessing globals, and each binary image has its own address
range here.
PC-Rel not being used as PC-Rel doesn't allow for multiple process
instances of a given loaded binary within a shared address space.
Vs, say, for PIE ELF binaries where it is needed to load a new copy for
each process instance because of this (well, excluding an FDPIC style
ABI, but seemingly still no one seems to have bothered adding FDPIC
support in GCC or friends for RV64 based targets, ...).
Well, granted, because Linux and similar tend to load every new process
into its own address space and/or use CoW.
Other ISAs use a flag bit for each register, but this is less viable
with an ISA with a larger number of registers, well, unless one uses a
64 or 96 bit LDM/STM encoding (possible). Merit though would be not
needing multiple LDM's / STM's to deal with a discontinuous register
range.
To quote Trevor Smith:: "Why would anyone want to do that" ??
Discontinuous register ranges:
Because pretty much no ABI's put all of the callee save registers in a contiguous range.
Granted, I guess if someone were designing an ISA and ABI clean, they
could make all of the argument registers and callee save registers contiguous.
Say:
R0..R3: Special
R4..R15: Scratch
R16..R31: Argument
R32..R63: Callee Save
....
But, invariably, someone will want "compressed" instructions with a
subset of the registers, and one can't just have these only having
access to argument registers.
Well, also excluding the possibility where the LDM/STM is essentially
just a function call (say, if beyond certain number of registers are to
be saved/restored, the compiler generates a call to a save/restore
sequence, which is also generates as-needed). Granted, this is basically >>> the strategy used by BGBCC. If multiple functions happen to save/restore >>> the same combination of registers, they get to reuse the prior
function's save/restore sequence (generally folded off to before the
function in question).
Calling a subroutine to perform epilogues is adding to the number of
branches a program executes. Having an instruction like EXIT means
when you know you need to exit, you EXIT you don't branch to the exit
point. Saving instructions.
Prolog needs a call, but epilog can just be a branch, since no need to
return back into the function that is returning.
Needs to have a lower limit though, as it is not worth it to use a call/branch to save/restore 3 or 4 registers...
But, say, 20 registers, it is more worthwhile.
Canary values are in addition to ENTER and EXIT not part of themGranted, the folding strategy can still do canary values, but doing so
in the reused portions would limit the range of unique canary values
(well, unless the canary magic is XOR'ed with SP or something...).
On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:
On 3/31/2025 1:07 PM, MitchAlsup1 wrote:-------------
Another option being if it could be a feature of a Load/Store Multiple. >>>>
Say, LDM/STM:
6b Hi (Upper bound of register to save)
6b Lo (Lower bound of registers to save)
1b LR (Flag to save Link Register)
1b GP (Flag to save Global Pointer)
1b SK (Flag to generate a canary)
ENTER and EXIT have 2 of those flags--but also note use of SP and CSP
are implicit.
Likely (STM):
Pushes LR first (if bit set);
Pushes GP second (if bit set);
Pushes registers in range (if Hi>=Lo);
Pushes stack canary (if bit set).
EXIT uses its 3rd flag used when doing longjump() and THROW()
so as to pop the call-stack but not actually RET from the stack
walker.
OK.
I guess one could debate whether an LDM could treat the Load-LR as "Load
LR" or "Load address and Branch", and/or have separate flags (Load LR vs
Load PC, with Load PC meaning to branch).
Other ABIs may not have as much reason to save/restore the Global
Pointer all the time. But, in my case, it is being used as the primary
way of accessing globals, and each binary image has its own address
range here.
I use constants to access globals.
These comes in 32-bit and 64-bit flavors.
PC-Rel not being used as PC-Rel doesn't allow for multiple process
instances of a given loaded binary within a shared address space.
As long as the relative distance is the same, it does.
Vs, say, for PIE ELF binaries where it is needed to load a new copy for
each process instance because of this (well, excluding an FDPIC style
ABI, but seemingly still no one seems to have bothered adding FDPIC
support in GCC or friends for RV64 based targets, ...).
Well, granted, because Linux and similar tend to load every new process
into its own address space and/or use CoW.
CoW and execl()
--------------
Other ISAs use a flag bit for each register, but this is less viable
with an ISA with a larger number of registers, well, unless one uses a >>>> 64 or 96 bit LDM/STM encoding (possible). Merit though would be not
needing multiple LDM's / STM's to deal with a discontinuous register
range.
To quote Trevor Smith:: "Why would anyone want to do that" ??
Discontinuous register ranges:
Because pretty much no ABI's put all of the callee save registers in a
contiguous range.
Granted, I guess if someone were designing an ISA and ABI clean, they
could make all of the argument registers and callee save registers
contiguous.
Say:
R0..R3: Special
R4..R15: Scratch
R16..R31: Argument
R32..R63: Callee Save
....
But, invariably, someone will want "compressed" instructions with a
subset of the registers, and one can't just have these only having
access to argument registers.
Brian had little trouble using My 66000 ABI which does have contiguous register groupings.
Well, also excluding the possibility where the LDM/STM is essentially
just a function call (say, if beyond certain number of registers are to >>>> be saved/restored, the compiler generates a call to a save/restore
sequence, which is also generates as-needed). Granted, this is
basically
the strategy used by BGBCC. If multiple functions happen to save/
restore
the same combination of registers, they get to reuse the prior
function's save/restore sequence (generally folded off to before the
function in question).
Calling a subroutine to perform epilogues is adding to the number of
branches a program executes. Having an instruction like EXIT means
when you know you need to exit, you EXIT you don't branch to the exit
point. Saving instructions.
Prolog needs a call, but epilog can just be a branch, since no need to
return back into the function that is returning.
Yes, but this means My 66000 executes 3 fewer transfers of control
per subroutine than you do. And taken branches add latency.
Needs to have a lower limit though, as it is not worth it to use a
call/branch to save/restore 3 or 4 registers...
But, say, 20 registers, it is more worthwhile.
ENTER saves as few as 1 or as many as 32 and remains that 1 single instruction. Same for EXIT and exit also performs the RET when LDing
R0.
Canary values are in addition to ENTER and EXIT not part of them
Granted, the folding strategy can still do canary values, but doing so >>>> in the reused portions would limit the range of unique canary values
(well, unless the canary magic is XOR'ed with SP or something...).
IMHO.
On 2025-03-31 4:52 p.m., MitchAlsup1 wrote:--------------------
On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:
ENTER saves as few as 1 or as many as 32 and remains that 1 single
instruction. Same for EXIT and exit also performs the RET when LDing
R0.
Canary values are in addition to ENTER and EXIT not part of them
Granted, the folding strategy can still do canary values, but doing so >>>>> in the reused portions would limit the range of unique canary values >>>>> (well, unless the canary magic is XOR'ed with SP or something...).
IMHO.
In Q+3 there are push and pop multiple instructions. I did not want to
add load and store multiple on top of that. They work great for ISRs,
but not so great for task switching code. I have the instructions
pushing or popping up to 17 registers in a group. Groups of registers
overlap by eight. The instructions can handle all 96 registers in the machine. ENTER and EXIT are also present.
It is looking like the context switch code for the OS will take about
3000 clock cycles to run.
Not wanting to disable interrupts for that--- Synchronet 3.20c-Linux NewsLink 1.2
long, I put a spinlock on the system’s task control block array. But I think I have run into an issue. It is the timer ISR that switches tasks. Since it is an ISR it pushes a subset of registers that it uses and
restores them at exit. But when exiting and switching tasks it spinlocks
on the task control block array. I am not sure this is a good thing. As
the timer IRQ is fairly high priority. If something else locked the TCB
array it would deadlock. I guess the context switching could be deferred until the app requests some other operating system function. But then
the issue is what if the app gets stuck in an infinite loop, not calling
the OS? I suppose I could make an OS heartbeat function call a
requirement of apps. If the app does not do a heartbeat within a
reasonable time, it could be terminated.
Q+3 progresses rapidly. A lot of the stuff in earlier versions was
removed. The pared down version is a 32-bit machine. Expecting some
headaches because of the use of condition registers and branch
registers.
On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:
On 3/31/2025 1:07 PM, MitchAlsup1 wrote:-------------
Another option being if it could be a feature of a Load/Store Multiple. >>>>
Say, LDM/STM:
6b Hi (Upper bound of register to save)
6b Lo (Lower bound of registers to save)
1b LR (Flag to save Link Register)
1b GP (Flag to save Global Pointer)
1b SK (Flag to generate a canary)
ENTER and EXIT have 2 of those flags--but also note use of SP and CSP
are implicit.
Likely (STM):
Pushes LR first (if bit set);
Pushes GP second (if bit set);
Pushes registers in range (if Hi>=Lo);
Pushes stack canary (if bit set).
EXIT uses its 3rd flag used when doing longjump() and THROW()
so as to pop the call-stack but not actually RET from the stack
walker.
OK.
I guess one could debate whether an LDM could treat the Load-LR as "Load
LR" or "Load address and Branch", and/or have separate flags (Load LR vs
Load PC, with Load PC meaning to branch).
Other ABIs may not have as much reason to save/restore the Global
Pointer all the time. But, in my case, it is being used as the primary
way of accessing globals, and each binary image has its own address
range here.
I use constants to access globals.
These comes in 32-bit and 64-bit flavors.
PC-Rel not being used as PC-Rel doesn't allow for multiple process
instances of a given loaded binary within a shared address space.
As long as the relative distance is the same, it does.
Vs, say, for PIE ELF binaries where it is needed to load a new copy for
each process instance because of this (well, excluding an FDPIC style
ABI, but seemingly still no one seems to have bothered adding FDPIC
support in GCC or friends for RV64 based targets, ...).
Well, granted, because Linux and similar tend to load every new process
into its own address space and/or use CoW.
CoW and execl()
--------------
Other ISAs use a flag bit for each register, but this is less viable
with an ISA with a larger number of registers, well, unless one uses a >>>> 64 or 96 bit LDM/STM encoding (possible). Merit though would be not
needing multiple LDM's / STM's to deal with a discontinuous register
range.
To quote Trevor Smith:: "Why would anyone want to do that" ??
Discontinuous register ranges:
Because pretty much no ABI's put all of the callee save registers in a
contiguous range.
Granted, I guess if someone were designing an ISA and ABI clean, they
could make all of the argument registers and callee save registers
contiguous.
Say:
R0..R3: Special
R4..R15: Scratch
R16..R31: Argument
R32..R63: Callee Save
....
But, invariably, someone will want "compressed" instructions with a
subset of the registers, and one can't just have these only having
access to argument registers.
Brian had little trouble using My 66000 ABI which does have contiguous register groupings.
Well, also excluding the possibility where the LDM/STM is essentially
just a function call (say, if beyond certain number of registers are to >>>> be saved/restored, the compiler generates a call to a save/restore
sequence, which is also generates as-needed). Granted, this is
basically
the strategy used by BGBCC. If multiple functions happen to save/
restore
the same combination of registers, they get to reuse the prior
function's save/restore sequence (generally folded off to before the
function in question).
Calling a subroutine to perform epilogues is adding to the number of
branches a program executes. Having an instruction like EXIT means
when you know you need to exit, you EXIT you don't branch to the exit
point. Saving instructions.
Prolog needs a call, but epilog can just be a branch, since no need to
return back into the function that is returning.
Yes, but this means My 66000 executes 3 fewer transfers of control
per subroutine than you do. And taken branches add latency.
Needs to have a lower limit though, as it is not worth it to use a
call/branch to save/restore 3 or 4 registers...
But, say, 20 registers, it is more worthwhile.
ENTER saves as few as 1 or as many as 32 and remains that 1 single instruction. Same for EXIT and exit also performs the RET when LDing
R0.
Canary values are in addition to ENTER and EXIT not part of them
Granted, the folding strategy can still do canary values, but doing so >>>> in the reused portions would limit the range of unique canary values
(well, unless the canary magic is XOR'ed with SP or something...).
IMHO.
On 2025-03-31 4:52 p.m., MitchAlsup1 wrote:
On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:
On 3/31/2025 1:07 PM, MitchAlsup1 wrote:-------------
Another option being if it could be a feature of a Load/Store
Multiple.
Say, LDM/STM:
6b Hi (Upper bound of register to save)
6b Lo (Lower bound of registers to save)
1b LR (Flag to save Link Register)
1b GP (Flag to save Global Pointer)
1b SK (Flag to generate a canary)
Q+3 uses a bitmap of register selection with four more bits selecting overlapping groups. It can work with up to 17 registers.
ENTER and EXIT have 2 of those flags--but also note use of SP and CSP
are implicit.
Likely (STM):
Pushes LR first (if bit set);
Pushes GP second (if bit set);
Pushes registers in range (if Hi>=Lo);
Pushes stack canary (if bit set).
EXIT uses its 3rd flag used when doing longjump() and THROW()
so as to pop the call-stack but not actually RET from the stack
walker.
OK.
I guess one could debate whether an LDM could treat the Load-LR as "Load >>> LR" or "Load address and Branch", and/or have separate flags (Load LR vs >>> Load PC, with Load PC meaning to branch).
Other ABIs may not have as much reason to save/restore the Global
Pointer all the time. But, in my case, it is being used as the primary
way of accessing globals, and each binary image has its own address
range here.
I use constants to access globals.
These comes in 32-bit and 64-bit flavors.
PC-Rel not being used as PC-Rel doesn't allow for multiple process
instances of a given loaded binary within a shared address space.
As long as the relative distance is the same, it does.
Vs, say, for PIE ELF binaries where it is needed to load a new copy for
each process instance because of this (well, excluding an FDPIC style
ABI, but seemingly still no one seems to have bothered adding FDPIC
support in GCC or friends for RV64 based targets, ...).
Well, granted, because Linux and similar tend to load every new process
into its own address space and/or use CoW.
CoW and execl()
--------------
Other ISAs use a flag bit for each register, but this is less viable >>>>> with an ISA with a larger number of registers, well, unless one uses a >>>>> 64 or 96 bit LDM/STM encoding (possible). Merit though would be not
needing multiple LDM's / STM's to deal with a discontinuous register >>>>> range.
To quote Trevor Smith:: "Why would anyone want to do that" ??
Discontinuous register ranges:
Because pretty much no ABI's put all of the callee save registers in a
contiguous range.
Granted, I guess if someone were designing an ISA and ABI clean, they
could make all of the argument registers and callee save registers
contiguous.
Say:
R0..R3: Special
R4..R15: Scratch
R16..R31: Argument
R32..R63: Callee Save
....
But, invariably, someone will want "compressed" instructions with a
subset of the registers, and one can't just have these only having
access to argument registers.
Brian had little trouble using My 66000 ABI which does have contiguous
register groupings.
Well, also excluding the possibility where the LDM/STM is essentially >>>>> just a function call (say, if beyond certain number of registers
are to
be saved/restored, the compiler generates a call to a save/restore
sequence, which is also generates as-needed). Granted, this is
basically
the strategy used by BGBCC. If multiple functions happen to save/
restore
the same combination of registers, they get to reuse the prior
function's save/restore sequence (generally folded off to before the >>>>> function in question).
Calling a subroutine to perform epilogues is adding to the number of
branches a program executes. Having an instruction like EXIT means
when you know you need to exit, you EXIT you don't branch to the exit
point. Saving instructions.
Prolog needs a call, but epilog can just be a branch, since no need to
return back into the function that is returning.
Yes, but this means My 66000 executes 3 fewer transfers of control
per subroutine than you do. And taken branches add latency.
Needs to have a lower limit though, as it is not worth it to use a
call/branch to save/restore 3 or 4 registers...
But, say, 20 registers, it is more worthwhile.
ENTER saves as few as 1 or as many as 32 and remains that 1 single
instruction. Same for EXIT and exit also performs the RET when LDing
R0.
Canary values are in addition to ENTER and EXIT not part of them
Granted, the folding strategy can still do canary values, but doing so >>>>> in the reused portions would limit the range of unique canary values >>>>> (well, unless the canary magic is XOR'ed with SP or something...).
IMHO.
In Q+3 there are push and pop multiple instructions. I did not want to
add load and store multiple on top of that. They work great for ISRs,
but not so great for task switching code. I have the instructions
pushing or popping up to 17 registers in a group. Groups of registers overlap by eight. The instructions can handle all 96 registers in the machine. ENTER and EXIT are also present.
It is looking like the context switch code for the OS will take about
3000 clock cycles to run. Not wanting to disable interrupts for that
long, I put a spinlock on the system’s task control block array. But I think I have run into an issue. It is the timer ISR that switches tasks. Since it is an ISR it pushes a subset of registers that it uses and
restores them at exit. But when exiting and switching tasks it spinlocks
on the task control block array. I am not sure this is a good thing. As
the timer IRQ is fairly high priority. If something else locked the TCB array it would deadlock. I guess the context switching could be deferred until the app requests some other operating system function. But then
the issue is what if the app gets stuck in an infinite loop, not calling
the OS? I suppose I could make an OS heartbeat function call a
requirement of apps. If the app does not do a heartbeat within a
reasonable time, it could be terminated.
Q+3 progresses rapidly. A lot of the stuff in earlier versions was
removed. The pared down version is a 32-bit machine. Expecting some headaches because of the use of condition registers and branch registers.
On Tue, 1 Apr 2025 4:58:58 +0000, Robert Finch wrote:
On 2025-03-31 4:52 p.m., MitchAlsup1 wrote:--------------------
On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:
ENTER saves as few as 1 or as many as 32 and remains that 1 single
instruction. Same for EXIT and exit also performs the RET when LDing
R0.
Canary values are in addition to ENTER and EXIT not part of them
Granted, the folding strategy can still do canary values, but
doing so
in the reused portions would limit the range of unique canary values >>>>>> (well, unless the canary magic is XOR'ed with SP or something...). >>>>>>
IMHO.
In Q+3 there are push and pop multiple instructions. I did not want to
add load and store multiple on top of that. They work great for ISRs,
but not so great for task switching code. I have the instructions
pushing or popping up to 17 registers in a group. Groups of registers
overlap by eight. The instructions can handle all 96 registers in the
machine. ENTER and EXIT are also present.
It is looking like the context switch code for the OS will take about
3000 clock cycles to run.
How much of that is figuring out who to switch to and, now that that has
been decided, make the context switch manifest ??
Not wanting to disable interrupts for that
long, I put a spinlock on the system’s task control block array. But I
think I have run into an issue. It is the timer ISR that switches tasks.
Since it is an ISR it pushes a subset of registers that it uses and
restores them at exit. But when exiting and switching tasks it spinlocks
on the task control block array. I am not sure this is a good thing. As
the timer IRQ is fairly high priority. If something else locked the TCB
array it would deadlock. I guess the context switching could be deferred
until the app requests some other operating system function. But then
the issue is what if the app gets stuck in an infinite loop, not calling
the OS? I suppose I could make an OS heartbeat function call a
requirement of apps. If the app does not do a heartbeat within a
reasonable time, it could be terminated.
Q+3 progresses rapidly. A lot of the stuff in earlier versions was
removed. The pared down version is a 32-bit machine. Expecting some
headaches because of the use of condition registers and branch
registers.
On 3/31/2025 11:58 PM, Robert Finch wrote:
On 2025-03-31 4:52 p.m., MitchAlsup1 wrote:
On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:
On 3/31/2025 1:07 PM, MitchAlsup1 wrote:-------------
Another option being if it could be a feature of a Load/Store
Multiple.
Say, LDM/STM:
6b Hi (Upper bound of register to save)
6b Lo (Lower bound of registers to save)
1b LR (Flag to save Link Register)
1b GP (Flag to save Global Pointer)
1b SK (Flag to generate a canary)
Q+3 uses a bitmap of register selection with four more bits selecting
overlapping groups. It can work with up to 17 registers.
OK.
If I did LDM/STM style ops, not sure which strategy I would take.
The possibility of using a 96-bit encoding with an Imm64 holding a bit-
mask of all the registers makes some sense...
ENTER and EXIT have 2 of those flags--but also note use of SP and CSP >>>>> are implicit.
Likely (STM):
Pushes LR first (if bit set);
Pushes GP second (if bit set);
Pushes registers in range (if Hi>=Lo);
Pushes stack canary (if bit set).
EXIT uses its 3rd flag used when doing longjump() and THROW()
so as to pop the call-stack but not actually RET from the stack
walker.
OK.
I guess one could debate whether an LDM could treat the Load-LR as
"Load
LR" or "Load address and Branch", and/or have separate flags (Load
LR vs
Load PC, with Load PC meaning to branch).
Other ABIs may not have as much reason to save/restore the Global
Pointer all the time. But, in my case, it is being used as the primary >>>> way of accessing globals, and each binary image has its own address
range here.
I use constants to access globals.
These comes in 32-bit and 64-bit flavors.
PC-Rel not being used as PC-Rel doesn't allow for multiple process
instances of a given loaded binary within a shared address space.
As long as the relative distance is the same, it does.
Vs, say, for PIE ELF binaries where it is needed to load a new copy for >>>> each process instance because of this (well, excluding an FDPIC style
ABI, but seemingly still no one seems to have bothered adding FDPIC
support in GCC or friends for RV64 based targets, ...).
Well, granted, because Linux and similar tend to load every new process >>>> into its own address space and/or use CoW.
CoW and execl()
--------------
Other ISAs use a flag bit for each register, but this is less viable >>>>>> with an ISA with a larger number of registers, well, unless one
uses a
64 or 96 bit LDM/STM encoding (possible). Merit though would be not >>>>>> needing multiple LDM's / STM's to deal with a discontinuous register >>>>>> range.
To quote Trevor Smith:: "Why would anyone want to do that" ??
Discontinuous register ranges:
Because pretty much no ABI's put all of the callee save registers in a >>>> contiguous range.
Granted, I guess if someone were designing an ISA and ABI clean, they
could make all of the argument registers and callee save registers
contiguous.
Say:
R0..R3: Special
R4..R15: Scratch
R16..R31: Argument
R32..R63: Callee Save
....
But, invariably, someone will want "compressed" instructions with a
subset of the registers, and one can't just have these only having
access to argument registers.
Brian had little trouble using My 66000 ABI which does have contiguous
register groupings.
Well, also excluding the possibility where the LDM/STM is essentially >>>>>> just a function call (say, if beyond certain number of registers
are to
be saved/restored, the compiler generates a call to a save/restore >>>>>> sequence, which is also generates as-needed). Granted, this is
basically
the strategy used by BGBCC. If multiple functions happen to save/ >>>>>> restore
the same combination of registers, they get to reuse the prior
function's save/restore sequence (generally folded off to before the >>>>>> function in question).
Calling a subroutine to perform epilogues is adding to the number of >>>>> branches a program executes. Having an instruction like EXIT means
when you know you need to exit, you EXIT you don't branch to the exit >>>>> point. Saving instructions.
Prolog needs a call, but epilog can just be a branch, since no need to >>>> return back into the function that is returning.
Yes, but this means My 66000 executes 3 fewer transfers of control
per subroutine than you do. And taken branches add latency.
Needs to have a lower limit though, as it is not worth it to use a
call/branch to save/restore 3 or 4 registers...
But, say, 20 registers, it is more worthwhile.
ENTER saves as few as 1 or as many as 32 and remains that 1 single
instruction. Same for EXIT and exit also performs the RET when LDing
R0.
Canary values are in addition to ENTER and EXIT not part of them
Granted, the folding strategy can still do canary values, but
doing so
in the reused portions would limit the range of unique canary values >>>>>> (well, unless the canary magic is XOR'ed with SP or something...). >>>>>>
IMHO.
In Q+3 there are push and pop multiple instructions. I did not want to
add load and store multiple on top of that. They work great for ISRs,
but not so great for task switching code. I have the instructions
pushing or popping up to 17 registers in a group. Groups of registers
overlap by eight. The instructions can handle all 96 registers in the
machine. ENTER and EXIT are also present.
It is looking like the context switch code for the OS will take about
3000 clock cycles to run. Not wanting to disable interrupts for that
long, I put a spinlock on the system’s task control block array. But I
think I have run into an issue. It is the timer ISR that switches
tasks. Since it is an ISR it pushes a subset of registers that it uses
and restores them at exit. But when exiting and switching tasks it
spinlocks on the task control block array. I am not sure this is a
good thing. As the timer IRQ is fairly high priority. If something
else locked the TCB array it would deadlock. I guess the context
switching could be deferred until the app requests some other
operating system function. But then the issue is what if the app gets
stuck in an infinite loop, not calling the OS? I suppose I could make
an OS heartbeat function call a requirement of apps. If the app does
not do a heartbeat within a reasonable time, it could be terminated.
Q+3 progresses rapidly. A lot of the stuff in earlier versions was
removed. The pared down version is a 32-bit machine. Expecting some
headaches because of the use of condition registers and branch registers.
OK.
Ironically, I seem to have comparably low task-switch cost...
However, each system call is essentially 2 task switches, and it is
still slow enough to negatively effect performance if they happen at all frequently.
So, say, one needs to try to minimize the number of unnecessary system
calls (say, don't implement "fputs()" by sending 1 byte at a time, ...).
Unlike on a modern PC, one generally needs to care more about efficiency.
Hence, all the fiddling with low bit-depth graphics formats, and things
like my recent fiddling with 2-bit ADPCM audio.
And, online, one is (if anything) more likely to find people complaining about how old/obsolescent ADPCM is (and/or arguing that people should
store all their sound effects as Ogg/Vorbis or similar; ...).
Then again, I did note that I may need to find some other "quality
metric" for audio, as RMSE isn't really working...
At least going by RMSE, the "best" option would be to use 8-bit PCM and
then downsample it.
Say, 4kHz 8-bit PCM has a lower RMSE score than 2-bit ADPCM, but subjectively the 2-bit ADPCM sounds significantly better.
Say: for 16kHz, and a test file (using a song here):
PCM8, 16kHz : 121 (128 kbps)
A-Law, 16kHz : 284 (128 kbps)
IMA 4bit, 16kHz : 617 (64 kbps)
IMA 2bit, 16kHz : 1692 (32 kbps, *)
ADLQ 2bit, 16kHz: 2000 (32 kbps)
PCM8, 4kHz : 242 (32 kbps)
However, 4kHz PCM8 sounds terrible vs either 2-bit IMA or ADLQ.
Basically sounds muffled, speech is unintelligible.
But, it would be the "best" option if going solely by RMSE.
Also A-Law sounds better than PCM8 (at the same sample rate).
Even with the higher RMSE score.
Seems like it could be possible to do RMSE on A-Law samples as a metric,
but if anything this is just kicking the can down the road slightly.
Granted, A-Law sounds better than 4-bit IMA, and 4-bit IMA sounds better than the 2-bit ADPCM's at least...
*: Previously it was worse, around 4500, but the RMSE score dropped
after switching it to using a similar encoder strategy to ADLQ, namely
doing a brute-force search over the next 3 samples to find the values
that best approximate the target samples.
Though, which is "better", or whether or not even lower RMSE "improves" quality here, is debatable (the PCM8 numbers clearly throw using RMSE as
a quality metric into question for this case).
Ideally I would want some metric that better reflects hearing perception
and is computationally cheap.
...
On 3/31/2025 3:52 PM, MitchAlsup1 wrote:---------------------
On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:
PC-Rel not being used as PC-Rel doesn't allow for multiple process
instances of a given loaded binary within a shared address space.
As long as the relative distance is the same, it does.
Can't happen within a shared address space.
Say, if you load a single copy of a binary at 0x24680000.
Process A and B can't use the same mapping in the same address space,
with PC-rel globals, as then they would each see the other's globals.
You can't do a duplicate mapping at another address, as this both wastes
VAS, and also any Abs64 base-relocs or similar would differ.
You also can't CoW the data/bss sections, as this is no longer a shared address space.
So, alternative is to use GBR to access globals, with the data/bss
sections allocated independently of the binary.
This way, multiple processes can share the same mapping at the same
address for any executable code and constant data, with only the data sections needing to be allocated.
Does mean though that one needs to save/restore the global pointer, and
there is a ritual for reloading it.
EXE's generally assume they are index 0, so:
MOV.Q (GBR, 0), Rt
MOV.Q (Rt, 0), GBR
Or, in RV terms:
LD X6, 0(X3)
LD X3, Disp33(X6)
Or, RV64G:
LD X6, 0(X3)
LUI X5, DispHi
ADD X5 X5, X6
LD X3, DispLo(X5)
For DLL's, the index is fixed up with a base-reloc (for each loaded
DLL), so basically the same idea. Typically a Disp33 is used here to
allow for a potentially large/unknown number of loaded DLL's. Thus far,
a global numbering scheme is used.
Where, (GBR+0) gives the address of a table of global pointers for every loaded binary (can be assumed read-only from userland).
Generally, this is needed if:
Function may be called from outside of the current binary and:
Accesses global variables;
And/or, calls local functions.
Though, still generally lower average-case overhead than the strategy typically used by FDPIC, which would handle this reload process on the
caller side...
SD X3, Disp(SP)
LD X3, 8(X18)
LD X6, 0(X18)
JALR X1, 0(X6)
LD X3, Disp(SP)
Though, execl() effectively replaces the current process.
IMHO, a "CreateProcess()" style abstraction makes more sense than
fork+exec.
But, invariably, someone will want "compressed" instructions with a
subset of the registers, and one can't just have these only having
access to argument registers.
Brian had little trouble using My 66000 ABI which does have contiguous
register groupings.
But, My66000 also isn't like, "Hey, how about 16-bit ops with 3 or 4 bit register numbers".
Not sure the thinking behind the RV ABI.
Prolog needs a call, but epilog can just be a branch, since no need to
return back into the function that is returning.
Yes, but this means My 66000 executes 3 fewer transfers of control
per subroutine than you do. And taken branches add latency.
Granted.
Each predicted branch adds 2 cycles.
Needs to have a lower limit though, as it is not worth it to use a
call/branch to save/restore 3 or 4 registers...
But, say, 20 registers, it is more worthwhile.
ENTER saves as few as 1 or as many as 32 and remains that 1 single
instruction. Same for EXIT and exit also performs the RET when LDing
R0.
Granted.
My strategy isn't perfect:
Non-zero branching overheads, when the feature is used;
Per-function load/store slides in prolog/epilog, when not used.
Then, the heuristic mostly becomes one of when it is better to use the
inline strategy (load/store slide), or to fold them off and use calls/branches.
Does technically also work for RISC-V though (though seemingly GCC--- Synchronet 3.20c-Linux NewsLink 1.2
always uses inline save/restore, but also the RV ABI has fewer
registers).
On 2025-04-01 2:51 p.m., MitchAlsup1 wrote:------------------
On Tue, 1 Apr 2025 4:58:58 +0000, Robert Finch wrote:
It is looking like the context switch code for the OS will take about
3000 clock cycles to run.
How much of that is figuring out who to switch to and, now that that has
been decided, make the context switch manifest ??
That was just for the making the switch. I calculated based on the
number of register loads and stores x2 and then times 13 clocks for
memory access, plus a little bit of overhead for other instructions.
Deciding who to switch to may be another good chunk of time. But the--- Synchronet 3.20c-Linux NewsLink 1.2
system is using a hardware ready list, so the choice is just to pop
(load) the top task id off the ready list. The guts of the switcher is
only about 30 LOC, but it calls a couple of helper routines.
On Tue, 1 Apr 2025 22:06:10 +0000, Robert Finch wrote:
On 2025-04-01 2:51 p.m., MitchAlsup1 wrote:------------------
On Tue, 1 Apr 2025 4:58:58 +0000, Robert Finch wrote:
It is looking like the context switch code for the OS will take about
3000 clock cycles to run.
How much of that is figuring out who to switch to and, now that that has >>> been decided, make the context switch manifest ??
That was just for the making the switch. I calculated based on the
number of register loads and stores x2 and then times 13 clocks for
memory access, plus a little bit of overhead for other instructions.
Why is it not 13 cycles to get started and then each register is 1 one
cycle.
Deciding who to switch to may be another good chunk of time. But the
system is using a hardware ready list, so the choice is just to pop
(load) the top task id off the ready list. The guts of the switcher is
only about 30 LOC, but it calls a couple of helper routines.
On 2025-04-01 7:24 p.m., MitchAlsup1 wrote:
On Tue, 1 Apr 2025 22:06:10 +0000, Robert Finch wrote:The CPU does not do pipe-lined burst loads. To load the cache line it is
On 2025-04-01 2:51 p.m., MitchAlsup1 wrote:------------------
On Tue, 1 Apr 2025 4:58:58 +0000, Robert Finch wrote:
It is looking like the context switch code for the OS will take about >>>>> 3000 clock cycles to run.
How much of that is figuring out who to switch to and, now that that has >>>> been decided, make the context switch manifest ??
That was just for the making the switch. I calculated based on the
number of register loads and stores x2 and then times 13 clocks for
memory access, plus a little bit of overhead for other instructions.
Why is it not 13 cycles to get started and then each register is 1 one
cycle.
two independent loads. 256-bits at a time. Stores post to the bus, but
I seem to remember having to space out the stores so the queue in the
memory controller did not overflow. Needs more work.
Stores should be faster, I think they are single cycle. But loads may be quite slow if things are not in the cache. I should really measure it.
It may not be as bad I think. It is still 300 LOC, about 100 loads and
stores each way. Lots of move instructions for regs that cannot be
directly loaded or stored. And with CRs serializing the processor. But
the processor should eat up all the moves fairly quickly.
--- Synchronet 3.20c-Linux NewsLink 1.2Deciding who to switch to may be another good chunk of time. But the
system is using a hardware ready list, so the choice is just to pop
(load) the top task id off the ready list. The guts of the switcher is
only about 30 LOC, but it calls a couple of helper routines.
Say, if you load a single copy of a binary at 0x24680000.
Process A and B can't use the same mapping in the same address space,
with PC-rel globals, as then they would each see the other's globals.
Say I load a copy of the binary text at 0x24680000 and its data at
0x35900000 for a distance of 0x11280000 into the address space of
a process.
Then I load another copy at 0x44680000 and its data at 55900000
into the address space of a different process.
On 2025-04-01 5:21 p.m., BGB wrote:
On 3/31/2025 11:58 PM, Robert Finch wrote:System calls for Q+ are slightly faster (but not much) than task
On 2025-03-31 4:52 p.m., MitchAlsup1 wrote:
On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:
On 3/31/2025 1:07 PM, MitchAlsup1 wrote:-------------
Another option being if it could be a feature of a Load/Store
Multiple.
Say, LDM/STM:
6b Hi (Upper bound of register to save)
6b Lo (Lower bound of registers to save)
1b LR (Flag to save Link Register)
1b GP (Flag to save Global Pointer)
1b SK (Flag to generate a canary)
Q+3 uses a bitmap of register selection with four more bits selecting
overlapping groups. It can work with up to 17 registers.
OK.
If I did LDM/STM style ops, not sure which strategy I would take.
The possibility of using a 96-bit encoding with an Imm64 holding a
bit- mask of all the registers makes some sense...
ENTER and EXIT have 2 of those flags--but also note use of SP and CSP >>>>>> are implicit.
Likely (STM):
Pushes LR first (if bit set);
Pushes GP second (if bit set);
Pushes registers in range (if Hi>=Lo);
Pushes stack canary (if bit set).
EXIT uses its 3rd flag used when doing longjump() and THROW()
so as to pop the call-stack but not actually RET from the stack
walker.
OK.
I guess one could debate whether an LDM could treat the Load-LR as
"Load
LR" or "Load address and Branch", and/or have separate flags (Load
LR vs
Load PC, with Load PC meaning to branch).
Other ABIs may not have as much reason to save/restore the Global
Pointer all the time. But, in my case, it is being used as the primary >>>>> way of accessing globals, and each binary image has its own address
range here.
I use constants to access globals.
These comes in 32-bit and 64-bit flavors.
PC-Rel not being used as PC-Rel doesn't allow for multiple process
instances of a given loaded binary within a shared address space.
As long as the relative distance is the same, it does.
Vs, say, for PIE ELF binaries where it is needed to load a new copy >>>>> for
each process instance because of this (well, excluding an FDPIC style >>>>> ABI, but seemingly still no one seems to have bothered adding FDPIC
support in GCC or friends for RV64 based targets, ...).
Well, granted, because Linux and similar tend to load every new
process
into its own address space and/or use CoW.
CoW and execl()
--------------
Other ISAs use a flag bit for each register, but this is less viable >>>>>>> with an ISA with a larger number of registers, well, unless one >>>>>>> uses a
64 or 96 bit LDM/STM encoding (possible). Merit though would be not >>>>>>> needing multiple LDM's / STM's to deal with a discontinuous register >>>>>>> range.
To quote Trevor Smith:: "Why would anyone want to do that" ??
Discontinuous register ranges:
Because pretty much no ABI's put all of the callee save registers in a >>>>> contiguous range.
Granted, I guess if someone were designing an ISA and ABI clean, they >>>>> could make all of the argument registers and callee save registers
contiguous.
Say:
R0..R3: Special
R4..R15: Scratch
R16..R31: Argument
R32..R63: Callee Save
....
But, invariably, someone will want "compressed" instructions with a
subset of the registers, and one can't just have these only having
access to argument registers.
Brian had little trouble using My 66000 ABI which does have contiguous >>>> register groupings.
Well, also excluding the possibility where the LDM/STM is
essentially
just a function call (say, if beyond certain number of registers >>>>>>> are to
be saved/restored, the compiler generates a call to a save/restore >>>>>>> sequence, which is also generates as-needed). Granted, this is
basically
the strategy used by BGBCC. If multiple functions happen to save/ >>>>>>> restore
the same combination of registers, they get to reuse the prior
function's save/restore sequence (generally folded off to before the >>>>>>> function in question).
Calling a subroutine to perform epilogues is adding to the number of >>>>>> branches a program executes. Having an instruction like EXIT means >>>>>> when you know you need to exit, you EXIT you don't branch to the exit >>>>>> point. Saving instructions.
Prolog needs a call, but epilog can just be a branch, since no need to >>>>> return back into the function that is returning.
Yes, but this means My 66000 executes 3 fewer transfers of control
per subroutine than you do. And taken branches add latency.
Needs to have a lower limit though, as it is not worth it to use a
call/branch to save/restore 3 or 4 registers...
But, say, 20 registers, it is more worthwhile.
ENTER saves as few as 1 or as many as 32 and remains that 1 single
instruction. Same for EXIT and exit also performs the RET when LDing
R0.
Canary values are in addition to ENTER and EXIT not part of them
Granted, the folding strategy can still do canary values, but
doing so
in the reused portions would limit the range of unique canary values >>>>>>> (well, unless the canary magic is XOR'ed with SP or something...). >>>>>>>
IMHO.
In Q+3 there are push and pop multiple instructions. I did not want
to add load and store multiple on top of that. They work great for
ISRs, but not so great for task switching code. I have the
instructions pushing or popping up to 17 registers in a group. Groups
of registers overlap by eight. The instructions can handle all 96
registers in the machine. ENTER and EXIT are also present.
It is looking like the context switch code for the OS will take about
3000 clock cycles to run. Not wanting to disable interrupts for that
long, I put a spinlock on the system’s task control block array. But
I think I have run into an issue. It is the timer ISR that switches
tasks. Since it is an ISR it pushes a subset of registers that it
uses and restores them at exit. But when exiting and switching tasks
it spinlocks on the task control block array. I am not sure this is a
good thing. As the timer IRQ is fairly high priority. If something
else locked the TCB array it would deadlock. I guess the context
switching could be deferred until the app requests some other
operating system function. But then the issue is what if the app gets
stuck in an infinite loop, not calling the OS? I suppose I could make
an OS heartbeat function call a requirement of apps. If the app does
not do a heartbeat within a reasonable time, it could be terminated.
Q+3 progresses rapidly. A lot of the stuff in earlier versions was
removed. The pared down version is a 32-bit machine. Expecting some
headaches because of the use of condition registers and branch
registers.
OK.
Ironically, I seem to have comparably low task-switch cost...
However, each system call is essentially 2 task switches, and it is
still slow enough to negatively effect performance if they happen at
all frequently.
switches. I just have the system saving state on the stack. I don't
bother saving the FP registers or some of the other system registers
that the OS controls. So, it is a little bit shorter than the task
switch code.
The only thing that can do a task switch in the system is the time-slicer.
So, say, one needs to try to minimize the number of unnecessary systemIm not one much for music, although I play the tunes ocassionally. I'm little hard of hearing.
calls (say, don't implement "fputs()" by sending 1 byte at a time, ...).
Unlike on a modern PC, one generally needs to care more about efficiency.
Hence, all the fiddling with low bit-depth graphics formats, and
things like my recent fiddling with 2-bit ADPCM audio.
And, online, one is (if anything) more likely to find people
complaining about how old/obsolescent ADPCM is (and/or arguing that
people should store all their sound effects as Ogg/Vorbis or
similar; ...).
Then again, I did note that I may need to find some other "quality
metric" for audio, as RMSE isn't really working...
At least going by RMSE, the "best" option would be to use 8-bit PCM
and then downsample it.
Say, 4kHz 8-bit PCM has a lower RMSE score than 2-bit ADPCM, but
subjectively the 2-bit ADPCM sounds significantly better.
Say: for 16kHz, and a test file (using a song here):
PCM8, 16kHz : 121 (128 kbps)
A-Law, 16kHz : 284 (128 kbps)
IMA 4bit, 16kHz : 617 (64 kbps)
IMA 2bit, 16kHz : 1692 (32 kbps, *)
ADLQ 2bit, 16kHz: 2000 (32 kbps)
PCM8, 4kHz : 242 (32 kbps)
However, 4kHz PCM8 sounds terrible vs either 2-bit IMA or ADLQ.
Basically sounds muffled, speech is unintelligible.
But, it would be the "best" option if going solely by RMSE.
Also A-Law sounds better than PCM8 (at the same sample rate).
Even with the higher RMSE score.
Seems like it could be possible to do RMSE on A-Law samples as a
metric, but if anything this is just kicking the can down the road
slightly.
Granted, A-Law sounds better than 4-bit IMA, and 4-bit IMA sounds
better than the 2-bit ADPCM's at least...
*: Previously it was worse, around 4500, but the RMSE score dropped
after switching it to using a similar encoder strategy to ADLQ, namely
doing a brute-force search over the next 3 samples to find the values
that best approximate the target samples.
Though, which is "better", or whether or not even lower RMSE
"improves" quality here, is debatable (the PCM8 numbers clearly throw
using RMSE as a quality metric into question for this case).
Ideally I would want some metric that better reflects hearing
perception and is computationally cheap.
...
Say, if you load a single copy of a binary at 0x24680000.
Process A and B can't use the same mapping in the same address space,
with PC-rel globals, as then they would each see the other's globals.
Say I load a copy of the binary text at 0x24680000 and its data at
0x35900000 for a distance of 0x11280000 into the address space of
a process.
Then I load another copy at 0x44680000 and its data at 55900000
into the address space of a different process.
But then if thread A (whose state is stored at 0x35900000) sends to
thread B (whose state is at 55900000) a closure whose code points
somewhere inside 0x24680000, it will end up using the state of thread
A instead of the state of the current thread.
On Tue, 1 Apr 2025 19:34:10 +0000, BGB wrote:
On 3/31/2025 3:52 PM, MitchAlsup1 wrote:---------------------
On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:
PC-Rel not being used as PC-Rel doesn't allow for multiple process
instances of a given loaded binary within a shared address space.
As long as the relative distance is the same, it does.
Can't happen within a shared address space.
Say, if you load a single copy of a binary at 0x24680000.
Process A and B can't use the same mapping in the same address space,
with PC-rel globals, as then they would each see the other's globals.
Say I load a copy of the binary text at 0x24680000 and its data at
0x35900000 for a distance of 0x11280000 into the address space of
a process.
Then I load another copy at 0x44680000 and its data at 55900000
into the address space of a different process.
PC-rel addressing works in both cases--because the distance (-rel)
remains the same,
and the MMU can translate the code to the same physical, and map
each area of data individually.
Different virtual addresses, same code physical address, different
data virtual and physical addresses.
You can't do a duplicate mapping at another address, as this both wastes
VAS, and also any Abs64 base-relocs or similar would differ.
A 64-bit VAS is a wasteable address space, whereas a 48-bit VAS is not.
You also can't CoW the data/bss sections, as this is no longer a shared
address space.
You are trying to "get at" something here, but I can't see it (yet).
So, alternative is to use GBR to access globals, with the data/bss
sections allocated independently of the binary.
This way, multiple processes can share the same mapping at the same
address for any executable code and constant data, with only the data
sections needing to be allocated.
Does mean though that one needs to save/restore the global pointer, and
there is a ritual for reloading it.
EXE's generally assume they are index 0, so:
MOV.Q (GBR, 0), Rt
MOV.Q (Rt, 0), GBR
Or, in RV terms:
LD X6, 0(X3)
LD X3, Disp33(X6)
Or, RV64G:
LD X6, 0(X3)
LUI X5, DispHi
ADD X5 X5, X6
LD X3, DispLo(X5)
For DLL's, the index is fixed up with a base-reloc (for each loaded
DLL), so basically the same idea. Typically a Disp33 is used here to
allow for a potentially large/unknown number of loaded DLL's. Thus far,
a global numbering scheme is used.
Where, (GBR+0) gives the address of a table of global pointers for every
loaded binary (can be assumed read-only from userland).
Generally, this is needed if:
Function may be called from outside of the current binary and:
Accesses global variables;
And/or, calls local functions.
I just use 32-bit of 64-bit displacement constants. Does not matter
how control arrived at this subroutine, it accesses its data as the
linker resolved addresses--without wasting a register.
Though, still generally lower average-case overhead than the strategy
typically used by FDPIC, which would handle this reload process on the
caller side...
SD X3, Disp(SP)
LD X3, 8(X18)
LD X6, 0(X18)
JALR X1, 0(X6)
LD X3, Disp(SP)
This is just::
CALX [IP,,#GOT[funct_num]-.]
In the 32-bit linking mode this is a 2 word instruction, in the 64-bit linking mode it is a 3 word instruction.
----------------
Though, execl() effectively replaces the current process.
IMHO, a "CreateProcess()" style abstraction makes more sense than
fork+exec.
You are 40 years late on that.
---------------
But, invariably, someone will want "compressed" instructions with a
subset of the registers, and one can't just have these only having
access to argument registers.
Brian had little trouble using My 66000 ABI which does have contiguous
register groupings.
But, My66000 also isn't like, "Hey, how about 16-bit ops with 3 or 4 bit
register numbers".
Not sure the thinking behind the RV ABI.
If RISC-V removed its 16-bit instructions, there is room in its ISA
to put my entire ISA along with all the non-compressed RISC-V inst-
ructions.
---------------
Prolog needs a call, but epilog can just be a branch, since no need to >>>> return back into the function that is returning.
Yes, but this means My 66000 executes 3 fewer transfers of control
per subroutine than you do. And taken branches add latency.
Granted.
Each predicted branch adds 2 cycles.
So, you loose 6 cycles on just under ½ of all subroutine calls,
while also executing 2-5 instructions manipulating your global
pointer.
Needs to have a lower limit though, as it is not worth it to use a
call/branch to save/restore 3 or 4 registers...
But, say, 20 registers, it is more worthwhile.
ENTER saves as few as 1 or as many as 32 and remains that 1 single
instruction. Same for EXIT and exit also performs the RET when LDing
R0.
Granted.
My strategy isn't perfect:
Non-zero branching overheads, when the feature is used;
Per-function load/store slides in prolog/epilog, when not used.
Then, the heuristic mostly becomes one of when it is better to use the
inline strategy (load/store slide), or to fold them off and use
calls/branches.
My solution gets rid of the delimma:
a) the call code is always smaller
b) the call code never takes more cycles
In addition, there is a straightforward way to elide the STs of ENTER
when the memory unit is still executing the previous EXIT.
Does technically also work for RISC-V though (though seemingly GCC
always uses inline save/restore, but also the RV ABI has fewer
registers).
But, yeah, inter-process function pointers aren't really a thing, and should not be a thing.
BGB [2025-04-01 23:19:11] wrote:
But, yeah, inter-process function pointers aren't really a thing, and should >> not be a thing.
AFAIK, this point was brought in the context of a shared address space
(I assumed it was some kind of SASOS situation, but the same thing
happens with per-thread data inside a POSIX-style process).
Function pointers are perfectly normal and common in data (even tho they
may often be implicit, e.g. within the method table of objects), and the whole point of sharing an address space is to be able to exchange data.
Stefan
On 4/3/2025 9:09 AM, Stefan Monnier wrote:
BGB [2025-04-01 23:19:11] wrote:
But, yeah, inter-process function pointers aren't really a thing, and
should
not be a thing.
AFAIK, this point was brought in the context of a shared address space
(I assumed it was some kind of SASOS situation, but the same thing
happens with per-thread data inside a POSIX-style process).
Function pointers are perfectly normal and common in data (even tho they
may often be implicit, e.g. within the method table of objects), and the
whole point of sharing an address space is to be able to exchange data.
Or, to allow for NOMMU operation, or reduce costs by not having context switches result in as large of numbers of TLB misses.
Also makes the kernel simpler as it doesn't need to deal with each
process having its own address space.
Some data sharing is used for IPC, but directly sharing function
pointers between processes, or local memory (stack, malloc, etc), is not allowed.
Though, things may change later, there is a plan to more to separate global/local address ranges. Likely things like code will remain in the shared range, and program data will be in the local range.
Stefan
On 2025-04-03 1:22 p.m., BGB wrote:
On 4/3/2025 9:09 AM, Stefan Monnier wrote:
BGB [2025-04-01 23:19:11] wrote:
But, yeah, inter-process function pointers aren't really a thing,
and should
not be a thing.
AFAIK, this point was brought in the context of a shared address space
(I assumed it was some kind of SASOS situation, but the same thing
happens with per-thread data inside a POSIX-style process).
Function pointers are perfectly normal and common in data (even tho they >>> may often be implicit, e.g. within the method table of objects), and the >>> whole point of sharing an address space is to be able to exchange data.
Or, to allow for NOMMU operation, or reduce costs by not having
context switches result in as large of numbers of TLB misses.
Also makes the kernel simpler as it doesn't need to deal with each
process having its own address space.
Have you seen the MPRV bit in RISCV? Allows memory ops to execute using
the previous mode / address space. The bit just has to be set, then do
the memory op, then reset the bit. Makes it easy to access data using
the process address space.
Thinking of having a CPU local address space in Q+ to store vars for
Some data sharing is used for IPC, but directly sharing function
pointers between processes, or local memory (stack, malloc, etc), is
not allowed.
Though, things may change later, there is a plan to more to separate
global/local address ranges. Likely things like code will remain in
the shared range, and program data will be in the local range.
that particular CPU. It looks like only a small RAM is required. I guess
it would be hardware thread local storage. May place the RAM in the CPU itself.
Stefan
On 2025-04-01 7:24 p.m., MitchAlsup1 wrote:
On Tue, 1 Apr 2025 22:06:10 +0000, Robert Finch wrote: -------------------------The CPU does not do pipe-lined burst loads. To load the cache line it is
Why is it not 13 cycles to get started and then each register is 1 one
cycle.
two independent loads. 256-bits at a time. Stores post to the bus, but
I seem to remember having to space out the stores so the queue in the
memory controller did not overflow. Needs more work.
Stores should be faster, I think they are single cycle. But loads may be quite slow if things are not in the cache. I should really measure it.
It may not be as bad I think. It is still 300 LOC, about 100 loads and
stores each way. Lots of move instructions for regs that cannot be
directly loaded or stored. And with CRs serializing the processor. But
the processor should eat up all the moves fairly quickly.
On 2025-04-03 1:22 p.m., BGB wrote:-------------------
Or, to allow for NOMMU operation, or reduce costs by not having context
switches result in as large of numbers of TLB misses.
Also makes the kernel simpler as it doesn't need to deal with each
process having its own address space.
Have you seen the MPRV bit in RISCV? Allows memory ops to execute using
the previous mode / address space. The bit just has to be set, then do
the memory op, then reset the bit. Makes it easy to access data using
the process address space.
On Fri, 4 Apr 2025 3:49:31 +0000, Robert Finch wrote:
On 2025-04-03 1:22 p.m., BGB wrote:-------------------
Or, to allow for NOMMU operation, or reduce costs by not having context
switches result in as large of numbers of TLB misses.
Also makes the kernel simpler as it doesn't need to deal with each
process having its own address space.
Have you seen the MPRV bit in RISCV? Allows memory ops to execute using
the previous mode / address space. The bit just has to be set, then do
the memory op, then reset the bit. Makes it easy to access data using
the process address space.
Let us postulate you are running in RISC-V HyperVisor on core[j]
and you want to write into GuestOS VAS and into application VAS
more or less simultaneously.
Seems to me like you need a MPRV to be more than a single bit
so it could index which layer of the SW stack's VAS it needs
to touch.
On 2025-04-04 5:13 p.m., MitchAlsup1 wrote:
On Fri, 4 Apr 2025 3:49:31 +0000, Robert Finch wrote:Would not writing to the GuestOs VAS and the application VAS be the
On 2025-04-03 1:22 p.m., BGB wrote:-------------------
Or, to allow for NOMMU operation, or reduce costs by not having context >>>> switches result in as large of numbers of TLB misses.
Also makes the kernel simpler as it doesn't need to deal with each
process having its own address space.
Have you seen the MPRV bit in RISCV? Allows memory ops to execute using
the previous mode / address space. The bit just has to be set, then do
the memory op, then reset the bit. Makes it easy to access data using
the process address space.
Let us postulate you are running in RISC-V HyperVisor on core[j]
and you want to write into GuestOS VAS and into application VAS
more or less simultaneously.
result of separate system calls? Or does the hypervisor take over for
the GuestOS?
Seems to me like you need a MPRV to be more than a single bit
so it could index which layer of the SW stack's VAS it needs
to touch.
So, there is a need to be able to go back two or three levels? I suppose
it could also be done by manipulating the stack, although adding an
extra bit may be easier. How often does it happen?
On Sat, 5 Apr 2025 3:45:51 +0000, Robert Finch wrote:
Would not writing to the GuestOs VAS and the application VAS be the
result of separate system calls? Or does the hypervisor take over for
the GuestOS?
Application has a 64-bit VAS
GusetOS has a 64-bit VAS
HyprVisor has a 64-bit VAS
and so does
Securte has a 64-bit VAS
So, we are in HV and we need to write to guestOS and to Application
but we have only 1-bit of distinction.
mitchalsup@aol.com (MitchAlsup1) writes:
On Sat, 5 Apr 2025 3:45:51 +0000, Robert Finch wrote:
Would not writing to the GuestOs VAS and the application VAS be the
result of separate system calls? Or does the hypervisor take over for
the GuestOS?
Application has a 64-bit VAS
GusetOS has a 64-bit VAS
HyprVisor has a 64-bit VAS
and so does
Securte has a 64-bit VAS
So, we are in HV and we need to write to guestOS and to Application
but we have only 1-bit of distinction.
On ARM64, when the HV needs to write to guest user VA or guest PA,
the SMMU provides an interface the processor can use to translate
the guest VA or Guest PA to the corresponding system physical address.
Of course, there is a race if the guest OS changes the underlying
translation tables during the upcall to the hypervisor or secure
monitor, although that would be a bug in the guest were it so to do,
since the guest explicitly requested the action from the higher
privilege level (e.g. HV).
Arm does have a set of load/store "user" instructions that translate addresses using the unprivileged (application) translation tables. There's also a processor state bit (UAO - User Access Override) that can
be set to force those instructions to use the permissions associated
with the current processor privilege level.
Note that there is a push by all vendors to include support
for guest 'privacy', such that the hypervisor has no direct
access to memory owned by the guest, or where the the guest
memory is encrypted using a key the hypervisor or secure monitor
don't have access to.
mitchalsup@aol.com (MitchAlsup1) writes:
On Sat, 5 Apr 2025 3:45:51 +0000, Robert Finch wrote:
Would not writing to the GuestOs VAS and the application VAS be the
result of separate system calls? Or does the hypervisor take over for
the GuestOS?
Application has a 64-bit VAS
GusetOS has a 64-bit VAS
HyprVisor has a 64-bit VAS
and so does
Securte has a 64-bit VAS
So, we are in HV and we need to write to guestOS and to Application
but we have only 1-bit of distinction.
On ARM64, when the HV needs to write to guest user VA or guest PA,
the SMMU provides an interface the processor can use to translate
the guest VA or Guest PA to the corresponding system physical address.
Of course, there is a race if the guest OS changes the underlying
translation tables during the upcall to the hypervisor or secure
monitor, although that would be a bug in the guest were it so to do,
since the guest explicitly requested the action from the higher
privilege level (e.g. HV).
Arm does have a set of load/store "user" instructions that translate addresses using the unprivileged (application) translation tables.
There's
also a processor state bit (UAO - User Access Override) that can
be set to force those instructions to use the permissions associated
with the current processor privilege level.
Note that there is a push by all vendors to include support
for guest 'privacy', such that the hypervisor has no direct
access to memory owned by the guest, or where the the guest
memory is encrypted using a key the hypervisor or secure monitor
don't have access to.
On 2025-04-05 2:31 p.m., Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:Okay,
On Sat, 5 Apr 2025 3:45:51 +0000, Robert Finch wrote:
Would not writing to the GuestOs VAS and the application VAS be the
result of separate system calls? Or does the hypervisor take over for
the GuestOS?
Application has a 64-bit VAS
GusetOS has a 64-bit VAS
HyprVisor has a 64-bit VAS
and so does
Securte has a 64-bit VAS
So, we are in HV and we need to write to guestOS and to Application
but we have only 1-bit of distinction.
On ARM64, when the HV needs to write to guest user VA or guest PA,
the SMMU provides an interface the processor can use to translate
the guest VA or Guest PA to the corresponding system physical address.
Of course, there is a race if the guest OS changes the underlying
translation tables during the upcall to the hypervisor or secure
monitor, although that would be a bug in the guest were it so to do,
since the guest explicitly requested the action from the higher
privilege level (e.g. HV).
Arm does have a set of load/store "user" instructions that translate
addresses using the unprivileged (application) translation tables.
There's
also a processor state bit (UAO - User Access Override) that can
be set to force those instructions to use the permissions associated
with the current processor privilege level.
Note that there is a push by all vendors to include support
for guest 'privacy', such that the hypervisor has no direct
access to memory owned by the guest, or where the the guest
memory is encrypted using a key the hypervisor or secure monitor
don't have access to.
I was interpreting RISCV specs wrong. They have three bits dedicated to
this. 1 is an on/off and the other two are the mode to use. I am left wondering how it is determined which mode to use. If the hypervisor is
passed a pointer to a VAS variable in a register, how does it know that
the pointer is for the supervisor or the user/app?
It's why I assumed it found the mode from the stack. Those two select bits have to set
somehow. It seems like extra code to access the right address space.
I got the thought to use the three bits a bit differently.
111 = use current mode
110 = use mode from stack
100 = debug? mode
011 = secure (machine) mode
010 = hypervisor mode
001 = supervisor mode
000 = user/app mode
I was just using inline code to select the proper address space. But if
it is necessary to dig around to figure the mode, it may turn into a subroutine call.
On 2025-04-05 2:31 p.m., Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:Okay,
On Sat, 5 Apr 2025 3:45:51 +0000, Robert Finch wrote:
Would not writing to the GuestOs VAS and the application VAS be the
result of separate system calls? Or does the hypervisor take over for
the GuestOS?
Application has a 64-bit VAS
GusetOS has a 64-bit VAS
HyprVisor has a 64-bit VAS
and so does
Securte has a 64-bit VAS
So, we are in HV and we need to write to guestOS and to Application
but we have only 1-bit of distinction.
On ARM64, when the HV needs to write to guest user VA or guest PA,
the SMMU provides an interface the processor can use to translate
the guest VA or Guest PA to the corresponding system physical address.
Of course, there is a race if the guest OS changes the underlying
translation tables during the upcall to the hypervisor or secure
monitor, although that would be a bug in the guest were it so to do,
since the guest explicitly requested the action from the higher
privilege level (e.g. HV).
Arm does have a set of load/store "user" instructions that translate
addresses using the unprivileged (application) translation tables. There's >> also a processor state bit (UAO - User Access Override) that can
be set to force those instructions to use the permissions associated
with the current processor privilege level.
Note that there is a push by all vendors to include support
for guest 'privacy', such that the hypervisor has no direct
access to memory owned by the guest, or where the the guest
memory is encrypted using a key the hypervisor or secure monitor
don't have access to.
I was interpreting RISCV specs wrong. They have three bits dedicated to >this. 1 is an on/off and the other two are the mode to use. I am left >wondering how it is determined which mode to use. If the hypervisor is >passed a pointer to a VAS variable in a register, how does it know that
the pointer is for the supervisor or the user/app?
It's why I assumed it
found the mode from the stack. Those two select bits have to set
somehow. It seems like extra code to access the right address space.'
On Sat, 5 Apr 2025 18:31:44 +0000, Scott Lurndal wrote:
Arm does have a set of load/store "user" instructions that translate
addresses using the unprivileged (application) translation tables.
When Secure Monitor executes a "user" instructions which layer
of the SW stack is accessed:: {HV, SV, User} ?
Is this 1-layer down the stack, or all layers down the stack ??
There's
also a processor state bit (UAO - User Access Override) that can
be set to force those instructions to use the permissions associated
with the current processor privilege level.
That is how My 66000 MMU is defined--higher privilege layers
have R/W access to the next lower privilege layer--without
doing anything other than a typical LD or ST instruction.
I/O MMU has similar issues to solve in that a device can Read
write-execute only memory and write read-execute only memory.
Note that there is a push by all vendors to include support
for guest 'privacy', such that the hypervisor has no direct
access to memory owned by the guest, or where the the guest
memory is encrypted using a key the hypervisor or secure monitor
don't have access to.
I call these "paranoid" applications--generally requiring no
privilege, but they don't want GuestOS of HyperVisor to look
at their data and at the same time, they want GuestOS or HV
to perform I/O to said data--so some devices have a effective
privilege above that of the driver commanding them.
I understand the reasons and rational.
Robert Finch <robfi680@gmail.com> writes:
On 2025-04-05 2:31 p.m., Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:Okay,
On Sat, 5 Apr 2025 3:45:51 +0000, Robert Finch wrote:
Would not writing to the GuestOs VAS and the application VAS be the
result of separate system calls? Or does the hypervisor take over for >>>>> the GuestOS?
Application has a 64-bit VAS
GusetOS has a 64-bit VAS
HyprVisor has a 64-bit VAS
and so does
Securte has a 64-bit VAS
So, we are in HV and we need to write to guestOS and to Application
but we have only 1-bit of distinction.
On ARM64, when the HV needs to write to guest user VA or guest PA,
the SMMU provides an interface the processor can use to translate
the guest VA or Guest PA to the corresponding system physical address.
Of course, there is a race if the guest OS changes the underlying
translation tables during the upcall to the hypervisor or secure
monitor, although that would be a bug in the guest were it so to do,
since the guest explicitly requested the action from the higher
privilege level (e.g. HV).
Arm does have a set of load/store "user" instructions that translate
addresses using the unprivileged (application) translation tables. There's >>> also a processor state bit (UAO - User Access Override) that can
be set to force those instructions to use the permissions associated
with the current processor privilege level.
Note that there is a push by all vendors to include support
for guest 'privacy', such that the hypervisor has no direct
access to memory owned by the guest, or where the the guest
memory is encrypted using a key the hypervisor or secure monitor
don't have access to.
I was interpreting RISCV specs wrong. They have three bits dedicated to
this. 1 is an on/off and the other two are the mode to use. I am left
wondering how it is determined which mode to use. If the hypervisor is
passed a pointer to a VAS variable in a register, how does it know that
the pointer is for the supervisor or the user/app?
When the exception (in this case an upcall to a more privileged
regime) occurs, the saved state register/stack word should contain the prior privilege level. The hypervisor will know from that whether
the upcall was from the guest OS or a guest Application.
Note that on ARM, there are restrictions on upcalls to
more privileged regimes - generally a particular regime
can only upcall the next higher privileged regime, so
the user app can only upcall the GuestOS, the guest OS can only
upcall the HV and the HV is the only regime that can
upcall the secure monitor.
It's why I assumed it
found the mode from the stack. Those two select bits have to set
somehow. It seems like extra code to access the right address space.'
I haven't spent much time with RISC-V, but surely the processor
has a state register that stores the current mode, and which
must be preserved over exceptions/upcalls, which would require
that they be recorded in an exception syndrome register for
restoration when the upcall returns.
When the exception (in this case an upcall to a more privileged
regime) occurs, the saved state register/stack word should contain the
prior privilege level. The hypervisor will know from that whether
the upcall was from the guest OS or a guest Application.
Note that on ARM, there are restrictions on upcalls to
more privileged regimes - generally a particular regime
can only upcall the next higher privileged regime, so
the user app can only upcall the GuestOS, the guest OS can only
upcall the HV and the HV is the only regime that can
upcall the secure monitor.
That presumes a shared address space between the privilege
levels - which is common for the OS and user-modes. It's
not common (or particularly useful[*]) at any other privilege
level.
On 2025-04-06 10:21 a.m., Scott Lurndal wrote:
Note that on ARM, there are restrictions on upcalls toAllows two directional virtualization I think. Q+ has all exceptions and >interrupts going to the secure monitor, which can then delegate it back
more privileged regimes - generally a particular regime
can only upcall the next higher privileged regime, so
the user app can only upcall the GuestOS, the guest OS can only
upcall the HV and the HV is the only regime that can
upcall the secure monitor.
to a lower level.
On Sun, 6 Apr 2025 14:21:26 +0000, Scott Lurndal wrote:
----------------
When the exception (in this case an upcall to a more privileged
regime) occurs, the saved state register/stack word should contain the
prior privilege level. The hypervisor will know from that whether
the upcall was from the guest OS or a guest Application.
Note that on ARM, there are restrictions on upcalls to
more privileged regimes - generally a particular regime
can only upcall the next higher privileged regime, so
the user app can only upcall the GuestOS, the guest OS can only
upcall the HV and the HV is the only regime that can
upcall the secure monitor.
On Sun, 6 Apr 2025 14:32:43 +0000, Scott Lurndal wrote:
That presumes a shared address space between the privilege
levels - which is common for the OS and user-modes. It's
not common (or particularly useful[*]) at any other privilege
level.
So, is this dichotomy because::
a) HVs are good enough at virtualizing raw HW that GuestOS
does not need a lot of paravirtualization to be efficient ??
b) GuestOS does not need "that much paravirtualization" to be
efficient anyway.
c) the kinds of things GuestOS ask HVs to perform is just not
enough like the kind of things user asks of GuestOS.
d) User and GuestOS evolved in a time before virtualization
and simply prefer to exist as it used to be ??
mitchalsup@aol.com (MitchAlsup1) writes:
On Sun, 6 Apr 2025 14:21:26 +0000, Scott Lurndal wrote:
----------------
When the exception (in this case an upcall to a more privileged
regime) occurs, the saved state register/stack word should contain the
prior privilege level. The hypervisor will know from that whether
the upcall was from the guest OS or a guest Application.
Note that on ARM, there are restrictions on upcalls to
more privileged regimes - generally a particular regime
can only upcall the next higher privileged regime, so
the user app can only upcall the GuestOS, the guest OS can only
upcall the HV and the HV is the only regime that can
upcall the secure monitor.
On Sun, 6 Apr 2025 14:32:43 +0000, Scott Lurndal wrote:
That presumes a shared address space between the privilege
levels - which is common for the OS and user-modes. It's
not common (or particularly useful[*]) at any other privilege
level.
So, is this dichotomy because::
a) HVs are good enough at virtualizing raw HW that GuestOS
does not need a lot of paravirtualization to be efficient ??
Yes. Once AMD added Nested Page Tables to SVM and the PCI-SIG
proposed the SR-IOV capability, paravirtualization became anathema.
b) GuestOS does not need "that much paravirtualization" to be
efficient anyway.
With modern hardware support, yes.
c) the kinds of things GuestOS ask HVs to perform is just not
enough like the kind of things user asks of GuestOS.
Yes, that's also a truism.
d) User and GuestOS evolved in a time before virtualization
and simply prefer to exist as it used to be ??
Typically an OS doesn't know if it is a guest or bare metal.
That characteristic means that a given distribution can
operate as either.
mitchalsup@aol.com (MitchAlsup1) writes:
On Sun, 6 Apr 2025 14:21:26 +0000, Scott Lurndal wrote:
----------------
When the exception (in this case an upcall to a more privileged
regime) occurs, the saved state register/stack word should contain the
prior privilege level. The hypervisor will know from that whether
the upcall was from the guest OS or a guest Application.
Note that on ARM, there are restrictions on upcalls to
more privileged regimes - generally a particular regime
can only upcall the next higher privileged regime, so
the user app can only upcall the GuestOS, the guest OS can only
upcall the HV and the HV is the only regime that can
upcall the secure monitor.
On Sun, 6 Apr 2025 14:32:43 +0000, Scott Lurndal wrote:
That presumes a shared address space between the privilege
levels - which is common for the OS and user-modes. It's
not common (or particularly useful[*]) at any other privilege
level.
So, is this dichotomy because::
a) HVs are good enough at virtualizing raw HW that GuestOS
does not need a lot of paravirtualization to be efficient ??
Yes. Once AMD added Nested Page Tables to SVM and the PCI-SIG
proposed the SR-IOV capability, paravirtualization became anathema.
--- Synchronet 3.20c-Linux NewsLink 1.2
b) GuestOS does not need "that much paravirtualization" to be
efficient anyway.
With modern hardware support, yes.
c) the kinds of things GuestOS ask HVs to perform is just not
enough like the kind of things user asks of GuestOS.
Yes, that's also a truism.
d) User and GuestOS evolved in a time before virtualization
and simply prefer to exist as it used to be ??
Typically an OS doesn't know if it is a guest or bare metal.
That characteristic means that a given distribution can
operate as either.
On Mon, 7 Apr 2025 14:09:50 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Sun, 6 Apr 2025 14:21:26 +0000, Scott Lurndal wrote:
----------------
When the exception (in this case an upcall to a more privileged
regime) occurs, the saved state register/stack word should contain the >>>> prior privilege level. The hypervisor will know from that whether
the upcall was from the guest OS or a guest Application.
Note that on ARM, there are restrictions on upcalls to
more privileged regimes - generally a particular regime
can only upcall the next higher privileged regime, so
the user app can only upcall the GuestOS, the guest OS can only
upcall the HV and the HV is the only regime that can
upcall the secure monitor.
On Sun, 6 Apr 2025 14:32:43 +0000, Scott Lurndal wrote:
That presumes a shared address space between the privilege
levels - which is common for the OS and user-modes. It's
not common (or particularly useful[*]) at any other privilege
level.
So, is this dichotomy because::
a) HVs are good enough at virtualizing raw HW that GuestOS
does not need a lot of paravirtualization to be efficient ??
Yes. Once AMD added Nested Page Tables to SVM and the PCI-SIG
proposed the SR-IOV capability, paravirtualization became anathema.
Ok, back to Dan Cross:: (with help from Scott)
If GuestOS wants to grab and hold onto a lock/mutex for a while
to do some critical section stuff--does GuestOS "care" that HV
can still take an interrupt while GuestOS is doing its CS thing ??
since HV is not going to touch any memory associated with GuestOS.
In effect, I am asking is Disable Interrupt is SW-stack-wide or only >applicable to the current layer of the SW stack ?? One can equally
use SW-stack-wide to mean core-wide.
For example:: GuestOS DIs, and HV takes a page fault from GuestOS;
makes the page resident and accessible, and allows GuestOS to run
from the point of fault. GuestOS "sees" no interrupt and nothing
in GuestOS VAS is touched by HV in servicing the page fault.
Now, sure that lock is held while the page fault is being serviced,
and the ugly head of priority inversion takes hold. But ... I am ni
need of some edumacation here.
mitchalsup@aol.com (MitchAlsup1) writes:
On Mon, 7 Apr 2025 14:09:50 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Sun, 6 Apr 2025 14:21:26 +0000, Scott Lurndal wrote: >>>>----------------
When the exception (in this case an upcall to a more privileged
regime) occurs, the saved state register/stack word should contain the >>>>> prior privilege level. The hypervisor will know from that whether
the upcall was from the guest OS or a guest Application.
Note that on ARM, there are restrictions on upcalls to
more privileged regimes - generally a particular regime
can only upcall the next higher privileged regime, so
the user app can only upcall the GuestOS, the guest OS can only
upcall the HV and the HV is the only regime that can
upcall the secure monitor.
On Sun, 6 Apr 2025 14:32:43 +0000, Scott Lurndal wrote:
That presumes a shared address space between the privilege
levels - which is common for the OS and user-modes. It's
not common (or particularly useful[*]) at any other privilege
level.
So, is this dichotomy because::
a) HVs are good enough at virtualizing raw HW that GuestOS
does not need a lot of paravirtualization to be efficient ??
Yes. Once AMD added Nested Page Tables to SVM and the PCI-SIG
proposed the SR-IOV capability, paravirtualization became anathema.
Ok, back to Dan Cross:: (with help from Scott)
If GuestOS wants to grab and hold onto a lock/mutex for a while
to do some critical section stuff--does GuestOS "care" that HV
can still take an interrupt while GuestOS is doing its CS thing ??
since HV is not going to touch any memory associated with GuestOS.
Generally, the Guest should execute "as if" it were running on
Bare Metal. Consider an intel/amd processor running a bare-metal
operating system that takes an interrupt into SMM mode; from the
POV of a guest, an HV interrupt is similar to an SMM interrupt.
If the SMM, Secure Monitor or HV modify guest memory in any way,
all bets are off.
In effect, I am asking is Disable Interrupt is SW-stack-wide or only >>applicable to the current layer of the SW stack ?? One can equally
use SW-stack-wide to mean core-wide.
Current layer of the privilege stack. If there is a secure monitor
at a more privileged level than the HV, it can take interrupts in a
manner similar to the legacy SMM interupts. Typically there will
be independent periodic timer interrupts in the Guest OS, the HV, and
the secure monitor.
For example:: GuestOS DIs, and HV takes a page fault from GuestOS;
Note that these will be rare and only if the HV overcommits physical
memory.
makes the page resident and accessible, and allows GuestOS to run
from the point of fault. GuestOS "sees" no interrupt and nothing
in GuestOS VAS is touched by HV in servicing the page fault.
The only way that the guest OS or guest OS application can detect
such an event is if it measures an affected load/store - a covert
channel. So there may be security considerations.
Now, sure that lock is held while the page fault is being serviced,
and the ugly head of priority inversion takes hold. But ... I am ni
need of some edumacation here.
Priority inversion is only applicable within a privilege level/ring. Interrupts to a higher privilege level cannot be masked by an active interrupt at a lower priority level.
The higher privilege level must not unilaterally modify guest OS or application state.
On Tue, 15 Apr 2025 14:02:37 +0000, Scott Lurndal wrote:
Current layer of the privilege stack. If there is a secure monitor
at a more privileged level than the HV, it can take interrupts in a
manner similar to the legacy SMM interupts. Typically there will
be independent periodic timer interrupts in the Guest OS, the HV, and
the secure monitor.
This agrees with the RISC-V approach where each layer in the stack
has its own Interrupt Enable configuration. {Which is what lead to
my questions}.
However, many architectures have only a single control bit for the
whole core--which is shy I am trying to get a complete understanding
of what is required and what is choice. That there is some control
is (IS) required--how many seems to be choice at this stage.
Would it be unwise of me to speculate that a control at each layer
is more optimal, or that the critical section that is delayed due
to "other stuff needing to be handled" should have taken precedent.
The only way that the guest OS or guest OS application can detect
such an event is if it measures an affected load/store - a covert
channel. So there may be security considerations.
Damn that high precision clock .....
Which also leads to the question of should a Virtual Machine have
its own virtual time ?? {Or VM and VMM share the concept of virtual
time} ??
Now, sure that lock is held while the page fault is being serviced,
and the ugly head of priority inversion takes hold. But ... I am ni
need of some edumacation here.
Priority inversion is only applicable within a privilege level/ring.
Interrupts to a higher privilege level cannot be masked by an active
interrupt at a lower priority level.
So, if core is running HyperVisor at priority 15 and a user interrupt
arrives at a higher priority but directed at GuestOS (instead of HV)
does::
a) HV continue leaving higher priority interrupt waiting.
b) switch back to GuestOS for higher priority interrupt--in such
. a way that when GuestOS returns from interrupt HV takes over
. from whence it left.
This is really a question of what priority means across the entire
SW stack--and real-time versus Linux may have different answers on
this matter.
The higher privilege level must not unilaterally modify guest OS or
application state.
Given the almost complete lack of shared address spaces in a manner
where pointers can be passed between, there is almost nothing an HV
can do to a GuestOS VAS unless GuestOS has ask for a HV service via >paravirtualization entry point.
mitchalsup@aol.com (MitchAlsup1) writes:---------snip-----------
So, if core is running HyperVisor at priority 15 and a user interrupt >>arrives at a higher priority but directed at GuestOS (instead of HV)
does::
a) HV continue leaving higher priority interrupt waiting.
b) switch back to GuestOS for higher priority interrupt--in such
. a way that when GuestOS returns from interrupt HV takes over
. from whence it left.
ARM, for example, splits the per-core interrupt priority range into
halves
- one half is assigned to the secure monitor and the other is assigned
to the
non-secure software running on the core.
Early hypervisors would field
all
non-secure interrupts and either handle them itself or inject them into
the guest. The first ARM64 cores would field all interrupts in the HV
and the int controller had special registers the HV could use to inject interrupts
into the guest. The overhead was not insignifcant, so they added
a mechanism to allow some interrupts to be directly fielded by the
guest itself - avoiding the round trip through the HV on every interrupt (called virtual LPIs).
Given 4 layers in the stack {Secure, Hyper, Super, User} and we have interrupts targeting {Secure, Hyper, Super}, do we pick up any liability
or do we gain flexibility by being able to target interrupts directly to {user} ?? (the 4th element).
Given 4 layers in the stack {Secure, Hyper, Super, User} and we have
interrupts targeting {Secure, Hyper, Super}, do we pick up any liability
or do we gain flexibility by being able to target interrupts directly to
{user} ?? (the 4th element).
All these discussions seem to presume a very fixed structure that (I
presume) corresponds to a typical situation in servers nowadays.
But shouldn't the hardware aim for something more flexible to account
for other use cases?
E.g. What if I want to run my own VM as a user? Or my own HV?
That's likely to be a common desire for people working on the
development and testing of OSes and HVs?
Stefan--- Synchronet 3.20c-Linux NewsLink 1.2
On Wed, 16 Apr 2025 14:07:36 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:---------snip-----------
So, if core is running HyperVisor at priority 15 and a user interrupt
arrives at a higher priority but directed at GuestOS (instead of HV)
does::
a) HV continue leaving higher priority interrupt waiting.
b) switch back to GuestOS for higher priority interrupt--in such
. a way that when GuestOS returns from interrupt HV takes over
. from whence it left.
ARM, for example, splits the per-core interrupt priority range into
halves
- one half is assigned to the secure monitor and the other is assigned
to the
non-secure software running on the core.
Thus, my predilection for 64-priority levels (rather than ~8 as
suggested
by another participant) allows for this distribution of priorities
across
layers in the SW stack at the discretion of trustable-SW.
Early hypervisors would field
all
non-secure interrupts and either handle them itself or inject them into
the guest. The first ARM64 cores would field all interrupts in the HV >> and the int controller had special registers the HV could use to inject
interrupts
into the guest. The overhead was not insignifcant, so they added
a mechanism to allow some interrupts to be directly fielded by the
guest itself - avoiding the round trip through the HV on every interrupt
(called virtual LPIs).
Given 4 layers in the stack {Secure, Hyper, Super, User} and we have interrupts targeting {Secure, Hyper, Super}, do we pick up any liability
or do we gain flexibility by being able to target interrupts directly to {user} ?? (the 4th element).
On Wed, 16 Apr 2025 14:07:36 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:---------snip-----------
So, if core is running HyperVisor at priority 15 and a user interrupt >>>arrives at a higher priority but directed at GuestOS (instead of HV) >>>does::
a) HV continue leaving higher priority interrupt waiting.
b) switch back to GuestOS for higher priority interrupt--in such
. a way that when GuestOS returns from interrupt HV takes over
. from whence it left.
ARM, for example, splits the per-core interrupt priority range into
halves
- one half is assigned to the secure monitor and the other is assigned
to the
non-secure software running on the core.
Thus, my predilection for 64-priority levels (rather than ~8 as
suggested
by another participant) allows for this distribution of priorities
across
layers in the SW stack at the discretion of trustable-SW.
Early hypervisors would field
all
non-secure interrupts and either handle them itself or inject them into
the guest. The first ARM64 cores would field all interrupts in the HV
and the int controller had special registers the HV could use to inject
interrupts
into the guest. The overhead was not insignifcant, so they added
a mechanism to allow some interrupts to be directly fielded by the
guest itself - avoiding the round trip through the HV on every interrupt
(called virtual LPIs).
Given 4 layers in the stack {Secure, Hyper, Super, User} and we have >interrupts targeting {Secure, Hyper, Super}, do we pick up any liability
or do we gain flexibility by being able to target interrupts directly to >{user} ?? (the 4th element).
Given 4 layers in the stack {Secure, Hyper, Super, User} and we have
interrupts targeting {Secure, Hyper, Super}, do we pick up any liability
or do we gain flexibility by being able to target interrupts directly to
{user} ?? (the 4th element).
All these discussions seem to presume a very fixed structure that (I
presume) corresponds to a typical situation in servers nowadays.
But shouldn't the hardware aim for something more flexible to account
for other use cases?
E.g. What if I want to run my own VM as a user? Or my own HV?
That's likely to be a common desire for people working on the
development and testing of OSes and HVs?
On 4/16/2025 2:13 PM, MitchAlsup1 wrote:
On Wed, 16 Apr 2025 14:07:36 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:---------snip-----------
So, if core is running HyperVisor at priority 15 and a user interrupt
arrives at a higher priority but directed at GuestOS (instead of HV)
does::
a) HV continue leaving higher priority interrupt waiting.
b) switch back to GuestOS for higher priority interrupt--in such
. a way that when GuestOS returns from interrupt HV takes over
. from whence it left.
ARM, for example, splits the per-core interrupt priority range into
halves
- one half is assigned to the secure monitor and the other is assigned
to the
non-secure software running on the core.
Thus, my predilection for 64-priority levels (rather than ~8 as
suggested
by another participant) allows for this distribution of priorities
across
layers in the SW stack at the discretion of trustable-SW.
Early hypervisors would field
all
non-secure interrupts and either handle them itself or inject them into
the guest. The first ARM64 cores would field all interrupts in the HV >>> and the int controller had special registers the HV could use to inject
interrupts
into the guest. The overhead was not insignifcant, so they added
a mechanism to allow some interrupts to be directly fielded by the
guest itself - avoiding the round trip through the HV on every interrupt >>> (called virtual LPIs).
Given 4 layers in the stack {Secure, Hyper, Super, User} and we have
interrupts targeting {Secure, Hyper, Super}, do we pick up any liability
or do we gain flexibility by being able to target interrupts directly to
{user} ?? (the 4th element).
I think you could gain a tiny amount of efficiency if the OS (super)
allowed the user to set up handle certain classes of exceptions, e.g.
divide faults) itself rather than having to go through the super.
I think you could gain a tiny amount of efficiency if the OS (super) >>allowed the user to set up handle certain classes of exceptions, e.g. >>divide faults) itself rather than having to go through the super.
Think carefully about the security implications of user-mode interrupt >delivery. Particuarly with respect to potential impacts on other
processes running on the system, and to overall system functionality.
Handling interrupts requires direct access to the hardware from
user-mode.
External interrupts are indeed a lot harder unless you know a whole lot
about the thing that's interrupting.
According to Scott Lurndal <slp53@pacbell.net>:
I think you could gain a tiny amount of efficiency if the OS (super)
allowed the user to set up handle certain classes of exceptions, e.g.
divide faults) itself rather than having to go through the super.
Think carefully about the security implications of user-mode interrupt
delivery. Particuarly with respect to potential impacts on other
processes running on the system, and to overall system functionality.
Handling interrupts requires direct access to the hardware from
user-mode.
I think he was talking about exceptions, not interrupts.
I don't see much
danger in reflecting divide faults and supervisor calls directly back
to the virtual machine. I gather that IBM's virtualization microcode has done that for decades.
External interrupts are indeed a lot harder unless you know a whole lot
about the thing that's interrupting.
According to Scott Lurndal <slp53@pacbell.net>:
I think you could gain a tiny amount of efficiency if the OS (super) >>>allowed the user to set up handle certain classes of exceptions, e.g. >>>divide faults) itself rather than having to go through the super.
Think carefully about the security implications of user-mode interrupt >>delivery. Particuarly with respect to potential impacts on other
processes running on the system, and to overall system functionality.
Handling interrupts requires direct access to the hardware from
user-mode.
I think he was talking about exceptions, not interrupts. I don't see much >danger in reflecting divide faults and supervisor calls directly back
to the virtual machine. I gather that IBM's virtualization microcode has >done that for decades.
External interrupts are indeed a lot harder unless you know a whole lot
about the thing that's interrupting.
According to Scott Lurndal <slp53@pacbell.net>:
I think you could gain a tiny amount of efficiency if the OS (super) >>>allowed the user to set up handle certain classes of exceptions, e.g. >>>divide faults) itself rather than having to go through the super.
Think carefully about the security implications of user-mode interrupt >>delivery. Particuarly with respect to potential impacts on other
processes running on the system, and to overall system functionality.
Handling interrupts requires direct access to the hardware from
user-mode.
I think he was talking about exceptions, not interrupts. I don't see
much danger in reflecting divide faults and supervisor calls directly
back
to the virtual machine. I gather that IBM's virtualization microcode
has done that for decades.
External interrupts are indeed a lot harder unless you know a whole lot--- Synchronet 3.20c-Linux NewsLink 1.2
about the thing that's interrupting.
On Thu, 17 Apr 2025 1:04:10 +0000, John Levine wrote:
According to Scott Lurndal <slp53@pacbell.net>:
I think you could gain a tiny amount of efficiency if the OS (super) >>>>allowed the user to set up handle certain classes of exceptions, e.g. >>>>divide faults) itself rather than having to go through the super.
Think carefully about the security implications of user-mode interrupt >>>delivery. Particuarly with respect to potential impacts on other >>>processes running on the system, and to overall system functionality.
Handling interrupts requires direct access to the hardware from >>>user-mode.
I think he was talking about exceptions, not interrupts. I don't see
much danger in reflecting divide faults and supervisor calls directly
back
to the virtual machine. I gather that IBM's virtualization microcode
has done that for decades.
I used (I think) the word interrupted as in "the thread currently in
control
has its instruction stream interrupted" which could stand in for
interrupts
or exceptions or faults; to see how the conversation develops.
It seems to me that to "take" and interrupt at user layer in SW-stack,
that the 3-upper layers have to be in the same state as when that User
thread is in control of a core. But, It also seems to me that to "take"
an interrupt into Super, the 2 higher layers of SW-stack also have to
be as they were when that Super thread has control. You don't want >HV[j].GuestOS[k] to take an interrupt when Hyper != HV[j] && Super != >GuestOS[k] -- because the various translation tables are not properly >available to perform the nested MMU VAS->UAS translation.
mitchalsup@aol.com (MitchAlsup1) writes:
On Thu, 17 Apr 2025 1:04:10 +0000, John Levine wrote:
According to Scott Lurndal <slp53@pacbell.net>:
I think you could gain a tiny amount of efficiency if the OS (super) >>>>>allowed the user to set up handle certain classes of exceptions, e.g. >>>>>divide faults) itself rather than having to go through the super.
Think carefully about the security implications of user-mode interrupt >>>>delivery. Particuarly with respect to potential impacts on other >>>>processes running on the system, and to overall system functionality.
Handling interrupts requires direct access to the hardware from >>>>user-mode.
I think he was talking about exceptions, not interrupts. I don't see
much danger in reflecting divide faults and supervisor calls directly
back
to the virtual machine. I gather that IBM's virtualization microcode
has done that for decades.
I used (I think) the word interrupted as in "the thread currently in >>control
has its instruction stream interrupted" which could stand in for
interrupts
or exceptions or faults; to see how the conversation develops.
In ARM64, an interrupt is just a maskable asynchronous exception.
It seems to me that to "take" and interrupt at user layer in SW-stack,
that the 3-upper layers have to be in the same state as when that User >>thread is in control of a core. But, It also seems to me that to "take"
an interrupt into Super, the 2 higher layers of SW-stack also have to
be as they were when that Super thread has control. You don't want >>HV[j].GuestOS[k] to take an interrupt when Hyper != HV[j] && Super != >>GuestOS[k] -- because the various translation tables are not properly >>available to perform the nested MMU VAS->UAS translation.
Note that while any one layer is executing _on a core/hardware thread_,
the other layers aren't running on that core,
by definition. However,
there is
no synchronization with other cores, so other cores in the same system
may be executing in any one or all of the privilege levels/security
layers
while a given core is taking an exception (synchronous or asynchronous).
Sysop: | DaiTengu |
---|---|
Location: | Appleton, WI |
Users: | 1,030 |
Nodes: | 10 (1 / 9) |
Uptime: | 63:14:31 |
Calls: | 13,350 |
Calls today: | 2 |
Files: | 186,574 |
D/L today: |
1,871 files (514M bytes) |
Messages: | 3,358,618 |