• NES games on Apple IIgs

    From D Finnigan@dog_cow@macgui.com to comp.sys.apple2 on Fri Apr 19 16:42:33 2024
    From Newsgroup: comp.sys.apple2

    Last week, Lucas Scharenbroich (Super Mario GTE fame) released silky-gs
    which he calls "an NES runtime for the Apple IIgs that provides a NES PPU
    and APU compatibility layer."

    https://github.com/lscharen/silky-gs


    Here are some of his remarks on development:

    ---

    Fixed a long-standing rendering issue.  Left half of the video is the "jitter", right side is fixed.

    Since practically I can only render graphics on byte boundaries on the IIgs, the NES graphics are effectively snapped to even pixel boundaries.  I
    noticed that there are quite a few cases where the sprites look "jittery" or don't exactly line up with their expected positions.  Turns out there were two bugs.

    First, I had been naively clamping both the scroll position and the sprite horizontal coordinates.  This is not correct when both of the values are odd in their original NES coordinates, e.g. scroll_x = 1 and sprite_x = 49.  In this case the sprite should be placed up at x = 50 (byte 25) to align with
    the background, but was being clamped to x = 48 instead.  This fix corrected the calculation, but the jittering was still present.

    The second bug was more subtle.  In the Super Mario Bros ROM, the sprite
    data is uploaded to the PPU at the start of each frame and then the scroll positions are set just before exiting the NMI handler mid-frame right after
    the status bar.  In order to optimize my rendering, I had been ignoring the sprite data upload and reading them directly from the game's RAM just before blitting to the screen.  However, my code only gets control after the ROM
    has executed and it has already calculated the sprite positions for the next frame at that point.  So my code was always reading the sprite data one
    frame ahead of the scroll position.

    I resolved this by falling back and actually performing the data copy.  I'll find a way to remove it later because it's quite the performance hog since
    even copying the 256 bytes of data in an unrolled loop takes ~2,500 cycles
    and this copy is happening during every VBL interrupt instead of just when
    the IIgs code tries to render the screen. This ends up burning nearly
    100,000 extra cycles per seconds -- a fair chunk of the entire CPU budget,
    but working around it will be a game-specific tweak and not a generic improvement to the runtime.


    --------------


    And to show concretely how much CPU time the ROM code consumes. Here's
    Balloon Fight running at stock speeds with a red border that starts when the framework calls the NES interrupt vector and clears it back once control returns.

    Once the gameplay starts more than half of the CPU time is spent running the ROM code, so I'm actually pretty happy getting the frame rates we do.

    Incidentally, this is why an accelerator really helps on this code. The NES code is all in 16KB or 32KB of memory, so very cache-friendly and it's pure code that doesn't touch anything that requires the system to slow down to
    1MHz.

    --------------


    This might be interesting / useful for people and I need to start
    documenting things anyway. Here is a breakdown of how the dirty rendering works on the IIgs side of things. I have linked into the relevant bits of
    code and provided some sidebar comments on what might be possible optimizations. Most of these are ideas I'm planning to look into
    post-release.
    We'll assume that the ROM code does not update any tiles in the frame and is only moving sprites around.
    The IIgs fires a native VBL interrupt and begins executing the interrupt handler.
    The interrupt handler calls NES_TriggerNMI, which simulates the NMI
    interrupt on the NES
    NES code runs for one frame (this could take a while)
    Return from the native interrupt
    At this point the sprite information is sitting in NES RAM at $0200. This
    is technically supposed to be uploaded to the PPU OAM memory via DMA, but
    the runtime cheats to avoid copying 256 bytes. The IIgs frame is built as following.
    From the main event loop the framework calls the NES_RenderFrame function
    This function disables interrupts and scans the NES sprite information
    First it clears 30 bytes used for a bitmap that tracks which scanlines the sprites are on
    Then it scans all 64 sprites (probably some loop unrolling / register optimizations here)
    Each game defines a macro for game-specific exclusions (like fully
    transparent sprites)
    Sprites outside of the visible IIgs screen area are skipped
    A couple of table lookups are used to set the bits in the bitmap
    It freezes a few essential variables so the VBL interrupts don't change them while rendering
    Any PPU tiles that changed since the last IIgs render are copied into shadow memory (we assume no tiles change, so almost no time spent here)
    Calls back to the game-specific RenderScreen function
    This does the necessary work to set up the graphics screen (this can be done once and then skipped for a static screen; optimization not yet implemented) Then calls the dirty screen rendering function
    SHR shadowing is turned off
    A macro is used to walk current and previous frame's bitmap and draw the background on lines that are occupied by sprites on the current frame and
    the prior frame. These are the lines that need to be erased before drawing
    the current frame.
    The sprites are drawn. (this can definitely be optimized. No compiled
    sprites or any serious work at simplifying yet)
    SHR shadowing is turned on
    The background is drawn only on lines occupied by sprites in the prior
    frame, but not the current frame.
    Finally, the lines that the new sprites were drawn on are exposed via a PEI Slam.
    Cleanup is done to get ready for the next frame (as above, possible to be deferred until a non-dirty update happens)
    The macro that does the bitmap walking is moderately expensive. It has to
    look at 25 bytes (25 * 8 = 200 scanlines) three times. Not terrible, but a couple thousand cycles for sure.


    ----------

    So I started looking at the Balloon Fight ROM and it has turned my
    experience on its head!

    In the Super Mario Bros ROM the Reset vector calls a small routine that initializes the hardware and memory to a known state and then jumps into an infinite loop. All of the game logic is driven by the VBL/NMI interrupt
    which exits cleanly via an RTI once the logic for the frame is finished.

    This is a nice setup because my framework can call the reset vector, get control back and then set up a native VBL interrupt that simply calls the
    ROM routine. A clean 1-to-1 mapping.

    In Balloon Fight, the Reset code initializes the system as expected, but
    then continues on to the game logic. The code is full of places where it busy-waits on the hardware VBL flag to clear before continuing on with the
    main program’s execution.

    The Balloon Fight VBL/NMI logic is just a tiny routine that copies the
    current sprite data to the PPU and updates the sound registers.

    This is exactly the opposite of how the SMB code was structured, so I’ll
    need the find a way to break out of the program code so my framework can get control back in time to do the actual drawing on the IIgs side of things.

    Since the BF ROM has a “wait for vbl” subroutine, I can patch that out to behave as a “yield” back to my code, but I’ll need to add some simple context switching management since I’ll no longer be calling an interrupt vector in the ROM on each native VBL interrupt, but instead returning
    control to the point that was yielded on the previous frame.

    Doable, but a surprising twist to be sure.

    --- Synchronet 3.20a-Linux NewsLink 1.114