• large pages access time

    From Bonita Montero@Bonita.Montero@gmail.com to comp.lang.c++ on Sat Oct 12 16:24:04 2024
    From Newsgroup: comp.lang.c++

    I wanted to test the performance of 2MiB pages against 4kiB pages.
    My Zen4 CPU has a fully associative L1-TLB of 72 entries for all
    page sizes and 3072 entries for each 4kiB and 2/4MiB pages.
    So I wrote a little benchmark that allocates 32GiB memory with 4kiB
    and 2MiB pages and that touches each 4kiB block once with a byte at
    a random page address. If I touch the pages all at the same page off-
    set large pages are only a quarter faster. But If I touch the pages
    at a random offset large pages become 2.75 times faster. I can't
    explain this hughe difference since the page-address is random so
    no prefetching could help. But nevertheless it shows that large
    pages could make a big difference.

    I wrote a little benchmark


    #include <Windows.h>
    #undef max
    #include <iostream>
    #include <memory>
    #include <chrono>
    #include <span>
    #include <random>
    #include "invoke_on_destruct.h"
    #include "ndi.h"

    using namespace std;
    using namespace chrono;

    using XHANDLE = unique_ptr<void, decltype([]( HANDLE h ) { h && h != INVALID_HANDLE_VALUE && CloseHandle( h ); })>;

    DWORD enablePrivilege( char const *privilege, bool enable );

    static atomic_char ac;

    int main()
    {
    if( enablePrivilege( "SeLockMemoryPrivilege", true ) )
    return EXIT_FAILURE;
    constexpr size_t
    BLOCK_SIZE = 32 * 1024ull * 1024 * 1024,
    N_PAGES = BLOCK_SIZE / 0x1000;
    vector<ndi<size_t>> wheres;
    wheres.resize( N_PAGES );
    mt19937_64 mt;
    using uid = uniform_int_distribution<size_t>;
    uid rndInPage( 0, 0xFFF );
    for( size_t i = 0; i != N_PAGES; ++i )
    wheres[i] = i * 0x1000 + rndInPage( mt );
    uid rndPage( 0, N_PAGES - 1 );
    for( size_t i = 0; i != N_PAGES; ++i )
    swap( wheres[i], wheres[rndPage( mt )] );
    for( int large = 0; large <= 1; ++large )
    {
    void *p = VirtualAlloc( nullptr, BLOCK_SIZE, MEM_RESERVE | MEM_COMMIT
    | (large ? MEM_LARGE_PAGES : 0), PAGE_READWRITE );
    if( !p )
    return EXIT_FAILURE;
    invoke_on_destruct free( [&] { VirtualFree( p, 0, MEM_RELEASE ); } );
    span sp( (char *)p, BLOCK_SIZE );
    ptrdiff_t dist = !large ? 0x1000 : 0x200000;
    for( size_t i = 0; i != BLOCK_SIZE; i += !large ? 0x1000 : 0x200000 )
    sp[i] = 0;
    using dur_t = high_resolution_clock::duration;
    dur_t tMin = dur_t::max();
    for( unsigned n = 10; n; --n )
    {
    auto start = high_resolution_clock::now();
    char sum = 0;
    for( size_t where : wheres )
    sum += sp[where];
    dur_t t = high_resolution_clock::now() - start;
    ::ac.store( sum, memory_order_relaxed );
    tMin = t < tMin ? t : tMin;
    }
    double nsPerPage = duration_cast<nanoseconds>( tMin ).count() /
    (double)N_PAGES;
    char const *head = !large ? "4kiB: " : "2MiB: ";
    cout << head << nsPerPage << "ns/page" << endl;
    };
    }

    DWORD enablePrivilege( char const *privilege, bool enable )
    {
    TOKEN_PRIVILEGES tp;
    HANDLE h;
    if( !OpenProcessToken( GetCurrentProcess(), TOKEN_ADJUST_PRIVILEGES | TOKEN_QUERY, &h ) )
    return GetLastError();
    XHANDLE xhToken( h );
    if( !LookupPrivilegeValueA( nullptr, privilege, &tp.Privileges[0].Luid ) )
    return GetLastError();
    tp.PrivilegeCount = 1;
    tp.Privileges[0].Attributes = enable ? SE_PRIVILEGE_ENABLED : 0;
    if( !AdjustTokenPrivileges( xhToken.get(), FALSE, &tp, 0, nullptr, 0 ) )
    return GetLastError();
    return ERROR_SUCCESS;
    };
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Marcel Mueller@news.5.maazl@spamgourmet.org to comp.lang.c++ on Sat Oct 12 17:07:35 2024
    From Newsgroup: comp.lang.c++

    Am 12.10.24 um 16:24 schrieb Bonita Montero:
    I wanted to test the performance of 2MiB pages against 4kiB pages.
    My Zen4 CPU has a fully associative L1-TLB of 72 entries for all
    page sizes and 3072 entries for each 4kiB and 2/4MiB pages.
    So I wrote a little benchmark that allocates 32GiB memory with 4kiB
    and 2MiB pages and that touches each 4kiB block once with a byte at
    a random page address. If I touch the pages all at the same page off-
    set large pages are only a quarter faster. But If I touch the pages
    at a random offset large pages become 2.75 times faster. I can't
    explain this hughe difference since the page-address is random so
    no prefetching could help. But nevertheless it shows that large
    pages could make a big difference.

    Probably you should ask why 4k Pages are much slower at random access.
    You simply need much more TLB entries.

    With linear access it is likely that continuous physical memory is
    mapped wich does not require additional TLB entries regardless of the
    page size.


    Marcel
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Bonita Montero@Bonita.Montero@gmail.com to comp.lang.c++ on Sat Oct 12 17:24:43 2024
    From Newsgroup: comp.lang.c++

    Am 12.10.2024 um 17:07 schrieb Marcel Mueller:

    Probably you should ask why 4k Pages are much slower at random access.
    You simply need much more TLB entries.

    Of course, but why is the difference so minimal when I'm touching the
    pages at the same offset, but random page indices ?

    With linear access it is likely that continuous physical memory is
    mapped wich does not require additional TLB entries regardless of
    the page size.

    I'm not accessing linear.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.lang.c++ on Sat Oct 12 15:27:38 2024
    From Newsgroup: comp.lang.c++

    Marcel Mueller <news.5.maazl@spamgourmet.org> writes:
    Am 12.10.24 um 16:24 schrieb Bonita Montero:
    I wanted to test the performance of 2MiB pages against 4kiB pages.
    My Zen4 CPU has a fully associative L1-TLB of 72 entries for all
    page sizes and 3072 entries for each 4kiB and 2/4MiB pages.
    So I wrote a little benchmark that allocates 32GiB memory with 4kiB
    and 2MiB pages and that touches each 4kiB block once with a byte at
    a random page address. If I touch the pages all at the same page off-
    set large pages are only a quarter faster. But If I touch the pages
    at a random offset large pages become 2.75 times faster. I can't
    explain this hughe difference since the page-address is random so
    no prefetching could help. But nevertheless it shows that large
    pages could make a big difference.

    Probably you should ask why 4k Pages are much slower at random access.
    You simply need much more TLB entries.

    With linear access it is likely that continuous physical memory is
    mapped wich does not require additional TLB entries regardless of the
    page size.

    ARM64 has a 'contiguous' hint bit in the translation table entry that
    supports coalescing multiple consecutively addressed TLB entries
    into a single entry when the OA (output addresses) are contiguous.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Bonita Montero@Bonita.Montero@gmail.com to comp.lang.c++ on Sat Oct 12 17:38:04 2024
    From Newsgroup: comp.lang.c++

    Am 12.10.2024 um 17:27 schrieb Scott Lurndal:

    ARM64 has a 'contiguous' hint bit in the translation table entry that supports coalescing multiple consecutively addressed TLB entries
    into a single entry when the OA (output addresses) are contiguous.

    1: I'm accessing the pages with only one byte each in a random order.
    The offset within the page is also random.
    2: Current AMD-CPUs have L1-TLBs that can cover any page size and
    which can cover 16kB "pages", i.e. mutiple PTE's are joined if they
    refer to an contignous and 16kB aligned block, which is common because
    the page-colouring results in such arranged pages.
    --- Synchronet 3.20a-Linux NewsLink 1.114