Forum: War Ensemble BBS

large pages access time

From Bonita Montero@Bonita.Montero@gmail.com to comp.lang.c++ on Sat Oct 12 16:24:04 2024

From Newsgroup: comp.lang.c++

I wanted to test the performance of 2MiB pages against 4kiB pages.
My Zen4 CPU has a fully associative L1-TLB of 72 entries for all
page sizes and 3072 entries for each 4kiB and 2/4MiB pages.
So I wrote a little benchmark that allocates 32GiB memory with 4kiB
and 2MiB pages and that touches each 4kiB block once with a byte at
a random page address. If I touch the pages all at the same page off-
set large pages are only a quarter faster. But If I touch the pages
at a random offset large pages become 2.75 times faster. I can't
explain this hughe difference since the page-address is random so
no prefetching could help. But nevertheless it shows that large
pages could make a big difference.

I wrote a little benchmark

#include <Windows.h>
#undef max
#include <iostream>
#include <memory>
#include <chrono>
#include <span>
#include <random>
#include "invoke_on_destruct.h"
#include "ndi.h"

using namespace std;
using namespace chrono;

using XHANDLE = unique_ptr<void, decltype([]( HANDLE h ) { h && h != INVALID_HANDLE_VALUE && CloseHandle( h ); })>;

DWORD enablePrivilege( char const *privilege, bool enable );

static atomic_char ac;

int main()
{
if( enablePrivilege( "SeLockMemoryPrivilege", true ) )
return EXIT_FAILURE;
constexpr size_t
BLOCK_SIZE = 32 * 1024ull * 1024 * 1024,
N_PAGES = BLOCK_SIZE / 0x1000;
vector<ndi<size_t>> wheres;
wheres.resize( N_PAGES );
mt19937_64 mt;
using uid = uniform_int_distribution<size_t>;
uid rndInPage( 0, 0xFFF );
for( size_t i = 0; i != N_PAGES; ++i )
wheres[i] = i * 0x1000 + rndInPage( mt );
uid rndPage( 0, N_PAGES - 1 );
for( size_t i = 0; i != N_PAGES; ++i )
swap( wheres[i], wheres[rndPage( mt )] );
for( int large = 0; large <= 1; ++large )
{
void *p = VirtualAlloc( nullptr, BLOCK_SIZE, MEM_RESERVE | MEM_COMMIT
| (large ? MEM_LARGE_PAGES : 0), PAGE_READWRITE );
if( !p )
return EXIT_FAILURE;
invoke_on_destruct free( [&] { VirtualFree( p, 0, MEM_RELEASE ); } );
span sp( (char *)p, BLOCK_SIZE );
ptrdiff_t dist = !large ? 0x1000 : 0x200000;
for( size_t i = 0; i != BLOCK_SIZE; i += !large ? 0x1000 : 0x200000 )
sp[i] = 0;
using dur_t = high_resolution_clock::duration;
dur_t tMin = dur_t::max();
for( unsigned n = 10; n; --n )
{
auto start = high_resolution_clock::now();
char sum = 0;
for( size_t where : wheres )
sum += sp[where];
dur_t t = high_resolution_clock::now() - start;
::ac.store( sum, memory_order_relaxed );
tMin = t < tMin ? t : tMin;
}
double nsPerPage = duration_cast<nanoseconds>( tMin ).count() /
(double)N_PAGES;
char const *head = !large ? "4kiB: " : "2MiB: ";
cout << head << nsPerPage << "ns/page" << endl;
};
}

DWORD enablePrivilege( char const *privilege, bool enable )
{
TOKEN_PRIVILEGES tp;
HANDLE h;
if( !OpenProcessToken( GetCurrentProcess(), TOKEN_ADJUST_PRIVILEGES | TOKEN_QUERY, &h ) )
return GetLastError();
XHANDLE xhToken( h );
if( !LookupPrivilegeValueA( nullptr, privilege, &tp.Privileges[0].Luid ) )
return GetLastError();
tp.PrivilegeCount = 1;
tp.Privileges[0].Attributes = enable ? SE_PRIVILEGE_ENABLED : 0;
if( !AdjustTokenPrivileges( xhToken.get(), FALSE, &tp, 0, nullptr, 0 ) )
return GetLastError();
return ERROR_SUCCESS;
};
--- Synchronet 3.20a-Linux NewsLink 1.114

From Marcel Mueller@news.5.maazl@spamgourmet.org to comp.lang.c++ on Sat Oct 12 17:07:35 2024

From Newsgroup: comp.lang.c++

Am 12.10.24 um 16:24 schrieb Bonita Montero:

I wanted to test the performance of 2MiB pages against 4kiB pages.
My Zen4 CPU has a fully associative L1-TLB of 72 entries for all
page sizes and 3072 entries for each 4kiB and 2/4MiB pages.
So I wrote a little benchmark that allocates 32GiB memory with 4kiB
and 2MiB pages and that touches each 4kiB block once with a byte at
a random page address. If I touch the pages all at the same page off-
set large pages are only a quarter faster. But If I touch the pages
at a random offset large pages become 2.75 times faster. I can't
explain this hughe difference since the page-address is random so
no prefetching could help. But nevertheless it shows that large
pages could make a big difference.

Probably you should ask why 4k Pages are much slower at random access.
You simply need much more TLB entries.

With linear access it is likely that continuous physical memory is
mapped wich does not require additional TLB entries regardless of the
page size.

Marcel
--- Synchronet 3.20a-Linux NewsLink 1.114

From Bonita Montero@Bonita.Montero@gmail.com to comp.lang.c++ on Sat Oct 12 17:24:43 2024

From Newsgroup: comp.lang.c++

Am 12.10.2024 um 17:07 schrieb Marcel Mueller:

Probably you should ask why 4k Pages are much slower at random access.
You simply need much more TLB entries.

Of course, but why is the difference so minimal when I'm touching the
pages at the same offset, but random page indices ?

With linear access it is likely that continuous physical memory is
mapped wich does not require additional TLB entries regardless of
the page size.

I'm not accessing linear.
--- Synchronet 3.20a-Linux NewsLink 1.114

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.lang.c++ on Sat Oct 12 15:27:38 2024

From Newsgroup: comp.lang.c++

Marcel Mueller <news.5.maazl@spamgourmet.org> writes:

Am 12.10.24 um 16:24 schrieb Bonita Montero:

I wanted to test the performance of 2MiB pages against 4kiB pages.
My Zen4 CPU has a fully associative L1-TLB of 72 entries for all
page sizes and 3072 entries for each 4kiB and 2/4MiB pages.
So I wrote a little benchmark that allocates 32GiB memory with 4kiB
and 2MiB pages and that touches each 4kiB block once with a byte at
a random page address. If I touch the pages all at the same page off-
set large pages are only a quarter faster. But If I touch the pages
at a random offset large pages become 2.75 times faster. I can't
explain this hughe difference since the page-address is random so
no prefetching could help. But nevertheless it shows that large
pages could make a big difference.

Probably you should ask why 4k Pages are much slower at random access.
You simply need much more TLB entries.

With linear access it is likely that continuous physical memory is
mapped wich does not require additional TLB entries regardless of the
page size.

ARM64 has a 'contiguous' hint bit in the translation table entry that
supports coalescing multiple consecutively addressed TLB entries
into a single entry when the OA (output addresses) are contiguous.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Bonita Montero@Bonita.Montero@gmail.com to comp.lang.c++ on Sat Oct 12 17:38:04 2024

From Newsgroup: comp.lang.c++

Am 12.10.2024 um 17:27 schrieb Scott Lurndal:

ARM64 has a 'contiguous' hint bit in the translation table entry that supports coalescing multiple consecutively addressed TLB entries
into a single entry when the OA (output addresses) are contiguous.

1: I'm accessing the pages with only one byte each in a random order.
The offset within the page is also random.
2: Current AMD-CPUs have L1-TLBs that can cover any page size and
which can cover 16kB "pages", i.e. mutiple PTE's are joined if they
refer to an contignous and 16kB aligned block, which is common because
the page-colouring results in such arranged pages.
--- Synchronet 3.20a-Linux NewsLink 1.114

Who's Online
Recent Visitors
- Grey Gamer
  Thu Nov 21 07:37:11 2024
  from Show Low, Az via Telnet
- Microbot
  Thu Nov 21 03:10:00 2024
  from Moore, Ok via Telnet
- Winston
  Wed Nov 20 09:30:02 2024
  from Kerrville, Tx via SSH
- Microbot
  Wed Nov 20 05:27:23 2024
  from Moore, Ok via Telnet

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	991
Nodes:	10 (1 / 9)
Uptime:	75:40:17
Calls:	12,948
Calls today:	2
Files:	186,574
Messages:	3,264,527

large pages access time

Who's Online

Recent Visitors

System Info