I wanted to test the performance of 2MiB pages against 4kiB pages.
My Zen4 CPU has a fully associative L1-TLB of 72 entries for all
page sizes and 3072 entries for each 4kiB and 2/4MiB pages.
So I wrote a little benchmark that allocates 32GiB memory with 4kiB
and 2MiB pages and that touches each 4kiB block once with a byte at
a random page address. If I touch the pages all at the same page off-
set large pages are only a quarter faster. But If I touch the pages
at a random offset large pages become 2.75 times faster. I can't
explain this hughe difference since the page-address is random so
no prefetching could help. But nevertheless it shows that large
pages could make a big difference.
Probably you should ask why 4k Pages are much slower at random access.
You simply need much more TLB entries.
With linear access it is likely that continuous physical memory is
mapped wich does not require additional TLB entries regardless of
the page size.
Am 12.10.24 um 16:24 schrieb Bonita Montero:
I wanted to test the performance of 2MiB pages against 4kiB pages.
My Zen4 CPU has a fully associative L1-TLB of 72 entries for all
page sizes and 3072 entries for each 4kiB and 2/4MiB pages.
So I wrote a little benchmark that allocates 32GiB memory with 4kiB
and 2MiB pages and that touches each 4kiB block once with a byte at
a random page address. If I touch the pages all at the same page off-
set large pages are only a quarter faster. But If I touch the pages
at a random offset large pages become 2.75 times faster. I can't
explain this hughe difference since the page-address is random so
no prefetching could help. But nevertheless it shows that large
pages could make a big difference.
Probably you should ask why 4k Pages are much slower at random access.
You simply need much more TLB entries.
With linear access it is likely that continuous physical memory is
mapped wich does not require additional TLB entries regardless of the
page size.
ARM64 has a 'contiguous' hint bit in the translation table entry that supports coalescing multiple consecutively addressed TLB entries
into a single entry when the OA (output addresses) are contiguous.
Sysop: | DaiTengu |
---|---|
Location: | Appleton, WI |
Users: | 991 |
Nodes: | 10 (1 / 9) |
Uptime: | 75:40:17 |
Calls: | 12,948 |
Calls today: | 2 |
Files: | 186,574 |
Messages: | 3,264,527 |