Hello Benny!
01 Sep 22 12:08, you wrote to all:
Linux mx 5.15.63-gentoo-dist #1 SMP Thu Aug 25 12:40:44 -00 2022
x86_64 Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz GenuineIntel
GNU/Linux Linux localhost 5.19.6-gentoo-dist #1 SMP PREEMPT_DYNAMIC
Wed Aug 31 18:48:13 -00 2022 x86_64 AMD EPYC 7642 48-Core Processor AuthenticAMD GNU/Linux
We've been switching over to Epyc boxes at work, and it's been a bit of a nightmare, although mostly that's been due to software limitations.
We started out with dual CPU 64-core Epyc CPUs, and ran into limitations with applications that couldn't deal with 256 processors. We had to manually pin each thread/appliaction to a core / set of cores.
We eventually switched over to purchasing single-CPU 64 core EPYC boxes, which resolved our issues with CPU pinning for the most part.
However, every single EPYC box we're running has to have IOMMU disabled in BIOS. Otherwise, after about 3 months of running the servers will start spewing "AMD-Vi: Completion wait loop timed out" errors. This will cause the pcie devices to rapidly disable/re-enable, which knocks out networking. We've yet to nail down the actual cause of the issue, and it doesn't seem to matter what kernel version we're running. The odd thing is that it will happen in short bursts, groups of servers (usually assigned to the same application) will start blowing up every hour or two, one after another. Not a fun thing when I'm on call, because I swear it starts happening overnight every time. :D
Mike
... Dancers do it with rhythm.
--- GoldED+/LNX 1.1.5-b20220504