Forum: War Ensemble BBS

Re: A very slow program

From Vir Campestris@vir.campestris@invalid.invalid to alt.comp.lang.c,comp.lang.c on Wed Oct 2 11:16:37 2024

From Newsgroup: comp.lang.c

On 16/09/2024 06:22, Paul wrote:

Good prime code, comes with its own timing routines 🙂

http://cr.yp.to/primegen.html

http://cr.yp.to/primegen/primegen-0.97.tar.gz

Paul

I pricked my ears up when I saw this - I've been spending a quite
ridiculous amount of time tuning prime code.

Happily for my sanity the program I published over on comp.lang.c++(1)
is quite a bit faster than that one. On the other hand, I'm tuning for
modern processors, and the comment in there say inter alia "It generates
the 50847534 primes up to 1000000000 in just 8 seconds on a Pentium II-350"

I don't remember the last time I saw one of those.

<https://en.wikipedia.org/wiki/List_of_Intel_Pentium_II_processors>

tells me it was release in 1998...

Andy
--

(1) See thread "Sieve of Eratosthenes optimized to the max" and look for
a post from me which is quite big
--- Synchronet 3.20a-Linux NewsLink 1.114

From Paul@nospam@needed.invalid to alt.comp.lang.c,comp.lang.c on Wed Oct 2 06:37:46 2024

From Newsgroup: comp.lang.c

On Wed, 10/2/2024 6:16 AM, Vir Campestris wrote:

On 16/09/2024 06:22, Paul wrote:

Good prime code, comes with its own timing routines 🙂

http://cr.yp.to/primegen.html

http://cr.yp.to/primegen/primegen-0.97.tar.gz

Paul

I pricked my ears up when I saw this - I've been spending a quite ridiculous amount of time tuning prime code.

Happily for my sanity the program I published over on comp.lang.c++(1) is quite a bit faster than that one. On the other hand, I'm tuning for modern processors, and the comment in there say inter alia "It generates the 50847534 primes up to 1000000000 in just 8 seconds on a Pentium II-350"

I don't remember the last time I saw one of those.

<https://en.wikipedia.org/wiki/List_of_Intel_Pentium_II_processors>

tells me it was release in 1998...

Andy

I started on PCs, with a Celery 300 for home use.
That was the one that didn't have cache.

And the generation before I got a machine, there
were cache DIMMs for adding a cache function, and
the cache DIMM in some cases, only covered half
the address space, instead of covering the whole
thing. The things you have to put up with.

Paul

--- Synchronet 3.20a-Linux NewsLink 1.114

From Bonita Montero@Bonita.Montero@gmail.com to alt.comp.lang.c,comp.lang.c on Wed Oct 2 14:44:04 2024

From Newsgroup: comp.lang.c

Ive developed a Sieve of Erastosthenes in C++20 with an unbeaten
performance. With this code I get all primes <= 2 ^ 32 in 140ms
on my AMD 7950X 16-core Zen4-system.
It calculates first only the primes up to the square root of the
maximum. The remaining bits are partitioned according to the num-
ber of CPU-cores. Within each thread the bitmap is partitioned in
parts that fit into the L2-cache (queried via CPUID).
All operations are aligned on a cacheline-boundary to avoid false
sharing and locking of shared words.

#include <iostream>
#include <cstdlib>
#include <charconv>
#include <cstring>
#include <vector>
#include <algorithm>
#include <cmath>
#include <bit>
#include <fstream>
#include <string_view>
#include <thread>
#include <utility>
#include <new>
#include <span>
#include <array>
#include <cassert>
#include <sstream>
#if defined(_MSC_VER) && (defined(_M_IX86) || defined(_M_X64))
#include <intrin.h>
#elif (defined(__GNUC__) || defined(__clang__)) && (defined(__i386__) || defined(__x86_64__))
#include <cpuid.h>
#endif

#if defined(_MSC_VER)
#pragma warning(disable: 26495) // uninitialized member
#endif

using namespace std;

#if defined(__cpp_lib_hardware_interference_size)
constexpr ptrdiff_t CL_SIZE = hardware_destructive_interference_size;
#else
constexpr ptrdiff_t CL_SIZE = 64;
#endif

size_t getL2Size();
bool smt();

int main( int argc, char **argv )
{
try
{
if( argc < 2 )
return EXIT_FAILURE;
auto parse = []( char const *str, auto &value, bool bytes )
{
bool hex = str[0] == '0' && tolower( str[1] ) == 'x';
str += 2 * hex;
auto [ptr, ec] = from_chars( str, str + strlen( str ), value, !hex ?
10 : 16 );
if( ec != errc() )
return false;
if( bytes && tolower( *ptr ) == 'x' && !ptr[1] )
{
value *= 8;
--value;
++ptr;
}
return !*ptr;
};
size_t end;
if( !parse( argv[1], end, true ) )
return EXIT_FAILURE;
if( end < 2 )
return EXIT_FAILURE;
if( (ptrdiff_t)end++ < 0 )
throw bad_alloc();
unsigned
hc = jthread::hardware_concurrency(),
nThreads;
if( argc < 4 || !parse( argv[3], nThreads, false ) )
nThreads = hc;
if( !nThreads )
return EXIT_FAILURE;
using word_t = size_t;
constexpr size_t
BITS_PER_CL = CL_SIZE * 8,
BITS = sizeof(word_t) * 8;
size_t nBits = end / 2;
union alignas(CL_SIZE) ndi_words_cl { word_t words[CL_SIZE / sizeof(word_t)]; ndi_words_cl() {} };
vector<ndi_words_cl> sieveCachelines( (nBits + BITS_PER_CL - 1) /
BITS_PER_CL );
span<word_t> sieve( &sieveCachelines[0].words[0], (nBits + BITS - 1) /
BITS );
assert(to_address( sieve.end() ) <= &to_address( sieveCachelines.end()

words[0]);

fill( sieve.begin(), sieve.end(), -1 );
auto punch = [&]( size_t start, size_t end, size_t prime, auto )
{
size_t
bit = start / 2,
bitEnd = end / 2;
if( bit >= bitEnd ) [[unlikely]]
return;
if( prime >= BITS ) [[likely]]
do [[likely]]
sieve[bit / BITS] &= rotl( (word_t)-2, bit % BITS );
while( (bit += prime) < bitEnd );
else
{
auto pSieve = sieve.begin() + bit / BITS;
do [[likely]]
{
word_t
word = *pSieve,
mask = rotl( (word_t)-2, bit % BITS ),
oldMask;
do
{
word &= mask;
oldMask = mask;
mask = rotl( mask, prime % BITS );
bit += prime;
} while( mask < oldMask );
*pSieve++ = word;
} while( bit < bitEnd );
}
};
size_t sqrt = (ptrdiff_t)ceil( ::sqrt( (ptrdiff_t)end ) );
[&]
{
for( size_t bSqrt = sqrt / 2, bit = 1; bit < bSqrt; ++bit ) [[likely]]
{
auto pSieve = sieve.begin () + bit / BITS;
for( ; ; )
{
if( word_t w = *pSieve >> bit % BITS; w ) [[likely]]
{
bit += countr_zero ( w );
break;
}
++pSieve;
bit = bit + BITS & -(ptrdiff_t)BITS;
if( bit >= bSqrt ) [[unlikely]]
return;
}
size_t prime = 2 * bit + 1;
punch( prime * prime, sqrt, prime, false_type () );
}
}();
auto scan = [&]( size_t start, size_t end, auto fn )
{
size_t
bit = start / 2,
bitEnd = end / 2;
if( bit >= bitEnd ) [[unlikely]]
return;
auto pSieve = sieve.begin() + bit / BITS;
word_t word = *pSieve++ >> bit % BITS;
for( ; ; )
{
size_t nextWordBit = bit + BITS & -(ptrdiff_t)BITS;
for( unsigned gap; word; word >>= gap, word >>= 1, ++bit )
{
gap = countr_zero( word );
bit += gap;
if( bit >= bitEnd ) [[unlikely]]
return;
fn( bit * 2 + 1, true_type() );
}
bit = nextWordBit;
if( bit >= bitEnd ) [[unlikely]]
return;
word = *pSieve++;
}
};
vector<jthread> threads;
threads.reserve( nThreads );
static auto dbl = []( ptrdiff_t x ) { return (double)x; };
auto initiate = [&]( size_t start, auto fn )
{
double threadRange = dbl( end - start ) / nThreads;
ptrdiff_t t = nThreads;
size_t trEnd = end;
size_t prevStart, prevEnd;
bool prevValid = false;
do [[likely]]
{
size_t trStart = start + (ptrdiff_t)(--t * threadRange + 0.5);
trStart &= -(2 * CL_SIZE * 8);
trStart = trStart >= start ? trStart : start;
if( trStart < trEnd ) [[likely]]
{
if( prevValid ) [[likely]]
threads.emplace_back( fn, prevStart, prevEnd );
prevStart = trStart;
prevEnd = trEnd;
prevValid = true;
}
trEnd = trStart;
} while( t );
if( prevValid ) [[likely]]
fn( prevStart, prevEnd );
threads.resize( 0 );
};
double maxCacheRange = dbl( getL2Size() * 8 * 2 ) / 3 * (smt() &&
nThreads > hc / 2 ? 1 : 2);
initiate( sqrt,
[&]( size_t rStart, size_t rEnd )
{
double
thisThreadRange = dbl( rEnd - rStart ),
nCachePartitions = ceil( thisThreadRange / maxCacheRange ),
cacheRange = thisThreadRange / nCachePartitions;
for( size_t p = (ptrdiff_t)nCachePartitions, cacheEnd = rEnd,
cacheStart; p--; cacheEnd = cacheStart ) [[likely]]
{
cacheStart = rStart + (ptrdiff_t)((double)(ptrdiff_t)p * cacheRange
+ 0.5);
cacheStart &= -(2 * CL_SIZE * 8);
cacheStart = cacheStart >= sqrt ? cacheStart : sqrt;
scan( 3, sqrt,
[&]( size_t prime, auto )
{
size_t start = (cacheStart + prime - 1) / prime * prime;
start += start % 2 ? 0 : prime;
punch( start, cacheEnd, prime, true_type() );
} );
}
} );
if( bool count = false; argc >= 3 && (!*argv[2] || (count = strcmp(
argv[2], "*" ) == 0)) )
{
if( !count )
return EXIT_SUCCESS;
atomic<size_t> aNPrimes( 1 );
initiate( 3,
[&]( size_t rStart, size_t rEnd )
{
size_t nPrimes = 0;
scan( rStart, rEnd, [&]( ... ) { ++nPrimes; } );
aNPrimes.fetch_add( nPrimes, memory_order::relaxed );
} );
cout << aNPrimes.load( memory_order::relaxed ) << " primes" << endl;
return EXIT_SUCCESS;
}
ofstream ofs;
ofs.exceptions( ofstream::failbit | ofstream::badbit );
ofs.open( argc >= 3 ? argv[2] : "primes.txt", ofstream::binary |
ofstream::trunc );
constexpr size_t
BUF_SIZE = 0x100000,
AHEAD = 32;
union ndi_char { char c; ndi_char() {} };
vector<ndi_char> rawPrintBuf( BUF_SIZE + AHEAD - 1 );
span printBuf( &to_address( rawPrintBuf.begin() )->c, &to_address(
rawPrintBuf.end() )->c );
auto
bufBegin = printBuf.begin(),
bufEnd = bufBegin,
bufFlush = bufBegin + BUF_SIZE;
auto print = [&]() { ofs << string_view( bufBegin, bufEnd ); };
auto printPrime = [&]( size_t prime, auto )
{
auto [ptr, ec] = to_chars( &*bufEnd, &bufEnd[AHEAD - 1], prime );
if( (bool)ec ) [[unlikely]]
throw system_error( (int)ec, generic_category(), "converson failed" );
bufEnd = ptr - &*bufBegin + bufBegin; // pointer to iterator - NOP
*bufEnd++ = '\n';
if( bufEnd >= bufFlush ) [[unlikely]]
print(), bufEnd = bufBegin;
};
printPrime( 2, false_type() );
scan( 3, end, printPrime );
print();
}
catch( exception const &exc )
{
cout << exc.what() << endl;
}
}

#if (defined(__GNUC__) || defined(__clang__)) && (defined(__i386__) || defined(__x86_64__)) || defined(_MSC_VER) && (defined(_M_IX86) || defined(_M_X64))
array<unsigned, 4> cpuId( unsigned code )
{
int regs[4];
#if defined(_MSC_VER)
__cpuid( regs, code );
#elif defined(__GNUC__) || defined(__clang__)
__cpuid(code, regs[0], regs[1], regs[2], regs[3]);
#endif
using u = unsigned;
return array<u, 4> { (u)regs[0], (u)regs[1], (u)regs[2], (u)regs[3] }; }

bool smt()
{
if( cpuId( 0 )[0] < 1 )
return false;
return cpuId( 1 )[3] >> 28 & 1;
}

size_t getL2Size()
{
if( cpuId( 0x80000000u )[0] < 0x80000006u )
return 512ull * 1024;
return (cpuId( 0x80000006u )[2] >> 16) * 1024;
}
#else
size_t getL2Size()
{
return 512ull * 1024;
}

bool smt()
{
return false;
}
#endif
--- Synchronet 3.20a-Linux NewsLink 1.114

From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.c on Wed Oct 2 16:41:32 2024

From Newsgroup: comp.lang.c

On 16.09.2024 00:45, Student Project wrote:

[ snip C program ]
/*
The result on my slow machine:
Program takes an average of 0.018000 seconds to find 78498 primes.
*/

Note, this is the same runtime magnitude that Unix'es 'primes'
command needs to run on 1000000 numbers to create 78498 primes
(0m00.01s) [on my very old (and slow) Unix system].

Is there a point in still writing own programs for that task?
I see some point when examining performance optimizations (as
someone downthread seems to have done). For cryptography there's
certainly demands for yet faster (parallelized) algorithms - if
quantum system algorithms don't make it superfluous (lately, or
in near future). But for cryptography you'd anyway need larger
numeric domains, beyond '[long]* integer' arithmetics. (I don't
know whether the [downthread posted] C++ code was designed to
support arbitrary length arithmetics.)

Optimized algorithms of new methods alone might not be a C topic.
But given your posting name, "Student Project", I suppose you're
anyway just using it as example for learning the C language?

For ordinary users it's probably sufficient to use an existing
program; on Unix 'primes 1 1000000000' runs in about 8 seconds.
And it's checking 2^32 numbers in 50 seconds (including 'wc'
and creating output [suppressed by redirection]), generating
50847534 primes. A simple prime test of some "arbitrary large"
number runs in no time, as does factorization of numbers with
the Unix'es 'factor' program. Just a simple command line call.
That existing 'primes' program also has some limit; the man page
documents a value of 4294967295, but no error message is created
for values up to around 1.8446744*10^19 on my system.

Janis
--- Synchronet 3.20a-Linux NewsLink 1.114

From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.lang.c on Wed Oct 2 15:43:45 2024

From Newsgroup: comp.lang.c

Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
[...]

For ordinary users it's probably sufficient to use an existing
program; on Unix 'primes 1 1000000000' runs in about 8 seconds.
And it's checking 2^32 numbers in 50 seconds (including 'wc'
and creating output [suppressed by redirection]), generating
50847534 primes. A simple prime test of some "arbitrary large"
number runs in no time, as does factorization of numbers with
the Unix'es 'factor' program. Just a simple command line call.
That existing 'primes' program also has some limit; the man page
documents a value of 4294967295, but no error message is created
for values up to around 1.8446744*10^19 on my system.

On my system (Ubuntu 22.04.5 LTS, 64 bits), the primes(1) man page says:

"The stop value must not be greater than 3825123056546413050."

4294967295 is of course 2**31-1. I don't know the significance of 3825123056546413050. The change was made by Debian, but the sources
don't explicitly give a reason. Anyone who wants to look into it can
see the patches here:

https://salsa.debian.org/games-team/bsdgames
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */
--- Synchronet 3.20a-Linux NewsLink 1.114

From Vir Campestris@vir.campestris@invalid.invalid to alt.comp.lang.c,comp.lang.c on Thu Oct 3 17:06:13 2024

From Newsgroup: comp.lang.c

On 02/10/2024 13:44, Bonita Montero wrote:

Ive developed a Sieve of Erastosthenes in C++20 with an unbeaten
performance. With this code I get all primes <= 2 ^ 32 in 140ms
on my AMD 7950X 16-core Zen4-system.
It calculates first only the primes up to the square root of the
maximum. The remaining bits are partitioned according to the num-
ber of CPU-cores. Within each thread the bitmap is partitioned in
parts that fit into the L2-cache (queried via CPUID).
All operations are aligned on a cacheline-boundary to avoid false
sharing and locking of shared words.

Unbeaten?

This isn't the same program you posted before over on comp.lang.c++, but
it builds and runs fine on my machine.

for 4e9 primes my machine take nearly 10 seconds for 4294967295 primes.
The one I posted before over there takes 1.5 seconds.

This version is if anything slightly slower than your other one.

You might like to have a look.

(And Keith, yes, I realise there is no point whatsoever in doing this...
but it's been fun. First time in years I've written anything I'm not
getting paid for!)

Andy
--- Synchronet 3.20a-Linux NewsLink 1.114

From Paul@nospam@needed.invalid to alt.comp.lang.c,comp.lang.c on Thu Oct 3 17:25:14 2024

From Newsgroup: comp.lang.c

On Wed, 10/2/2024 8:44 AM, Bonita Montero wrote:

Ive developed a Sieve of Erastosthenes in C++20 with an unbeaten
performance. With this code I get all primes <= 2 ^ 32 in 140ms
on my AMD 7950X 16-core Zen4-system.

Why not measure against the programs provided in a Linux distro ?

$ time primesieve 1e6
Sieve size = 256 KiB
Threads = 1
100%
Seconds: 0.000
Primes: 78498

real 0m0.002s <=== Without printing the numbers, is pretty fast
user 0m0.000s
sys 0m0.000s

$ time primesieve 1e6 --print > NUL

real 0m0.009s <=== Without Terminal to slow I/O, is in the Unix ballpark user 0m0.005s
sys 0m0.000s

$ time primesieve 4294967295
Sieve size = 256 KiB
Threads = 16 <=== 5700G Zen3 (8C 16T), stock speed, execution under WSL2
100%
Seconds: 0.072
Primes: 203280221

real 0m0.073s
user 0m0.869s
sys 0m0.009s

Paul
--- Synchronet 3.20a-Linux NewsLink 1.114

From Bonita Montero@Bonita.Montero@gmail.com to alt.comp.lang.c,comp.lang.c on Fri Oct 4 06:53:00 2024

From Newsgroup: comp.lang.c

Am 03.10.2024 um 18:06 schrieb Vir Campestris:

for 4e9 primes my machine take nearly 10 seconds for 4294967295
primes. The one I posted before over there takes 1.5 seconds.

Show me your code. I'm pretty sure you never posted faster code.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Vir Campestris@vir.campestris@invalid.invalid to alt.comp.lang.c,comp.lang.c on Sun Oct 20 11:49:22 2024

From Newsgroup: comp.lang.c

On 04/10/2024 05:53, Bonita Montero wrote:

Am 03.10.2024 um 18:06 schrieb Vir Campestris:

for 4e9 primes my machine take nearly 10 seconds for 4294967295
primes. The one I posted before over there takes 1.5 seconds.

Show me your code. I'm pretty sure you never posted faster code.

Try this.
--
/*
Programme to calculate primes using the Sieve or Eratosthenes
Copyright (C) 2024 the person known as Vir Campestris

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>.
*/

#include <algorithm>
#include <cassert>
#include <chrono>
#include <climits>
#include <cmath>
#include <csignal>
#include <cstring>
#include <fstream>
#include <iomanip>
#include <iostream>
#include <list>
#include <memory>
#include <new>
#include <thread>
#include <vector>

// If these trigger you have a really odd architecture. static_assert(sizeof(uint8_t) == 1);
static_assert(CHAR_BIT == 8);

// The type of a prime number.
// size_t isn't big enough on a 32-bit machine with 2GB memory allocated
and 16 primes per byte.
typedef uint64_t PRIME;

// This is the type of the word I operate in. The code will run OK with
uin8_t or uint32_t,
// but *on my computer* it's slower.
typedef uint64_t StoreType;

// tuneable constants
constexpr size_t tinyCount = 5;
constexpr size_t smallCount = 10;
constexpr size_t bigCache = 16384;

// this is a convenience class for timing.
class Stopwatch {
typedef std::chrono::steady_clock CLOCK;
typedef std::pair<unsigned, std::chrono::time_point<CLOCK> > LAP;
typedef std::vector<LAP> LAPS;
LAPS mLaps;
public:
typedef std::vector<std::pair<unsigned, double>> LAPTIMES;
Stopwatch() {}

// record the start of an interval
void start()
{
mLaps.clear();
mLaps.emplace_back(std::make_pair(0, CLOCK::now()));
}

void lap(unsigned label)
{
mLaps.emplace_back(std::make_pair(label, CLOCK::now()));
}

// record the end of an interval
void stop()
{
mLaps.emplace_back(std::make_pair(-1, CLOCK::now()));
}

// report the difference between the last start and stop
double delta() const
{
assert(mLaps.size() >= 2);
assert(mLaps.front().first == 0);
assert(mLaps.back().first == -1);
return std::chrono::duration_cast<std::chrono::nanoseconds>(mLaps.back().second
- mLaps.front().second).count() / (double)1e9;
}

LAPTIMES const laps() const
{
LAPTIMES ret;
assert(mLaps.size() >= 2);
ret.reserve(mLaps.size() - 1);
auto lap = mLaps.begin();
auto start = lap->second;

for(++lap; lap != mLaps.end(); ++lap)
{
ret.emplace_back(
std::make_pair(
lap->first,

std::chrono::duration_cast<std::chrono::nanoseconds>(lap->second - start).count() / (double)1e9));

}
return ret;
}
};

// Internal class used to store the mask data for one or more
// primes rather than all primes. This has an initial block,
// followed by a block which contains data which is repeated
// for all higher values.
// Example: StoreType== uint8_t, prime == 3, the mask should be
// 90 meaning 1,3,5,7,11,13 are prime but not 9, 15
// 24,49,92 covers odds 17-63
// 24,49,92 covers 65-111, with an identical mask
// As the mask is identical we only need to store the
// non-repeat, plus one copy of the repeated block and data to
// show where it starts and ends. Note that for big primes the
// initial block may be more than one StoreType.
// As there are no common factors between any prime and the
// number of bits in StoreType the repeat is the size of the
// prime.
// Note also that a masker for two primes will have an
// initial block which is as big as the larger of the sources,
// and a repeat which is the product of the sources.
class Masker final
{
// The length of the initial block.
size_t initSize;

// The size of the repeated part. The entire store size is the
sum of these two.
size_t repSize;

// Buffer management local class. I started using std::vector
for the store,
// but I found it was slow for large sizes as it insists on
calling the constructor
// for each item in turn. And there may be billions of them.
class Buffer final
{
StoreType* mpBuffer; // base of the data referred to
size_t mSize; // size of the buffer in
StoreType-s.

public:
// Normal constructor
Buffer(size_t size = 0) : mpBuffer(nullptr), mSize(size)
{
if (size > 0)
{
mpBuffer = static_cast<StoreType*>(std::malloc(size * sizeof(StoreType)));

if (!mpBuffer)
throw std::bad_alloc();
}
}

Buffer(Buffer&) = delete;

// Move constructor
Buffer(Buffer&& source)
: mpBuffer(source.mpBuffer), mSize(source.mSize)
{
source.mSize = 0;
source.mpBuffer = nullptr;
};

Buffer& operator=(Buffer&) = delete;

// Move assignment
Buffer& operator=(Buffer&& source)
{
if (mpBuffer)
std::free(mpBuffer);
mpBuffer = source.mpBuffer;
mSize = source.mSize;
source.mSize = 0;
source.mpBuffer = nullptr;
return *this;
}

void resize(size_t newSize)
{
mpBuffer = static_cast<StoreType*>(std::realloc(mpBuffer, newSize *
sizeof(StoreType)));
if (!mpBuffer)
throw std::bad_alloc();
mSize = newSize;
}

// Get the data start
inline StoreType* operator*() const
{
return mpBuffer;
}

// Get the owned size
inline size_t const & size() const
{
return mSize;
}

// clean up.
~Buffer()
{
if (mpBuffer)
std::free(mpBuffer);
}
} mBuffer;

// Raw pointer to the store for speed.
StoreType * mStorePtr;

// shift from the prime value to the index in mStore
static constexpr size_t wordshift = sizeof(StoreType) == 1 ? 4
: sizeof(StoreType) == 4 ? 6 : sizeof(StoreType) == 8 ? 7 : 0;
static_assert(wordshift);

// Mask for the bits in the store that select a prime.
static constexpr size_t wordmask = (StoreType)-1 >> (sizeof(StoreType) * CHAR_BIT + 1 - wordshift);

// convert a prime to the index of a word in the store
static inline size_t index(PRIME const & prime)
{
return prime >> wordshift;
}

// get a mask representing the bit for a prime in a word
static inline StoreType mask(PRIME const & prime)
{
return (StoreType)1 << ((prime >> 1) & wordmask);
}

public:
// This class allows us to iterate through a Masker
indefinitely, retrieving words,
// automatically wrapping back to the beginning of the repat
part when the end is met.
class MaskIterator
{
StoreType const * index, *repStart, *repEnd;
public:
MaskIterator(Masker const * origin)
: index(origin->mStorePtr)
, repStart(index + origin->initSize)
, repEnd(repStart + origin->repSize)
{}

inline StoreType const & operator*() const
{
return *index;
}

inline StoreType next() // returns value with iterator post-increment
{
auto & retval = *index;
if (++index >= repEnd)
index = repStart;
return retval;
}
};

// Default constructor. Makes a dummy mask for overwriting.
Masker()
: initSize(0)
, repSize(1)
, mBuffer(1)
, mStorePtr(*mBuffer)
{
*mStorePtr = 0; // nothing marked unprime
}

// Move constructor (destroys source)
Masker(Masker&& source)
: initSize(source.initSize)
, repSize(source.repSize)
, mBuffer(std::move(source.mBuffer))
, mStorePtr(*mBuffer)
{
}

// Copy constructor (preserves source)
Masker(Masker& source)
: initSize(source.initSize)
, repSize(source.repSize)
, mBuffer(initSize + repSize)
, mStorePtr(*mBuffer)
{
memcpy(*mBuffer, *source.mBuffer, mBuffer.size() * sizeof(StoreType));
}

// move assignment (destroys source)
Masker& operator=(Masker&& source)
{
initSize = source.initSize;
repSize = source.repSize;
mStorePtr = source.mStorePtr;
mBuffer = std::move(source.mBuffer);
return *this;
}

// Construct for a single prime
Masker(PRIME prime)
: initSize(this->index(prime)+1)
, repSize(prime)
, mBuffer(initSize + prime)
, mStorePtr(*mBuffer)
{
// Pre-zero the buffer, because for bigger primes we
don't write every word.
// This is fast enough not to care about excess writes.
memset(mStorePtr, 0, mBuffer.size() * sizeof(StoreType));

// The max prime we'll actually store.
// *2 because we don't store evens.
PRIME const maxPrime = (initSize + repSize) * CHAR_BIT
* sizeof(StoreType) * 2;

// Filter all the bits. Note that although we
// don't strictly need to filter from prime*3,
// but only from prime**2, doing from *3 makes
// the init block smaller.
PRIME const prime2 = prime * 2;

for (PRIME value = prime * 3; value < maxPrime; value
+= prime2)
{
mStorePtr[this->index(value)] |= this->mask(value);
}
}

// Construct from two others, making a mask that excludes all
numbers
// marked not prime in either of them. The output init block
// will be the largest from either of the inputs; the repeat
block size will be the
// product of the input repeat sizes.
// The output could get quite large - 3.7E12 for all primes up
to 37
Masker(const Masker& left, const Masker& right, size_t
storeLimit = 0)
: initSize(std::max(left.initSize, right.initSize))
, repSize (left.repSize * right.repSize)
, mBuffer()
{
if (storeLimit)
repSize = std::min(initSize + repSize,
storeLimit) - initSize;

auto storeSize = initSize + repSize;

// Actually construct the store with the desired size
mBuffer = storeSize;
mStorePtr = *mBuffer;

// get iterators to the two inputs. These automatically
wrap
// when their repsize is reached.
auto li = left.begin();
auto ri = right.begin();

for (size_t word = 0; word < storeSize; ++word)
{
mStorePtr[word] = li.next() | ri.next();
}
}

// Construct from several others, making a mask that excludes
all numbers
// marked not prime in any of them. The output init block
// will be the largest from any of the inputs; the repeat size
will be the
// product of all the input repeat sizes.
// The output could get quite large - 3.7E12 for all primes up
to 37
// the type iterator should be an iterator into a collection of Masker-s.
template <typename iterator> Masker(const iterator& begin,
const iterator& end, size_t storeLimit = 0)
: initSize(0)
, repSize(1)
, mBuffer() // empty for now
{
// Iterate over the inputs. We will
// * Determine the maximum init size
// * Determine the product of all the repeat sizes.
// * Record the number of primes we represent.
size_t nInputs = std::distance(begin, end);
std::vector<MaskIterator> iterators;
iterators.reserve(nInputs);

for (auto i = begin; i != end; ++i)
{
initSize = std::max(initSize, i->initSize);
repSize *= i->repSize;
iterators.push_back(i->begin());
}
if (storeLimit)
repSize = std::min(initSize + repSize,
storeLimit) - initSize;
auto storeSize = initSize + repSize;
// Actually construct the store with the desired size
mBuffer = storeSize;
mStorePtr = *mBuffer;

// take the last one off (most efficient to remove)
// and use it as the initial mask value
auto last = iterators.back();
iterators.pop_back();
for (auto word = 0; word < storeSize; ++word)
{
StoreType mask = last.next();
for(auto &i: iterators)
{
mask |= i.next();
}
mStorePtr[word] = mask;
}
}

template <typename iterator> Masker& andequals(size_t
cachesize, iterator begin, iterator end)
{
static constexpr size_t wordshift = sizeof(StoreType)
== 1 ? 4 : sizeof(StoreType) == 4 ? 6 : sizeof(StoreType) == 8 ? 7 : 0;
static constexpr size_t wordmask = (StoreType)-1 >> (sizeof(StoreType) * CHAR_BIT + 1 - wordshift);

struct value
{
PRIME step; // this is the
prime. The step should be 2P, but only half the bits are stored
PRIME halfMultiple; // this is the current multiple, _without_ the last bit
value(PRIME prime): step(prime), halfMultiple((prime*prime) >> 1) {}
bool operator<(PRIME halfLimit) const
{
return halfMultiple < halfLimit;
}
void next(StoreType * store)
{
static constexpr size_t wsm1 =
wordshift - 1;
auto index = halfMultiple >> wsm1;
auto mask = (StoreType)1 <<
(halfMultiple & wordmask);
*(store + index) |= mask;
halfMultiple += step;
}
};
std::vector<value> values;
values.reserve(std::distance(begin, end));
for (auto prime = begin; prime != end; ++prime)
{
auto p = *prime;
values.emplace_back(p);
}

size_t limit = 0;
do
{
limit = std::min(limit + cachesize, initSize + repSize + 1);
PRIME halfMaxPrime = (limit * 16 *
sizeof(StoreType)) / 2 + 1;
for (auto & value: values)
{
while (value < halfMaxPrime)
value.next(mStorePtr);
}
}
while (limit < initSize + repSize);

return *this;
}

// duplicates the data up to the amount needed for a prime newSize
void resize(PRIME newSize)
{
size_t sizeWords = this->index(newSize) + 1;
assert(sizeWords > (initSize + repSize)); // not
allowed to shrink!
mBuffer.resize(sizeWords);
mStorePtr = *mBuffer;

auto copySource = mStorePtr + initSize;
do
{
auto copySize = std::min(sizeWords - repSize - initSize, repSize);
auto dest = copySource + repSize;
memcpy(dest, copySource, copySize * sizeof(StoreType));
repSize += copySize;
} while ((initSize + repSize) < sizeWords);
}

// returns true if this mask thinks value is prime.
bool get(PRIME value) const
{
assert(value < sizeof(StoreType) * CHAR_BIT * 2 * mBuffer.size());
auto ret =
(value <= 3)
|| ((value & 1) &&
(mStorePtr[this->index(value)] & this->mask(value)) == 0);

return ret;
}

// Get the beginning of the bitmap.
// Incrementing this iterator can continue indefinitely,
regardless of the bitmap size.
MaskIterator begin() const
{
return MaskIterator(this);
}

size_t repsize() const
{
return repSize;
}

size_t size() const
{
return initSize + repSize;
}

// prints a collection to stderr
void dump(size_t limit = 0) const
{
std::cerr << std::dec << repSize;
std::cerr << std::hex;

if (limit == 0) limit = initSize + repSize;

auto iter = begin();
for (auto i = 0; i < limit; ++i)
{
// cast prevents attempting to print uint8_t as
a char
std::cerr << ',' << std::setfill('0') << std::setw(sizeof(StoreType) * 2) << uint64_t(mStorePtr[i]);
}
std::cerr << std::endl;
std::cerr << std::dec << repSize;

for (auto p = 1; p < limit * 16 * sizeof(StoreType); ++p)
{
if (get(p))
{
std::cerr << ',' << p;
}
}
std::cerr << std::endl;
}
};

int main(int argc, char**argv)
{
// find out if the user asked for a different number of primes
PRIME PrimeCount = 1e9;
if (argc >= 2)
{
std::istringstream arg1(argv[1]);
std::string crud;
arg1 >> PrimeCount >> crud;
if (!crud.empty() || PrimeCount < 3)
{
double isitfloat;
arg1.seekg(0, arg1.beg);
crud.clear();
arg1 >> isitfloat >> crud;
if (!crud.empty() || isitfloat < 3)
{
std::cerr << "Invalid first argument
\"" << argv[1] << "\" Should be decimal number of primes required\n";
exit(42);
}
else PrimeCount = isitfloat;
}
}

std::string outName;
if (argc >= 3)
{
outName = argv[2];
}

Stopwatch s; s.start();

Masker primes;
PRIME nextPrime;
{
nextPrime = 3;
Masker tiny(nextPrime);
std::vector<Masker> smallMasks;
smallMasks.emplace_back(tiny);

// Make a mask for the really small primes, 3 5 7 11
size_t count = 1; // allows for the 3
for ( ; count < tinyCount; ++count)
{
do
{
nextPrime += 2;
} while (!tiny.get(nextPrime));
tiny = Masker(tiny, nextPrime);
}

// that masker for three will be correct up until 5
squared,
// the first non-prime whose smallest root is not 2 or 3
assert (nextPrime < 25 && "If this triggers tinyCount
is too big and the code will give wrong results");

// this limit is too small, but is easy to calculate
and big enough
auto limit = nextPrime*nextPrime;

smallMasks.clear();
for (; count < smallCount; ++count)
{
do
{
nextPrime += 2;
} while (!tiny.get(nextPrime));
smallMasks.emplace_back(nextPrime);
}
assert(nextPrime <= limit && "If this triggers
smallCount is too big and the code will give wrong results");

// Make a single masker for all the small primes. Don't
make it bigger than needed.
auto small = Masker(smallMasks.begin(),
smallMasks.end(), PrimeCount / (2 * CHAR_BIT * sizeof(StoreType)) + 1);
s.lap(__LINE__);

// This is the first step that takes an appreciable
time. 2.5s for primes up to 1e11 on my machine.
primes = Masker(tiny, small, PrimeCount / (2 * CHAR_BIT
* sizeof(StoreType)) + 1);
s.lap(__LINE__);
}

// if the buffer isn't big enough yet expand it by duplicating
early entries.
if (primes.size() < PrimeCount / (2 * CHAR_BIT *
sizeof(StoreType)) + 1)
{
primes.resize(PrimeCount);
s.lap(__LINE__);
}

do
{
std::vector<PRIME> quickies;
PRIME quickMaxPrime = std::min(nextPrime*nextPrime, PRIME(sqrt(PrimeCount) + 1));
do
{
do
{
nextPrime += 2;
} while (!primes.get(nextPrime));
quickies.emplace_back(nextPrime);
} while (nextPrime < quickMaxPrime);
s.lap(__LINE__);

// this function adds all the quickies into the mask,
but does all of them in
// one area of memory before moving onto the next. The
area of memory,
// and the list of primes, both fit into the L1 cache.
// You may need to adjust bigCache for best performance.
// This step takes a while - two lots of 20s for qe11
primes.
primes.andequals(bigCache, quickies.begin(),
quickies.end());
s.lap(__LINE__);
} while (nextPrime*nextPrime < PrimeCount);

s.stop();
std::cerr << primes.size() << "," << s.laps().back().second << std::endl;
for (auto l: s.laps()) std::cerr << l.first << "," << l.second
<< std::endl;

if (outName.length())
{
std::ofstream outfile(outName);
outfile << "2\n";
for (PRIME prime = 3; prime < PrimeCount; prime += 2)
{
if (primes.get(prime))
{
outfile << prime << '\n';
}
}
}

return 0;
}

--- Synchronet 3.20a-Linux NewsLink 1.114

From gazelle@gazelle@shell.xmission.com (Kenny McCormack) to alt.comp.lang.c,comp.lang.c on Sun Oct 20 12:14:42 2024

From Newsgroup: comp.lang.c

In article <vf2n7i$cko5$1@dont-email.me>,
Vir Campestris <vir.campestris@invalid.invalid> wrote:
...

#include <algorithm>

What newsgroup(s) is/are we in?

Let's check:

Newsgroups: alt.comp.lang.c,comp.lang.c

Hmmm.
--
A pervert, a racist, and a con man walk into a bar...

Bartender says, "What will you have, Donald!"

--- Synchronet 3.20a-Linux NewsLink 1.114

From Bonita Montero@Bonita.Montero@gmail.com to alt.comp.lang.c,comp.lang.c on Mon Oct 21 17:36:57 2024

From Newsgroup: comp.lang.c

Am 20.10.2024 um 12:49 schrieb Vir Campestris:

On 04/10/2024 05:53, Bonita Montero wrote:

Am 03.10.2024 um 18:06 schrieb Vir Campestris:

for 4e9 primes my machine take nearly 10 seconds for 4294967295
primes. The one I posted before over there takes 1.5 seconds.

Show me your code. I'm pretty sure you never posted faster code.

Try this.

Still about half the speed of my code for 2 ^ 32.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Richard Harnden@richard.nospam@gmail.invalid to comp.lang.c on Tue Oct 22 17:13:08 2024

From Newsgroup: comp.lang.c

On 20/10/2024 11:49, Vir Campestris wrote:

On 04/10/2024 05:53, Bonita Montero wrote:

Am 03.10.2024 um 18:06 schrieb Vir Campestris:

for 4e9 primes my machine take nearly 10 seconds for 4294967295
primes. The one I posted before over there takes 1.5 seconds.

Show me your code. I'm pretty sure you never posted faster code.

Try this.

I just want to say that those are very good comments.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Vir Campestris@vir.campestris@invalid.invalid to alt.comp.lang.c,comp.lang.c on Sat Oct 26 15:28:24 2024

From Newsgroup: comp.lang.c

On 21/10/2024 16:36, Bonita Montero wrote:

Am 20.10.2024 um 12:49 schrieb Vir Campestris:

On 04/10/2024 05:53, Bonita Montero wrote:

Am 03.10.2024 um 18:06 schrieb Vir Campestris:

for 4e9 primes my machine take nearly 10 seconds for 4294967295
primes. The one I posted before over there takes 1.5 seconds.

Show me your code. I'm pretty sure you never posted faster code.

Try this.

Still about half the speed of my code for 2 ^ 32.

Fascinating.

But after my latest experience perhaps not surprising.

You're on Windows aren't you? I do know you have a far more powerful
machine than I do. I'm running Ubuntu with Gnu compiler. (I tried Clang,
but it's slower)

When I try your code on my system it takes 10.4s eleapsed, and 12.1s of
CPU to calculate 4294967296. That code I supplied up there takes only
1.47s elapsed, and a little less CPU - it's single threaded.

So on _my_ system my code is about 7 times faster!

But my recent experience - I've been playing with using a modulo30
store, not just odds.

My code makes bitmasks for 2 sets of small primes (these can be small,
there's a repeat at the product of all the primes) then combines the 2
to make the initial mask for the desired number of primes. It then masks
out the bigger primes.

With a mask of odds-only the mask words can be 64 bit, and combining
them is way faster than Mod30 - where the mask is a byte. So I put
together some code so that it would put the two small masks together
with 64 bit operations. That bit of the code got twice as fast. Then I realised that the next part, for reasons I have been unable to
establish, runs three times more slowly.

But only when I'm using GCC (not Clang) and only when -Ofast
optimisation is turned on. That loop is the same speed in all 0other cases.

I then tried putting some debug timing statements in, and I could get
the same effect by inserting some in in various places.

I looked at the output assembly, and there seem to be random changes (instructions that are not dependent on each other being re-ordered) so
it's not obvious why this is happening.
I think I'm bored enough with this project to just leave it. I've been
more interested in correctness, not speed, in recent years.

Andy
--- Synchronet 3.20a-Linux NewsLink 1.114

From Vir Campestris@vir.campestris@invalid.invalid to alt.comp.lang.c,comp.lang.c on Sat Oct 26 15:30:15 2024

From Newsgroup: comp.lang.c

On 20/10/2024 13:14, Kenny McCormack wrote:

In article <vf2n7i$cko5$1@dont-email.me>,
Vir Campestris <vir.campestris@invalid.invalid> wrote:
...

#include <algorithm>

What newsgroup(s) is/are we in?

Let's check:

Newsgroups: alt.comp.lang.c,comp.lang.c

Hmmm.

And the thread title is... ?

Just kill the thread if you don't care. I won't start another one.

Andy

--- Synchronet 3.20a-Linux NewsLink 1.114

From Bonita Montero@Bonita.Montero@gmail.com to alt.comp.lang.c,comp.lang.c on Sat Oct 26 16:33:45 2024

From Newsgroup: comp.lang.c

Am 26.10.2024 um 16:28 schrieb Vir Campestris:

You're on Windows aren't you? I do know you have a far more powerful
machine than I do. I'm running Ubuntu with Gnu compiler. (I tried Clang,
but it's slower)

I compared your code against mine on a AMD 7950X.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Vir Campestris@vir.campestris@invalid.invalid to alt.comp.lang.c,comp.lang.c on Sat Oct 26 21:02:10 2024

From Newsgroup: comp.lang.c

On 26/10/2024 15:33, Bonita Montero wrote:

Am 26.10.2024 um 16:28 schrieb Vir Campestris:

You're on Windows aren't you? I do know you have a far more powerful
machine than I do. I'm running Ubuntu with Gnu compiler. (I tried
Clang, but it's slower)

I compared your code against mine on a AMD 7950X.

Which operating system, and which compiler?

It's probably also worth trying for a much larger number of primes. I
wrote my command line parser so it will try to read the number as a
double if it isn't an integer.

For 100000000000 primes my code on my machine is 39s, and yours is 3:46.

(I've got the mod30 version down to 28s now, but met an intractable
optimiser funny)

Andy
--- Synchronet 3.20a-Linux NewsLink 1.114

From Bonita Montero@Bonita.Montero@gmail.com to alt.comp.lang.c,comp.lang.c on Sun Oct 27 06:02:25 2024

From Newsgroup: comp.lang.c

Am 26.10.2024 um 22:02 schrieb Vir Campestris:

Which operating system, and which compiler?

Win11, clang-cl 17.0.3. MSVC is about the same performance.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Vir Campestris@vir.campestris@invalid.invalid to alt.comp.lang.c,comp.lang.c on Mon Oct 28 16:21:24 2024

From Newsgroup: comp.lang.c

On 27/10/2024 05:02, Bonita Montero wrote:

Am 26.10.2024 um 22:02 schrieb Vir Campestris:

Which operating system, and which compiler?

Win11, clang-cl 17.0.3. MSVC is about the same performance.

The plot thickens.

I'm in Ubuntu 24.01 LTS.

I normally use Gnu C++, g++ (Ubuntu 13.2.0-23ubuntu4) 13.2.0.

Using g++ for 10000000000 primes my code took 3.3s, and yours 22.7.

If I rebuild them using clang (Ubuntu clang version 18.1.3) then
my code takes 14.5s (far slower) and yours takes 11.6 (much quicker than
with Gnu and a bit quicker than mine).

Andy
--- Synchronet 3.20a-Linux NewsLink 1.114

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	991
Nodes:	10 (0 / 10)
Uptime:	119:43:19
Calls:	12,958
Files:	186,574
Messages:	3,265,638

Re: A very slow program

Who's Online

System Info