Improve serial building of a string with openMP {Copeland-Erdős constant} - string

I'm building a program to find substrings of Copeland-Erdős constant in C++11
Copeland-Erdős constant is a string with all primes in order:
2,3,5,7,11,13… → 23571113…
I need to check if a substring given is inside that constant, and do it in a quick way.
By the moment I've build a serial program using Miller Rabin function for checking if the numbers generated by a counter are primes or not and add to the main string (constant). To find 8th Marsene Number (231-1) the program spends 8 minutes.
And then, I use find to check if the substring given is in the constant and the position where it starts.
PROBLEMS:
I use serial programming. I start at 0 and check if all numbers are prime to add them or not... I don't know if there is any other way to do it. The substring can be a mix of primes. ex: 1..{1131}..7 (substring of 11,13,17)
Do you have any proposal to improve the program execution time by using OpenMP?
I want to calculate 9th Mersene Number in "human time". I've spend more than one day and it doesn't find it (well, arrive to the number).
gcc version 4.4.7 20120313
Main.cpp
while (found == -1 && lastNumber < LIMIT) //while not found & not pass our limit
{
//I generate at least a string with double size of the input (llargada)
for (lastNumber; primers.length() <= 2*llargada; lastNumber++){
if (is_prime_mr(lastNumber))
primers += to_string(lastNumber); //if prime, we add it to the main string
}
found = primers.find(sequencia); //search substring and keep position
if (found == string::npos){ //if not found
indexOfZero += primers.length()/2; //keep IndexOfZero, the position of string in global constant
primers.erase(0,primers.length()/2); //delete first middle part of calculated string
}
}
if (found != -1){
cout << "FOUNDED!" << endl;
cout << "POS: " << indexOfZero << " + " << found << " = " << indexOfZero+found << endl;} //that give us the real position of the substring in the main string
//although we only spend 2*inputString.size() memory
else
cout << "NOT FOUND" << endl;

Improving serial execution:
For starters, you do not need to check every number to see if it's prime, but rather every odd number (except for 2). We know that no even number past two can be prime. This should cut down your execution time in half.
Also, I do not understand why you have a nested loop. You should only have to check your list once.
Also, I fear that your algorithm might not be correct. Currently, if you do not find the substring, you delete half of your string and move on. However, if you have 50 non-primes in a row, you could end up deleting the entire string except for the very last character. But what if the substring you're looking for is 3 digits and needed 2 of the previous characters? Then you've erased some of the information needed to find your solution!
Finally, you should only search for your substring if you've actually found a prime number. Otherwise, you have already searched for it last iteration and nothing has been added to your string.
Combining all of these ideas, you have:
primers = "23";
lastNumber = 3;
found = -1;
while (found == -1)
{
lastNumber += 2;
if (is_prime_mr(lastNumber)) {
primers += to_string(lastNumber); //if prime, we add it to the main string
found = primers.find(sequencia); //search substring and keep position
if (found == string::npos)
found = -1;
else
break;
}
}
Also, you should write your own find function to only check the last few digits (where few = length of your most recent concatenation to the global string primers). If the substring wasn't in the previous global string, there's only a few places it could pop up in your newest string. That algorithm should be O(1) as opposed to O(n).
int findSub(std::string total, std::string substring, std::string lastAddition);
With this change your if statement should change to:
if (found != -1)
break;
Adding parallelism:
Unfortunately, as-is, your algorithm is inherently serial because you have to iterate through all the primes one-by-one, adding them to the list in a row in order to find your answer. There's no simple OpenMP way to parallelize your algorithm.
However, you can take advantage of parallelism by breaking up your string into pieces and having each thread work separately. Then, the only tricky thing you have to do is consider the boundaries between the final strings to double check you haven't missed anything. Something like as follows:
bool globalFound = false;
bool found;
std::vector<std::string> primers;
#pragma omp parallel private(lastNumber, myFinalNumber, found, my_id, num_threads)
{
my_id = omp_get_thread_num();
num_threads = omp_get_num_threads();
if (my_id == 0) { // first thread starts at 0... well, actually 3
primers.resize(num_threads);
#pragma omp barrier
primers[my_id] = "23";
lastNumber = 3;
}
else {
// barrier needed to ensure that primers is initialized to correct size
#pragma omp barrier
primers[my_id] = "";
lastNumber = (my_id/(double)num_threads)*LIMIT - 2; // figure out my starting place
if (lastNumber % 2 == 0) // ensure I'm not even
lastNumber++;
}
found = false;
myFinalNumber = ((my_id+1)/(double)num_threads)*LIMIT - 2;
while (!globalFound && lastNumber < myFinalNumber)
{
lastNumber += 2;
if (is_prime_mr(lastNumber)) {
primers[my_id] += to_string(lastNumber);
found = findSub(primers[my_id], sequencia, to_string(lastNumber)); // your new version of find
if (found) {
#pragma omp atomic
globalFound = true;
break;
}
}
}
}
if (!globalFound) {
// Result was not found in any thread, so check for boundaries/endpoints
globalFound = findVectorSubstring(primers, sequencia);
}
I'll let you finish this (by writing the smart find, findVectorSubstring - should only be checking for boundaries between elements of primers, and double checking you understand the logic of this new algorithm). Furthermore, if the arbitrary LIMIT that you setup turns out to be too small, you can always wrap this whole thing in a loop that searches between i*LIMIT and (i+1)*LIMIT.
Lastly, yes there will be load balancing issues. I can certainly imagine threads finding an uneven amount of prime numbers. Therefore, certain threads will be doing more work in the find function than others. However, a smart version of find() should be O(1) whereas is_prime_mr() is probably O(n) or O(logn), so I'm assuming that the majority of the execution time will be spent in the is_prime_mr() function. Therefore, I do not believe the load balancing will be too bad.
Hope this helps.

Related

OpenCL float sum reduction

I would like to apply a reduce on this piece of my kernel code (1 dimensional data):
__local float sum = 0;
int i;
for(i = 0; i < length; i++)
sum += //some operation depending on i here;
Instead of having just 1 thread that performs this operation, I would like to have n threads (with n = length) and at the end having 1 thread to make the total sum.
In pseudo code, I would like to able to write something like this:
int i = get_global_id(0);
__local float sum = 0;
sum += //some operation depending on i here;
barrier(CLK_LOCAL_MEM_FENCE);
if(i == 0)
res = sum;
Is there a way?
I have a race condition on sum.
To get you started you could do something like the example below (see Scarpino). Here we also take advantage of vector processing by using the OpenCL float4 data type.
Keep in mind that the kernel below returns a number of partial sums: one for each local work group, back to the host. This means that you will have to carry out the final sum by adding up all the partial sums, back on the host. This is because (at least with OpenCL 1.2) there is no barrier function that synchronizes work-items in different work-groups.
If summing the partial sums on the host is undesirable, you can get around this by launching multiple kernels. This introduces some kernel-call overhead, but in some applications the extra penalty is acceptable or insignificant. To do this with the example below you will need to modify your host code to call the kernel repeatedly and then include logic to stop executing the kernel after the number of output vectors falls below the local size (details left to you or check the Scarpino reference).
EDIT: Added extra kernel argument for the output. Added dot product to sum over the float 4 vectors.
__kernel void reduction_vector(__global float4* data,__local float4* partial_sums, __global float* output)
{
int lid = get_local_id(0);
int group_size = get_local_size(0);
partial_sums[lid] = data[get_global_id(0)];
barrier(CLK_LOCAL_MEM_FENCE);
for(int i = group_size/2; i>0; i >>= 1) {
if(lid < i) {
partial_sums[lid] += partial_sums[lid + i];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if(lid == 0) {
output[get_group_id(0)] = dot(partial_sums[0], (float4)(1.0f));
}
}
I know this is a very old post, but from everything I've tried, the answer from Bruce doesn't work, and the one from Adam is inefficient due to both global memory use and kernel execution overhead.
The comment by Jordan on the answer from Bruce is correct that this algorithm breaks down in each iteration where the number of elements is not even. Yet it is essentially the same code as can be found in several search results.
I scratched my head on this for several days, partially hindered by the fact that my language of choice is not C/C++ based, and also it's tricky if not impossible to debug on the GPU. Eventually though, I found an answer which worked.
This is a combination of the answer by Bruce, and that from Adam. It copies the source from global memory into local, but then reduces by folding the top half onto the bottom repeatedly, until there is no data left.
The result is a buffer containing the same number of items as there are work-groups used (so that very large reductions can be broken down), which must be summed by the CPU, or else call from another kernel and do this last step on the GPU.
This part is a little over my head, but I believe, this code also avoids bank switching issues by reading from local memory essentially sequentially. ** Would love confirmation on that from anyone that knows.
Note: The global 'AOffset' parameter can be omitted from the source if your data begins at offset zero. Simply remove it from the kernel prototype and the fourth line of code where it's used as part of an array index...
__kernel void Sum(__global float * A, __global float *output, ulong AOffset, __local float * target ) {
const size_t globalId = get_global_id(0);
const size_t localId = get_local_id(0);
target[localId] = A[globalId+AOffset];
barrier(CLK_LOCAL_MEM_FENCE);
size_t blockSize = get_local_size(0);
size_t halfBlockSize = blockSize / 2;
while (halfBlockSize>0) {
if (localId<halfBlockSize) {
target[localId] += target[localId + halfBlockSize];
if ((halfBlockSize*2)<blockSize) { // uneven block division
if (localId==0) { // when localID==0
target[localId] += target[localId + (blockSize-1)];
}
}
}
barrier(CLK_LOCAL_MEM_FENCE);
blockSize = halfBlockSize;
halfBlockSize = blockSize / 2;
}
if (localId==0) {
output[get_group_id(0)] = target[0];
}
}
https://pastebin.com/xN4yQ28N
You can use new work_group_reduce_add() function for sum reduction inside single work group if you have support for OpenCL C 2.0 features
A simple and fast way to reduce data is by repeatedly folding the top half of the data into the bottom half.
For example, please use the following ridiculously simple CL code:
__kernel void foldKernel(__global float *arVal, int offset) {
int gid = get_global_id(0);
arVal[gid] = arVal[gid]+arVal[gid+offset];
}
With the following Java/JOCL host code (or port it to C++ etc):
int t = totalDataSize;
while (t > 1) {
int m = t / 2;
int n = (t + 1) / 2;
clSetKernelArg(kernelFold, 0, Sizeof.cl_mem, Pointer.to(arVal));
clSetKernelArg(kernelFold, 1, Sizeof.cl_int, Pointer.to(new int[]{n}));
cl_event evFold = new cl_event();
clEnqueueNDRangeKernel(commandQueue, kernelFold, 1, null, new long[]{m}, null, 0, null, evFold);
clWaitForEvents(1, new cl_event[]{evFold});
t = n;
}
The host code loops log2(n) times, so it finishes quickly even with huge arrays. The fiddle with "m" and "n" is to handle non-power-of-two arrays.
Easy for OpenCL to parallelize well for any GPU platform (i.e. fast).
Low memory, because it works in place
Works efficiently with non-power-of-two data sizes
Flexible, e.g. you can change kernel to do "min" instead of "+"

Read a String with spaces till a new line in C

I am in a pickle right now. I'm having trouble taking in an input of example
1994 The Shawshank Redemption
1994 Pulp Fiction
2008 The Dark Knight
1957 12 Angry Men
I first take in the number into an integer, then I need to take in the name of the Movie into a string using a character array, however i have not been able to get this done.
here is the code atm
while(scanf("%d", &myear) != EOF)
{
i = 0;
while(scanf("%[^\n]", &ch))
{
title[i] = ch;
i++;
}
addNode(makeData(title,myear));
}
The title array is arbitrarily large and the function is to add the data as a node to a linked list. right now the output I keep getting for each node is as follows
" hank Redemption"
" ion"
" Knight"
" Men"
Yes, it oddly prints a space in front of the cut-off title. I checked the variables and it adds the space in the data. (I am not printing the year as that is taken in correctly)
How can I fix this?
You are using the wrong type of argument passed to scanf() -- instead of scanning a character, try scanning to the string buffer immediately. %[^\n] scans an entire string up to (but not including) the newline. It does not scan only one character.
(Marginal secondary problem: I don't know from where you people are getting the idea that scanf() returns EOF at end of input, but it doesn't - you'd be better off reading the documentation instead of making incorrect assumptions.)
I hope you see now: scanf() is hard to get right. It's evil. Why not input the whole line at once then parse it using sane functions?
char buf[LINE_MAX];
while (fgets(buf, sizeof buf, stdin) != NULL) {
int year = strtol(buf, NULL, 0);
const char *p = strchr(buf, ' ');
if (p != NULL) {
char name[LINE_MAX];
strcpy(name, p + 1); // safe because strlen(p) <= sizeof(name)
}
}

OpenMP implement switch ... case

I'm trying to parallelize switch ... case (c++) using OpenMP directive, but despite my best efforts, the code goes slower than normal sequential execution.
I have used #pragma parallel, #pragma sections,
I have tried to rewrite the switch case with an if ... else statement
but with no good result ...
switch (number) {
case 1:
f1();
break;
case 2:
f2();
break;
case 3:
f3();
break;
case 4:
fn();
break;
}
Then there is a second problem, OpenMP won't break or return.
The switch cases cannot be implemented in Openmp, just by adding pragma's like parallel, section. The threads running along the parallel section divide work among themselves via the loop index or else they do the same work in a conditional loop. Openmp section needs to know either how many elements it needs to work on or a master condition which determines start and end. You want to make the input section as parallel instead of the functions (f1, f2, .. fn), so I am guessing you are processing a lot of "number". One way is to collect these numbers in a array/vector. Then, you can make a parallel for along this vector/array, calling the corresponding function.
while(some_condition_on_numbers)
{
// Collect Numbers in a vector / some array
}
#pragma omp parallel for
for(int counter = 0; counter < elements_to_process; counter++)
{
F(array_of_number[counter]);
}
F(int choice)
{
if(choice = 1) {f1(); }
if(choice = 2) {f2(); }
..
}

Longest Subsequence with all occurrences of a character at 1 place

In a sequence S of n characters; each character may occur many times in the sequence. You want to find the longest subsequence of S where all occurrences of the same character are together in one place;
For ex. if S = aaaccaaaccbccbbbab, then the longest such subsequence(answer) is aaaaaaccccbbbb i.e= aaa__aaacc_ccbbb_b.
In other words, any alphabet character that appears in S may only appear in one contiguous block in the subsequence. If possible, give a polynomial time
algorithm to determine the solution.
Design
Below I give a C++ implementation of a dynamic programming algorithm that solves this problem. An upper bound on the running time (which is probably not tight) is given by O(g*(n^2 + log(g))), where n is the length of the string and g is the number of distinct subsequences in the input. I don't know a good way to characterise this number, but it can be as bad as O(2^n) for a string consisting of n distinct characters, making this algorithm exponential-time in the worst case. It also uses O(ng) space to hold the DP memoisation table. (A subsequence, unlike a substring, may consist of noncontiguous character from the original string.) In practice, the algorithm will be fast whenever the number of distinct characters is small.
The two key ideas used in coming up with this algorithm were:
Every subsequence of a length-n string is either (a) the empty string or (b) a subsequence whose first element is at some position 1 <= i <= n and which is followed by another subsequence on the suffix beginning at position i+1.
If we append characters (or more specifically character positions) one at a time to a subsequence, then in order to build all and only the subsequences that satisfy the validity criteria, whenever we add a character c, if the previous character added, p, was different from c, then it is no longer possible to add any p characters later on.
There are at least 2 ways to manage the second point above. One way is to maintain a set of disallowed characters (e.g. using a 256-bit array), which we add to as we add characters to the current subsequence. Every time we want to add a character to the current subsequence, we first check whether it is allowed.
Another way is to realise that whenever we have to disallow a character from appearing later in the subsequence, we can achieve this by simply deleting all copies of the character from the remaining suffix, and using this (probably shorter) string as the subproblem to solve recursively. This strategy has the advantage of making it more likely that the solver function will be called multiple times with the same string argument, which means more computation can be avoided when the recursion is converted to DP. This is how the code below works.
The recursive function ought to take 2 parameters: the string to work on, and the character most recently appended to the subsequence that the function's output will be appended to. The second parameter must be allowed to take on a special value to indicate that no characters have been appended yet (which happens in the top-level recursive case). One way to accomplish this would be to choose a character that does not appear in the input string, but this introduces a requirement not to use that character. The obvious workaround is to pass a 3rd parameter, a boolean indicating whether or not any characters have already been added. But it's slightly more convenient to use just 2 parameters: a boolean indicating whether any characters have been added yet, and a string. If the boolean is false, then the string is simply the string to be worked on. If it is true, then the first character of the string is taken to be the last character added, and the rest is the string to be worked on. Adopting this approach means the function takes only 2 parameters, which simplifies memoisation.
As I said at the top, this algorithm is exponential-time in the worst case. I can't think of a way to completely avoid this, but some optimisations can help certain cases. One that I've implemented is to always add maximal contiguous blocks of the same character in a single step, since if you add at least one character from such a block, it can never be optimal to add fewer than the entire block. Other branch-and-bound-style optimisations are possible, such as keeping track of a globally best string so far and cutting short the recursion whenever we can be certain that the current subproblem cannot produce a longer one -- e.g. when the number of characters added to the subsequence so far, plus the total number of characters remaining, is less than the length of the best subsequence so far.
Code
#include <iostream>
#include <vector>
#include <string>
#include <algorithm>
#include <functional>
#include <map>
using namespace std;
class RunFinder {
string s;
map<string, string> memo[2]; // DP matrix
// If skip == false, compute the longest valid subsequence of t.
// Otherwise, compute the longest valid subsequence of the string
// consisting of t without its first character, taking that first character
// to be the last character of a preceding subsequence that we will be
// adding to.
string calc(string const& t, bool skip) {
map<string, string>::iterator m(memo[skip].find(t));
// Only calculate if we haven't already solved this case.
if (m == memo[skip].end()) {
// Try the empty subsequence. This is always valid.
string best;
// Try starting a subsequence whose leftmost position is one of
// the remaining characters. Instead of trying each character
// position separately, consider only contiguous blocks of identical
// characters, since if we choose one character from this block there
// is never any harm in choosing all of them.
for (string::const_iterator i = t.begin() + skip; i != t.end();) {
if (t.end() - i < best.size()) {
// We can't possibly find a longer string now.
break;
}
string::const_iterator next = find_if(i + 1, t.end(), bind1st(not_equal_to<char>(), *i));
// Just use next - 1 to cheaply give us an extra char at the start; this is safe
string u(next - 1, t.end());
u[0] = *i; // Record the previous char for the recursive call
if (skip && *i != t[0]) {
// We have added a new segment that is different from the
// previous segment. This means we can no longer use the
// character from the previous segment.
u.erase(remove(u.begin() + 1, u.end(), t[0]), u.end());
}
string v(i, next);
v += calc(u, true);
if (v.size() > best.size()) {
best = v;
}
i = next;
}
m = memo[skip].insert(make_pair(t, best)).first;
}
return (*m).second;
}
public:
RunFinder(string s) : s(s) {}
string calc() {
return calc(s, false);
}
};
int main(int argc, char **argv) {
RunFinder rf(argv[1]);
cout << rf.calc() << '\n';
return 0;
}
Example results
C:\runfinder>stopwatch runfinder aaaccaaaccbccbbbab
aaaaaaccccbbbb
stopwatch: Terminated. Elapsed time: 0ms
stopwatch: Process completed with exit code 0.
C:\runfinder>stopwatch runfinder abbaaasdbasdnfa,mnbmansdbfsbdnamsdnbfabbaaasdbasdnfa,mnbmansdbfsbdnamsdnbfabbaaasdbasdnfa,mnbmansdbfsbdnamsdnbfabbaaasdbasdnfa,mnbmansdbfsbdnamsdnbf
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa,mnnsdbbbf
stopwatch: Terminated. Elapsed time: 609ms
stopwatch: Process completed with exit code 0.
C:\runfinder>stopwatch -v runfinder abcdefghijklmnopqrstuvwxyz123456abcdefghijklmnop
stopwatch: Command to be run: <runfinder abcdefghijklmnopqrstuvwxyz123456abcdefghijklmnop>.
stopwatch: Global memory situation before commencing: Used 2055507968 (49%) of 4128813056 virtual bytes, 1722564608 (80%) of 2145353728 physical bytes.
stopwatch: Process start time: 21/11/2012 02:53:14
abcdefghijklmnopqrstuvwxyz123456
stopwatch: Terminated. Elapsed time: 8062ms, CPU time: 7437ms, User time: 7328ms, Kernel time: 109ms, CPU usage: 92.25%, Page faults: 35473 (+35473), Peak working set size: 145440768, Peak VM usage: 145010688, Quota peak paged pool usage: 11596, Quota peak non paged pool usage: 1256
stopwatch: Process completed with exit code 0.
stopwatch: Process completion time: 21/11/2012 02:53:22
The last run, which took 8s and used 145Mb, shows how it can have problems with strings containing many distinct characters.
EDIT: Added in another optimisation: we now exit the loop that looks for the place to start the subsequence if we can prove that it cannot possibly be better than the best one discovered so far. This drops the time needed for the last example from 32s down to 8s!
EDIT: This solution is wrong for OP's problem. I'm not deleting it because it might be right for someone else. :)
Consider a related problem: find the longest subsequence of S of consecutive occurrences of a given character. This can be solved in linear time:
char c = . . .; // the given character
int start = -1;
int bestStart = -1;
int bestLength = 0;
int currentLength = 0;
for (int i = 0; i < S.length; ++i) {
if (S.charAt(i) == c) {
if (start == -1) {
start = i;
}
++currentLength;
} else {
if (currentLength > bestLength) {
bestStart = start;
bestLength = currentLength;
}
start = -1;
currentLength = 0;
}
}
if (bestStart >= 0) {
// longest sequence of c starts at bestStart
} else {
// character c does not occur in S
}
If the number of distinct characters (call it m) is reasonably small, just apply this algorithm in parallel to each character. This can be easily done by converting start, bestStart, currentLength, bestLength to arrays m long. At the end, scan the bestLength array for the index of the largest entry and use the corresponding entry in the bestStart array as your answer. The total complexity is O(mn).
import java.util.*;
public class LongestSubsequence {
/**
* #param args
*/
public static void main(String[] args) {
Scanner sc = new Scanner(System.in);
String str = sc.next();
execute(str);
}
static void execute(String str) {
int[] hash = new int[256];
String ans = "";
for (int i = 0; i < str.length(); i++) {
char temp = str.charAt(i);
hash[temp]++;
}
for (int i = 0; i < hash.length; i++) {
if (hash[i] != 0) {
for (int j = 0; j < hash[i]; j++)
ans += (char) i;
}
}
System.out.println(ans);
}
}
Space: 256 -> O(256), I don't if it's correct to say this way..., cause O(256) I think is O(1)
Time: O(n)

How can I make this prime finder operate in parallel

I know prime finding is well studied, and there are a lot of different implementations. My question is, using the provided method (code sample), how can I go about breaking up the work? The machine it will be running on has 4 quad core hyperthreaded processors and 16GB of ram. I realize that there are some improvements that could be made, particularly in the IsPrime method. I also know that problems will occur once the list has more than int.MaxValue items in it. I don't care about any of those improvements. The only thing I care about is how to break up the work.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace Prime
{
class Program
{
static List<ulong> primes = new List<ulong>() { 2 };
static void Main(string[] args)
{
ulong reportValue = 10;
for (ulong possible = 3; possible <= ulong.MaxValue; possible += 2)
{
if (possible > reportValue)
{
Console.WriteLine(String.Format("\nThere are {0} primes less than {1}.", primes.Count, reportValue));
try
{
checked
{
reportValue *= 10;
}
}
catch (OverflowException)
{
reportValue = ulong.MaxValue;
}
}
if (IsPrime(possible))
{
primes.Add(possible);
Console.Write("\r" + possible);
}
}
Console.WriteLine(primes[primes.Count - 1]);
Console.ReadLine();
}
static bool IsPrime(ulong value)
{
foreach (ulong prime in primes)
{
if (value % prime == 0) return false;
if (prime * prime > value) break;
}
return true;
}
}
}
There are 2 basic schemes I see: 1) using all threads to test a single number, which is probably great for higher primes but I cannot really think of how to implement it, or 2) using each thread to test a single possible prime, which can cause a non-continuous string of primes to be found and run into unused resources problems when the next number to be tested is greater than the square of the highest prime found.
To me it feels like both of these situations are challenging only in the early stages of building the list of primes, but I'm not entirely sure. This is being done for a personal exercise in breaking this kind of work.
If you want, you can parallelize both operations: the checking of a prime, and the checking of multiple primes at once. Though I'm not sure this would help. To be honest I'd consider remove the threading in main().
I've tried to stay faithful to your algorithm, but to speed it up a lot I've used x*x instead of reportvalue; this is something you could easily revert if you wish.
To further improve on my core splitting you could determine an algorithm to figure out the number of computations required to perform the divisions based on the size of the numbers and split the list that way. (aka smaller numbers take less time to divide by so make the first partitions larger)
Also my concept of threadpool may not exist the way I want to use it
Here's my go at it(pseudo-ish-code):
List<int> primes = {2};
List<int> nextPrimes = {};
int cores = 4;
main()
{
for (int x = 3; x < MAX; x=x*x){
int localmax = x*x;
for(int y = x; y < localmax; y+=2){
thread{primecheck(y);}
}
"wait for all threads to be executed"
primes.add(nextPrimes);
nextPrimes = {};
}
}
void primecheck(int y)
{
bool primality;
threadpool? pool;
for(int x = 0; x < cores; x++){
pool.add(thread{
if (!smallcheck(x*primes.length/cores,(x+1)*primes.length/cores ,y)){
primality = false;
pool.kill();
}
});
}
"wait for all threads to be executed or killed"
if (primality)
nextPrimes.add(y);
}
bool smallcheck(int a, int b, int y){
foreach (int div in primes[a to b])
if (y%div == 0)
return false;
return true;
}
E: I added what I think pooling should look like, look at revision if you want to see it without.
Use the sieve of Eratosthenes instead. It's not worthwhile to parallelize unless you use a good algorithm in the first place.
Separate the space to sieve into large regions and sieve each in its own thread. Or better use some workqueue concept for large regions.
Use a bit array to represent the prime numbers, it takes less space than representing them explicitly.
See also this answer for a good implementation of a sieve (in Java, no split into regions).

Resources