how can I use multithreading to compute the intersection on a very large dataset - multithreading

I have a file composed of 4 millions sets. every set contains 1 to n words. The size of the file is 120 MB.
set1 = {w11, w12,...,w1i}
set2 = {w21, w22,...,w2j}
...
setm = {wm1, wm2,...,wmk}
I want to compute the intersection between all the sets.
Set 1 ∩ {set1,...,setm}
Set 2 ∩ {set1,...,setm}
...
Set m ∩ {set1,...,setm}
Every operation takes Around 1.2 seconds. What I did the following:
divide the 4 million sets into 6 chunks. Every chunk containing 666666 sets
Then I do the following. In here i'll be creating 36 threads and i'll be computing the intersection between the chuncks. It is too slow and I complicated the problem.
vector<thread> threads;
for(int i = 0; i< chunk.size();i++)
{
for(int j = 0; j < chunk.size();j++)
{
threads.push_back(thread(&Transform::call_intersection, this, ref(chunk[i]),ref(tmp[j]), chunk(results)));
}
}
for(auto &t : threads){ t.join(); }
Do you have an idea on how to divide the problem into sub-problems and then join all of them together in the end. any good way in linux too?
Sample
The first column represents the ID of the set and the rest of the columns represents the words.
m.06fl3b|hadji|barbarella catton|haji catton|haji cat|haji
m.06flgy|estadio neza 86
m.06fm8g|emd gp39dc
m.0md41|pavees|barbarella catton
m.06fmg|round
m.01012g|hadji|fannin county windom town|windom
m.0101b|affray
Example
m.06fl3b has an intersection with m.01012g and m.0md41. The output file will be as follows:
m.06fl3b m.01012g m.0md41
m.06flgy
m.06fm8g
m.0md41 m.06fl3b
m.06fmg
m.01012g m.06fl3b
m.0101b

Set intersection is associative and therefore amenable to parallel folding (which is one of many use cases of MapReduce). For each pair of sets ((1, 2), (3, 4), ...), you can compute the intersection of each pair, and put the results into a new collection of sets, which will have half the size. Repeat until you're left with only one set. The total number of intersection operations will be equal to the number of sets minus one.
Launching millions of threads is going to bog down your machine, however, so you will probably want to use a thread pool: Make a number of threads that is close to the amount of CPU cores you have available, and create a list of tasks, where each task is two sets that are to be intersected. Each thread repeatedly checks the task list and grabs the first available task (make sure that you access the task list in a thread-safe manner).

Related

Stacking and dynamic programing

Basically I'm trying to solve this problem :
Given N unit cube blocks, find the smaller number of piles to make in order to use all the blocks. A pile is either a cube or a pyramid. For example two valid piles are the cube 4 *4 *4=64 using 64 blocks, and the pyramid 1²+2²+3²+4²=30 using 30 blocks.
However, I can't find the right angle to approach it. I feel like it's similar to the knapsack problem, but yet, couldn't find an implementation.
Any help would be much appreciated !
First I will give a recurrence relation which will permit to solve the problem recursively. Given N, let
SQUARE-NUMS
TRIANGLE-NUMS
be the subset of square numbers and triangle numbers in {1,...,N} respectively. Let PERMITTED_SIZES be the union of these. Note that, as 1 occurs in PERMITTED_SIZES, any instance is feasible and yields a nonnegative optimum.
The follwing function in pseudocode will solve the problem in the question recursively.
int MinimumNumberOfPiles(int N)
{
int Result = 1 + min { MinimumNumberOfPiles(N-i) }
where i in PERMITTED_SIZES and i smaller than N;
return Result;
}
The idea is to choose a permitted bin size for the items, remove these items (which makes the problem instance smaller) and solve recursively for the smaller instances. To use dynamic programming in order to circumvent multiple evaluation of the same subproblem, one would use a one-dimensional state space, namely an array A[N] where A[i] is the minimum number of piles needed for i unit blocks. Using this state space, the problem can be solved iteratively as follows.
for (int i = 0; i < N; i++)
{
if i is 0 set A[i] to 0,
if i occurs in PERMITTED_SIZES, set A[i] to 1,
set A[i] to positive infinity otherwise;
}
This initializes the states which are known beforehand and correspond to the base cases in the above recursion. Next, the missing states are filled using the following loop.
for (int i = 0; i <= N; i++)
{
if (A[i] is positive infinity)
{
A[i] = 1 + min { A[i-j] : j is in PERMITTED_SIZES and j is smaller than i }
}
}
The desired optimal value will be found in A[N]. Note that this algorithm only calculates the minimum number of piles, but not the piles themselves; if a suitable partition is needed, it has to be found either by backtracking or by maintaining additional auxiliary data structures.
In total, provided that PERMITTED_SIZES is known, the problem can be solved in O(N^2) steps, as PERMITTED_SIZES contains at most N values.
The problem can be seen as an adaptation of the Rod Cutting Problem where each square or triangle size has value 0 and every other size has value 1, and the objective is to minimize the total value.
In total, an additional computation cost is necessary to generate PERMITTED_SIZES from the input.
More precisely, the corresponding choice of piles, once A is filled, can be generated using backtracking as follows.
int i = N; // i is the total amount still to be distributed
while ( i > 0 )
{
choose j such that
j is in PERMITTED_SIZES and j is smaller than i
and
A[i] = 1 + A[i-j] is minimized
Output "Take a set of size" + j; // or just output j, which is the set size
// the part above can be commented as "let's find out how
// the value in A[i] was generated"
set i = i-j; // decrease amount to distribute
}

Parallel random number sequence independent of number of threads

There has been a large number of parallel RNG questions here, but I couldn't find one which addresses my variant.
I'm writing a function that given a seed fills a long array with random numbers based on that seed. I currently do that serially, but I find that the RNG is taking a significant amount of the running time of my program. I therefore want to speed up my function by using multiple threads. But I want this to be transparent to the user. That is, given a seed, one should get the same random number sequence out independent of the number of threads the function uses internally.
My current idea for doing this is to divide the array into chunks (independently of the number of threads), and generate a new RNG for every chunk, for example by seeding each RNG with the seed+chunk_id. The chunks could then be processed independently, and it would not matter which thread processes which chunk. But I am worried that this might reduce the quality of the RNG. Is this a safe way of doing this for a high-quality RNG like mersenne twister?
To illustrate, here is some pseudo-code for the process:
function random(array, seed, blocksize=100000)
for each block of size blocksize in array
rng[block] = create_rng(seed+i)
parallel for each block in array
for each sample in block
array[sample] = call_rng(rng[block])
This should produce the same values for each (seed,blocksize) combination. But is this the best way of doing this?
I tested the effective RNG quality of this approach using the TestU01 random number generator test suite by constructing a custom RNG which is reseeded with a new sequential seed every 0x1000 steps:
#include <stdlib.h>
#include "ulcg.h"
#include "unif01.h"
#include "bbattery.h"
long long i=1,j=0;
unif01_Gen * gen;
unsigned long myrand()
{
if(++i&0xfff==0)
{
ugfsr_DeleteGen(gen);
gen = ugfsr_CreateMT19937_02(++j, NULL, 0);
}
return gen->GetBits(gen->param, gen->state);
}
int main()
{
unif01_Gen *gen2 = unif01_CreateExternGenBitsL("foo", myrand);
gen = ugfsr_CreateMT19937_02(1, NULL, 0);
bbattery_Crush (gen2);
return 0;
}
Result (after waiting 40 minutes for the tests to complete):
Test p-value
----------------------------------------------
71 LinearComp, r = 0 1 - eps1
72 LinearComp, r = 29 1 - eps1
----------------------------------------------
All other tests were passed
These are the same tests Mersenne Twister fails even when used normally, when not reseeding. So the TestU01 Crush test could not distinguish the sequential seeding scenario from normal usage.
I also tested the approach of reseeding with the output from another Mersenne Twister instead of using sequential integers. The result was exactly the same.
While I did not try the most time-consuming "BigCrush" test (which takes 8 hours), I think it's safe to say that the quality of MT is not significantly impaired by generating sub-RNGs with sequential seeds, as suggested in the question.

Parallel For to sum an array of ushorts (big array 18M)

I would like to use parallel processing for taking array statistics for large arrays of unsigned short (16 bit) values.
ushort[] array = new ushort[2560 * 3072]; // x = rows(2560) y = columns(3072)
double avg = Parallel.For (0, array.Length, WHAT GOES HERE);
The same for standard deviation & standard deviation of row means.
I have normal for loop versions of these functions and they take too long when combined with Median Filter methods.
The end product is to try and get a Median Filter for the array. But the first steps are important as well. So if you have the whole solution great but if you want to help with the first parts as well it is all appreciated.
Have you tried PLINQ?
double average = array.AsParallel().Average(n => n);
I'm not sure how performant it will be with a large array of ushort values, but it's worth testing to see if it meets your needs.

What is an efficient way to compute the Dice coefficient between 900,000 strings?

I have a corpus of 900,000 strings. They vary in length, but have an average character count of about 4,500. I need to find the most efficient way of computing the Dice coefficient of every string as it relates to every other string. Unfortunately, this results in the Dice coefficient algorithm being used some 810,000,000,000 times.
What is the best way to structure this program for increased efficiency? Obviously, I can prevent computing the Dice of sections A and B, and then B and A--but this only halves the work required. Should I consider taking some shortcuts or creating some sort of binary tree?
I'm using the following implementation of the Dice coefficient algorithm in Java:
public static double diceCoefficient(String s1, String s2) {
Set<String> nx = new HashSet<String>();
Set<String> ny = new HashSet<String>();
for (int i = 0; i < s1.length() - 1; i++) {
char x1 = s1.charAt(i);
char x2 = s1.charAt(i + 1);
String tmp = "" + x1 + x2;
nx.add(tmp);
}
for (int j = 0; j < s2.length() - 1; j++) {
char y1 = s2.charAt(j);
char y2 = s2.charAt(j + 1);
String tmp = "" + y1 + y2;
ny.add(tmp);
}
Set<String> intersection = new HashSet<String>(nx);
intersection.retainAll(ny);
double totcombigrams = intersection.size();
return (2 * totcombigrams) / (nx.size() + ny.size());
}
My ultimate goal is to output an ID for every section that has a Dice coefficient of greater than 0.9 with another section.
Thanks for any advice that you can provide!
Make a single pass over all the Strings, and build up a HashMap which maps each bigram to a set of the indexes of the Strings which contain that bigram. (Currently you are building the bigram set 900,000 times, redundantly, for each String.)
Then make a pass over all the sets, and build a HashMap of [index,index] pairs to common-bigram counts. (The latter Map should not contain redundant pairs of keys, like [1,2] and [2,1] -- just store one or the other.)
Both of these steps can easily be parallelized. If you need some sample code, please let me know.
NOTE one thing, though: from the 26 letters of the English alphabet, a total of 26x26 = 676 bigrams can be formed. Many of these will never or almost never be found, because they don't conform to the rules of English spelling. Since you are building up sets of bigrams for each String, and the Strings are so long, you will probably find almost the same bigrams in each String. If you were to build up lists of bigrams for each String (in other words, if the frequency of each bigram counted), it's more likely that you would actually be able to measure the degree of similarity between Strings, but then the calculation of Dice's coefficient as given in the Wikipedia article wouldn't work; you'd have to find a new formula.
I suggest you continue researching algorithms for determining similarity between Strings, try implementing a few of them, and run them on a smaller set of Strings to see how well they work.
You should come up with some kind of inequality like: D(X1,X2) > 1-p, D(X1,X3) < 1-q and p D(X2,X3) < 1-q+p . Or something like that. Now, if 1-q+p < 0.9, then probably you don't have to evaluate D(X2,X3).
PS: I am not sure about this exact inequality, but I have a gut feeling that this might be right (but I do not have enough time to actually do the derivations now). Look for some of the inequalities with other similarity measures and see if any of them are valid for Dice co-efficient.
=== Also ===
If there are a elements in set A, and if your threshold is r (=0.9), then set B should have number of elements b should be such that: r*a/(2-r) <= b <= (2-r)*a/r . This should eliminate need for lots of comparisons IMHO. You can probably sort the strings according to length and use the window describe above to limit comparisons.
Disclaimer first: This will not reduce the number of comparisons you'll have to make. But this should make a Dice comparison faster.
1) Don't build your HashSets every time you do a diceCoefficient() call! It should speed things up considerably if you just do it once for each string and keep the result around.
2) Since you only care about if a particular bigram is present in the string, you could get away with a BitSet with a bit for each possible bigram, rather than a full HashMap. Coefficient calculation would then be simplified to ANDing two bit sets and counting the number of set bits in the result.
3) Or, if you have a huge number of possible bigrams (Unicode, perhaps?) - or monotonous strings with only a handful of bigrams each - a sorted Array of bigrams might provide faster, more space-efficent comparisons.
Is their charset limited somehow? If it is, you can compute character counts by their code in each string and compare these numbers. After such pre-computation (it will occupy 2*900K*S bytes of memory [if we assume no character is found more then 65K time in the same string], where S is different character count). Then computing the coefficent would take O(S) time. Sure, this would be helpful if S<4500.

How can I determine if this write access is coalesced?

How can I determine if the following memory access is coalesced or not:
// Thread-ID
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// Offset:
int offset = gridDim.x * blockDim.x;
while ( idx < NUMELEMENTS )
{
// Do Something
// ....
// Write to Array which contains results of calculations
results[ idx ] = df2;
// Next Element
idx += offset;
}
NUMELEMENTS is the complete number of single dataelements to process. The array results is passed as pointer to the kernel function and allocated before in global memory.
My Question: Is the write access in the line results[ idx ] = df2; coalesced?
I believe it is as each thread processes consecutive indexed items but I'm not completely sure about it & I don't know how to tell.
Thanks!
Depends if the length of the lines of your matrix is a multiple of half the warp size for devices of compute capability 1.x or a multiple of the warp size for devices of compute capability 2.x. If it is not you can use padding to make it fully coalesced. The function cudaMallocPitch can be used for this purpose.
edit:
Sorry for the confusion. You write 'offset' elements at a time which I interpreted as lines of a matrix.
What I mean is, after each iteration of your cycle you increase the idx by offset. If offset is a multiple of half the warp size for devices of compute capability 1.x or a multiple of the warp size for devices of compute capability 2.x then you it is coalesced, if not then you need padding to make it so.
Probably it is already coalesced because you should choose the number of threads per block and thus the blockDim as a multiple of the warp size.

Resources