Multiple Threads processing a Lists data - multithreading

I have a list of 300,000 + items.
What i am currently doing with the list is validating the Address and writing both the original address and corrected address to a specified file.
What i'd like to do is split the list evenly among a given number of threads and do processes on them concurrently.
Can anyone help me with an example on how i can go about doing something like this?
Thanks

If you're working in 2.0 and the list is only being used in a read only fashion (not being changed while this processing is occuring) then you can simply divide up the indexes. For instance ...
public void Process(List<Item> list, int threadCount) {
int perThread = list.Count < threadCount ? list.Count : list.Count / threadCount;
int index = 0;
while ( index < list.Count ) {
int start = index;
int count = Math.Min(perThread,list.Count-start);
WaitCallBack del = delegate(object state) { ProcessCore(list, start, count); };
ThreadPool.QueueUserWorkItem(del);
index += count;
}
}
private void ProcessCore(List<Item> list, int startIndex, int count) {
// Do work here
}

Conceptually, it's fairly simple, given a couple of assumptions:
You're not changing the list during
processing.
You know the number of threads ahead of time.
Basically, your algorithm is this:
Split the list into mostly-even sections based on the number of threads.
Give each thread the start and end index of its section, and the output file.
In each thread:
a. Process an item
b. Lock access to the output file
c. Write the original and corrected address
d. Unlock access to the output file

Related

Was: How does BPF calculate number of CPU for PERCPU_ARRAY?

I have encountered an interesting issue where a PERCPU_ARRAY created on one system with 2 processors creates an array with 2 per-CPU elements and on another system with 2 processors, an array with 128 per-CPU elements. The latter was rather unexpected to me!
The way I discovered this behavior is that a program that allocated an array for the number of CPUs (using get_nprocs_conf(3)) and then read in the PERCPU_ARRAY into it (using bpf_map_lookup_elem()) ended up writing past the end of the array and crashing.
I would like to find out what is the proper way to determine in a program that reads BPF maps the number of elements in a PERCPU_ARRAY used on a system.
Failing that, I think the second best approach is to pick a buffer for reading in that is "large enough." Here, the problem is similar: what is that number and is there way to learn it at runtime?
The question comes from reading the source of bpftool, which figures this out:
unsigned int get_possible_cpus(void)
{
int cpus = libbpf_num_possible_cpus();
if (cpus < 0) {
p_err("Can't get # of possible cpus: %s", strerror(-cpus));
exit(-1);
}
return cpus;
}
int libbpf_num_possible_cpus(void)
{
static const char *fcpu = "/sys/devices/system/cpu/possible";
static int cpus;
int err, n, i, tmp_cpus;
bool *mask;
/* ---8<--- snip */
}
So that's how they do it!

C# - Matrix Printing Incorrectly on a Separate Thread

I'm currently making a Console Application game, and am working on the multiplayer version.
When I am receiving packets, I create a new thread to handle the packets, in order to return to the ReceiveFrom command as soon as possible.
The newly created thread should reprint the whole Matrix, using the new information it got regarding the changes in the matrix (it should print the Matrix with updated player positions).
The problem is, when the Print method is called on the new thread, it is badly performed. It prints the matrix very inaccurately, and many characters of the matrix are just in a mess on the Console Screen.
Here are the methods:
void Receive()
{
while (true)
{
byte[] msg = new byte[1024];
client.Receive(msg);
Thread handle = new Thread(() => HandleInput(msg));
handle.Start();
}
}
void HandleInput(byte[] msgX)
{
string data = Encoding.ASCII.GetString(msgX);
data = data.Replace("\0", "");
if (data.Contains('*') || data.Contains('\0') || data.Contains(' ')) // Move Packet (moving objects in a matrice)
{
// for example, a move packet can be: '*'!12!10
char ToMove = char.Parse(data.Split('!')[0]);
int X = int.Parse(data.Split('!')[1]);
int Y = int.Parse(data.Split('!')[2]);
grid[X, Y] = ToMove;
Print();
}
}
void Print() // print a 20x50 matrix
{
Console.Clear();
for (int y = 0; y < 52; y++)
{
Console.Write('-');
}
Console.WriteLine();
for (int i = 0; i < 20; i++)
{
Console.Write('|');
for (int j = 0; j < 50; j++)
{
Console.Write(grid[i,j]);
}
Console.WriteLine('|');
}
for (int x = 0; x < 52; x++)
{
Console.Write('-');
}
}
But, when I tried to treat the input and print the matrix on the same thread of the Receive method, it worked fine. The problem printing it only came when I printed the matrix on a separate thread.
So why does this happen? Why can't a separate thread just print the matrix correctly?
You create a new thread for every received kilobyte, but those threads are not guaranteed to start running in sequence, to wait one another, nor to finish running in sequence.
There is also no guarantee that network client will receive all needed bytes in one call of Receive method. It may happen in one, but it may happen in several calls. It may also happen that two Send methods from the other side of the connection are merged into one message. All that is guaranteed in network communication is the ordering of the bytes in the stream, and nothing else.
Your next question will probably be "so how do I make it work"? I can't do the job for you, but this is in general how I would do it:
Read chunk of bytes from the stream and store it in some buffer.
Check in the loop for existence of a complete message in the buffer. If there is such a message remove that message from the buffer, and process the message. I would process it in the same thread, and only if there are problems with that approach I would consider additional threads (or rather just one additional thread to whom I would be sending all the messages).
Repeat the loop until all complete messages are removed from the buffer, and continue reading the stream.

OpenCL float sum reduction

I would like to apply a reduce on this piece of my kernel code (1 dimensional data):
__local float sum = 0;
int i;
for(i = 0; i < length; i++)
sum += //some operation depending on i here;
Instead of having just 1 thread that performs this operation, I would like to have n threads (with n = length) and at the end having 1 thread to make the total sum.
In pseudo code, I would like to able to write something like this:
int i = get_global_id(0);
__local float sum = 0;
sum += //some operation depending on i here;
barrier(CLK_LOCAL_MEM_FENCE);
if(i == 0)
res = sum;
Is there a way?
I have a race condition on sum.
To get you started you could do something like the example below (see Scarpino). Here we also take advantage of vector processing by using the OpenCL float4 data type.
Keep in mind that the kernel below returns a number of partial sums: one for each local work group, back to the host. This means that you will have to carry out the final sum by adding up all the partial sums, back on the host. This is because (at least with OpenCL 1.2) there is no barrier function that synchronizes work-items in different work-groups.
If summing the partial sums on the host is undesirable, you can get around this by launching multiple kernels. This introduces some kernel-call overhead, but in some applications the extra penalty is acceptable or insignificant. To do this with the example below you will need to modify your host code to call the kernel repeatedly and then include logic to stop executing the kernel after the number of output vectors falls below the local size (details left to you or check the Scarpino reference).
EDIT: Added extra kernel argument for the output. Added dot product to sum over the float 4 vectors.
__kernel void reduction_vector(__global float4* data,__local float4* partial_sums, __global float* output)
{
int lid = get_local_id(0);
int group_size = get_local_size(0);
partial_sums[lid] = data[get_global_id(0)];
barrier(CLK_LOCAL_MEM_FENCE);
for(int i = group_size/2; i>0; i >>= 1) {
if(lid < i) {
partial_sums[lid] += partial_sums[lid + i];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if(lid == 0) {
output[get_group_id(0)] = dot(partial_sums[0], (float4)(1.0f));
}
}
I know this is a very old post, but from everything I've tried, the answer from Bruce doesn't work, and the one from Adam is inefficient due to both global memory use and kernel execution overhead.
The comment by Jordan on the answer from Bruce is correct that this algorithm breaks down in each iteration where the number of elements is not even. Yet it is essentially the same code as can be found in several search results.
I scratched my head on this for several days, partially hindered by the fact that my language of choice is not C/C++ based, and also it's tricky if not impossible to debug on the GPU. Eventually though, I found an answer which worked.
This is a combination of the answer by Bruce, and that from Adam. It copies the source from global memory into local, but then reduces by folding the top half onto the bottom repeatedly, until there is no data left.
The result is a buffer containing the same number of items as there are work-groups used (so that very large reductions can be broken down), which must be summed by the CPU, or else call from another kernel and do this last step on the GPU.
This part is a little over my head, but I believe, this code also avoids bank switching issues by reading from local memory essentially sequentially. ** Would love confirmation on that from anyone that knows.
Note: The global 'AOffset' parameter can be omitted from the source if your data begins at offset zero. Simply remove it from the kernel prototype and the fourth line of code where it's used as part of an array index...
__kernel void Sum(__global float * A, __global float *output, ulong AOffset, __local float * target ) {
const size_t globalId = get_global_id(0);
const size_t localId = get_local_id(0);
target[localId] = A[globalId+AOffset];
barrier(CLK_LOCAL_MEM_FENCE);
size_t blockSize = get_local_size(0);
size_t halfBlockSize = blockSize / 2;
while (halfBlockSize>0) {
if (localId<halfBlockSize) {
target[localId] += target[localId + halfBlockSize];
if ((halfBlockSize*2)<blockSize) { // uneven block division
if (localId==0) { // when localID==0
target[localId] += target[localId + (blockSize-1)];
}
}
}
barrier(CLK_LOCAL_MEM_FENCE);
blockSize = halfBlockSize;
halfBlockSize = blockSize / 2;
}
if (localId==0) {
output[get_group_id(0)] = target[0];
}
}
https://pastebin.com/xN4yQ28N
You can use new work_group_reduce_add() function for sum reduction inside single work group if you have support for OpenCL C 2.0 features
A simple and fast way to reduce data is by repeatedly folding the top half of the data into the bottom half.
For example, please use the following ridiculously simple CL code:
__kernel void foldKernel(__global float *arVal, int offset) {
int gid = get_global_id(0);
arVal[gid] = arVal[gid]+arVal[gid+offset];
}
With the following Java/JOCL host code (or port it to C++ etc):
int t = totalDataSize;
while (t > 1) {
int m = t / 2;
int n = (t + 1) / 2;
clSetKernelArg(kernelFold, 0, Sizeof.cl_mem, Pointer.to(arVal));
clSetKernelArg(kernelFold, 1, Sizeof.cl_int, Pointer.to(new int[]{n}));
cl_event evFold = new cl_event();
clEnqueueNDRangeKernel(commandQueue, kernelFold, 1, null, new long[]{m}, null, 0, null, evFold);
clWaitForEvents(1, new cl_event[]{evFold});
t = n;
}
The host code loops log2(n) times, so it finishes quickly even with huge arrays. The fiddle with "m" and "n" is to handle non-power-of-two arrays.
Easy for OpenCL to parallelize well for any GPU platform (i.e. fast).
Low memory, because it works in place
Works efficiently with non-power-of-two data sizes
Flexible, e.g. you can change kernel to do "min" instead of "+"

Longest Subsequence with all occurrences of a character at 1 place

In a sequence S of n characters; each character may occur many times in the sequence. You want to find the longest subsequence of S where all occurrences of the same character are together in one place;
For ex. if S = aaaccaaaccbccbbbab, then the longest such subsequence(answer) is aaaaaaccccbbbb i.e= aaa__aaacc_ccbbb_b.
In other words, any alphabet character that appears in S may only appear in one contiguous block in the subsequence. If possible, give a polynomial time
algorithm to determine the solution.
Design
Below I give a C++ implementation of a dynamic programming algorithm that solves this problem. An upper bound on the running time (which is probably not tight) is given by O(g*(n^2 + log(g))), where n is the length of the string and g is the number of distinct subsequences in the input. I don't know a good way to characterise this number, but it can be as bad as O(2^n) for a string consisting of n distinct characters, making this algorithm exponential-time in the worst case. It also uses O(ng) space to hold the DP memoisation table. (A subsequence, unlike a substring, may consist of noncontiguous character from the original string.) In practice, the algorithm will be fast whenever the number of distinct characters is small.
The two key ideas used in coming up with this algorithm were:
Every subsequence of a length-n string is either (a) the empty string or (b) a subsequence whose first element is at some position 1 <= i <= n and which is followed by another subsequence on the suffix beginning at position i+1.
If we append characters (or more specifically character positions) one at a time to a subsequence, then in order to build all and only the subsequences that satisfy the validity criteria, whenever we add a character c, if the previous character added, p, was different from c, then it is no longer possible to add any p characters later on.
There are at least 2 ways to manage the second point above. One way is to maintain a set of disallowed characters (e.g. using a 256-bit array), which we add to as we add characters to the current subsequence. Every time we want to add a character to the current subsequence, we first check whether it is allowed.
Another way is to realise that whenever we have to disallow a character from appearing later in the subsequence, we can achieve this by simply deleting all copies of the character from the remaining suffix, and using this (probably shorter) string as the subproblem to solve recursively. This strategy has the advantage of making it more likely that the solver function will be called multiple times with the same string argument, which means more computation can be avoided when the recursion is converted to DP. This is how the code below works.
The recursive function ought to take 2 parameters: the string to work on, and the character most recently appended to the subsequence that the function's output will be appended to. The second parameter must be allowed to take on a special value to indicate that no characters have been appended yet (which happens in the top-level recursive case). One way to accomplish this would be to choose a character that does not appear in the input string, but this introduces a requirement not to use that character. The obvious workaround is to pass a 3rd parameter, a boolean indicating whether or not any characters have already been added. But it's slightly more convenient to use just 2 parameters: a boolean indicating whether any characters have been added yet, and a string. If the boolean is false, then the string is simply the string to be worked on. If it is true, then the first character of the string is taken to be the last character added, and the rest is the string to be worked on. Adopting this approach means the function takes only 2 parameters, which simplifies memoisation.
As I said at the top, this algorithm is exponential-time in the worst case. I can't think of a way to completely avoid this, but some optimisations can help certain cases. One that I've implemented is to always add maximal contiguous blocks of the same character in a single step, since if you add at least one character from such a block, it can never be optimal to add fewer than the entire block. Other branch-and-bound-style optimisations are possible, such as keeping track of a globally best string so far and cutting short the recursion whenever we can be certain that the current subproblem cannot produce a longer one -- e.g. when the number of characters added to the subsequence so far, plus the total number of characters remaining, is less than the length of the best subsequence so far.
Code
#include <iostream>
#include <vector>
#include <string>
#include <algorithm>
#include <functional>
#include <map>
using namespace std;
class RunFinder {
string s;
map<string, string> memo[2]; // DP matrix
// If skip == false, compute the longest valid subsequence of t.
// Otherwise, compute the longest valid subsequence of the string
// consisting of t without its first character, taking that first character
// to be the last character of a preceding subsequence that we will be
// adding to.
string calc(string const& t, bool skip) {
map<string, string>::iterator m(memo[skip].find(t));
// Only calculate if we haven't already solved this case.
if (m == memo[skip].end()) {
// Try the empty subsequence. This is always valid.
string best;
// Try starting a subsequence whose leftmost position is one of
// the remaining characters. Instead of trying each character
// position separately, consider only contiguous blocks of identical
// characters, since if we choose one character from this block there
// is never any harm in choosing all of them.
for (string::const_iterator i = t.begin() + skip; i != t.end();) {
if (t.end() - i < best.size()) {
// We can't possibly find a longer string now.
break;
}
string::const_iterator next = find_if(i + 1, t.end(), bind1st(not_equal_to<char>(), *i));
// Just use next - 1 to cheaply give us an extra char at the start; this is safe
string u(next - 1, t.end());
u[0] = *i; // Record the previous char for the recursive call
if (skip && *i != t[0]) {
// We have added a new segment that is different from the
// previous segment. This means we can no longer use the
// character from the previous segment.
u.erase(remove(u.begin() + 1, u.end(), t[0]), u.end());
}
string v(i, next);
v += calc(u, true);
if (v.size() > best.size()) {
best = v;
}
i = next;
}
m = memo[skip].insert(make_pair(t, best)).first;
}
return (*m).second;
}
public:
RunFinder(string s) : s(s) {}
string calc() {
return calc(s, false);
}
};
int main(int argc, char **argv) {
RunFinder rf(argv[1]);
cout << rf.calc() << '\n';
return 0;
}
Example results
C:\runfinder>stopwatch runfinder aaaccaaaccbccbbbab
aaaaaaccccbbbb
stopwatch: Terminated. Elapsed time: 0ms
stopwatch: Process completed with exit code 0.
C:\runfinder>stopwatch runfinder abbaaasdbasdnfa,mnbmansdbfsbdnamsdnbfabbaaasdbasdnfa,mnbmansdbfsbdnamsdnbfabbaaasdbasdnfa,mnbmansdbfsbdnamsdnbfabbaaasdbasdnfa,mnbmansdbfsbdnamsdnbf
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa,mnnsdbbbf
stopwatch: Terminated. Elapsed time: 609ms
stopwatch: Process completed with exit code 0.
C:\runfinder>stopwatch -v runfinder abcdefghijklmnopqrstuvwxyz123456abcdefghijklmnop
stopwatch: Command to be run: <runfinder abcdefghijklmnopqrstuvwxyz123456abcdefghijklmnop>.
stopwatch: Global memory situation before commencing: Used 2055507968 (49%) of 4128813056 virtual bytes, 1722564608 (80%) of 2145353728 physical bytes.
stopwatch: Process start time: 21/11/2012 02:53:14
abcdefghijklmnopqrstuvwxyz123456
stopwatch: Terminated. Elapsed time: 8062ms, CPU time: 7437ms, User time: 7328ms, Kernel time: 109ms, CPU usage: 92.25%, Page faults: 35473 (+35473), Peak working set size: 145440768, Peak VM usage: 145010688, Quota peak paged pool usage: 11596, Quota peak non paged pool usage: 1256
stopwatch: Process completed with exit code 0.
stopwatch: Process completion time: 21/11/2012 02:53:22
The last run, which took 8s and used 145Mb, shows how it can have problems with strings containing many distinct characters.
EDIT: Added in another optimisation: we now exit the loop that looks for the place to start the subsequence if we can prove that it cannot possibly be better than the best one discovered so far. This drops the time needed for the last example from 32s down to 8s!
EDIT: This solution is wrong for OP's problem. I'm not deleting it because it might be right for someone else. :)
Consider a related problem: find the longest subsequence of S of consecutive occurrences of a given character. This can be solved in linear time:
char c = . . .; // the given character
int start = -1;
int bestStart = -1;
int bestLength = 0;
int currentLength = 0;
for (int i = 0; i < S.length; ++i) {
if (S.charAt(i) == c) {
if (start == -1) {
start = i;
}
++currentLength;
} else {
if (currentLength > bestLength) {
bestStart = start;
bestLength = currentLength;
}
start = -1;
currentLength = 0;
}
}
if (bestStart >= 0) {
// longest sequence of c starts at bestStart
} else {
// character c does not occur in S
}
If the number of distinct characters (call it m) is reasonably small, just apply this algorithm in parallel to each character. This can be easily done by converting start, bestStart, currentLength, bestLength to arrays m long. At the end, scan the bestLength array for the index of the largest entry and use the corresponding entry in the bestStart array as your answer. The total complexity is O(mn).
import java.util.*;
public class LongestSubsequence {
/**
* #param args
*/
public static void main(String[] args) {
Scanner sc = new Scanner(System.in);
String str = sc.next();
execute(str);
}
static void execute(String str) {
int[] hash = new int[256];
String ans = "";
for (int i = 0; i < str.length(); i++) {
char temp = str.charAt(i);
hash[temp]++;
}
for (int i = 0; i < hash.length; i++) {
if (hash[i] != 0) {
for (int j = 0; j < hash[i]; j++)
ans += (char) i;
}
}
System.out.println(ans);
}
}
Space: 256 -> O(256), I don't if it's correct to say this way..., cause O(256) I think is O(1)
Time: O(n)

How to read large number of files using multiple threads,help me please!

in my application there is a small part of function,in which it will read files to get some information,the number of filecount would be utleast 50,So I thought of implementing threading.Say if the user is giving 50 files,I wanted to separate it as 5 *10, 5 thread should be created,so that each thread can handle 10 files which can speed up the process.And also from the below code you can see that some variables are common.I read some articles about threading and I am aware that only one thread should access a variable/contorl at a me(CCriticalStiuation can be used for that).For me as a beginner,I am finding hard to imlplement what I have learned about threading.Somebody please give me some idea with code shown below..thanks in advance
file read function://
void CMyClass::GetWorkFilesInfo(CStringArray& dataFilesArray,CString* dataFilesB,
int* check,DWORD noOfFiles,LPWSTR path)
{
CString cFilePath;
int cIndex =0;
int exceptionInd = 0;
wchar_t** filesForWork = new wchar_t*[noOfFiles];
int tempCheck;
int localIndex =0;
for(int index = 0;index < noOfFiles; index++)
{
tempCheck = *(check + index);
if(tempCheck == NOCHECKBOX)
{
*(filesForWork+cIndex) = new TCHAR[MAX_PATH];
wcscpy(*(filesForWork+cIndex),*(dataFilesB +index));
cIndex++;
}
else//CHECKED or UNCHECKED
{
dataFilesArray.Add(*(dataFilesB+index));
*(check + localIndex) = *(check + index);
localIndex++;
}
}
WorkFiles(&cFilePath,dataFilesArray,filesForWork,
path,
cIndex);
dataFilesArray.Add(cFilePath);
*(check + localIndex) = CHECKED;
}
I think you would be better off having just one thread to read all the files. The context switch between the threads together with the synchronization issues is really not worth the price in your example. The hard drive is one resource so imagine all five threads taking turns moving the hard drive read heads to various positions on the hard drive == not very effective.

Resources