In the multithreaded versions of merge sort that I've seen, the multithreading is typically done during the recursion on the left and right subarray (i.e., each thread is assigned their own subarray to work on) and the merge operation is done by the master thread after each thread completes their individual work.
I am wondering if there's a nice way to multithread the final merge operation where you're merging 2 sorted subarrays? If so, how can this be done?
Actually there is a way to split the merging task among 2 concurrent threads:
once both subarrays are sorted,
assign one thread the task to merge the elements from the beginning of the sorted subarrays to the first half of the target array and
assign the other thread a different but complementary task: merging from the end of the sorted subarrays to the second half of the target array, starting from the end.
you must write these merging functions carefully so the sort stays stable and each thread should will only write half of the target array, potentially reading the same elements from the sorted subarrays but selecting different ones.
I have not seen this approach mentioned in the literature about multithread merge sort. I wonder if it performs better than classic implementations.
Related
I have a list of several thousand items. Each item has an attribute called "address range". I have a function that verifies the correctness of the items in the list by making sure that none of their address ranges overlap with the address ranges of any other items in the list (each item has precisely one address range). If N is the number of entries in the list, I essentially have to run (N-1)*(N/2) address range overlap checks. In other words, if the number of items in the list doubles, the number of overlap checks quadruples.
Months ago, such a list would only have a few thousand items, and the whole operation would finish relatively quickly, but over time the number of items has grown, and now it takes several minutes to run all the cross-checks.
I've been trying to parallelize the cross-checks, but I have yet to think of a feasible approach. My problem is that if I want to distribute the cross-checks to perform over say 8 threads (to fully exploit the CPUs on the computer), I would have to split the possible cross-check combinations into 8 independent chunks.
To use an example, say we have 5 items in our list: ( A, B, C, D, E ). Using the formula (N-1)*(N/2), we can see that this requires (5-1)*(5/2)=10 cross-checks:
A vs B
A vs C
A vs D
A vs E
B vs C
B vs D
B vs E
C vs D
C vs E
D vs E
The only way I can think of to distribute the cross-check combinations across a given number of threads is to first create a list all cross-check combination pairs and then split that list into evenly sized chunks. That would work in principle, but even for just 20,000 items that list would already contain (20,000-1)*(20,000/2)=199,990,000 entries!!
So my question is, is there some super-sophisticated algorithm that would allow me to pass the entire list of items to each thread and then have each individual thread figure out by itself which cross-checks it should run so that no 2 threads would repeat the same cross-checks?
I'm programming this in Perl, but really the problem is independent from any particular programming language.
EDIT: Hmmm, I'm now wondering if I've been going about this the wrong way altogether. If I could sort the items by their address ranges, I could just walk through the sorted list and check if any item overlaps with its successor item. I'll try that and see if that speeds things up.
UPDATE: Oh my God, this actually works!!! :D Using a pre-sorted list, the entire operation takes 0.7 seconds for 11,700 items, where my previous naive implementation would take 2 to 3 minutes!
UPDATE AFTER usr's comment: As usr has noted, just checking each item against its immediate successor is not enough. As I'm walking through the sorted list, I'm dragging along an additional (initially empty) list in which I keep track of all items involved in the current overlap. Each time an item is found to overlap with its successor item, the successor item is added to the list (if the list was previously empty, the current item itself is also added). As soon as an item does NOT overlap with its successor item, I locally cross-check all items in my additional list against each other and then clear that list (the same operation is performed if there are still any items in my additional list after I've finished walking the list of all items).
My unit tests seem to confirm that this algorithm works; at least with all the examples I've fed it so far.
It seems like you could create N threads where N = number of cores on the computer. Each of these threads is identical and consumes items from the queue until there are no more items. Each item is the comparison pair that the thread should work on. Since an item can only be consumed once, you will get no duplicate work.
On the producer side, simply send every valid combination to the queue (just the pairs of items); the threads are what does the work on each item. Thus there is no need to spit the items into chunks.
It would be great if each thread could be pinned to a core, but whatever OS you're running on will most likely do a good enough job at scheduling that you won't need to worry about that.
I have a ConcurrentLinkedQueue and I want to split it into two halves and let two separate threads handle each. I have tried using Spliterator but I do not understand how to get the partitioned queues.
ConcurrentLinkedQueue<int[]> q = // contains a large number of elements
Spliterator<int[]> p1 = q.spliterator();
Spliterator<int[]> p2 = p1.trySplit();
p1.getQueue();
p2.getQueue();
I want to but cannot do p1.getQueue() etc.
Please let me know the correct way to do it.
You can't split it in half in general, I mean to split in half this queue must have a size at each point in time. And while CLQ does have a size() method, it's documentation is pretty clear that this size requires O(n) traversal time and because this is a concurrent queue it's size might not be accurate at all (it is named concurrent for a reason after all). The current Spliterator from CLQ splits it in batches from what I can see.
If you want to split it in half logically and process the elements, then I would suggest moving to some Blocking implementation that has a drainTo method, this way you could drain the elements to an ArrayList for example, that will split much better (half, then half again and so on).
On a side note, why would you want to do the processing in different threads yourself? This seems very counter-intuitive, the Spliterator is designed to work for parallel streams. Calling trySplit once is probably not even enough - you have to call it until it returns null... Either way doing these things on your own sounds like a very bad idea to me.
I was asked this question to reverse a singly linked list as big as having 7 million nodes by using threads efficiently. Using recursion doesn't look feasible if there are so many nodes so I opted for divide and conquer where in each thread be given a chunk of linked list which gets reversed by just making the node pointer point back to previous node by store a reference to current, future and past node and later adding it with reversed chunks from other threads. But the interviewer insisted that the size of the link list is not know, and you can do it without finding the size in an efficient manner. Well I couldn't figure it out , how would you go about it ?
Such questions I like to implement "top-down":
Assume that you already have a Class that implement Runnable or extends Thread out of which you can create instances and run, each instance receives two parameters: a pointer to a Node in the List and number of Nodes to reverse
Your main traverse all 7 million nodes and "marks" the starting points for your threads, say we have 7 threads, the marked points will be: 1, 1,000,000, 2,000,000,... save the marked nodes in an array or whichever data-structure you like
After you finished "marking the starting points, create the threads and give each one of them its starting point and the counter 1,000,000
After all the threads are done, "glue" each of the marking points to point back to the last node of the previous thread (which should be saved in another "static" ordered data-structure).
Now that we have a plan - all that's left to do is implement a (considerably easy) algorithm that, give the number N and a Node x, it will reverse the next N nodes (including x) in a singly linked list :)
I am building something like a delay-line: one process RPUSHes objects into a list, another LPOPs them out in the same order.
The trick is that objects should only be popped from the list one hour after they have been added. I could use a time-stamp for that. The delay is the same for all items and never changes.
Now how to I implement the pop in a concurrency-friendly way (so that it still works when several workers access that list)? I could take out an item, check the timestamp and put it back into the list if it's still too early. But if several workers do that simultaneously, it may mess up the order of items. I could check the first item and only pop it if it's due. But another worker might have popped it then so I pop the wrong one.
Should I use the WATCH command? How? Should I use sorted sets instead of a list? Help appreciated!
I'd suggest using a sorted set. Entries go into the zset with the normal identifier as key and a Unix-style timestamp as the score. Make the timestamps the date+time after which each entry is ready for parsing. Workers do a ZPOP, which isn't a built-in but can be emulated with:
MULTI
ZRANGE <key> 0 0 WITHSCORES
ZREMRANGEBYRANK <key> 0 0
EXEC
Capture the results of the ZRANGE and you have the element with the lowest score at the time, already removed from the set, with its score. Put it back if it's not valid yet with a ZADD <key> <score> <item>.
Putting the item back in the list after checking it won't mess up the ordering enough to matter - by having any sort of concurrency you accept that the order won't be strictly defined. However, that approach won't be particularly efficient as you can't check the timestamp without some processing of the item.
With the scripting build you could implement a conditional pop command for sorted sets, though I don't think that is the easiest solution.
There are a couple of ways to implement basic scheduling using multiple lists:
Add tasks to a sorted set and have a single worker responsible for moving them from the sorted set to a list that multiple workers access with pop.
Have a list for each minute, and have the workers read from past lists. You can minimize the number of lists checked by using another key that tracks the oldest non-empty queue.
I have 2 threaded methods running in 2 separate places but sharing access at the same time to a list array object (lets call it PriceArray), the first thread Adds and Removes items from PriceArray when necessary (the content of the array gets updated from a third party data provider) and the average update rate is between 0.5 and 1 second.
The second thread only reads -for now- the content of the array every 3 seconds using a foreach loop (takes most items but not all of them).
To ensure avoiding the nasty Collection was modified; enumeration operation may not execute exception when the second thread loops through the array I have wrapped the add and remove operation in the first thread with lock(PriceArray) to ensure exclusive access and prevent that exception from occurring. The problem is I have noticed a performance issue when the second method tries to loop through the array items as most of the time the array is locked by the add/remove thread.
Having the scenario running this way, do you have any suggestions how to improve the performance using other thread-safety/exclusive access tactics in C# 4.0?
Thanks.
Yes, there are many alternatives.
The best/easiest would be to switch to using an appropriate collection in System.Collections.Concurrent. These are all thread-safe collections, and will allow you to use them without managing your own locks. They are typically either lock-free or use very fine grained locking, so will likely dramatically improve the performance impacts you're getting from the synchronization.
Another option would be to use ReaderWriterLockSlim to allow your readers to not block each other. Since a third party library is writing this array, this may be a more appropriate solution. It would allow you to completely block during writing, but the readers would not need to block each other during reads.
My suggestion is that ArrayList.Remove() takes most of the time, because in order to perform deletion it performs two costly things:
linear search: just takes elements one by one and compares with element being removed
when index of the element being removed is found - it shifts everything below it by one position to the left.
Thus every deletion takes time proportionally to count of elements currently in the collection.
So you should try to replace ArrayList with more appropriate structure for this task. I need more information about your case to suggest which one to choose.