Cluster tiny tasks in large DAG to large tasks in tiny DAG - multithreading

I have a task DAG (Directed Acyclic Graph) of 3000+ vertices (i.e.: tasks). For parallelisation, I want to group/cluster tiny tasks into a couple of large tasks, such that any clustered task is suitable for running as a job on a thread. This DAG is executed millions of times, so finding an efficient schedule for multicore processing would be very desirable.
(Optional) Context: As a first and simple step, I search for disconnected components in the DAG and launch components on threads. However, I have one very large connected component of 900 vertices, whereas most other components are only 1 to 20 vertices. This 900-vertex-component is still taking most of the time: one processor core takes care of this component, while my other cores deal with the remaining 2100 nodes in parallel. The 2100 other tasks are completed before the 900 tasks, hence this large DAG is the bottleneck.
Question: I'm thus looking for an algorithm to cluster these tiny tasks of a DAG into larger tasks, while preserving the semantics. Am I missing some terminology to search for this?
Simple example: Consider this task-DAG:
.--> B --> C --> D -.
/ \
/ |
/ v
A --> E --> F --> G ---> H --> I --> J
which can be reduced to:
.--> (BCD) -.
/ \
/ |
/ v
A ----> (EFG) --> H --> (IJ)
which can be reduced to:
.--> (BCD) -.
/ \
/ |
/ v
A ----> (EFG) --> (HIJ)
which can be reduced even further to:
(ABCDEFGHIJ)
Note that:
the last situation is the clustering of tasks on a single-thread.
first DAG has too many tiny tasks and will have a lot of thread-synchronisation overhead.
the second DAG is a fine reduction, as it can schedule two things in parallel. However, it still has the disadvantage that H and IJ are two separate tasks, where they can be merged to avoid synchronisation overhead.
the third DAG is a good reduction, as it can schedule two things in parallel.
More complex example: Somewhat more realistic in my scenario is is following DAG structure. Consider 100 tasks Ai and Bi, where Bi is dependent on Ai, where these 100 instantiations are independent of each other. Now one tasks C combines the results of these 100 Bi tasks.
A_0 --> B_0 \
\
A_1 --> B_1 -+
|
A_2 --> B_2 -+--> C
|
A_3 --> B_3 -+
|
...
|
A_99 -> B_99 /
The first reduction would cluster the Ai and Bi together into an (AB)i.
AB_0 \
\
AB_1 -+
|
AB_2 -+--> C
|
AB_3 -+
|
...
|
AB_99 /
However, we still have 100 parallel paths in the DAG. Since I have only n (eg: 4) processors, I would like to cluster these 100 parallel paths into n jobs (eg: of 25 paths), such that I save on the synchronisation overhead of the 100 paths, and only have n paths to synchronise.
AB_[00..24] \
\
AB_[25..49] -+
|
AB_[50..74] -+--> C
/
AB_[75..99] /
So in this example, we started from 201 tasks, and merged them into 5 tasks suitable to run on 4 threads. This kind of intelligent merging of tasks is what I'm after.

Related

Spark Geolocated Points Clustering

I have a dataset of points of interest on the maps like the following:
ID latitude longitude
1 48.860294 2.338629
2 48.858093 2.294694
3 48.8581965 2.2937403
4 48.8529717 2.3477134
...
The goal is to find those clusters of points that are very close to each other (distance less than 100m).
So the output I expect for this dataset would be:
(2, 3)
The point 2 and 3 are very close to each other with a distance less than 100m, while the others are far away so they should be ignored.
Since the dataset is huge with all the points of interest in the world, I need to do it with Spark with some parallel processing.
What approach should I take for this case?
I actually solved this problem using the following 2 approaches:
DBSCAN algorithm implemented as Spark job with partitioning
https://github.com/irvingc/dbscan-on-spark
GeoSpark with spacial distance join
https://github.com/DataSystemsLab/GeoSpark
both of them are based on Spark so they work well with large scale of data.
however I found the dbscan-on-spark consumes a lot of memory, so I ended up using the GeoSpark with distance join.
I would love to do a cross join here , however that probably won't work since your data is huge.
One approach is to partition the data per region wise. That means you can change the input data as
ID latitude longitude latitiude_int longitude_int group_unique_id
1 48.860294 2.338629 48 2 48_2
2 48.858093 2.294694 48 2 48_2
3 48.8581965 2.2937403 48 2 48_2
4 48.8529717 2.3477134 48 2 48_2
The assumption here if the integral portion of the lat/long changes that will result > 100m deviation.
Now you can partition the data w.r.t group_unique_id and then do a cross join per partition.
This will probably reduce the workload.

Scala parallel collections workload balancing strategies

I've been toying around with the Scala parallel collections and I was wondering if there was a way to easily define what workload balancing strategy to use.
For instance, let's say we're calculating how many prime numbers we have between 1 and K = 500 000:
def isPrime(k: Int) = (2 to k/2).forall(k % _ != 0)
Array.range(1, 500*1000).par.filter(isPrime).length
If all .par is doing is dividing the data to be processed in different contiguous blocks, then there's not much advantage in parallelizing this algorithm, as the last blocks would dominate the total running time anyway.
On the other hand, running this algorithm such that each thread has an equally distributed share of work would solve the issue (by having each one of N threads start at index x € (0 .. N-1) and then work only on elements at x+kN).
I would like to avoid having to write such boilerplate code. Is there some parameter that would allow me to easily tell Scala's library how to do this?

Calculating the size of a full outer join in pandas

tl;dr
My issue here is that I'm stuck at calculating how many rows to anticipate on each part of a full outer merge when using Pandas DataFrames as part of a combinatorics graph.
Questions (repeated below).
The ideal solution would be to not require the merge and to query panel objects. Given that there isn't a query method on the panel is there a cleaner solution which would solve this problem without hitting the memory ceiling?
If the answer to 2 is no, how can I calculate the size of the required merge table for each combination of sets without carrying out the merge? This might be a sub-optimal approach but in this instance it would be acceptable for the purpose of the application.
Is Python the right language for this or should I be looking at a more statistical language such as R or write it at a lower level (c, cython) - Databases are out of the question.
The problem
Recently I re-wrote the py-upset graphing library to make it more efficient in terms of time when calculating combinations across DataFrames. I'm not looking for a review of this code, it works perfectly well in most instances and I'm happy with the approach. What I am looking for now is the answer to a very specific problem; uncovered when working with large data-sets.
The approach I took with the re-write was to formulate an in-memory merge of all provided dataframes on a full outer join as seen on lines 480 - 502 of pyupset.resources
for index, key in enumerate(keys):
frame = self._frames[key]
frame.columns = [
'{0}_{1}'.format(column, key)
if column not in self._unique_keys
else
column
for column in self._frames[key].columns
]
if index == 0:
self._merge = frame
else:
suffixes = (
'_{0}'.format(keys[index-1]),
'_{0}'.format(keys[index]),
)
self._merge = self._merge.merge(
frame,
on=self._unique_keys,
how='outer',
copy=False,
suffixes=suffixes
)
For small to medium dataframes using joins works incredibly well. In fact recent performance tests have shown that it'll handle 5 or 6 Data-Sets containing 10,000's of lines each in a less than a minute which is more than ample for the application structure I require.
The problem now moves from time based to memory based.
Given datasets of potentially 100s of thousands of records, the library very quickly runs out of memory even on a large server.
To put this in perspective, my test machine for this application is an 8-core VMWare box with 128GiB RAM running Centos7.
Given the following dataset sizes, when adding the 5th dataframe, memory usage spirals exponentially. This was pretty much anticipated but underlines the heart of the problem I am facing.
Rows | Dataframe
------------------------
13963 | dataframe_one
48346 | dataframe_two
52356 | dataframe_three
337292 | dataframe_four
49936 | dataframe_five
24542 | dataframe_six
258093 | dataframe_seven
16337 | dataframe_eight
These are not "small" dataframes in terms of the number of rows although the column count for each is limited to one unique key + 4 non-unique columns. The size of each column in pandas is
column | type | unique
--------------------------
X | object | Y
id | int64 | N
A | float64 | N
B | float64 | N
C | float64 | N
This merge can cause problems as memory is eaten up. Occasionally it aborts with a MemoryError (great, I can catch and handle those), other times the kernel takes over and simply kills the application before the system becomes unstable, and occasionally, the system just hangs and becomes unresponsive / unstable until finally the kernel kills the application and frees the memory.
Sample output (memory sizes approximate):
[INFO] Creating merge table
[INFO] Merging table dataframe_one
[INFO] Data index length = 13963 # approx memory <500MiB
[INFO] Merging table dataframe_two
[INFO] Data index length = 98165 # approx memory <1.8GiB
[INFO] Merging table dataframe_three
[INFO] Data index length = 1296665 # approx memory <3.0GiB
[INFO] Merging table dataframe_four
[INFO] Data index length = 244776542 # approx memory ~13GiB
[INFO] Merging table dataframe_five
Killed # > 128GiB
When the merge table has been produced, it is queried in set combinations to produce graphs similar to https://github.com/mproffitt/py-upset/blob/feature/ISSUE-7-Severe-Performance-Degradation/tests/generated/extra_additional_pickle.png
The approach I am trying to build for solving the memory issue is to look at the sets being offered for merge, pre-determine how much memory the merge will require, then if that combination requires too much, split it into smaller combinations, calculate each of those separately, then put the final dataframe back together (divide and conquer).
My issue here is that I'm stuck at calculating how many rows to anticipate on each part of the merge.
Questions (repeated from above)
The ideal solution would be to not require the merge and to query panel objects. Given that there isn't a query method on the panel is there a cleaner solution which would solve this problem without hitting the memory ceiling?
If the answer to 2 is no, how can I calculate the size of the required merge table for each combination of sets without carrying out the merge? This might be a sub-optimal approach but in this instance it would be acceptable for the purpose of the application.
Is Python the right language for this or should I be looking at a more statistical language such as R or write it at a lower level (c, cython).
Apologies for the lengthy question. I'm happy to provide more information if required or possible.
Can anybody shed some light on what might be the reason for this?
Thank you.
Question 1.
Dask shows a lot of promise in being able to calculate the merge table "out of memory" by using hdf5 files as a temporary store.
By using multi-processing to create the merges, dask also offers a performance increase over pandas. Unfortunately this is not carried through to the query method so performance gains made on the merge are lost on querying.
It is still not a completely viable solution as dask may still run out of memory on large, complex merges.
Question 2.
Pre-calculating the size of the merge is entirely possible using the following method.
Group each dataframe by a unique key and calculate the size.
Create a set of key names for each dataframe.
Create an intersection of sets from 2.
Create a set difference for set 1 and for set 2
To accommodate for np.nan stored in the unique key, select all NAN values. If one frame contains nan and the other doesn't, write the other as 1.
for sets in the intersection, multiply the count from each groupby('...').size()
Add counts from the set differences
Add a count of np.nan values
In python this could be written as:
def merge_size(left_frame, right_frame, group_by):
left_groups = left_frame.groupby(group_by).size()
right_groups = right_frame.groupby(group_by).size()
left_keys = set(left_groups.index)
right_keys = set(right_groups.index)
intersection = right_keys & left_keys
left_sub_right = left_keys - intersection
right_sub_left = right_keys - intersection
left_nan = len(left_frame.query('{0} != {0}'.format(group_by)))
right_nan = len(right_frame.query('{0} != {0}'.format(group_by)))
left_nan = 1 if left_nan == 0 and right_nan != 0 else left_nan
right_nan = 1 if right_nan == 0 and left_nan != 0 else right_nan
sizes = [(left_groups[group_name] * right_groups[group_name]) for group_name in intersection]
sizes += [left_groups[group_name] for group_name in left_sub_right]
sizes += [right_groups[group_name] for group_name in right_sub_left]
sizes += [left_nan * right_nan]
return sum(sizes)
Question 3
This method is fairly heavy on calculating and would be better written in Cython for performance gains.

How to define a good partition plan to ensure CPU balance in JSR 352?

JSR 352 - Batch Applications for the Java Platform provides parallelism feature using partitions. Batch runtime can execute a step in different partitions in order to accelerate the progress. JSR 352 also introduces the threads definition : we can define the number of threads to use, such as
<step id="Step1">
<chunk .../>
<partition>
<plan partitions="3" threads="2"/>
</partition>
</chunk>
</step>
Then I feel confused : how to give an appreciated partition plan so that each thread is occupied and ensure the CPU balance ?
For example, there're table A, B, C to do and their rows are respectively 1 billion, 1 million, 1 thousand. The step aims to process these entities to documents, one entity go to one document. The order of document production is not important. The CPU time for these tables' entity is respectively 1s, 2s, 5s. The threads number is 4.
If there're 3 partitions, one per table type, then the step will take 1 * 10^9 seconds to finish, because :
Partition A will take 1 * 10^9 * 1s = 1 * 10^9s, run on thread 2
Partition B will take 1 * 10^6 * 2s = 2 * 10^6s, run on thread 3
Partition C will take 1 * 10^3 * 5s = 5 * 10^3s, run on thread 4
However, while the thread 2 is occupied, thread 3 is free since 2 * 10^6s and thread 4 is free since 5 * 10^3s. So obviously, this is not a good partition plan.
My questions are :
Is there a better partition plan to complete in the above example ?
Can I consider : partitions is a queue to consume and threads consume this queue ?
In general, how many threads can I / should I use ? Is that the same number of the CPU cores ?
In general, how to give an appreciated partition plan so that each thread is occupied and ensure CPU balance ?
Answers...
Is there a better partition plan to complete in the above example?
Yes, there is. See answer 4...
Can I consider : partitions is a queue to consume and threads consume this queue ?
That is what exactly happens!
In general, how many threads can I / should I use ? Is that the same number of the CPU cores ?
It depends. This question has many perspectives... From the JSR-352 Specification View, "threads":
Specifies the maximum number of threads on which to execute the partitions
of this step. Note the batch runtime cannot guarantee the requested number of threads are available; it will use as many as it can up to the requested maximum. This is an optional attribute. The default is the number of partitions.
So, based only in this perspective, you should set this value as high as you want (the batch runtime will set the real limit, according to its resources!).
From the Batch Runtime Perspective (JSR352 Implementation): Any decent implementation will use a thread pool to execute the partitioned steps. So, if such pool has a fixed size of N, no matter how big you set your threads number, you will never execute more than N partitions concurrently.
JBeret is an implementation of JSR352 specification, used by wildfly server (It is the implementation that I've used). At Wildfly, it has a default thread pool setting of max 10 threads. This pool is not only shared between partitioned steps, it is also shared between batch jobs. So, if you're running 2 jobs at the same time, you will have 2 thread less for use. Additional to this fact, when you partition, one thread takes the role of coordinator, assigning partitions to the others threads and waiting for results ... so if your partition plan says that it uses 2 threads, it will in fact uses 3! (two as workers, one as coordinator)... and all this resources (threads) are taken from the same pool!!
Anyway, the important thing of all this is: investigate what implementation of JSR325 are you using and setup it accordingly.
From hardware View, your CPU has a thread max limit. Under this perspective (and as rule of thumb), set the "threads" value equals to such value.
From the Performance View, analyze the work that are you doing. If you're accessing a shared resource (like a DB) between many threads, you can produce a bottleneck causing thread blocking. If you face that kind of problem, you must think at lowering the "theads" value.
In Summary, set the "threads" value as high as the CPU max thread limit. Then, check if that value does not cause blocking issues; if it does, reduce the value. Also, verify it the batch runtime is configured accordingly and it allows to you execute as many threads as you desire.
In general, how to give an appreciated partition plan so that each thread is occupied and ensure CPU balance ?
Avoid the use of static partition plans (at least for you case). Instead, use a Partition Mapper. A Partition Mapper is a class that implements the javax.batch.api.partition.PartitionMapper interface and allows to define a partition plan (how many partitions, how many threads, the properties of each partition) programatically. So for your case, take your tables (A, B, C) and split them into blocks of N (where N = 1000) ... each block will be a partition. You should start with the partition of type C and do round robin between your entity partitions (tables): C0, B0, A0, B1, A1, ..., B999, A999, A1000, ..., A999999 ... using this scheme, entity C will finish first, leaving one thread open to resolve more A and B partitions. Later, B will finish, leaving more resources to attack the remaining A partitions.
Hope this help...

Resource Optimization: which one to use Knapsack, brute-force or any other approach?

In a resource allocation problem i have n bucket sizes and m resources. Resources should be allocated to buckets in such a way that there will be max utilization. I need to write algorithm in Node js
Here's the problem: Let's say i have 2 buckets of sizes 50 and 60 respectively. Resource sizes are 20, 25, 40. Following is the more proper representation with possible solutions:
Solution 1:
| Bucket Size | Resource(s) allocated | Utilization |
| 50 | 20, 25 | 45/50 = 0.9 |
| 60 | 40 | 40/60 = 0.667 |
Total Utilization in this case is >1.5
Solution 2:
| Bucket Size | Resource(s) allocated | Utilization |
| 50 | 25 | 25/50 = 0.5 |
| 60 | 20, 40 | 60/60 = 1.0 |
Total Utilization in this case is 1.5
Inference:
-- Knapsack approach will return Solution 2 because it will do optimization based on higher bucket size.
-- Brute-Force approach will return both the solutions. One concern with this approach i have is; given that i have to use Node js and it is single threaded, i am little skeptic about performance when n (buckets) and m (resources) will be very large.
Will Brute-Force would do just fine or is there a better way/algorithm with which i can solve this problem? Also, is the concern which I've cited above is valid in any sense?
Knapsack problem (and this is knapsack problem) is NPC, which means, you can find solution only by brute-force or with alghoritms which have O-complexity same as bruteforce, but can be better in average case...
it is single threaded, i am little skeptic about performance when n
(buckets) and m (resources) will be very large.
I am not sure, if you know how thing works. If you do not create child threads and handle them (which is not that easy), every standard language will work in one thread, therefore in one processor. And if you want more processors that much, you can create child threads even in Node.Js.
Also in complexity problems, it does not matter, if solution takes multiple-time more, if the "multiple" is constant. In your case, I suppose the "multiple" means 4, if you have quad-core.
There are two good solutions :
1)Backtracking - it is basically advanced brute-force mechanism, which can in same cases return solution much faster.
2)Dynamic programming - If you have items with relatively low-values, then while classic brute-force is not able to find solution for 200 items in the expected time of universe itself, the dynamic approach can give you solution in (mili)seconds.

Resources