There is a graph where vertices represent pieces of code and edges represent dependencies between them. Additionally, each vertex has two numbers: how many threads the corresponding piece of code can use (1, 2, ..., or "as many as there are cores"), and how much time it is estimated to take if it gets that many threads (compared to others - for example, 1, 0.1 or 10). The idea is to run the pieces of code minding their dependencies in parallel, giving them such numbers of threads that the total execution time is the smallest.
Is there some existing algorithm which would do that or which I could use as a base?
So far I was thinking as follows. For example, we have 8 threads total (so NT = 8T) and the following graph.
+----------------+ +----------------+
+-+ A: 0.2x, 1T +----+ | F: 0.1x, 1T |
| +---+------------+ | +---+------------+
| | | |
| +---v------------+ | +---v------------+
| | B: 0.1x, 2T +-+ | | G: 0.3x, NT +-+
| +----------------+ | | +----------------+ |
| | | |
| +----------------+ | | +----------------+ |
+-> C: 0.4x, 1T | | +----> H: 0.1x, 1T | |
+--+-------------+ | +--+-------------+ |
+----+ | | |
| +----------------+ | +--v-------------+ |
| | D: 0.1x, 1T <-+ | J: 1.5x, 4T <-+
| +--+-------------+ +-------+--------+
| | |
| +--v-------------+ |
+-> E: 1.0x, 4T +------------+ |
+----------------+ | |
+--v----v--------+
+ I: 0.01x, 1T |
+----------------+
At task I we have 2 dependencies, E and J. As J dependencies, we have F-G and A-H. For E, A-C and A-B-D. To get to J, we need 0.3x on A-H and 0.4x on F-G, but G needs many threads for that. We could first run A and F in parallel (each with a single thread). Then we would run G with 7 threads and as A finishes, H with 1 thread. However there's also the E branch. Ideally, we would like it to be ready 0.5 later than J. In this case, it's quite easy because the longest path to E when we have already processed A takes 0.4 using one thread, and the other path takes less than that and uses just 2 threads - so we can run these calculations when J is running. But if, say, D took 0.6x, we would probably need to run it in parallel with G as well.
So I think I could start with the sink vertex and balance the weights of subgraphs on which it depends. But given these "N-thread" tasks, it's not particularly clear how. And considering that the x-numbers are just estimates, it would be good if it could make adjustments if particular tasks took more or less time than anticipated.
You can model this problem as a job shop scheduling problem (flexible job shop problem in particular, where the machines are processors, and the jobs are slices of programs to be run).
First, you have to modify a bit your DAG, in order to transform it into another DAG which is the disjunctive graph representing your problem.
This transformation is very simple. For any node i, t, nb_t representing the job i, that need t seconds to be performed with 1 thread, and that can be parallelized into nb_t threads, do the following:
Replace i, t, nb_t by nb_t vertices i_1, t/nb_t, ..., i_(nb_t), t/nb_t. For each incoming/outgoing edge of the node i, create an incoming/outgoing edge from/to all the newly created nodes. Basically, we just split each job that can be parallelized into smaller jobs that can be handled by several processors (machines) simultaneously.
You then have your disjuntive graph, which is the input to the job shop problem.
Then, all you need to do is to solve this well-known problem, there are different options available ....
I would advice using a MILP solver, but from the small search I just did, it seems like many meta-heuristics can tackle the problem (simulated annealing, genetic programming, ...).
Related
I have a "master" pipeline in Azure Data factory, which looks like this:
One rectangle is Execute pipeline activity for 1 destination (target) Table, so this "child" pipeline takes some data, transform it and save as a specified table. Essentialy this means that before filling table on the right, we have to fill previous (connected with line) tables.
The problem is that this master pipeline contains more than 100 activities and the limit for data factory pipeline is 40 activities.
I was thinking about dividing pipeline into several smaller pipelines (i.e. first layer (3 rectangles on the left), then second layer etc.), however this could cause pipeline to run a lot longer as there could be some large table in each layer.
How to approach this? What is the best practice here?
Had a similar issue at work but I didn't used Execute Pipeline because it is a terrible approach in my case. I have more than 800 PLs to run with multiple parent and child dependencies that can go multiple levels deep depending the complexity of the data plus several restrictions (starting with transforming data for 9 regions in the US reusing PLs). A simplified diagram of one of many cases I have can easily look like this:
The solution:
A master dependency table where to store all the dependencies:
| Job ID | dependency ID | level | PL_name |
|--------|---------------|-------|--------------|
| Token1 | | 0 | |
| L1Job1 | Token1 | 1 | my_PL_name_1 |
| L1Job2 | Token1 | 1 | my_PL_name_2 |
| L2Job1 | L1Job1,L2Job2 | 2 | my_PL_name_3 |
| ... | ... | ... | ... |
From here it is a tree problem:
There are ways of mapping trees in SQL. Once you have all the dependencies mapped from a tree put them in a stage or tracker table:
| Job ID | dependency ID | level | status | start_date | end_date |
|--------|---------------|-------|-----------|------------|----------|
| Token1 | | 0 | | | |
| L1Job1 | Token1 | 1 | Running | | |
| L1Job2 | Token1 | 1 | Succeeded | | |
| L2Job1 | L1Job1,L2Job2 | 2 | | | |
| ... | ... | ... | ... | ... | ... |
We can easily query this table using a Look up activity to get the PLs level 1 to run and use a For Each activity to trigger the target PL to run with a dynamic Web Activity. Then Update the tracker table status, start_date, end_date, etc accordantly per PL.
There are only two PLs orchestrating:
one for mapping the tree and assign some type of unique ID for that batch.
two for validation (verifies status of parent PLs and controls which PL to run next)
Note: Both call a store procedure with some logic depending the case
I have a recursive call to the validation PL each time a target pipeline ends:
Lets assume L1Job1 and L1Job2 are running in parallel:
L1Job1 ended successful -> calls validation PL -> validation triggers L2Job1 only if L1job1 and L1Job2 have a succeeded status.
If L1Job2 hasn't ended the validation PL ends without triggering L2Job1.
Then L1Job2 ended successful -> calls validation PL -> validation triggers L2Job1 only if L1job1 and L1Job2 have a succeeded status.
L2Job1 starts running after passing the validations.
Repeat for each level.
This works because we already mapped all the PL dependencies in the job tracker and we know exactly which PLs should run.
I know this looks complicated and maybe can't apply to your case but I hope this can give you or others a clue on how to solve complex data workflows in Azure Data Factory.
Yes as per documentation, Maximum activities per pipeline, which includes inner activities for containers is 40 only.
So, there is only option left is splitting your pipeline in to multiple small pipelines.
Please check below link to know limitations on ADF
https://github.com/MicrosoftDocs/azure-docs/blob/master/includes/azure-data-factory-limits.md
I have a big dataset groupped by certain field that I need to run descriptive statistics on each field.
Let's say dataset is 200m+ records and there's about 15 stat functions that I need to run - sum/avg/min/max/stddev etc. Problem that it's very hard to scale that task since there's not clear way to partition dataset.
Example dataset:
+------------+----------+-------+-----------+------------+
| Department | PartName | Price | UnitsSold | PartNumber |
+------------+----------+-------+-----------+------------+
| Texas | Gadget1 | 5 | 100 | 5943 |
| Florida | Gadget3 | 484 | 2400 | 4233 |
| Alaska | Gadget34 | 44 | 200 | 4235 |
+------------+----------+-------+-----------+------------+
Right now I am doing it this way (example):
columns_to_profile = ['Price', 'UnitSold', 'PartNumber']
functions = [
Function(F.mean, 'mean'),
Function(F.min, 'min_value'),
Function(F.max, 'max_value'),
Function(F.variance, 'variance'),
Function(F.kurtosis, 'kurtosis'),
Function(F.stddev, 'std'),
Function(F.skewness, 'skewness'),
Function(count_zeros, 'n_zeros'),
Function(F.sum, 'sum'),
Function(num_hist, "hist_data"),
]
functions_to_apply = [f.function(c).alias(f'{c}${f.alias}')
for c in columns_to_profile for f in get_functions(column_types, c)]
df.groupby('Department').agg(*functions_to_apply).toPandas()
Problem here is that list of functions is bigger than this (there's about 16-20) which applies to each column but cluster spend most of the time in shuffling and CPU load is about 5-10%.
How should I partition this data or maybe my approach is incorrect?
If departments are skewed (i.e. Texas have 90% of volume) what should be my approach?
this is my spark dag for this job:
I have a DataFrame with two categorical columns, similar to the following example:
+----+-------+-------+
| ID | Cat A | Cat B |
+----+-------+-------+
| 1 | A | B |
| 2 | B | C |
| 5 | A | B |
| 7 | B | C |
| 8 | A | C |
+----+-------+-------+
I have some processing to do that needs two steps: The first one needs the data to be grouped by both categorical columns. In the example, it would generate the following DataFrame:
+-------+-------+-----+
| Cat A | Cat B | Cnt |
+-------+-------+-----+
| A | B | 2 |
| B | C | 2 |
| A | C | 1 |
+-------+-------+-----+
Then, the next step consists on grouping only by CatA, to calculate a new aggregation, for example:
+-----+-----+
| Cat | Cnt |
+-----+-----+
| A | 3 |
| B | 2 |
+-----+-----+
Now come the questions:
In my solution, I create the intermediate dataframe by doing
val df2 = df.groupBy("catA", "catB").agg(...)
and then I aggregate this df2 to get the last one:
val df3 = df2.groupBy("catA").agg(...)
I assume it is more efficient than aggregating the first DF again. Is it a good assumption? Or it makes no difference?
Are there any suggestions of a more efficient way to achieve the same results?
Generally speaking it looks like a good approach and should be more efficient than aggregating data twice. Since shuffle files are implicitly cached at least part of the work should be performed only once. So when you call an action on df2 and subsequently on df3 you should see that stages corresponding to df2 have been skipped. Also partial structure enforced by the first shuffle may reduce memory requirements for the aggregation buffer during the second agg.
Unfortunately DataFrame aggregations, unlike RDD aggregations, cannot use custom partitioner. It means that you cannot compute both data frames using a single shuffle based on a value of catA. It means that second aggregation will require separate exchange hash partitioning. I doubt it justifies switching to RDDs.
There is an illustration in kernel source Documentation/memory-barriers.txt, like this:
CPU 1 CPU 2
======================= =======================
{ B = 7; X = 9; Y = 8; C = &Y }
STORE A = 1
STORE B = 2
<write barrier>
STORE C = &B LOAD X
STORE D = 4 LOAD C (gets &B)
LOAD *C (reads B)
Without intervention, CPU 2 may perceive the events on CPU 1 in some
effectively random order, despite the write barrier issued by CPU 1:
+-------+ : : : :
| | +------+ +-------+ | Sequence of update
| |------>| B=2 |----- --->| Y->8 | | of perception on
| | : +------+ \ +-------+ | CPU 2
| CPU 1 | : | A=1 | \ --->| C->&Y | V
| | +------+ | +-------+
| | wwwwwwwwwwwwwwww | : :
| | +------+ | : :
| | : | C=&B |--- | : : +-------+
| | : +------+ \ | +-------+ | |
| |------>| D=4 | ----------->| C->&B |------>| |
| | +------+ | +-------+ | |
+-------+ : : | : : | |
| : : | |
| : : | CPU 2 |
| +-------+ | |
Apparently incorrect ---> | | B->7 |------>| |
perception of B (!) | +-------+ | |
| : : | |
| +-------+ | |
The load of X holds ---> \ | X->9 |------>| |
up the maintenance \ +-------+ | |
of coherence of B ----->| B->2 | +-------+
+-------+
: :
I don't understand, since we have a write barrier, so, any store must take effect when C = &B is executed, which means whence B would equals 2. For CPU 2, B should have been 2 when it gets the value of C, which is &B, why would it perceive B as 7. I am really confused.
The key missing point is the mistaken assumption that for the sequence:
LOAD C (gets &B)
LOAD *C (reads B)
the first load has to precede the second load. A weakly ordered architectures can act "as if" the following happened:
LOAD B (reads B)
LOAD C (reads &B)
if( C!=&B )
LOAD *C
else
Congratulate self on having already loaded *C
The speculative "LOAD B" can happen, for example, because B was on the same cache line as some other variable of earlier interest or hardware prefetching grabbed it.
From the section of the document titled "WHAT MAY NOT BE ASSUMED ABOUT MEMORY BARRIERS?":
There is no guarantee that any of the memory accesses specified before a
memory barrier will be complete by the completion of a memory barrier
instruction; the barrier can be considered to draw a line in that CPU's
access queue that accesses of the appropriate type may not cross.
and
There is no guarantee that a CPU will see the correct order of effects
from a second CPU's accesses, even if the second CPU uses a memory
barrier, unless the first CPU also uses a matching memory barrier (see
the subsection on "SMP Barrier Pairing").
What memory barriers do (in a very simplified way, of course) is make sure neither the compiler nor in-CPU hardware perform any clever attempts at reordering load (or store) operations across a barrier, and that the CPU correctly perceives changes to the memory made by other parts of the system. This is necessary when the loads (or stores) carry additional meaning, like locking a lock before accessing whatever it is we're locking. In this case, letting the compiler/CPU make the accesses more efficient by reordering them is hazardous to the correct operation of our program.
When reading this document we need to keep two things in mind:
That a load means transmitting a value from memory (or cache) to a CPU register.
That unless the CPUs share the cache (or have no cache at all), it is possible for their cache systems to be momentarily our of sync.
Fact #2 is one of the reasons why one CPU can perceive the data differently from another. While cache systems are designed to provide good performance and coherence in the general case, but might need some help in specific cases like the ones illustrated in the document.
In general, like the document suggests, barriers in systems involving more than one CPU should be paired to force the system to synchronize the perception of both (or all participating) CPUs. Picture a situation in which one CPU completes loads or stores and the main memory is updated, but the new data had yet to be transmitted to the second CPU's cache, resulting in a lack of coherence across both CPUs.
I hope this helps. I'd suggest reading memory-barriers.txt again with this in mind and particularly the section titled "THE EFFECTS OF THE CPU CACHE".
I have noticed that the sum of squares in my models can change fairly radically with even the slightest adjustment to my models???? Is this normal???? I'm using SPSS 16, and both models presented below used the same data and variables with only one small change - categorizing one of the variables as either a 2 level or 3 level variable.
Details - using a 2 x 2 x 6 mixed model ANOVA with the 6 being the repeated measure i get the following in the between group analysis
------------------------------------------------------------
Source | Type III SS | df | MS | F | Sig
------------------------------------------------------------
intercept | 4086.46 | 1 | 4086.46 | 104.93 | .000
X | 224.61 | 1 | 224.61 | 5.77 | .019
Y | 2.60 | 1 | 2.60 | .07 | .80
X by Y | 19.25 | 1 | 19.25 | .49 | .49
Error | 2570.40 | 66 | 38.95 |
Then, when I use the exact same data but a slightly different model in which variable Y has 3 levels instead of 2 levels I get the following
------------------------------------------------------------
Source | Type III SS | df | MS | F | Sig
------------------------------------------------------------
intercept | 3603.88 | 1 | 3603.88 | 90.89 | .000
X | 171.89 | 1 | 171.89 | 4.34 | .041
Y | 19.23 | 2 | 9.62 | .24 | .79
X by Y | 17.90 | 2 | 17.90 | .80 | .80
Error | 2537.76 | 64 | 39.65 |
I don't understand why variable X would have a different sum of squares simply because variable Y gets devided up into 3 levels instead of 2. This is also the case in the within groups analysis too.
Please help me understand :D
Thank you in advance
Pat
The type III Sum-of-Squares for X tells you how much you gain when you add X to a model including all the other terms. It appears that the 3-level Y variable is a much better predictor than the 2-level one: its SS went from 2.6 to 19.23. (this can happen, for example, if the effect of Y is quadratic: a cut at the vertex is not very predictive, but cutting into three groups would be better). Thus there is less left for X to explain - its SS decreases.
Just adding to what Aniko has said, the reason why variable X has a different sum of squares simply because variable Y gets divided up into 3 levels instead of 2, is that the SS formula for each factor depends on the number of samples in each treatment. When you change the number of levels in one factor, you actually change the number of samples for each treatment and this has an impact on the SS value for all the other factors.