Conditional iteration in a single job in Apache Spark - apache-spark

I am working on an iterative algorithm using Apache Spark, which claims to be perfect for just that. The examples I have found so far creates a single job with a hardcoded number of iterations. I need the algorithm to run until a certain condition is met.
My current implementation launches a new job for each iteration something like this:
var data = sc.textFile(...).map().cache()
while(data.filter(...).isEmpty()) {
// Run the Algorithm (also handles caching)
val data = performStep(data)
}
This is pretty inefficient. Between each iteration I wait a long time for the next job to start. For four servers I wait around 10 seconds in between each job, for 32 servers is almost 100 seconds. In total I end up spending at least half of the runtime waiting in between jobs.
I find conditional iterations quite common in certain types of algorithms, for example early stopping criteria in machine learning. So I am hoping this can be improved.
Is there a more efficient way of doing this? For example away to run this conditional repetition in a single job? Thanks!

Related

Invisible Delays between Spark Jobs

There are 4 major actions(jdbc write) with respect to application and few counts which in total takes around 4-5 minutes for completion.
But the total uptime of Application is around 12-13minutes.
I see there are certain jobs by name run at ThreadPoolExecutor.java : 1149. Just before this job being reflected on Spark UI, the invisible long delays occur.
I want to know what are the possible causes for these delays.
My application is reading 8-10 CSV files, 5-6 VIEWs from table. Number of joins are around 59, few groupBy with agg(sum) are there and 3 unions are there.
I am not able to reproduce the issue in DEV/UAT env since the data is not that much.
It's in the production where I get the app. executed run by my Manager.
If anyone has come across such delays in their job, please share your experience what could be the potential cause for this, currently I am working around the unions, i.e. caching the associated dataframes and calling count so as to get the benefit of cache in the coming union(yet to test, if union is the reason for delays)
Similarly, I tried the break the long chain of transformations with cache and count in between to break the long lineage.
The time reduced from initial 18 minutes to 12 minutes but the issue with invisible delays still persist.
Thanks in advance
I assume you don't have a CPU or IO heavy code between your spark jobs.
So it really sparks, 99% it is QueryPlaning delay.
You can use
spark.listenerManager.register(QueryExecutionListener) to check different metrics of query planing performance.

How to reduce white space in the task stream?

I have obtained task stream using distributed computing in Dask for different number of workers. I can observe that as the number of workers increase (from 16 to 32 to 64), the white spaces in task stream also increases which reduces the efficiency of parallel computation. Even when I increase the work-load per worker (that is, more number of computation per worker), I obtain the similar trend. Can anyone suggest how to reduce the white spaces?
PS: I need to extend the computation to 1000s of workers, so reducing the number of workers is not an option for me.
Image for: No. of workers = 16
Image for: No. of workers = 32
Image for: No. of workers = 64
As you mention, white space in the task stream plot means that there is some inefficiency causing workers to not be active all the time.
This can be caused by many reasons. I'll list a few below:
Very short tasks (sub millisecond)
Algorithms that are not very parallelizable
Objects in the task graph that are expensive to serialize
...
Looking at your images I don't think that any of these apply to you.
Instead, I see that there are gaps of inactivity followed by gaps of activity. My guess is that this is caused by some code that you are running locally. My guess is that your code looks like the following:
for i in ...:
results = dask.compute(...) # do some dask work
next_inputs = ... # do some local work
So you're being blocked by doing some local work. This might be Dask's fault (maybe it takes a long time to build and serialize your graph) or maybe it's the fault of your code (maybe building the inputs for the next computation takes some time).
I recommend profiling your local computations to see what is going on. See https://docs.dask.org/en/latest/phases-of-computation.html

Spark optimize "DataFrame.explain" / Catalyst

I've got a complex software which performs really complex SQL queries (well not queries, Spark plans you know). <-- The plans are dynamic, they change based on user input so I can't "cache" them.
I've got a phase in which spark takes 1.5-2min building the plan. Just to make sure, I added "logXXX", then explain(true), then "logYYY" and it takes 1minute 20 seconds for the explain to execute.
I've trying breaking the lineage but this seems to cause worse performance because the actual execution time becomes longer.
I can't parallelize driver work (already did, but this task can't be overlapped with anything else).
Any ideas/guide on how to improve the plan builder in Spark? (like for example, flags to try enabling/disabling and such...)
Is there a way to cache plans in Spark? (so I can run that in parallel and then execute it)
I've tried disabling all possible optimizer rules, setting min iterations to 30... but nothing seems to affect that concrete point :S
I tried disabling wholeStageCodegen and it helped a little, but the execution is longer so :).
Thanks!,
PS: The plan does contain multiple unions (<20, but quite complex plans inside each union) which are the cause for the time, but splitting them apart also affects execution time.
Just in case it helps someone (and if no-one provides more insights).
As I couldn't manage to reduce optimizer times (and well, not sure if reducing optimizer times would be good, as I may lose execution time).
One of the latest parts of my plan was scanning two big tables and getting one column from each one of them (using windows, aggregations etc...).
So I splitted my code in two parts:
1- The big plan (cached)
2- The small plan which scans and aggregates two big tables (cached)
And added one more part:
3- Left Join/enrich the big plan with the output of "2" (this takes like 10seconds, the dataset is not so big) and finish the remainder computation.
Now I launch both actions (1,2) in parallel (using driver-level parallelism/threads), cache the resulting DataFrames and then wait+ afterwards perform 3.
With this, while Spark driver (thread 1) is calculating the big plan (~2minutes) the executors will be executing part "2" (which has a small plan, but big scans/shuffles) and then both get "mixed" in like 10-15seconds, which a good improvement in execution time over the 1:30 I save while calculating the plan.
Comparing times:
Before I would have
1:30 Spark optimizing time + 6 minutes execution time
Now I have
max
(
1:30 Spark Optimizing time + 4 minutes execution time,
0:02 Spark Optimizing time + 2 minutes execution time
)
+ 15 seconds joining both parts
Not so much, but quite a few "expensive" people will be waiting for it to finish :)

Using Java 8 parallelStream inside Spark mapParitions

I am trying to understand the behavior of Java 8 parallel stream inside spark parallelism. When I run the below code, I am expecting the output size of listOfThings to be the same as input size. But that's not the case, I sometimes have missing items in my output. This behavior is not consistent. If I just iterate through the iterator instead of using parallelStream, everything is fine. Count matches every time.
// listRDD.count = 10
JavaRDD test = listRDD.mapPartitions(iterator -> {
List listOfThings = IteratorUtils.toList(iterator);
return listOfThings.parallelStream.map(
//some stuff here
).collect(Collectors.toList());
});
// test.count = 9
// test.count = 10
// test.count = 8
// test.count = 7
it's a very good question.
Whats going on here is Race Condition. when you parallelize the stream then stream split the full list into several equal parts [Based on avaliable threads and size of list] then it tries to run subparts independently on each avaliable thread to perform the work.
But you are also using apache spark which is famous for computing the work faster i.e. general purpose computation engine. Spark uses the same approach [parallelize the work] to perform the action.
Now Here in this Scenerio what is happening is Spark already parallelized the whole work then inside this you are again parallelizing the work due to this the race condition starts i.e. spark executor starts processing the work and then you parallelized the work then stream process aquires other thread and start processing IF THE THREAD THAT WAS PROCESSING STREAM WORK FINISHES WORK BEFORE THE SPARK EXECUTOR COMPLETE HIS WORK THEN IT ADD THE RESULT OTHERWISE SPARK EXECUTOR CONTINUES TO REPORT RESULT TO MASTER.
This is not a good approach to re-parallelize the work it will always gives you the pain let the spark do it for you.
Hope you understand whats going on here
Thanks

Spark: rdd.countApprox() vs rdd.count()

Could someone please explain the difference between RDD countApprox() vs count() and also if possible can answer which is the fastest ? it would be of great help we have a requirement where count() is very slow takes about 30 min's ** ...tried countApprox() it was **fast for the first run (**About 1.2 min) and then slowed to 30 min's .....
this is how we used it not sure if it's the best way to use
rdd.countApprox(timeout=800, confidence=0.5)
Count() - Returns you the number of elements in an RDD.
CountApprox - Approximate version of count() that returns a potentially incomplete result within a timeout, even if not all tasks have finished.
countApprox(timeout: Long, confidence: Double)
Default: confidence = 0.95
Note: As per the spark source code, support for countApprox is marked 'Experimental'.
With timeout=800, you should have seen an approximate count in <1min.
Are you sure nothing else is causing this slowdown of 30mins.
Share your code/code-snippet to get more accurate inputs from other members.
Not my answer, but there is a very useful and important answer here.
In very short, countApprax.getFinalValue blocks even if this is longer than the timeout.
getInitialValue does not block and so you will get a response within the timeout.
BUT, as I learned from painful experience, even if you use getInitalValue the process will continue to final value.
If you are repeating this in a loop, the getFinalValue will be running for multiple RDDs long after you have retrieved the result from getInitialValue. This can then lead to OOM conditions and broadcast errors that are difficult to diagnose
rdd.count() is an action, which is an eager operation.
This means that all the other transformations that you had written before that will start executing now because of Spark's lazy evaluation. So, essentially its not only Count() operation that's taking all the time but, all the other operations which were waiting to get executed.
Now coming back to the question of count() vs countApprox().
Count is just like doing a select count(*) from Table. countApprox can have a timeout and confidence level which returns back a result which is approximately correct and a number you can live with.
We should use countApprox when we are more interested in knowing an approximate number and save time for example in a streaming application.
Count() should be used when you need the exact count for example to log something or for auditing.

Resources