Spark: rdd.countApprox() vs rdd.count() - apache-spark

Could someone please explain the difference between RDD countApprox() vs count() and also if possible can answer which is the fastest ? it would be of great help we have a requirement where count() is very slow takes about 30 min's ** ...tried countApprox() it was **fast for the first run (**About 1.2 min) and then slowed to 30 min's .....
this is how we used it not sure if it's the best way to use
rdd.countApprox(timeout=800, confidence=0.5)

Count() - Returns you the number of elements in an RDD.
CountApprox - Approximate version of count() that returns a potentially incomplete result within a timeout, even if not all tasks have finished.
countApprox(timeout: Long, confidence: Double)
Default: confidence = 0.95
Note: As per the spark source code, support for countApprox is marked 'Experimental'.
With timeout=800, you should have seen an approximate count in <1min.
Are you sure nothing else is causing this slowdown of 30mins.
Share your code/code-snippet to get more accurate inputs from other members.

Not my answer, but there is a very useful and important answer here.
In very short, countApprax.getFinalValue blocks even if this is longer than the timeout.
getInitialValue does not block and so you will get a response within the timeout.
BUT, as I learned from painful experience, even if you use getInitalValue the process will continue to final value.
If you are repeating this in a loop, the getFinalValue will be running for multiple RDDs long after you have retrieved the result from getInitialValue. This can then lead to OOM conditions and broadcast errors that are difficult to diagnose

rdd.count() is an action, which is an eager operation.
This means that all the other transformations that you had written before that will start executing now because of Spark's lazy evaluation. So, essentially its not only Count() operation that's taking all the time but, all the other operations which were waiting to get executed.
Now coming back to the question of count() vs countApprox().
Count is just like doing a select count(*) from Table. countApprox can have a timeout and confidence level which returns back a result which is approximately correct and a number you can live with.
We should use countApprox when we are more interested in knowing an approximate number and save time for example in a streaming application.
Count() should be used when you need the exact count for example to log something or for auditing.

Related

Invisible Delays between Spark Jobs

There are 4 major actions(jdbc write) with respect to application and few counts which in total takes around 4-5 minutes for completion.
But the total uptime of Application is around 12-13minutes.
I see there are certain jobs by name run at ThreadPoolExecutor.java : 1149. Just before this job being reflected on Spark UI, the invisible long delays occur.
I want to know what are the possible causes for these delays.
My application is reading 8-10 CSV files, 5-6 VIEWs from table. Number of joins are around 59, few groupBy with agg(sum) are there and 3 unions are there.
I am not able to reproduce the issue in DEV/UAT env since the data is not that much.
It's in the production where I get the app. executed run by my Manager.
If anyone has come across such delays in their job, please share your experience what could be the potential cause for this, currently I am working around the unions, i.e. caching the associated dataframes and calling count so as to get the benefit of cache in the coming union(yet to test, if union is the reason for delays)
Similarly, I tried the break the long chain of transformations with cache and count in between to break the long lineage.
The time reduced from initial 18 minutes to 12 minutes but the issue with invisible delays still persist.
Thanks in advance
I assume you don't have a CPU or IO heavy code between your spark jobs.
So it really sparks, 99% it is QueryPlaning delay.
You can use
spark.listenerManager.register(QueryExecutionListener) to check different metrics of query planing performance.

Pyspark FP growth implementation running slow

I am using the pyspark.ml.fpm (FP Growth) implementation of association rule mining on Spark v2.3.
The spark UI shows that the tasks as the end run very slowly. This seems to be a common problem and might be related to data skew.
Is this the real reason? Is there any solution for this?
I don't want to change the minSupport or minConfidence thresholds because that would effect by results. Removing the columns isn't a solution either.
I was facing a similar issue. One solution you might try is setting a threshold on the amount of products in a transaction. If there are a couple of transactions that have way more products than the average, the tree computed by FP Growth blows up. This causes the runtime increases significantly and the risk for memory errors is much higher.
Hence, doing outlier removal on the transactions with disproportional amount of products might do the trick.
Hope this helps you out a bit :)
Late answer but I also had an issue with long FPGrowth wait times, and the above answer really helped. Implemented as such to filter out anything that's above one standard deviation (this is after the transactions have been grouped):
def clean_transactions(df):
transactions_init = df.withColumn("basket_size", size("basket"))
print('---collecting stats')
df_stats = transactions_init.select(
_mean(col('basket_size')).alias('mean'),
_stddev(col('basket_size')).alias('std')
).collect()
mean = df_stats[0]['mean']
std = df_stats[0]['std']
max_ct = mean + std
print("--filtering out outliers")
transactions_cleaned = transactions_init.filter(transactions_init.basket_size <= max_ct)
return transactions_cleaned

Spark optimize "DataFrame.explain" / Catalyst

I've got a complex software which performs really complex SQL queries (well not queries, Spark plans you know). <-- The plans are dynamic, they change based on user input so I can't "cache" them.
I've got a phase in which spark takes 1.5-2min building the plan. Just to make sure, I added "logXXX", then explain(true), then "logYYY" and it takes 1minute 20 seconds for the explain to execute.
I've trying breaking the lineage but this seems to cause worse performance because the actual execution time becomes longer.
I can't parallelize driver work (already did, but this task can't be overlapped with anything else).
Any ideas/guide on how to improve the plan builder in Spark? (like for example, flags to try enabling/disabling and such...)
Is there a way to cache plans in Spark? (so I can run that in parallel and then execute it)
I've tried disabling all possible optimizer rules, setting min iterations to 30... but nothing seems to affect that concrete point :S
I tried disabling wholeStageCodegen and it helped a little, but the execution is longer so :).
Thanks!,
PS: The plan does contain multiple unions (<20, but quite complex plans inside each union) which are the cause for the time, but splitting them apart also affects execution time.
Just in case it helps someone (and if no-one provides more insights).
As I couldn't manage to reduce optimizer times (and well, not sure if reducing optimizer times would be good, as I may lose execution time).
One of the latest parts of my plan was scanning two big tables and getting one column from each one of them (using windows, aggregations etc...).
So I splitted my code in two parts:
1- The big plan (cached)
2- The small plan which scans and aggregates two big tables (cached)
And added one more part:
3- Left Join/enrich the big plan with the output of "2" (this takes like 10seconds, the dataset is not so big) and finish the remainder computation.
Now I launch both actions (1,2) in parallel (using driver-level parallelism/threads), cache the resulting DataFrames and then wait+ afterwards perform 3.
With this, while Spark driver (thread 1) is calculating the big plan (~2minutes) the executors will be executing part "2" (which has a small plan, but big scans/shuffles) and then both get "mixed" in like 10-15seconds, which a good improvement in execution time over the 1:30 I save while calculating the plan.
Comparing times:
Before I would have
1:30 Spark optimizing time + 6 minutes execution time
Now I have
max
(
1:30 Spark Optimizing time + 4 minutes execution time,
0:02 Spark Optimizing time + 2 minutes execution time
)
+ 15 seconds joining both parts
Not so much, but quite a few "expensive" people will be waiting for it to finish :)

Spark cores & tasks concurrency

I've a very basic question about spark. I usually run spark jobs using 50 cores. While viewing the job progress, most of the times it shows 50 processes running in parallel (as it is supposed to do), but sometimes it shows only 2 or 4 spark processes running in parallel. Like this:
[Stage 8:================================> (297 + 2) / 500]
The RDD's being processed are repartitioned on more than 100 partitions. So that shouldn't be an issue.
I have an observations though. I've seen the pattern that most of the time it happens, the data locality in SparkUI shows NODE_LOCAL, while other times when all 50 processes are running, some of the processes show RACK_LOCAL.
This makes me doubt that, maybe this happens because the data is cached before processing in the same node to avoid network overhead, and this slows down the further processing.
If this is the case, what's the way to avoid it. And if this isn't the case, what's going on here?
After a week or more of struggling with the issue, I think I've found what was causing the problem.
If you are struggling with the same issue, the good point to start would be to check if the Spark instance is configured fine. There is a great cloudera blog post about it.
However, if the problem isn't with configuration (as was the case with me), then the problem is somewhere within your code. The issue is that sometimes due to different reasons (skewed joins, uneven partitions in data sources etc) the RDD you are working on gets a lot of data on 2-3 partitions and the rest of the partitions have very few data.
In order to reduce the data shuffle across the network, Spark tries that each executor processes the data residing locally on that node. So, 2-3 executors are working for a long time, and the rest of the executors are just done with the data in few milliseconds. That's why I was experiencing the issue I described in the question above.
The way to debug this problem is to first of all check the partition sizes of your RDD. If one or few partitions are very big in comparison to others, then the next step would be to find the records in the large partitions, so that you could know, especially in the case of skewed joins, that what key is getting skewed. I've wrote a small function to debug this:
from itertools import islice
def check_skewness(df):
sampled_rdd = df.sample(False,0.01).rdd.cache() # Taking just 1% sample for fast processing
l = sampled_rdd.mapPartitionsWithIndex(lambda x,it: [(x,sum(1 for _ in it))]).collect()
max_part = max(l,key=lambda item:item[1])
min_part = min(l,key=lambda item:item[1])
if max_part[1]/min_part[1] > 5: #if difference is greater than 5 times
print 'Partitions Skewed: Largest Partition',max_part,'Smallest Partition',min_part,'\nSample Content of the largest Partition: \n'
print (sampled_rdd.mapPartitionsWithIndex(lambda i, it: islice(it, 0, 5) if i == max_part[0] else []).take(5))
else:
print 'No Skewness: Largest Partition',max_part,'Smallest Partition',min_part
It gives me the smallest and largest partition size, and if the difference between these two is more than 5 times, it prints 5 elements of the largest partition, to should give you a rough idea on what's going on.
Once you have figured out that the problem is skewed partition, you can find a way to get rid of that skewed key, or you can re-partition your dataframe, which will force it to get equally distributed, and you'll see now all the executors will be working for equal time and you'll see far less dreaded OOM errors and processing will be significantly fast too.
These are just my two cents as a Spark novice, I hope Spark experts can add some more to this issue, as I think a lot of newbies in Spark world face similar kind of problems far too often.

Conditional iteration in a single job in Apache Spark

I am working on an iterative algorithm using Apache Spark, which claims to be perfect for just that. The examples I have found so far creates a single job with a hardcoded number of iterations. I need the algorithm to run until a certain condition is met.
My current implementation launches a new job for each iteration something like this:
var data = sc.textFile(...).map().cache()
while(data.filter(...).isEmpty()) {
// Run the Algorithm (also handles caching)
val data = performStep(data)
}
This is pretty inefficient. Between each iteration I wait a long time for the next job to start. For four servers I wait around 10 seconds in between each job, for 32 servers is almost 100 seconds. In total I end up spending at least half of the runtime waiting in between jobs.
I find conditional iterations quite common in certain types of algorithms, for example early stopping criteria in machine learning. So I am hoping this can be improved.
Is there a more efficient way of doing this? For example away to run this conditional repetition in a single job? Thanks!

Resources