Spark join always stuck on the same task, how can I debug? - apache-spark

I am using pyspark to run a join of this sort:
rdd1=sc.textFile(hdfs_dir1).map(lambda row: (getKey1(row),getData1(row)))
rdd2=sc.textFile(hdfs_dir2).map(lambda row: (getKey2(row),getData2(row)))
result=rdd1.join(rdd2).collect()
The job executes the first 300 tasks quite fast (~seconds each), and hangs when reaching task 301/308, even when I let it run for days.
I tried to run the pyspark shell with different configuration (number of workers, memory, cpus, cores, shuffle rates) and the result is always the same.
What can be the cause ? and how can I debug it ?

Has anyone able to solve this problem? My guess is that the issue is because of shuffling data between executors. I used ,ridiculously, two small datasets ( 10 records ) with no missing key and still the join operation was stuck. I had to eventually kill the instance. The only thing which could help in my case was cache().
If we take above example
rdd1=sc.textFile(hdfs_dir1).map(lambda row: (getKey1(row),getData1(row)))
rdd2=sc.textFile(hdfs_dir2).map(lambda row: (getKey2(row),getData2(row)))
# cache it
rdd1.cache()
rdd2.cache()
# I also tried rdd1.collect() and rdd2.collect() to get data cached
# then try the joins
result=rdd1.join(rdd2)
# I would get the answer
result.collect() # it works
I am not able to find why caching works though ( Apparently, it should have worked otherwise too ie without cache() ).

Collect will try to fetch the result of your join in the application driver node and you will run into memory issues.
The join operation will cause a lot of shuffle operation, but you can reduce this by using bloom filters (Bloom filter). You construct a bloom filter for the keys in one partition, broadcast and filter the other partition. After applying this operations you should expect smaller RDDs (if you do not have the exact same keys in both of them) and your join operation should be much faster.
The bloom filter can be collected efficiently since you can combine the bits set by one element with the bits set by another element with OR, which is associative and commutative.

You can narrow down whether this is a problem with the collect() call by calling a count instead to see if it is an issue pulling the results into the driver:
result=rdd1.join(rdd2).count()
If the count works, it might be best to add a sample or limit, then call collect() if you're attempting to view the results.
You can also look at the task in the Spark UI to see if a task has been assigned to a particular executor and use the UI again to look at the executor logs. Within the executors tab, you can take a thread dump of the executor that is handling the task. If you take a few thread dumps and compare them, check to see if there's a thread that's hung.
Look at the driver log4j logs, stdout / stderr logs for any additional errors.

Related

Pyspark Order By 1.5B Records With 20 Distinct Values - Performance Issue

I have the following code:
cache_df = cache_df.orderBy(f.col('last_update').asc()).limit(10000000)
Cache_df contains 350M records and I want to get 10M with the oldest last_update value.
It seems like the reduce operation of the order by order all the data in 1 executor and I am not executing this operation in parallel way
Any idea how to solve it?
See this article explains the order by:
"orderBy() collects all the data into a single executor and then sorts them. This means that the order of the output data is guaranteed but this is probably a very costly operation."

Why is execution time of spark sql query different between first time and second time of execution?

I am using spark sql to run some aggregated query on the parquet data source.
My parquet data source includes a table with columns: id int, time timestamp, location int, counter_1 long, counter_2 long, ..., counter_48. The total data size is about 887 MB.
My spark version is 2.4.0. I run one master and one slave on a single machine (4 cores, 16G memory).
Using spark-shell, I ran the spark command:
spark.time(spark.sql("SELECT location, sum(counter_1)+sum(counter_5)+sum(counter_10)+sum(counter_15)+sum(cou
nter_20)+sum(counter_25)+sum(counter_30)+sum(counter_35 )+sum(counter_40)+sum(counter_45) from parquet.`/home/hungp
han227/spark_data/counters` group by location").show())
The execution time is 17s.
The second time I ran a similar command (only change columns):
spark.time(spark.sql("SELECT location, sum(counter_2)+sum(counter_6)+sum(counter_11)+sum(counter_16)+sum(cou
nter_21)+sum(counter_26)+sum(counter_31)+sum(counter_36 )+sum(counter_41)+sum(counter_46) from parquet.`/home/hungp
han227/spark_data/counters` group by location").show())
The execution time is about 3s.
My first question is: Why are they different? I know it is not data caching because of the parquet format. Is it about reusing something like query planning?
I did another test: The first command is
spark.time(spark.sql("SELECT location, sum(counter_1)+sum(counter_5)+sum(counter_10)+sum(counter_15)+sum(cou
nter_20)+sum(counter_25)+sum(counter_30)+sum(counter_35 )+sum(counter_40)+sum(counter_45) from parquet.`/home/hungp
han227/spark_data/counters` group by location").show())
The execution time is 17s.
In the second command, I change the aggregate function:
spark.time(spark.sql("SELECT location, avg(counter_1)+avg(counter_5)+avg(counter_10)+avg(counter_15)+avg(cou
nter_20)+avg(counter_25)+avg(counter_30)+avg(counter_35 )+avg(counter_40)+avg(counter_45) from parquet.`/home/hungp
han227/spark_data/counters` group by location").show())
The execution time is about 5s.
My second question is: Why is the second command is faster than the first command but the execution time difference is slightly smaller than the first scenario?
Finally, I have a problem related to above scenarios: The are about 200 formulas like:
formula1 = sum(counter_1)+sum(counter_5)+sum(counter_10)+sum(counter_15)+sum(cou
nter_20)+sum(counter_25)+sum(counter_30)+sum(counter_35 )+sum(counter_40)+sum(counter_45)
formula2 = avg(counter_2)+avg(counter_5)+avg(counter_11)+avg(counter_15)+avg(cou
nter_21)+avg(counter_25)+avg(counter_31)+avg(counter_35 )+avg(counter_41)+avg(counter_45)
I have to run the following format frequently:
select formulaX,formulaY, ..., formulaZ from table where time > value1 and time < value2 and location in (value1, value 2...) group by location
My third question is: Is there anyway to optimize the performance (the query used once should be faster if it is used again in the future)? Does spark optimize itself or do I have to write some code, change config?
It's called Exchange Reuse. When Spark runs shuffling (i.e. aggregation, join) it stores a copy of the shuffle data on local worker nodes for potential reuse. This is an internally controlled behavior and cannot be directly influenced by end user. If you find you're keep re-using a particular portion of data (or query outcome), you could consider explicitly CACHING it by using the cache(). However, bear in mind although this allows Spark to reuse cached result for potentially faster query performance (if, and only if the Analyzer Plan of your cached query matches your new query), over using CACHE can cause whole lot of different performance problems.
A bad example is when your dataset is very large, it may cause Disk Spill problem. That is, the dataset doesn't fit into your cluster's available memory and needs to be written to slower hard disks.
Another bad example is when your query only needs to access a subset of the cached data. By caching the entire dataset in memory, Spark is forced to perform full in-memory table scan. Not only that's waste of resource but also results in a slower query performance as oppose to not using cache at all.
The best thing to do is try & error with a few of your own example queries, look at the Spark UI and check if there is sign of Disk Spill or large amount of input data scan.
Every query/data combination is unique hence you'll need to experiment it a bit to find the best performance tuning method for your own workload.
When doing an aggregate spark creates what are called shuffle files. If you run the same query twice, it will reuse the shuffle files which are stored locally on the workers fs. Unfortunately you can't rely on them to always be there because eventually the file handler gets gc'd. If your going to run 10 queries on the same dataset, cache it or use databricks.

What does Spark do if a node running a .foreach() fails?

We have a large RDD with millions of rows. Each row needs to be processed with a third-party optimizer that is licensed (Gurobi). We have a limited number of licenses.
We have been calling the optimizer in the Spark .map() function. The problem is that Spark will run many more mappers than it needs and throw away the results. This causes a problem with license exhaustion.
We're looking at calling Gurobi inside the Spark .foreach() method. This works, but we have two problems:
Getting the data back from the optimizer into another RDD. Our tentative plan for this is to write the results into a database (e.g. MongoDB or DynamoDB).
What happens if the node on which the .foreach() method dies? Spark guarantees that each foreach only runs once. Does it detect that it dies and restart it elsewhere? Or does something else happen?
In general if task executed with foreachPartition dies a whole job dies.
This means that, if not additional steps are taken to ensure correctness, partial result might have been acknowledged by an external system, leading to inconsistent state.
Considering limited number of licenses map or foreachPartition shouldn't make any difference. Not going into discussion if using Spark in this case makes any sense, the best way to resolve it, is to limit number of executor cores, to the number of licenses you own.
If the goal here is to limit just X number of concurrent calls, I would repartition the RDD with x, and then run a partition level operation. I think that should keep you from exhausting the licenses.

DataFrame orderBy followed by limit in Spark

I am having a program take generate a DataFrame on which it will run something like
Select Col1, Col2...
orderBy(ColX) limit(N)
However, when i collect the data in end, i find that it is causing the driver to OOM if I take a enough large top N
Also another observation is that if I just do sort and top, this problem will not happen. So this happen only when there is sort and top at the same time.
I am wondering why it could be happening? And particular, what is really going underneath this two combination of transforms? How does spark will evaluate query with both sorting and limit and what is corresponding execution plan underneath?
Also just curious does spark handle sort and top different between DataFrame and RDD?
EDIT,
Sorry i didn't mean collect,
what i original just mean that when i call any action to materialize the data, regardless of whether it is collect (or any action sending data back to driver) or not (So the problem is definitely not on the output size)
While it is not clear why this fails in this particular case there multiple issues you may encounter:
When you use limit it simply puts all data on a single partition, no matter how big n is. So while it doesn't explicitly collect it almost as bad.
On top of that orderBy requires a full shuffle with range partitioning which can result in a different issues when data distribution is skewed.
Finally when you collect results can be larger than the amount of memory available on the driver.
If you collect anyway there is not much you can improve here. At the end of the day driver memory will be a limiting factor but there still some possible improvements:
First of all don't use limit.
Replace collect with toLocalIterator.
use either orderBy |> rdd |> zipWithIndex |> filter or if exact number of values is not a hard requirement filter data directly based on approximated distribution as shown in Saving a spark dataframe in multiple parts without repartitioning (in Spark 2.0.0+ there is handy approxQuantile method).

Spark performs poorly when generating non-associate features

I have been using Spark as a tool for my own feature-generation project. For this specific project, I have two data-sources which I load into RDDs as follows:
Datasource1: RDD1 = [(key,(time,quantity,user-id,...)j] => ... => bunch of other attributes such as transaction-id, etc.
Datasource2: RDD2 = [(key,(t1,t2)j)]
In RDD1, time denotes the time-stamp where the event has happened and, in RDD2, denotes the acceptable time-interval for each feature. The feature-key is "key". I have two types of features as follows:
associative features: number of items
non-associative features: Example: unique number of users
For each feature-key, I need to see which events fall in the interval (t1,t2) and then aggregate those things. So, I have a join followed by a reduce operation as follows:
`RDD1.join(RDD2).map((key,(v1,v2))=>(key,featureObj)).reduceByKey(...)`
The initial value for my feature would be featureObj=(0,set([])) where the first argument keeps number of items and the second stores number of unique user ids. I also partition the input data to make sure that RDD1 and RDD2 use the same partitioner.
Now, when I run the job to just calculate the associative feature, it runs very fast on a cluster of 16 m2.xlarge, in only 3 minutes. The minute I add the second one, the computation time jumps to 5min. I tried to add a couple of other non-associate features and, every time, the run-time increases fast. Right now, my job runs in 15minutes for 15 features 10 of them are non-associative. I also tried to use KyroSerializer and persist RDDs in a serialized form but nothing special happened. Since I will be moving to implement more features, this issue seems to become a bottleneck.
PS. I tried to do the same task on a single big host (128GB of Ram and 16 cores). With 145 features, the whole job was done in 10minutes. I am under the impression that the main Spark bottleneck is JOIN. I checked my RDDs and noticed that both are co-partitioned in the same way. As a single job is calling these two RDDs, I presume they are co-located too? However, spark web-console still shows "2.6GB" shuffle-read and "15.6GB" shuffle-write.
Could someone please advise me if I am doing something really crazy here? Am I using Spark for a wrong application? Thanks for the comments in advance.
With best regards,
Ali
I noticed poor performance with shuffle operations, too. It turned out that the shuffle ran very fast when data was shuffled from one core to another within the same executor (locality PROCESS_LOCAL), but much slower than expected in all other situations, even NODE_LOCAL was very slow. This can be seen in the Spark UI.
Further investigation with CPU and garbage collection monitoring found that at some point garbage collection made one of the nodes in my cluster unresponsive, and this would block the other nodes shuffling data from or to this node, too.
There are a lot of options that you can tweak in order to improve garbage collection performance. One important thing is to enable early reclamation of humongous objects for the G1 garbage collector, which requires java 8u45 or higher.
In my case the biggest problem was memory allocation in netty. When I turned direct buffer memory off by setting spark.shuffle.io.preferDirectBufs = false, my jobs ran much more stable.

Resources