Spark performs poorly when generating non-associate features - apache-spark

I have been using Spark as a tool for my own feature-generation project. For this specific project, I have two data-sources which I load into RDDs as follows:
Datasource1: RDD1 = [(key,(time,quantity,user-id,...)j] => ... => bunch of other attributes such as transaction-id, etc.
Datasource2: RDD2 = [(key,(t1,t2)j)]
In RDD1, time denotes the time-stamp where the event has happened and, in RDD2, denotes the acceptable time-interval for each feature. The feature-key is "key". I have two types of features as follows:
associative features: number of items
non-associative features: Example: unique number of users
For each feature-key, I need to see which events fall in the interval (t1,t2) and then aggregate those things. So, I have a join followed by a reduce operation as follows:
`RDD1.join(RDD2).map((key,(v1,v2))=>(key,featureObj)).reduceByKey(...)`
The initial value for my feature would be featureObj=(0,set([])) where the first argument keeps number of items and the second stores number of unique user ids. I also partition the input data to make sure that RDD1 and RDD2 use the same partitioner.
Now, when I run the job to just calculate the associative feature, it runs very fast on a cluster of 16 m2.xlarge, in only 3 minutes. The minute I add the second one, the computation time jumps to 5min. I tried to add a couple of other non-associate features and, every time, the run-time increases fast. Right now, my job runs in 15minutes for 15 features 10 of them are non-associative. I also tried to use KyroSerializer and persist RDDs in a serialized form but nothing special happened. Since I will be moving to implement more features, this issue seems to become a bottleneck.
PS. I tried to do the same task on a single big host (128GB of Ram and 16 cores). With 145 features, the whole job was done in 10minutes. I am under the impression that the main Spark bottleneck is JOIN. I checked my RDDs and noticed that both are co-partitioned in the same way. As a single job is calling these two RDDs, I presume they are co-located too? However, spark web-console still shows "2.6GB" shuffle-read and "15.6GB" shuffle-write.
Could someone please advise me if I am doing something really crazy here? Am I using Spark for a wrong application? Thanks for the comments in advance.
With best regards,
Ali

I noticed poor performance with shuffle operations, too. It turned out that the shuffle ran very fast when data was shuffled from one core to another within the same executor (locality PROCESS_LOCAL), but much slower than expected in all other situations, even NODE_LOCAL was very slow. This can be seen in the Spark UI.
Further investigation with CPU and garbage collection monitoring found that at some point garbage collection made one of the nodes in my cluster unresponsive, and this would block the other nodes shuffling data from or to this node, too.
There are a lot of options that you can tweak in order to improve garbage collection performance. One important thing is to enable early reclamation of humongous objects for the G1 garbage collector, which requires java 8u45 or higher.
In my case the biggest problem was memory allocation in netty. When I turned direct buffer memory off by setting spark.shuffle.io.preferDirectBufs = false, my jobs ran much more stable.

Related

Write spark dataframe to single parquet file

I am trying to do something very simple and I'm having some very stupid struggles. I think it must have to do with a fundamental misunderstanding of what spark is doing. I would greatly appreciate any help or explanation.
I have a very large (~3 TB, ~300MM rows, 25k partitions) table, saved as parquet in s3, and I would like to give someone a tiny sample of it as a single parquet file. Unfortunately, this is taking forever to finish and I don't understand why. I have tried the following:
tiny = spark.sql("SELECT * FROM db.big_table LIMIT 500")
tiny.coalesce(1).write.saveAsTable("db.tiny_table")
and then when that didn't work I tried this, which I thought should be the same, but I wasn't sure. (I added the print's in an effort to debug.)
tiny = spark.table("db.big_table").limit(500).coalesce(1)
print(tiny.count())
print(tiny.show(10))
tiny.write.saveAsTable("db.tiny_table")
When I watch the Yarn UI, both print statements and the write are using 25k mappers. The count took 3 mins, the show took 25 mins, and the write took ~40 mins, although it finally did write the single file table I was looking for.
It seems to me like the first line should take the top 500 rows and coalesce them to a single partition, and then the other lines should happen extremely fast (on a single mapper/reducer). Can anyone see what I'm doing wrong here? I've been told maybe I should use sample instead of limit but as I understand it limit should be much faster. Is that right?
Thanks in advance for any thoughts!
I’ll approach the print functions issue first, as it’s something fundamental to understanding spark. Then limit vs sample. Then repartition vs coalesce.
The reasons the print functions take so long in this manner is because coalesce is a lazy transformation. Most transformations in spark are lazy and do not get evaluated until an action gets called.
Actions are things that do stuff and (mostly) dont return a new dataframe as a result. Like count, show. They return a number, and some data, whereas coalesce returns a dataframe with 1 partition (sort of, see below).
What is happening is that you are rerunning the sql query and the coalesce call each time you call an action on the tiny dataframe. That’s why they are using the 25k mappers for each call.
To save time, add the .cache() method to the first line (for your print code anyway).
Then the data frame transformations are actually executed on your first line and the result persisted in memory on your spark nodes.
This won’t have any impact on the initial query time for the first line, but at least you’re not running that query 2 more times because the result has been cached, and the actions can then use that cached result.
To remove it from memory, use the .unpersist() method.
Now for the actual query youre trying to do...
It really depends on how your data is partitioned. As in, is it partitioned on specific fields etc...
You mentioned it in your question, but sample might the right way to go.
Why is this?
limit has to search for 500 of the first rows. Unless your data is partitioned by row number (or some sort of incrementing id) then the first 500 rows could be stored in any of the the 25k partitions.
So spark has to go search through all of them until it finds all the correct values. Not only that, it has to perform an additional step of sorting the data to have the correct order.
sample just grabs 500 random values. Much easier to do as there’s no order/sorting of the data involved and it doesn’t have to search through specific partitions for specific rows.
While limit can be faster, it also has its, erm, limits. I usually only use it for very small subsets like 10/20 rows.
Now for partitioning....
The problem I think with coalesce is it virtually changes the partitioning. Now I’m not sure about this, so pinch of salt.
According to the pyspark docs:
this operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.
So your 500 rows will actually still sit across your 25k physical partitions that are considered by spark to be 1 virtual partition.
Causing a shuffle (usually bad) and persisting in spark memory with .repartition(1).cache() is possibly a good idea here. Because instead of having the 25k mappers looking at the physical partitions when you write, it should only result in 1 mapper looking at what is in spark memory. Then write becomes easy. You’re also dealing with a small subset, so any shuffling should (hopefully) be manageable.
Obviously this is usually bad practice, and doesn’t change the fact spark will probably want to run 25k mappers when it performs the original sql query. Hopefully sample takes care of that.
edit to clarify shuffling, repartition and coalesce
You have 2 datasets in 16 partitions on a 4 node cluster. You want to join them and write as a new dataset in 16 partitions.
Row 1 for data 1 might be on node 1, and row 1 for data 2 on node 4.
In order to join these rows together, spark has to physically move one, or both of them, then write to a new partition.
That’s a shuffle, physically moving data around a cluster.
It doesn’t matter that everything is partitioned by 16, what matters is where the data is sitting on he cluster.
data.repartition(4) will physically move data from each 4 sets of partitions per node into 1 partition per node.
Spark might move all 4 partitions from node 1 over to the 3 other nodes, in a new single partition on those nodes, and vice versa.
I wouldn’t think it’d do this, but it’s an extreme case that demonstrates the point.
A coalesce(4) call though, doesn’t move the data, it’s much more clever. Instead, it recognises “I already have 4 partitions per node & 4 nodes in total... I’m just going to call all 4 of those partitions per node a single partition and then I’ll have 4 total partitions!”
So it doesn’t need to move any data because it just combines existing partitions into a joined partition.
Try this, in my empirical experience repartition works better for this kind of problems:
tiny = spark.sql("SELECT * FROM db.big_table LIMIT 500")
tiny.repartition(1).write.saveAsTable("db.tiny_table")
Even better if you are interested in the parquet you don't need to save it as a table:
tiny = spark.sql("SELECT * FROM db.big_table LIMIT 500")
tiny.repartition(1).write.parquet(your_hdfs_path+"db.tiny_table")

Spark 'limit' does not run in parallel?

I have a simple join where I limit on of the sides. In the explain plan I see that before the limit is executed there is an ExchangeSingle operation, indeed I see that at this stage there is only one task running in the cluster.
This of course affects performance dramatically (removing the limit removes the single task bottleneck but lengthens the join as it works on a much larger dataset).
Is limit truly not parallelizable? and if so- is there a workaround for this?
I am using spark on Databricks cluster.
Edit: regarding the possible duplicate. The answer does not explain why everything is shuffled into a single partition. Also- I asked for advice to work around this issue.
Following the advice given by user8371915 in the comments, I used sample instead of limit. And it uncorked the bottleneck.
A small but important detail: I still had to put a predictable size constraint on the result set after sample, but sample inputs a fraction, so the size of the result set can very greatly depending on the size of the input.
Fortunately for me, running the same query with count() was very fast. So I first counted the size of the entire result set and used it to compute the fraction I later used in the sample.
Workaround for parallelization after limit:
.repartition(200)
This redistributes the data again so that you can work in parallel.

What does Spark do if a node running a .foreach() fails?

We have a large RDD with millions of rows. Each row needs to be processed with a third-party optimizer that is licensed (Gurobi). We have a limited number of licenses.
We have been calling the optimizer in the Spark .map() function. The problem is that Spark will run many more mappers than it needs and throw away the results. This causes a problem with license exhaustion.
We're looking at calling Gurobi inside the Spark .foreach() method. This works, but we have two problems:
Getting the data back from the optimizer into another RDD. Our tentative plan for this is to write the results into a database (e.g. MongoDB or DynamoDB).
What happens if the node on which the .foreach() method dies? Spark guarantees that each foreach only runs once. Does it detect that it dies and restart it elsewhere? Or does something else happen?
In general if task executed with foreachPartition dies a whole job dies.
This means that, if not additional steps are taken to ensure correctness, partial result might have been acknowledged by an external system, leading to inconsistent state.
Considering limited number of licenses map or foreachPartition shouldn't make any difference. Not going into discussion if using Spark in this case makes any sense, the best way to resolve it, is to limit number of executor cores, to the number of licenses you own.
If the goal here is to limit just X number of concurrent calls, I would repartition the RDD with x, and then run a partition level operation. I think that should keep you from exhausting the licenses.

Unfair split of workload among spark executors

I am currently using spark to process documents. I have two servers at my disposal (innov1 and innov2) and I am using yarn as the resource manager.
The first step is to gather the paths of the files from a database, filter them, repartition them and persist them in a RDD[String]. However, I can't manage to have a fair sharing of the persist among all the executors:
persisted RDD memory taken among executors
and this lead to the executors not doing the same amount of work after that:
Work done by each executors (do not care about the 'dead' here, it's another problem)
And this happens randomly, sometimes it's innov1 that takes all the persist, and then only executors on innov1 work (but it tends to be innov2 in general). Right now, each time two executors are on innov1, I just kill the job to relaunch, and I pray for them to be on innov2 (which is utterly stupid, and break the goal of using spark).
What I have tried so far (and that didn't work):
make the driver sleep 60 seconds before the loading from the database (maybe innov1 takes more time to wake up?)
add spark.scheduler.minRegisteredResourcesRatio=1.0 when I submit the job (same idea than above)
persist with replication x2 (idea from this link), hoping that some of the block would be replicated on innov1
Note for point 3, sometimes it was persisting a replication on the same executor (which is a bit counter intuitive), or even weirder, not replicated at all (innov2 is not able to communicate with innov1?).
I am open to any suggestion, or link to similar problems I would have missed.
Edit:
I can't really put code here, as it's part of my company's product. I can give a simplified version however:
val rawHBaseRDD : RDD[(ImmutableBytesWritable, Result)] = sc
.newAPIHadoopRDD(...)
.map(x => (x._1, x._2)) // from doc of newAPIHadoopRDD
.repartition(200)
.persist(MEMORY_ONLY)
val pathsRDD: RDD[(String, String)] = rawHBaseRDD
.mapPartitions {
...
extract the key and the path from ImmutableBytesWritable and
Result.rawCells()
...
}
.filter(some cond)
.repartition(200)
.persist(MEMORY_ONLY)
For both persist, everything is on innov2. Is it possible that it's because the data are only on innov2? even if it's the case, I would assume that repartition help to share the rows between innov1 and innov2, but it doesn't happen here.
Your persisted data set is not very big - some ~100MB according to your screenshot. You have allocated 10 cores with 20GB of memory, so the 100MB fits easily into the memory of a single executor and that is basically what is happening.
In other words, you have allocated many more resources than are actually needed, so Spark just randomly picks the subset of resources that it needs to complete the job. Sometimes those resources happen to be on one worker, sometimes on another and sometimes it uses resources from both workers.
You have to remember that to Spark, it makes no difference if all resources are placed on a single machine or on a 100 different machines - as long as you are not trying to use more resources than available (in which case you would get an OOM).
Unfortunately (fortunately?) the problem solved by itself today. I assume it is not spark related as I hadn't modified the code until the resolution.
It's probably due to the complete reboot of all services with Ambari (even if I am not 100% sure, because I already tried this before), as it's the only "major" change that happened today.

How to deal with strongly varying data sizes in spark

I'm wondering about the best practice in designing spark-jobs where the volume of data is not known in advance (or is strongly varying). In my case, the application should both handle initial loads and later on incremental data.
I wonder how I should set the number of partitions in my data (e.g. using repartition or setting parameters like spark.sql.shuffle.partitions in order to avoid OOM excpetion in the executors (giving fixed amount of allocated memory per executor). I could
define a very high number of partition to make sure that even on very high workloads, the job does not fail
Set number of partitions at runtime depending on the size of source-data
Introduce an iteration over independent chunks of data (i.e. looping)
In all option, I see issues:
1: I imagine this to be inefficient for small data sizes as taks get very small
2: Needs additional querys (e.g. count) and e.g. for setting spark.sql.shuffle.partitions, SparkContext needs to be restartet which I would like to avoid
3: Seems to contradict the spirit of Spark
So I wonder what the most efficient strategy is for strongly varying data volumes.
EDIT:
I was wrong about setting spark.sql.shuffle.partitions, this can be set at runtime woutout restarting spark context
Do not set a high number of partitions without knowing this is needed. You will absolutely kill the performance of your job.
Yes
As you said, don't loop!
As you mention, you introduce an extra step which is to count your data, which at first glance seems wrong. However, you shouldn't think of this as mis-spent computation. Usually, the time it takes to count your data is significantly less than the time it would take to do further processing if you partition the data badly. Think of the count operation as an investment, it's certainly worth it.
You do not need to set partitions through the config and restart Spark. Instead, do the following:
Note current number of partitions for RDD / Dataframe / Dataset
Count number of entries / rows in your data
Based on an estimate of average row size, compute the target number of partitions
If #targetPartitions << #actualPartitions Then coalesce
Else If #targetPartitions >> #actualPartitions Then repartition
Else #targetPartitions ~= #actualPartitions Then do nothing
The coalesce operation will re-partition your data without shuffling, and so is much more efficient when it is available.
Ideally you can estimate the number of rows your will generate, rather than count them. Also, you will need to think carefully about when it is appropriate to perform this operation. With a long RDD lineage you can kill performance, because you may inadvertently reduce the number of cores which can execute complex code, due to scala lazy execution. Look into checkpointing to mitigate this problem.

Resources