Unfair split of workload among spark executors - apache-spark

I am currently using spark to process documents. I have two servers at my disposal (innov1 and innov2) and I am using yarn as the resource manager.
The first step is to gather the paths of the files from a database, filter them, repartition them and persist them in a RDD[String]. However, I can't manage to have a fair sharing of the persist among all the executors:
persisted RDD memory taken among executors
and this lead to the executors not doing the same amount of work after that:
Work done by each executors (do not care about the 'dead' here, it's another problem)
And this happens randomly, sometimes it's innov1 that takes all the persist, and then only executors on innov1 work (but it tends to be innov2 in general). Right now, each time two executors are on innov1, I just kill the job to relaunch, and I pray for them to be on innov2 (which is utterly stupid, and break the goal of using spark).
What I have tried so far (and that didn't work):
make the driver sleep 60 seconds before the loading from the database (maybe innov1 takes more time to wake up?)
add spark.scheduler.minRegisteredResourcesRatio=1.0 when I submit the job (same idea than above)
persist with replication x2 (idea from this link), hoping that some of the block would be replicated on innov1
Note for point 3, sometimes it was persisting a replication on the same executor (which is a bit counter intuitive), or even weirder, not replicated at all (innov2 is not able to communicate with innov1?).
I am open to any suggestion, or link to similar problems I would have missed.
Edit:
I can't really put code here, as it's part of my company's product. I can give a simplified version however:
val rawHBaseRDD : RDD[(ImmutableBytesWritable, Result)] = sc
.newAPIHadoopRDD(...)
.map(x => (x._1, x._2)) // from doc of newAPIHadoopRDD
.repartition(200)
.persist(MEMORY_ONLY)
val pathsRDD: RDD[(String, String)] = rawHBaseRDD
.mapPartitions {
...
extract the key and the path from ImmutableBytesWritable and
Result.rawCells()
...
}
.filter(some cond)
.repartition(200)
.persist(MEMORY_ONLY)
For both persist, everything is on innov2. Is it possible that it's because the data are only on innov2? even if it's the case, I would assume that repartition help to share the rows between innov1 and innov2, but it doesn't happen here.

Your persisted data set is not very big - some ~100MB according to your screenshot. You have allocated 10 cores with 20GB of memory, so the 100MB fits easily into the memory of a single executor and that is basically what is happening.
In other words, you have allocated many more resources than are actually needed, so Spark just randomly picks the subset of resources that it needs to complete the job. Sometimes those resources happen to be on one worker, sometimes on another and sometimes it uses resources from both workers.
You have to remember that to Spark, it makes no difference if all resources are placed on a single machine or on a 100 different machines - as long as you are not trying to use more resources than available (in which case you would get an OOM).

Unfortunately (fortunately?) the problem solved by itself today. I assume it is not spark related as I hadn't modified the code until the resolution.
It's probably due to the complete reboot of all services with Ambari (even if I am not 100% sure, because I already tried this before), as it's the only "major" change that happened today.

Related

what is the difference between memory and network on the column 'input size/Record' on spark ui?

I have some problem while running spark streaming in my cluster.
First, I know that speculative tasks are caused by slow execution of some executors, but some task that are not speculative also running slow with the 'input size/Record' column showing network while the other show memory. Here is a screenshot:
so can someone tell me what is the difference between memory and network on the column 'input size/Record'? Thanks!
The size of the data is not a problem here. Based on the screenshot all partitions are more or less of the same size.
What is really the issue is data locality. Majority of data can be accessed locally, however the problematic ones, are forced to use RACK_LOCAL, and since it takes much longer than expected speculative execution kicks in, and tries with ANY.
There is not enough information here to full diagnose the issue, but one thing you can try is increasing spark.locality.wait property (default is 3 seconds).

Spark Parquet Loader: Reduce number of jobs involved in listing a dataframe's files

I'm loading parquet data into a dataframe via
spark.read.parquet('hdfs:///path/goes/here/...')
There are around 50k files in that path due to parquet partitioning. When I run that command, spark spawns off dozens of small jobs that as a whole take several minutes to complete. Here's what the jobs look like in the spark UI:
As you can see, although each job has ~2100 tasks, they execute quickly, in about 2 seconds. Starting so many 'mini jobs' is inefficient and leads this file listing step to take about 10 minutes (where the clusters resources are mostly idle, and the cluster is mostly dealing with straggling tasks or the overhead of managing jobs/tasks).
How can I consolidate these tasks into fewer jobs, each with more tasks?
Bonus points for a solution that also works in pyspark.
I'm running spark 2.2.1 via pyspark on hadoop 2.8.3.
I believe you encountered a bug for which a former colleague of mine has filed a ticket and opened a pull request. You can check it out here. If it fits your issue, your best shot is probably voting the issue up and making some noise on the mailing list about it.
What you might want to do is tweaking the spark.sql.sources.parallelPartitionDiscovery.threshold and spark.sql.sources.parallelPartitionDiscovery.parallelism configuration parameters (with the former being cited in the linked ticket) in a way that suits your job.
You can have a look here and here to see how the configuration key is used. I'll share the related snippets here for completeness.
spark.sql.sources.parallelPartitionDiscovery.threshold
// Short-circuits parallel listing when serial listing is likely to be faster.
if (paths.size <= sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) {
return paths.map { path =>
(path, listLeafFiles(path, hadoopConf, filter, Some(sparkSession)))
}
}
spark.sql.sources.parallelPartitionDiscovery.parallelism
// Set the number of parallelism to prevent following file listing from generating many tasks
// in case of large #defaultParallelism.
val numParallelism = Math.min(paths.size, parallelPartitionDiscoveryParallelism)
The default value for this configuration are 32 for the threshold and 10000 for the parallelism (related code here).
In your case, I'd say that probably what you want to do is setting the threshold so that the process is run without spawning parallel jobs.
Note
The linked sources are from the latest available tagged release at the time of writing, 2.3.0.
Against an object store, even the listing and calls to getFileStatus are pretty expensive, and as this is done during partitioning, can extend the job a lot.
Play with mapreduce.input.fileinputformat.list-status.num-threads to see if adding more threads speeds things up, say a value of 20-30

Is it better to create many small Spark clusters or a smaller number of very large clusters

I am currently developing an application to wrangle a huge amount of data using Spark. The data is a mixture of Apache (and other) log files as well as csv and json files. The directory structure of my Google bucket will look something like this:
root_dir
web_logs
\input (subdirectory)
\output (subdirectory)
network_logs (same subdirectories as web_logs)
system_logs (same subdirectories as web_logs)
The directory structure under the \input directories is arbitrary. Spark jobs pick up all of their data from the \input directory and place it in the \output directory. There is an arbitrary number of *_logs directories.
My current plan is to split the entire wrangling task into about 2000 jobs and use the cloud dataproc api to spin up a cluster, do the job, and close down. Another option would be to create a smaller number of very large clusters and just send jobs to the larger clusters instead.
The first approach is being considered because each individual job is taking about an hour to complete. Simply waiting for one job to finish before starting the other will take too much time.
My questions are: 1) besides the cluster startup costs, are there any downside to taking the first approach? and 2) is there a better alternative?
Thanks so much in advance!
Besides startup overhead, the main other consideration when using single-use clusters per job is that some jobs might be more prone to "stragglers" where data skew leads to a small number of tasks taking much longer than other tasks, so that the cluster isn't efficiently utilized near the end of the job. In some cases this can be mitigated by explicitly downscaling, combined with the help of graceful decommissioning, but if a job is shaped such that many "map" partitions produce shuffle output across all the nodes but there are "reduce" stragglers, then you can't safely downscale nodes that are still responsible for serving shuffle data.
That said, in many cases, simply tuning the size/number of partitions to occur in several "waves" (i.e. if you have 100 cores working, carving the work into something like 1000 to 10,000 partitions) helps mitigate the straggler problem even in the presence of data skew, and the downside is on par with startup overhead.
Despite the overhead of startup and stragglers, though, usually the pros of using new ephemeral clusters per-job vastly outweigh the cons; maintaining perfect utilization of a large shared cluster isn't easy either, and the benefits of using ephemeral clusters includes vastly improved agility and scalability, letting you optionally adopt new software versions, switch regions, switch machine types, incorporate brand-new hardware features (like GPUs) if they become needed, etc. Here's a blog post by Thumbtack discussing the benefits of such "job-scoped clusters" on Dataproc.
A slightly different architecture if your jobs are very short (i.e. if each one only runs a couple minutes and thus amplify the downside of startup overhead) or the straggler problem is unsolveable, is to use "pools" of clusters. This blog post touches on using "labels" to easily maintain pools of larger clusters where you still teardown/create clusters regularly to ensure agility of version updates, adopting new hardware, etc.
You might want to explore my solution for Autoscaling Google Dataproc Clusters
The source code can be found here

Spark performs poorly when generating non-associate features

I have been using Spark as a tool for my own feature-generation project. For this specific project, I have two data-sources which I load into RDDs as follows:
Datasource1: RDD1 = [(key,(time,quantity,user-id,...)j] => ... => bunch of other attributes such as transaction-id, etc.
Datasource2: RDD2 = [(key,(t1,t2)j)]
In RDD1, time denotes the time-stamp where the event has happened and, in RDD2, denotes the acceptable time-interval for each feature. The feature-key is "key". I have two types of features as follows:
associative features: number of items
non-associative features: Example: unique number of users
For each feature-key, I need to see which events fall in the interval (t1,t2) and then aggregate those things. So, I have a join followed by a reduce operation as follows:
`RDD1.join(RDD2).map((key,(v1,v2))=>(key,featureObj)).reduceByKey(...)`
The initial value for my feature would be featureObj=(0,set([])) where the first argument keeps number of items and the second stores number of unique user ids. I also partition the input data to make sure that RDD1 and RDD2 use the same partitioner.
Now, when I run the job to just calculate the associative feature, it runs very fast on a cluster of 16 m2.xlarge, in only 3 minutes. The minute I add the second one, the computation time jumps to 5min. I tried to add a couple of other non-associate features and, every time, the run-time increases fast. Right now, my job runs in 15minutes for 15 features 10 of them are non-associative. I also tried to use KyroSerializer and persist RDDs in a serialized form but nothing special happened. Since I will be moving to implement more features, this issue seems to become a bottleneck.
PS. I tried to do the same task on a single big host (128GB of Ram and 16 cores). With 145 features, the whole job was done in 10minutes. I am under the impression that the main Spark bottleneck is JOIN. I checked my RDDs and noticed that both are co-partitioned in the same way. As a single job is calling these two RDDs, I presume they are co-located too? However, spark web-console still shows "2.6GB" shuffle-read and "15.6GB" shuffle-write.
Could someone please advise me if I am doing something really crazy here? Am I using Spark for a wrong application? Thanks for the comments in advance.
With best regards,
Ali
I noticed poor performance with shuffle operations, too. It turned out that the shuffle ran very fast when data was shuffled from one core to another within the same executor (locality PROCESS_LOCAL), but much slower than expected in all other situations, even NODE_LOCAL was very slow. This can be seen in the Spark UI.
Further investigation with CPU and garbage collection monitoring found that at some point garbage collection made one of the nodes in my cluster unresponsive, and this would block the other nodes shuffling data from or to this node, too.
There are a lot of options that you can tweak in order to improve garbage collection performance. One important thing is to enable early reclamation of humongous objects for the G1 garbage collector, which requires java 8u45 or higher.
In my case the biggest problem was memory allocation in netty. When I turned direct buffer memory off by setting spark.shuffle.io.preferDirectBufs = false, my jobs ran much more stable.

How does partitions map to tasks in Spark?

If I partition an RDD into say 60 and I have a total of 20 cores spread across 20 machines, i.e. 20 instances of single core machines, then the number of tasks is 60 (equal to the number of partitions). Why is this beneficial over having a single partition per core and having 20 tasks?
Additionally, I have run an experiment where I have set the number of partitions to 2, checking the UI shows 2 tasks running at any one time; however, what has surprised me is that it switches instances on completion of tasks, e.g. node1 and node2 do the first 2 tasks, then node6 and node8 do the next set of 2 tasks etc. I thought by setting the number of partitions to less than the cores (and instances) in a cluster then the program would just use the minimum number of instances required. Can anyone explain this behaviour?
For the first question: you might want to have more granular tasks than strictly necessary in order to load less into memory at the same time. Also, it can help with error tolerance, as less work needs to be redone in case of failure. It is nevertheless a parameter. In general the answer depends on the kind of workload (IO bound, memory bound, CPU bound).
As for the second one, I believe version 1.3 has some code to dynamically request resources. I'm unsure in which version the break is, but older versions just request the exact resources you configure your driver with. As for how comes a partition moves from one node to another, well, AFAIK it will pick the data for a task from the node that has a local copy of that data on HDFS. Since hdfs has multiple copies (3 by default) of each block of data, there are multiple options to run any given piece).

Resources