Spark 'limit' does not run in parallel? - apache-spark

I have a simple join where I limit on of the sides. In the explain plan I see that before the limit is executed there is an ExchangeSingle operation, indeed I see that at this stage there is only one task running in the cluster.
This of course affects performance dramatically (removing the limit removes the single task bottleneck but lengthens the join as it works on a much larger dataset).
Is limit truly not parallelizable? and if so- is there a workaround for this?
I am using spark on Databricks cluster.
Edit: regarding the possible duplicate. The answer does not explain why everything is shuffled into a single partition. Also- I asked for advice to work around this issue.

Following the advice given by user8371915 in the comments, I used sample instead of limit. And it uncorked the bottleneck.
A small but important detail: I still had to put a predictable size constraint on the result set after sample, but sample inputs a fraction, so the size of the result set can very greatly depending on the size of the input.
Fortunately for me, running the same query with count() was very fast. So I first counted the size of the entire result set and used it to compute the fraction I later used in the sample.

Workaround for parallelization after limit:
.repartition(200)
This redistributes the data again so that you can work in parallel.

Related

large gap between tasks in same job/stage in spark

I have a job that should take less than 1 sec.
In this case it takes around 10-12 sec. drilling down into one stage, shows that the tasks are running fine, you can see that the maximal, long running task, took 0.4 sec:
however, when looking at the timeline, you can see that there is a large gap (~10 sec.) between some tasks under the same stage:
is there anything I'm missing?
what should I configure in order to avoid that long-time gap?
Edit:
Here is the entire list of tasks in the timeline, it seems pretty balanced
Try to repartition your RDDs in order that each partition contains the same volume of data. This kind of problems often happens when partitions contain largely imbalanced data volume. Check this article, it may help understand the partitioning aspect and its effects on performance :
https://dzone.com/articles/apache-spark-performance-tuning-degree-of-parallel

What does the hint USE_ADDITIONAL_PARALLELISM do in Cloud Spanner

In the doc we can find a query hint named USE_ADDITIONAL_PARALLELISM here: https://cloud.google.com/spanner/docs/query-syntax#statement-hints
However the documentation is very short for it.
From my understanding it will spread a single query to be executed on multiple nodes; is that correct?
In what scenario would we use it?
What is its impact on the infrastructure?
How does it scale with number of nodes?
Does it need a query that picks data from different splits, or does it work on a single split?
Any meaningful information about it is welcome.
PS: I was originally introduced to the hint in this thread
A Spanner query may be executed on multiple remote servers.
Source: An illustration of the life of a query from the Cloud Spanner "Query execution plans" documentation
The root node coordinates the query execution.
If the execution plan expects rows on multiple splits to satisfy the query predicate(s), multiple subplans are executed on the respective remote servers.
Due to the distributed nature of Spanner these subplans can sometimes be executed in parallel; for example, the right subplan execution is not dependent on the left subplan results.
If the USE_ADDITIONAL_PARALLELISM query hint is provided, the root node may choose to increase the number of parallel remote executions, if the execution plan includes multiple subplans.
To answer the original questions:
From my understanding it will spread a single query to be executed on multiple nodes; is that correct?
This hint does not change how a query is executed, it only make it possible for subplans of that execution to be initiated with increased parallelism.
In what scenario would we use it?
Especially in cases when a full table scale is required, this may lead to faster, in wall-time, query completion, but the trade offs concerning resource allocation, and the affects on other parallel operations, should also be considered.
What is its impact on the infrastructure?
If an increased number of remote executions are run in parallel, the average CPU for the instance may increase.
How does it scale with number of nodes?
An increased number of nodes provides additional capacity for parallel operations.
Does it need a query that picks data from different splits, or does it work on a single split?
Benefits will likely be significantly higher for queries which require data that resides on multiple splits.
A Cloud Spanner query may have multiple levels of distribution. The USE_ADDITIONAL_PARALLELISM query hint will cause a node executing a query to try and prefetch the results of subqueries further up in the distribution queue. This can be useful in scenarios such as queries doing full table scans or doing full table scans with aggregations like COUNT(), MAX , MIN etc. where identical subqueries can be distributed to many splits and where the individual subqueries to the splits return relatively little data (such as aggregation state). However, if the individual subqueries return significant data then using this hint can cause memory usage on the consuming node to go up significantly due to prefetching.

What's the difference between getHits() and getGetOperationCount() in the local map statistics?

We want to analyze the performance of our Hazelcast caches. It's a distributed map on a cluster with six members and a backup count of 5.
We don't understand what getGetOperationCount() returns in difference to getHits(). In one case we have 69.932.537 cache hits but only 1.354 get operations, which makes no sense to us.
Can someone explain the meaning of this? Thank you!
Hits increase with both read and write operations while getGetOperationCount() only with IMap.get() operations.
getHits() is looking at the local member (cluster node), while the various operation counts (getGetOperationCount(), getPutOperationCount(), etc.) are for the cluster as a whole. That doesn't really explain the difference you're seeing; I'd expect the local hit count to be about one sixth of the total operation count. (Using getEventOperationCount() rather than getGetOperationCount() might give a better comparison).
The values are longs so a counter overflow doesn't seem likely, unless you're fetching the counters and storing them as ints somewhere along the way.
Edited to add: With 5 backups, if you have read-backup-data set true then you should always hit locally.

Limit max parallelism for a single RDD without decreasing the number of partitions

Is it possible to limit the max number of concurrent tasks at the RDD level without changing the actual number of partitions? The use case is to not overwhelm a database with too many concurrent connections without reducing the number of partitions. Reducing the number of partitions causes each partition to become larger and eventually unmanageable.
I'm re-posting this as an "answer" because I think it may be the least-dirty hack that might get the behavior you want:
Use a mapPartitions(...) call, and at the beginning of the mapping function, do some kind of blocking check on a globally viewable state (REST-call, maybe?) that only allows some maximum number of checks to succeed at any given time. Since that will delay the full RDD operation, you may need to increase the timeout on RDD finishing to prevent an error
Primary significance of partitioning in spark is for providing parallelism, and your requirement is to reduce parallelism!!! But the the requirement is genuine :)
What is the real problem with less number of partition? Is writing too much data at once is creating problem? If that is the case, you could breakdown the per partition writing.
Can you put the data in some intermediate queue and process the at a controlled manner?
One approach might be to enable dynamic allocation, and set the maximum number of executors to your desired maximum parallelism.
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.maxExecutors <maximum>
You can read more about configuring dynamic allocation is described here:
https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
https://spark.apache.org/docs/latest/configuration.html#scheduling
If you are trying to control one specific computation, you could experiment with programmatically controlling the number of executors:
https://github.com/jaceklaskowski/mastering-apache-spark-book/blob/master/spark-sparkcontext.adoc#dynamic-allocation

Spark performs poorly when generating non-associate features

I have been using Spark as a tool for my own feature-generation project. For this specific project, I have two data-sources which I load into RDDs as follows:
Datasource1: RDD1 = [(key,(time,quantity,user-id,...)j] => ... => bunch of other attributes such as transaction-id, etc.
Datasource2: RDD2 = [(key,(t1,t2)j)]
In RDD1, time denotes the time-stamp where the event has happened and, in RDD2, denotes the acceptable time-interval for each feature. The feature-key is "key". I have two types of features as follows:
associative features: number of items
non-associative features: Example: unique number of users
For each feature-key, I need to see which events fall in the interval (t1,t2) and then aggregate those things. So, I have a join followed by a reduce operation as follows:
`RDD1.join(RDD2).map((key,(v1,v2))=>(key,featureObj)).reduceByKey(...)`
The initial value for my feature would be featureObj=(0,set([])) where the first argument keeps number of items and the second stores number of unique user ids. I also partition the input data to make sure that RDD1 and RDD2 use the same partitioner.
Now, when I run the job to just calculate the associative feature, it runs very fast on a cluster of 16 m2.xlarge, in only 3 minutes. The minute I add the second one, the computation time jumps to 5min. I tried to add a couple of other non-associate features and, every time, the run-time increases fast. Right now, my job runs in 15minutes for 15 features 10 of them are non-associative. I also tried to use KyroSerializer and persist RDDs in a serialized form but nothing special happened. Since I will be moving to implement more features, this issue seems to become a bottleneck.
PS. I tried to do the same task on a single big host (128GB of Ram and 16 cores). With 145 features, the whole job was done in 10minutes. I am under the impression that the main Spark bottleneck is JOIN. I checked my RDDs and noticed that both are co-partitioned in the same way. As a single job is calling these two RDDs, I presume they are co-located too? However, spark web-console still shows "2.6GB" shuffle-read and "15.6GB" shuffle-write.
Could someone please advise me if I am doing something really crazy here? Am I using Spark for a wrong application? Thanks for the comments in advance.
With best regards,
Ali
I noticed poor performance with shuffle operations, too. It turned out that the shuffle ran very fast when data was shuffled from one core to another within the same executor (locality PROCESS_LOCAL), but much slower than expected in all other situations, even NODE_LOCAL was very slow. This can be seen in the Spark UI.
Further investigation with CPU and garbage collection monitoring found that at some point garbage collection made one of the nodes in my cluster unresponsive, and this would block the other nodes shuffling data from or to this node, too.
There are a lot of options that you can tweak in order to improve garbage collection performance. One important thing is to enable early reclamation of humongous objects for the G1 garbage collector, which requires java 8u45 or higher.
In my case the biggest problem was memory allocation in netty. When I turned direct buffer memory off by setting spark.shuffle.io.preferDirectBufs = false, my jobs ran much more stable.

Resources