Identifying why data is skewed in Spark

Identifying why data is skewed in Spark - apache-spark

I am investigating a Spark SQL job (Spark 1.6.0) that is performing poorly due to badly skewed data across the 200 partitions, most of the data is in 1 partition:
What I'm wondering is...is there anything in the Spark UI to help me find out more about how the data is partitioned? From looking at this I don't know which columns the dataframe is partitioned on. How can I find that out? (other than looking at the code - I'm wondering if there's anything in the logs and/or UI that could help me)?
Additional details, this is using Spark's dataframe API, Spark version 1.6. Underlying data is stored in parquet format.

The Spark UI and logs will not be terribly helpful for this. Spark uses a simple hash partitioning algorithm as the default for almost everything. As you can see here this basically recycles the Java hashCode method.
I would suggest the following:
Try to debug by sampling and printing the contents of the RDD or data frame. See if there's obvious issues with the data distribution (ie. low variance or low cardinality) of the key.
If thats ineffective, you can work back from the logs and UI to figure our how many partitions there are. You can find the hashCode of the data using spark and then take the modulus to see what the collision is.
Once you find the source of the collision you can try to a few techniques to remove it:
See if there's a better key you can use
See if you can improve the hashCode function of the key (the default one in Java isn't that great)
See if you can process the data in two steps by doing an initial scatter/gather step to force some parallelism and reduce the processing overhead for that one partition. This is probably the trickiest optimization to get right of those mentioned here. Basically, partition the data once using a random number generator to force some initial parallel combining of the data, then push it through again with the natural partitioner to get the final result. This requires that the operation you're applying be transitive and associative. This technique hits the network twice and is therefore very expensive unless the data is really actually that highly skewed.

Related

Suggestion for multiple joins in spark

Recently I got a requirement to perform combination joins.
I have to perform around 30 to 36 joins in Spark.
It was consuming more time to build the execution plan. So I cached the execution plan in intermediate stages using df.localCheckpoint().
Is this a good way to do? Any thoughts, please share.

Yes, it is fine.
This is mostly discussed for iterative ML algorithms, but can be equally applied for a Spark App with many steps - e.g. joins.
Quoting from https://medium.com/#adrianchang/apache-spark-checkpointing-ebd2ec065371:
Spark programs take a huge performance hit when fault tolerance occurs
as the entire set of transformations to a DataFrame or RDD have to be
recomputed when fault tolerance occurs or for each additional
transformation that is applied on top of an RDD or DataFrame.
localCheckpoint() is not "reliable".

Caching is definitely a strategy to optimize your performance. In general, given that your data size and resource of your spark application remains unchanged, there are three points that need to be considered when you want to optimize your joining operation:
Data skewness: In most of the time, when I'm trying to find out the reason why the joining takes a lot of time, data skewness is always be one of the reasons. In fact, not only the joining operation, any transformation need a even data distribution so that you won't have a skewed partition that have lots of data and wait the single task in single partition. Make sure your data are well distributed.
Data broadcasting: When we do the joining operation, data shuffling is inevitable. In some case, we use a relatively small dataframe as a reference to filter the data in a very big dataframe. In this case, it's a very expensive operation to shuffle the dataframe. Instead, we can use the dataframe broadcasting to broadcast your small dataframe to every single node and prevent the costly shuffling.
Keep your joining data as lean as possible: like what I mentioned in point 2, data shuffling is inevitable when you do the joining operation. Therefore, please keep your dataframe as lean as possible, which means remove the rows / columns if it's unnecessary to reduce the size of data that need to be moved across the network during the data shuffling.

Memory Management Pyspark

1.) I understand that "Spark's operators spills data to disk if it does not fit memory allowing it to run well on any sized data".
If this is true, why do we ever get OOM (Out of Memory) errors?
2.) Increasing the no. of executor cores increases parallelism. Would that also increase the chances of OOM, because the same memory is now divided into smaller parts for each core?
3.) Spark is much more susceptible to OOM because it performs operations in memory as compared to Hive, which repeatedly reads, writes into disk. Is that correct?

There is one angle that you need to consider there. You may get memory leaks if the data is not properly distributed. That means that you need to distribute your data evenly (if possible) on the Tasks so that you reduce shuffling as much as possible and make those Tasks to manage their own data. So if you need to perform a join, if data is distributed randomly, every Task (and therefore executor) will have to:
See what data they have
Send data to other executors (and tasks) to provide the same keys they need
Request the data that is needed by that task to the others
All that data exchange may cause network bottlenecks if you have a large dataset and also will make every Task to hold their data in memory plus whatever has been sent and temporary objects. All of those will blow up memory.
So to prevent that situation you can:
Load the data already repartitioned. By that I mean, if you are loading from a DB, try Spark stride as defined here. Please refer to the partitionColumn, lowerBound, upperBound attributes. That way you will create a number of partitions on the dataframe that will set the data on different tasks based on the criteria you need. If you are going to use a join of two dataframes, try similar approach on them so that partitions are similar (for not to say same) and that will prevent shuffling over network.
When you define partitions, try to make those values as evenly distributed among tasks as possible
The size of each partition should fit on memory. Although there could be spill to disk, that would slow down performance
If you don't have a column that make the data evenly distributed, try to create one that would have n number of different values, depending on the n number of tasks that you have
If you are reading from a csv, that would make it harder to create partitions, but still it's possible. You can either split the data (csv) on multiple files and create multiple dataframes (performing a union after they are loaded) or you can read that big csv and apply a repartition on the column you need. That will create shuffling as well, but it will be done once if you cache the dataframe already repartitioned
Reading from parquet it's possible that you may have multiple files but if they are not evenly distributed (because the previous process that generated didn't do it well) you may end up on OOM errors. To prevent that situation, you can load and apply repartition on the dataframe too
Or another trick valid for csv, parquet files, orc, etc. is to create a Hive table on top of that and run a query from Spark running a distribute by clause on the data, so that you can make Hive to redistribute, instead of Spark
To your question about Hive and Spark, I think you are right up to some point. Depending on the execute engine that Hive uses in your case (map/reduce, Tez, Hive on Spark, LLAP) you can have different behaviours. With map/reduce, as they are mostly disk operations, the chance to have a OOM is much lower than on Spark. Actually from Memory point of view, map/reduce is not that affected because of a skewed data distribution. But (IMHO) your goal should be to find always the best data distribution for the Spark job you are running and that will prevent that problem
Another consideration is if you are testing in a dev environment that doesn't have same data as in a prod environment. I suppose the data distribution should be similar although volumes may differ a lot (I am talking from experience ;)). In that case, when you assign Spark tuning parameters on the spark-submit command, they may be different in prod. So you need to invest some time on finding the best approach on dev and fine tune in prod

Huge majority of OOM in Spark are on the driver, not executors. This is usually a result of running .collect or similar actions on a dataset that won't fit in the driver memory.
Spark does a lot of work under the hood to parallelize the work, when using structured APIs (in contrast to RDDs) the chances of causing OOM on executor are really slim. Some combinations of cluster configuration and jobs can cause memory pressure that will impact performance and cause lots of garbage collection to happen so you need to address it, however spark should be able to handle low memory without explicit exception.
Not really - as above, Spark should be able to recover from memory issues when using structured APIs, however it may need intervention if you see garbage collection and performance impact.

Reading files which are written using PartitionBy or BucketBy in Spark

In Spark, when we read files which are written either using partitionBy or bucketBy, how spark identifies that they are of such sort (partitionBy/bucketBy) and accordingly the read operation becomes efficient ?
Can someone please explain. Thanks in advance!

Two different things. Here https://mapr.com/blog/tips-and-best-practices-to-take-advantage-of-spark-2-x/ an excellent excerpt from poor little mapR, let's hope HP makes something of it. Reading this will give you the whole context. Excellent read BTW.
Two different things in reality:
When partition filters are present, the Catalyst optimizer pushes down the partition filters from the given query. The scan reads only
the directories that match the partition filters, thus reducing disk
I/O. Performance improvement in relation to query, sec.
Bucketing is another data organization technique that groups data with the same bucket value across a fixed number of “buckets.” This
can improve performance in wide transformations and joins by
avoiding “shuffles.”

Spark: query dataframe vs join

Spark 1.5.
There is a static dataset which may range from some hundred MB to some GB (here I discard the option of broadcasting the dataset - too much memory needed).
I have a Spark Streaming input which I want to enrich with data from that static dataset, providing a common key (I understand this can be done using transform over the DStream to apply RDD/PairRDD logic). Key cardinality is high, on the thousands.
Here there are the options I can see:
I can make the full join, which I guess it would scale well in terms of memory, however it would pose problems in case of too much data having to flow between nodes. I understand it may pay off to partition both static and input RDDs by the same key.
I am considering though to just having the data loaded in a Dataframe, and go querying it every time from the input. Is this too much of a performance penalty? I think this would not be a proper way to use it unless the stream has low cardinality, right?
Are my assumptions correct? Then, would having the full join with partitioning be the preferred option?

Spark Decision tree fit runs in 1 task

I am trying to "train" a DecisionTreeClassifier using Apache Spark running in a cluster in Amazon EMR. Even though I can see that there are around 50 Executors added and that the features are created by querying a Postgres database using SparkSQL and stored in a DataFrame.
The DesisionTree fit method takes for many hours even though the Dataset is not that big (10.000 db entries with a couple of hundreds of bytes each row).I can see that there is only one task for this so I assume this is the reason that it's been so slow.
Where should I look for the reason that this is running in one task?
Is it the way that I retrieve the data?
I am sorry if this is a bit vague but I don't know if the code that retrieves the data is relevant, or is it a parameter in the algorithm (although I didn't find anything online), or is it just Spark tuning?
I would appreciate any direction!
Thanks in advance.

Spark relies on data locality. It seems that all the data is located in a single place. Hence spark uses a single partition to process it. You could apply a repartition or state the number of partitions you would like to use at load time. I would also look into the decision tree Api and see if you can set the number of partitions for it specifically.
Basically, partitions are your level of parallelism.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string