Spark: query dataframe vs join - apache-spark

Spark 1.5.
There is a static dataset which may range from some hundred MB to some GB (here I discard the option of broadcasting the dataset - too much memory needed).
I have a Spark Streaming input which I want to enrich with data from that static dataset, providing a common key (I understand this can be done using transform over the DStream to apply RDD/PairRDD logic). Key cardinality is high, on the thousands.
Here there are the options I can see:
I can make the full join, which I guess it would scale well in terms of memory, however it would pose problems in case of too much data having to flow between nodes. I understand it may pay off to partition both static and input RDDs by the same key.
I am considering though to just having the data loaded in a Dataframe, and go querying it every time from the input. Is this too much of a performance penalty? I think this would not be a proper way to use it unless the stream has low cardinality, right?
Are my assumptions correct? Then, would having the full join with partitioning be the preferred option?


Suggestion for multiple joins in spark

Recently I got a requirement to perform combination joins.
I have to perform around 30 to 36 joins in Spark.
It was consuming more time to build the execution plan. So I cached the execution plan in intermediate stages using df.localCheckpoint().
Is this a good way to do? Any thoughts, please share.
Yes, it is fine.
This is mostly discussed for iterative ML algorithms, but can be equally applied for a Spark App with many steps - e.g. joins.
Quoting from
Spark programs take a huge performance hit when fault tolerance occurs
as the entire set of transformations to a DataFrame or RDD have to be
recomputed when fault tolerance occurs or for each additional
transformation that is applied on top of an RDD or DataFrame.
localCheckpoint() is not "reliable".
Caching is definitely a strategy to optimize your performance. In general, given that your data size and resource of your spark application remains unchanged, there are three points that need to be considered when you want to optimize your joining operation:
Data skewness: In most of the time, when I'm trying to find out the reason why the joining takes a lot of time, data skewness is always be one of the reasons. In fact, not only the joining operation, any transformation need a even data distribution so that you won't have a skewed partition that have lots of data and wait the single task in single partition. Make sure your data are well distributed.
Data broadcasting: When we do the joining operation, data shuffling is inevitable. In some case, we use a relatively small dataframe as a reference to filter the data in a very big dataframe. In this case, it's a very expensive operation to shuffle the dataframe. Instead, we can use the dataframe broadcasting to broadcast your small dataframe to every single node and prevent the costly shuffling.
Keep your joining data as lean as possible: like what I mentioned in point 2, data shuffling is inevitable when you do the joining operation. Therefore, please keep your dataframe as lean as possible, which means remove the rows / columns if it's unnecessary to reduce the size of data that need to be moved across the network during the data shuffling.

Memory Management Pyspark

1.) I understand that "Spark's operators spills data to disk if it does not fit memory allowing it to run well on any sized data".
If this is true, why do we ever get OOM (Out of Memory) errors?
2.) Increasing the no. of executor cores increases parallelism. Would that also increase the chances of OOM, because the same memory is now divided into smaller parts for each core?
3.) Spark is much more susceptible to OOM because it performs operations in memory as compared to Hive, which repeatedly reads, writes into disk. Is that correct?
There is one angle that you need to consider there. You may get memory leaks if the data is not properly distributed. That means that you need to distribute your data evenly (if possible) on the Tasks so that you reduce shuffling as much as possible and make those Tasks to manage their own data. So if you need to perform a join, if data is distributed randomly, every Task (and therefore executor) will have to:
See what data they have
Send data to other executors (and tasks) to provide the same keys they need
Request the data that is needed by that task to the others
All that data exchange may cause network bottlenecks if you have a large dataset and also will make every Task to hold their data in memory plus whatever has been sent and temporary objects. All of those will blow up memory.
So to prevent that situation you can:
Load the data already repartitioned. By that I mean, if you are loading from a DB, try Spark stride as defined here. Please refer to the partitionColumn, lowerBound, upperBound attributes. That way you will create a number of partitions on the dataframe that will set the data on different tasks based on the criteria you need. If you are going to use a join of two dataframes, try similar approach on them so that partitions are similar (for not to say same) and that will prevent shuffling over network.
When you define partitions, try to make those values as evenly distributed among tasks as possible
The size of each partition should fit on memory. Although there could be spill to disk, that would slow down performance
If you don't have a column that make the data evenly distributed, try to create one that would have n number of different values, depending on the n number of tasks that you have
If you are reading from a csv, that would make it harder to create partitions, but still it's possible. You can either split the data (csv) on multiple files and create multiple dataframes (performing a union after they are loaded) or you can read that big csv and apply a repartition on the column you need. That will create shuffling as well, but it will be done once if you cache the dataframe already repartitioned
Reading from parquet it's possible that you may have multiple files but if they are not evenly distributed (because the previous process that generated didn't do it well) you may end up on OOM errors. To prevent that situation, you can load and apply repartition on the dataframe too
Or another trick valid for csv, parquet files, orc, etc. is to create a Hive table on top of that and run a query from Spark running a distribute by clause on the data, so that you can make Hive to redistribute, instead of Spark
To your question about Hive and Spark, I think you are right up to some point. Depending on the execute engine that Hive uses in your case (map/reduce, Tez, Hive on Spark, LLAP) you can have different behaviours. With map/reduce, as they are mostly disk operations, the chance to have a OOM is much lower than on Spark. Actually from Memory point of view, map/reduce is not that affected because of a skewed data distribution. But (IMHO) your goal should be to find always the best data distribution for the Spark job you are running and that will prevent that problem
Another consideration is if you are testing in a dev environment that doesn't have same data as in a prod environment. I suppose the data distribution should be similar although volumes may differ a lot (I am talking from experience ;)). In that case, when you assign Spark tuning parameters on the spark-submit command, they may be different in prod. So you need to invest some time on finding the best approach on dev and fine tune in prod
Huge majority of OOM in Spark are on the driver, not executors. This is usually a result of running .collect or similar actions on a dataset that won't fit in the driver memory.
Spark does a lot of work under the hood to parallelize the work, when using structured APIs (in contrast to RDDs) the chances of causing OOM on executor are really slim. Some combinations of cluster configuration and jobs can cause memory pressure that will impact performance and cause lots of garbage collection to happen so you need to address it, however spark should be able to handle low memory without explicit exception.
Not really - as above, Spark should be able to recover from memory issues when using structured APIs, however it may need intervention if you see garbage collection and performance impact.

Is it possible to outperform the Catalyst optimizer on highly skewed data using only RDDs

I am reading High Performance Spark and the author introduces a technique that can be used to perform joins on highly skewed data by selectively filtering the data to build a HashMap with the data containing the most common keys. This HashMap is then sent to all the partitions to perform a broadcast join. The resulting data are concatenated with a union operation at the very end.
Apologies in advance, but the text does not give an example of this technique using code, so I cannot share a code snippet to illustrate it.
Text follows.
Sometimes not all of our smaller RDD will fit into memory, but some keys are so overrepresented in the large dataset that you want to broadcast just the most common keys. This is especially useful if one key is so large that it can't fit on a single partition. In this case you can use countByKeyApprox on the large RDD to get an approximate idea of which keys would most benefit from a broadcast. You then filter the smaller RDD for only these keys, collecting the result locally in a HashMap. Using sc.broadcast you can broadcast the HashMap so that each worker only has one copy and manually perform the join against the HashMap. Using the same HashMap you can then filter your large RDD down to not include the large number of duplicate keys and perform your standard join, uniting it with the result of your manual join. This approach is quite convoluted but may allow you to handle highly skewed data you couldn't otherwise process.
For those who don't know, a broadcast join is a technique where the user can avoid a shuffle incurred when joining two chunks of data by sending the smaller chunk to every single executor. Each executor then performs the join on its own. The idea is that the shuffle is so expensive that having each executor perform the join and then discard the data it doesn't need is sometimes the best way to go.
The text describes a situation where part of a chunk of data can be extracted and joined using a broadcast join. The result of the join is then unioned with the rest of the data.
The reason why this might be necessary is that excessive shuffling can usually be avoided by making sure data consisting of the same keys in the two chunks are both present in the same partition, so that the same executor handles it. However, there are situations where a single key is too large to fit on a single partition. In that case, the author suggests that separating the overrepresented key into a HashMap and performing a broadcast join on just the overrepresented key may be a good idea.
Is this a good idea? Moreover, a technique like this seems very situational, so Catalyst probably does not use this technique. Is that correct? Is it true Catalyst does not use this technique? If so, does that mean on highly skewed data this technique using RDDs can beat Catalyst operating on Dataframes or Datasets?

Spark dataset exceeds total ram size

I am recently working in spark and came across few queries which I still couldn't resolve.
Let's say i have a dataset of 100GB and my ram size of the cluster is
16 GB.
Now, I know in case of simply reading the file and saving it in the HDFS will work as Spark will do it for each partition. What will happen when I perform sorting or aggregation transformation on 100GB data? How will it process 100GB in memory since we need entire data in case of sorting?
I have gone through below link but this only tells us what spark do in case of persisting, what I am looking is Spark aggregations or sorting on dataset greater than ram size.
Spark RDD - is partition(s) always in RAM?
Any help is appreciated.
There are 2 things you might want to know.
Once Spark reaches the memory limit, it will start spilling data to
disk. Please check this Spark faq and also there are severals
question from SO talking about the same, for example, this one.
There is an algorihtm called external sort that allows you to sort datasets which do not fit in memory. Essentially, you divide the large dataset by chunks which actually fit in memory, sort each chunk and write each chunk to disk. Finally, merge every sorted chunk in order to get the whole dataset sorted. Spark supports external sorting as you can see here and here is the implementation.
Answering your question, you do not really need that your data fit in memory in order to sort it, as I explained to you before. Now, I would encourage you to think about an algorithm for data aggregation dividing the data by chunks, just like external sort does.
There are multiple things you need to consider. Because you have 16RAM and 100GB data set, it will be good idea to keep persistence in DISK. It maybe difficult as when aggregating if data set has high cardinality. If the cardinality is low you will be better of to do aggregate at each RDD before merging into whole dataset. Also remember to make sure that each partition in RDD is less than memory (default value 0.4*container_size)

Identifying why data is skewed in Spark

I am investigating a Spark SQL job (Spark 1.6.0) that is performing poorly due to badly skewed data across the 200 partitions, most of the data is in 1 partition:
What I'm wondering there anything in the Spark UI to help me find out more about how the data is partitioned? From looking at this I don't know which columns the dataframe is partitioned on. How can I find that out? (other than looking at the code - I'm wondering if there's anything in the logs and/or UI that could help me)?
Additional details, this is using Spark's dataframe API, Spark version 1.6. Underlying data is stored in parquet format.
The Spark UI and logs will not be terribly helpful for this. Spark uses a simple hash partitioning algorithm as the default for almost everything. As you can see here this basically recycles the Java hashCode method.
I would suggest the following:
Try to debug by sampling and printing the contents of the RDD or data frame. See if there's obvious issues with the data distribution (ie. low variance or low cardinality) of the key.
If thats ineffective, you can work back from the logs and UI to figure our how many partitions there are. You can find the hashCode of the data using spark and then take the modulus to see what the collision is.
Once you find the source of the collision you can try to a few techniques to remove it:
See if there's a better key you can use
See if you can improve the hashCode function of the key (the default one in Java isn't that great)
See if you can process the data in two steps by doing an initial scatter/gather step to force some parallelism and reduce the processing overhead for that one partition. This is probably the trickiest optimization to get right of those mentioned here. Basically, partition the data once using a random number generator to force some initial parallel combining of the data, then push it through again with the natural partitioner to get the final result. This requires that the operation you're applying be transitive and associative. This technique hits the network twice and is therefore very expensive unless the data is really actually that highly skewed.
