Spark dataframe caching by executing multiple queries in different threads - apache-spark

I want to know if dataframe caching in spark is thread-safe. In one of our use cases, I am creating a dataframe from a hive-table, and then running multiple SQL on same dataframe by different threads. Since our storage and compute are decoupled, and the reads are very slow for some reason, I was thinking of caching the dataframe in memory and utilising the cached dataframe for all the queries. Is dataframe caching thread-safe? Are there any other pitfalls in doing so?
I have sufficient memory (disk and RAM) in my compute cluster to cache the table, and I will be executing 10+ queries on the same dataframe.
Thanks,
Akash

Re:"I want to know if dataframe caching in spark is thread-safe."
Whenever you configure executor cores, you are in a way using multiple threads to process the data on each executor. This means in a normal SPARK SQL scenario as well DAG is processed using multiple threads.
Caching should not have any impact towards thread safety. Moreover DataFrames as well are immutable as RDD hence you are not changing data in existing dataframe but you are producing a new one.
Hence even after caching when you create multiple threads to run different SQLs on the same dataframe, every thread would start from the cached stage and compute a new one based on your SQL.

Related

Memory Management Pyspark

1.) I understand that "Spark's operators spills data to disk if it does not fit memory allowing it to run well on any sized data".
If this is true, why do we ever get OOM (Out of Memory) errors?
2.) Increasing the no. of executor cores increases parallelism. Would that also increase the chances of OOM, because the same memory is now divided into smaller parts for each core?
3.) Spark is much more susceptible to OOM because it performs operations in memory as compared to Hive, which repeatedly reads, writes into disk. Is that correct?
There is one angle that you need to consider there. You may get memory leaks if the data is not properly distributed. That means that you need to distribute your data evenly (if possible) on the Tasks so that you reduce shuffling as much as possible and make those Tasks to manage their own data. So if you need to perform a join, if data is distributed randomly, every Task (and therefore executor) will have to:
See what data they have
Send data to other executors (and tasks) to provide the same keys they need
Request the data that is needed by that task to the others
All that data exchange may cause network bottlenecks if you have a large dataset and also will make every Task to hold their data in memory plus whatever has been sent and temporary objects. All of those will blow up memory.
So to prevent that situation you can:
Load the data already repartitioned. By that I mean, if you are loading from a DB, try Spark stride as defined here. Please refer to the partitionColumn, lowerBound, upperBound attributes. That way you will create a number of partitions on the dataframe that will set the data on different tasks based on the criteria you need. If you are going to use a join of two dataframes, try similar approach on them so that partitions are similar (for not to say same) and that will prevent shuffling over network.
When you define partitions, try to make those values as evenly distributed among tasks as possible
The size of each partition should fit on memory. Although there could be spill to disk, that would slow down performance
If you don't have a column that make the data evenly distributed, try to create one that would have n number of different values, depending on the n number of tasks that you have
If you are reading from a csv, that would make it harder to create partitions, but still it's possible. You can either split the data (csv) on multiple files and create multiple dataframes (performing a union after they are loaded) or you can read that big csv and apply a repartition on the column you need. That will create shuffling as well, but it will be done once if you cache the dataframe already repartitioned
Reading from parquet it's possible that you may have multiple files but if they are not evenly distributed (because the previous process that generated didn't do it well) you may end up on OOM errors. To prevent that situation, you can load and apply repartition on the dataframe too
Or another trick valid for csv, parquet files, orc, etc. is to create a Hive table on top of that and run a query from Spark running a distribute by clause on the data, so that you can make Hive to redistribute, instead of Spark
To your question about Hive and Spark, I think you are right up to some point. Depending on the execute engine that Hive uses in your case (map/reduce, Tez, Hive on Spark, LLAP) you can have different behaviours. With map/reduce, as they are mostly disk operations, the chance to have a OOM is much lower than on Spark. Actually from Memory point of view, map/reduce is not that affected because of a skewed data distribution. But (IMHO) your goal should be to find always the best data distribution for the Spark job you are running and that will prevent that problem
Another consideration is if you are testing in a dev environment that doesn't have same data as in a prod environment. I suppose the data distribution should be similar although volumes may differ a lot (I am talking from experience ;)). In that case, when you assign Spark tuning parameters on the spark-submit command, they may be different in prod. So you need to invest some time on finding the best approach on dev and fine tune in prod
Huge majority of OOM in Spark are on the driver, not executors. This is usually a result of running .collect or similar actions on a dataset that won't fit in the driver memory.
Spark does a lot of work under the hood to parallelize the work, when using structured APIs (in contrast to RDDs) the chances of causing OOM on executor are really slim. Some combinations of cluster configuration and jobs can cause memory pressure that will impact performance and cause lots of garbage collection to happen so you need to address it, however spark should be able to handle low memory without explicit exception.
Not really - as above, Spark should be able to recover from memory issues when using structured APIs, however it may need intervention if you see garbage collection and performance impact.

Spark dataframe creation through already distributed in-memory data sets

I am new to the Spark community. Please ignore if this question doesn't make sense.
My PySpark Dataframe is just taking a fraction of time (in ms) in 'Sorting', but moving data is much expensive (> 14 sec).
Explanation:
I have a huge Arrow RecordBatches collection which is equally distributed on all of my worker node's memories (in plasma_store). Currently, I am collecting all those RecordBatches back in my master node, merging them, and converting them to a single Spark Dataframe. Then I apply sorting function on that dataframe.
Spark dataframe is a cluster distributed data collection.
So my question is:
Is it possible to create a Spark dataframe from all that already distributed Arrow RecordBatches data collections in the worker's nodes memories? So that the data should remain in the respective worker's nodes memories (instead of bringing it to master node, merging, and then creating distributed dataframe).
Thanks!
Yes you can store the data in a spark cache, whenever you try to get the data, it would get you from cache rather than the source.
Please utilize below kinks to understand more on cache,
https://sparkbyexamples.com/spark/spark-dataframe-cache-and-persist-explained/
where does df.cache() is stored
https://unraveldata.com/to-cache-or-not-to-cache/

Is there an extra overhead to cache Spark dataframe in memory?

I am new to Spark and wanted to understand if there is an extra overhead/delay to persist and un-persist a dataframe in memory.
From what I know so far that there is not data movement that happens when we used cache a dataframe and it is just saved on executor's memory. So it should be just a matter of setting/unsetting a flag.
I am caching a dataframe in a spark streaming job and wanted to know if this could lead to additional delay in batch execution.
if there is an extra overhead/delay to persist and un-persist a dataframe in memory.
It depends. If you only mark a DataFrame to be persisted, nothing really happens since it's a lazy operation. You have to execute an action to trigger DataFrame persistence / caching. With the action you do add an extra overhead.
Moreover, think of persistence (caching) as a way to precompute data and save it closer to executors (memory, disk or their combinations). This moving data from where it lives to executors does add an extra overhead at execution time (even if it's just a tiny bit).
Internally, Spark manages data as blocks (using BlockManagers on executors). They're peers to exchange blocks on demand (using torrent-like protocol).
Unpersisting a DataFrame is simply to send a request (sync or async) to BlockManagers to remove RDD blocks. If it happens in async manner, the overhead is none (minus the extra work executors have to do while running tasks).
So it should be just a matter of setting/unsetting a flag.
In a sense, that's how it is under the covers. Since a DataFrame or an RDD are just abstractions to describe distributed computations and do nothing at creation time, this persist / unpersist is just setting / unsetting a flag.
The change can be noticed at execution time.
I am caching a dataframe in a spark streaming job and wanted to know if this could lead to additional delay in batch execution.
If you use async caching (the default), there should be a very minimal delay.

How to force Spark Dataframe to be split across all the worker nodes?

I want to create a small dataframe with just 10 rows. And I want to force this dataframe to be distributed to two worker nodes. My cluster has only two worker nodes. How do I do that?
Currently, whenever I create such a small dataframe, it gets persisted in only one worker node.
I know, Spark is build for Big Data and this question does not make much sense. However, conceptually, I just wanted to know if at all it is feasible or possible to enforce the Spark dataframe to be split across all the worker nodes (given a very small dataframe with 10-50 rows only).
Or, it is completely impossible and we have to rely upon the Spark master for this dataframe distribution?

Spark master memory requirements related to data size

Are Spark master memory requirements related to the size of the processed data?
The Spark driver and Spark workers/executors deal with processed data directly (and execute application code), so their memory needs can be linked to the size of the processed data. But is the Spark master in any way affected by the data size? It seems to me that it isn't, because it just manages the Spark workers and doesn't work with the data itself directly.
Spark main data entities like DataFrames or DataSets are based on RDD, or Resilient Distributed Datasets. They are distributed meaning the processing generally takes place in the executors.
Some RDD actions will end with data on the driver process though. Most notably collect and other actions that use it (like show, take or toPandas if you are using python). collect, as the name implies, will collect some or all of the rows of the distributed datasets and materialize them in the driver process. At this point, yes, you will need to take into account the memory footprint of your data.
This is why you will generally want to reduce as much as possible the data you collect. You can groupBy, filter, and many other transformations so that if you need to process the data in the driver, it's the most refined possible.

Resources