How to force Spark Dataframe to be split across all the worker nodes? - apache-spark

I want to create a small dataframe with just 10 rows. And I want to force this dataframe to be distributed to two worker nodes. My cluster has only two worker nodes. How do I do that?
Currently, whenever I create such a small dataframe, it gets persisted in only one worker node.
I know, Spark is build for Big Data and this question does not make much sense. However, conceptually, I just wanted to know if at all it is feasible or possible to enforce the Spark dataframe to be split across all the worker nodes (given a very small dataframe with 10-50 rows only).
Or, it is completely impossible and we have to rely upon the Spark master for this dataframe distribution?

Related

Spark dataframe creation through already distributed in-memory data sets

I am new to the Spark community. Please ignore if this question doesn't make sense.
My PySpark Dataframe is just taking a fraction of time (in ms) in 'Sorting', but moving data is much expensive (> 14 sec).
Explanation:
I have a huge Arrow RecordBatches collection which is equally distributed on all of my worker node's memories (in plasma_store). Currently, I am collecting all those RecordBatches back in my master node, merging them, and converting them to a single Spark Dataframe. Then I apply sorting function on that dataframe.
Spark dataframe is a cluster distributed data collection.
So my question is:
Is it possible to create a Spark dataframe from all that already distributed Arrow RecordBatches data collections in the worker's nodes memories? So that the data should remain in the respective worker's nodes memories (instead of bringing it to master node, merging, and then creating distributed dataframe).
Thanks!
Yes you can store the data in a spark cache, whenever you try to get the data, it would get you from cache rather than the source.
Please utilize below kinks to understand more on cache,
https://sparkbyexamples.com/spark/spark-dataframe-cache-and-persist-explained/
where does df.cache() is stored
https://unraveldata.com/to-cache-or-not-to-cache/

Spark dataframe caching by executing multiple queries in different threads

I want to know if dataframe caching in spark is thread-safe. In one of our use cases, I am creating a dataframe from a hive-table, and then running multiple SQL on same dataframe by different threads. Since our storage and compute are decoupled, and the reads are very slow for some reason, I was thinking of caching the dataframe in memory and utilising the cached dataframe for all the queries. Is dataframe caching thread-safe? Are there any other pitfalls in doing so?
I have sufficient memory (disk and RAM) in my compute cluster to cache the table, and I will be executing 10+ queries on the same dataframe.
Thanks,
Akash
Re:"I want to know if dataframe caching in spark is thread-safe."
Whenever you configure executor cores, you are in a way using multiple threads to process the data on each executor. This means in a normal SPARK SQL scenario as well DAG is processed using multiple threads.
Caching should not have any impact towards thread safety. Moreover DataFrames as well are immutable as RDD hence you are not changing data in existing dataframe but you are producing a new one.
Hence even after caching when you create multiple threads to run different SQLs on the same dataframe, every thread would start from the cached stage and compute a new one based on your SQL.

Does Spark distributes dataframe across nodes internally?

I am trying to use Spark for processing csv file on cluster. I want to understand if I need to explicitly read the file on each of the worker nodes to do the processing in parallel or will the driver node read the file and distribute the data across cluster for processing internally? (I am working with Spark 2.3.2 and Python)
I know RDD's can be parallelized using SparkContext.parallelize() but what in case of Spark DataFrames?
if __name__=="__main__":
spark=SparkSession.builder.appName('myApp').getOrCreate()
df=spark.read.csv('dataFile.csv',header=True)
df=df.filter("date>'2010-12-01' AND date<='2010-12-02' AND town=='Madrid'")
So if I am running the above code on cluster, will the entire operation be done by driver node or will it distribute df across cluster and each worker perform processing on its data partition?
To be strict, if you run the above code it will not read or process any data. DataFrames are basically an abstraction implemented on top of RDDs. As with RDDs, you have to distinguish transformations and actions. As your code only consists of one filter(...) transformation, noting will happen in terms of readind or processing of data. Spark will only create the DataFrame which is an execution plan. You have to perform an action like count() or write.csv(...) to actually trigger processing of the CSV file.
If you do so, the data will then be read and processed by 1..n worker nodes. It is never read or processed by the driver node. How many or your worker nodes are actually involved depends -- in your code -- on the number of partitions of your source file. Each partition of the source file can be processed in parallel by one worker node. In your example it is probably a single CSV file, so when you call df.rdd.getNumPartitions() after you read the file, it should return 1. Hence, only one worker node will read the data. The same is true if you check the number of partitions after your filter(...) operation.
Here are two ways of how the processing of your single CSV file can be parallelized:
You can manually repartition your source DataFrame by calling df.repartition(n) with n the number of partitions you want to have. But -- and this is a significant but -- this means that all data is potentially send over the network (aka shuffle)!
You perform aggregations or joins on the DataFrame. These operations have to trigger a shuffle. Spark then uses the number of partitions specified in spark.sql.shuffle.partitions(default: 200) to partition the resulting DataFrame.

In spark streaming can I create RDD on worker

I want to know how can I create RDD on worker say containing a Map. This Map/RDD will be small and I want this RDD to completely reside on one machine/executor (I guess repartition(1) can achieve this). Further I want to be able to cache this Map/RDD on local executor and use it in tasks running on this executor for lookup.
How can I do this?
No, you cannot create RDD in worker node. Only driver can create RDD.
The broadcast variable seems be solution in your situation. It will send data to all workers, however if your map is small, then it wouldn't be an issue.
You cannot control on which partition your RDD will be placed, so you cannot just do repartition(1) - you don't know if this RDD will be placed on the same node ;) Broadcast variable will be on every node, so lookup will be very fast
You can create RDD on your driver program using sc.parallelize(data) . For storing Map, it can be split into 2 parts as key, value and then can be stored in RDD/Dataframe as two separate columns.

Location of HadoopPartition

I have a dataset in a csv file that occupies two blocks in HDFS and replicated on two nodes, A and B. Each node has a copy of the dataset.
When Spark starts processing the data, I have seen two ways how Spark loads the dataset as input. It either loads the entire dataset into memory on one node and perform most of the tasks on it or loads the dataset into two nodes and spill the tasks on both nodes (based on what I observed on the history server). For both cases, there is sufficient capacity to keep the whole datasets in memory.
I repeated the same experiment multiple times and Spark seemed to alternate between these two ways. Supposedly Spark inherits the input split location as in a MapReduce job. From my understanding, MapReduce should be able to take advantage of two nodes. I don't understand why Spark or MapReduce will alternate between the two cases.
When only one node is used for processing, the performance is worse.
When your loading the data in Spark you can specify the minimum number of splits and this will force Spark to load the data on multiple machines (with the textFile api you would add minPartitions=2 to your call.

Resources