Is there a way to control the distribution of spark partitions across nodes in a cluster? - apache-spark

I have an 8 node cluster and I load two dataframes from a jdbc source like this:
positionsDf = spark.read.jdbc(
url=connStr,
table=positionsSQL,
column="PositionDate",
lowerBound=41275,
upperBound=42736,
numPartitions=128*3,
properties=props
)
positionsDF.cache()
varDatesDf = spark.read.jdbc(
url=connStr,
table=datesSQL,
column="PositionDate",
lowerBound=41275,
upperBound=42736,
numPartitions=128 * 3,
properties=props
)
varDatesDF.cache()
res = varDatesDf.join(positionsDf, on='PositionDate').count()
I can some from the storage tab of the application UI that the partitions are evenly distributed across the cluster nodes. However, what I can't tell is how they are distributed across the nodes. Ideally, both dataframes would be distributed in such a way that the joins are always local to the node, or even better local to the executors.
In other words, will the positionsDF dataframe partition that contains records with PositionDate="01 Jan 2016", be located in the same executor memory space as the varDatesDf dataframe partition that contains records with PositionDate="01 Jan 2016"? Will they be on the same node? Or is it just random?
Is there any way to see what partitions are on which node?
Does spark distribute the partitions created using a column key like this in a deterministic way across nodes? Will they always be node/executor local?

will the positionsDF dataframe partition that contains records with PositionDate="01 Jan 2016", be located in the same executor memory space as the varDatesDf dataframe partition that contains records with PositionDate="01 Jan 2016"
It won't be in general. Even if data is co-partitioned (it is not here) it doesn't imply co-location.
Is there any way to see what partitions are on which node?
This relation doesn't have to be fixed over time. Task can be for example rescheduled. You can use different RDD tricks (TaskContext) or database log but it is not reliable.
would be distributed in such a way that the joins are always local to the node, or even better local to the executors.
Scheduler has its internal optimizations and low level APIs allow you to set node preferences but this type of things are not controllable in Spark SQL.

Related

Distribute by vs Cluster by in spark SQL

I have recently started working on spark, We always use cluster by to optimize the tables before joining but I wanted to know is there any scenario where we prefer distribute by over cluster by clause.
Only difference between cluster by and distribute by is
Distribute by only repartitions the data based on the expression while cluster by first repartitions that data and then sorts the data based on key in each partition.
Equivalent representations of cluster by and distribute by in dataframe api is as follows:
distribute by
df.repartition($"key", 2)
cluster by
df.repartition($"key", 2).sortWithinPartitions()
Both involves shuffle operation except cluster by has extra sorting operation.

How to force Spark Dataframe to be split across all the worker nodes?

I want to create a small dataframe with just 10 rows. And I want to force this dataframe to be distributed to two worker nodes. My cluster has only two worker nodes. How do I do that?
Currently, whenever I create such a small dataframe, it gets persisted in only one worker node.
I know, Spark is build for Big Data and this question does not make much sense. However, conceptually, I just wanted to know if at all it is feasible or possible to enforce the Spark dataframe to be split across all the worker nodes (given a very small dataframe with 10-50 rows only).
Or, it is completely impossible and we have to rely upon the Spark master for this dataframe distribution?

spark behavior on hive partitioned table

I use Spark 2.
Actually I am not the one executing the queries so I cannot include query plans. I have been asked this question by the data science team.
We are having hive table partitioned into 2000 partitions and stored in parquet format. When this respective table is used in spark, there are exactly 2000 tasks that are executed among the executors. But we have a block size of 256 MB and we are expecting the (total size/256) number of partitions which will be much lesser than 2000 for sure. Is there any internal logic that spark uses physical structure of data to create partitions. Any reference/help would be greatly appreciated.
UPDATE: It is the other way around. Actually our table is very huge like 3 TB having 2000 partitions. 3TB/256MB would actually come to 11720 but we are having exactly same number of partitions as the table is partitioned physically. I just want to understand how the tasks are generated on data volume.
In general Hive partitions are not mapped 1:1 to Spark partitions. 1 Hive partition can be split into multiple Spark partitions, and one Spark partition can hold multiple hive-partitions.
The number of Spark partitions when you load a hive-table depends on the parameters:
spark.files.maxPartitionBytes (default 128MB)
spark.files.openCostInBytes (default 4MB)
You can check the partitions e.g. using
spark.table(yourtable).rdd.partitions
This will give you an Array of FilePartitions which contain the physical path of your files.
Why you got exactly 2000 Spark partitions from your 2000 hive partitions seems a coincidence to me, in my experience this is very unlikely to happen. Note that the situation in spark 1.6 was different, there the number of spark partitions resembled the number of files on the filesystem (1 spark partition for 1 file, unless the file was very large)
I just want to understand how the tasks are generated on data volume.
Tasks are a runtime artifact and their number is exactly the number of partitions.
The number of tasks does not correlate to data volume in any way. It's a Spark developer's responsibility to have enough partitions to hold the data.

How does Spark parallelize the processing of a 1TB file?

Imaginary problem
A gigantic CSV log file, let's say 1 TB in size, the file is located on a USB drive
The log contains activities logs of users around the world, let's assume that the line contains 50 columns, among those there is Country.
We want a line count per country, descending order.
Let's assume the Spark cluster has enough nodes with RAM to process the entire 1TB in memory (20 nodes, 4 cores CPU, each node has 64GB RAM)
My Poorman's conceptual solution
Using SparkSQL & Databricks spark-csv
$ ./spark-shell --packages com.databricks:spark-csv_2.10:1.4.0
val dfBigLog = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("/media/username/myUSBdrive/bogusBigLog1TB.log")
dfBigLog.select("Country")
.groupBy("Country")
.agg(count($"Country") as "CountryCount")
.orderBy($"CountryCount".desc).show
Question 1: How does Spark parallelize the processing?
I suppose the majority of the execution time (99% ?) of the above solution is to read the 1TB file from the USB drive into the Spark cluster. Reading the file from the USB drive is not parallelizable. But after reading the entire file, what does Spark do under the hood to parallelize the processing?
How many nodes used for creating the DataFrame? (maybe only one?)
How many nodes used for groupBy & count? Let's assume there are 100+ countries (but Spark doesn't know that yet). How would Spark partition to distribute the 100+ country values on 20 nodes?
Question 2: How to make the Spark application the fastest possible?
I suppose the area of improvement would be to parallelize the reading of the 1TB file.
Convert the CSV File into a Parquet file format + using Snappy compression. Let's assume this can be done in advance.
Copy the Parquet file on HDFS. Let's assume the Spark cluster is within the same Hadoop cluster and the datanodes are independant from the 20 nodes Spark cluster.
Change the Spark application to read from HDFS. I suppose Spark would now use several nodes to read the file as Parquet is splittable.
Let's assume the Parquet file compressed by Snappy is 10x smaller, size = 100GB, HDFS block = 128 MB in size. Total 782 HDFS blocks.
But then how does Spark manage to use all the 20 nodes for both creating the DataFrame and the processing (groupBy and count)? Does Spark use all the nodes each time?
Question 1: How does Spark parallelize the processing (of reading a
file from a USB drive)?
This scenario is not possible.
Spark relies on a hadoop compliant filesystem to read a file. When you mount the USB drive, you can only access it from the local host. Attempting to execute
.load("/media/username/myUSBdrive/bogusBigLog1TB.log")
will fail in cluster configuration, as executors in the cluster will not have access to that local path.
It would be possible to read the file with Spark in local mode (master=local[*]) in which case you only will have 1 host and hence the rest of the questions would not apply.
Question 2: How to make the Spark application the fastest possible?
Divide and conquer.
The strategy outlined in the question is good. Using Parquet will allow Spark to do a projection on the data and only .select("Country") column, further reducing the amount of data required to be ingested and hence speeding things up.
The cornerstone to parallelism in Spark are partitions. Again, as we are reading from a file, Spark relies on the Hadoop filesystem. When reading from HDFS, the partitioning will be dictated by the splits of the file on HDFS. Those splits will be evenly distributed among the executors. That's how Spark will initially distribute the work across all available executors for the job.
I'm not deeply familiar with the Catalist optimizations, but I think I could assume that .groupBy("Country").agg(count($"Country") will become something similar to: rdd.map(country => (country,1)).reduceByKey(_+_)
The map operation will not affect partitioning, so can be applied on site.
The reduceByKey will be applied first locally on each partition and partial results will be combined on the driver. So most counting happens distributed in the cluster, and adding it up will be centralized.
Reading the file from the USB drive is not parallelizable.
USB drive or any other data source the same rules apply. Either source is accessible from the driver and all worker machines and data is accessed in parallel (up to the source limits) or data is not accessed at all you get an exception.
How many nodes used for creating the DataFrame? (maybe only one?)
Assuming that files is accessible from all machines it depends on a configuration. For starters you should take a look at the split size.
How many nodes used for the GroupBy & Count?
Once again it depends on a configuration.

Spark streaming RDD partitions

In Spark streaming, is it possible to assign specific RDD partitions to specific nodes in the cluster (for data locality?)
For example, I get a stream of events [a,a,a,b,b,b] and have a 2 node Spark cluster.
I want all a's to always go to Node 1 and all b's to always go to Node 2.
Thanks!
This is possible by specifying a custom partitioner for your RDD. The RangeBasedPartitioner will partition your RDD based on range, but you can implement any partitioning logic you with a custom partitioner. Its generally useful/important for partitions to be relatively balanced, and depending on your input data doing something like this could cause problems (e.g. stragglers etc.) so be careful.

Resources