What is the exact difference between Spark dataframe writer partitionBy and delta/hive create table partition by. Which one will be faster and why?
Related
We have 2 Hive tables which are read in spark and joined using a join key, let’s call it user_id.
Then, we write this joined dataset to S3 and register it hive as a 3rd table for subsequent tasks to use this joined dataset.
One of the other columns in the joined dataset is called keychain_id.
We want to group all the user records belonging to the same keychain_id in the same partition for a reason to avoid shuffles later.
So, can I do a repartition(“keychain_id”) before writing to s3 and registering it in Hive , and when I read the same data back from this third table will it still have the same partition grouping (all users belonging to the
Same keychain_id in the same partition)? Because trying to avoid doing a repartition(“keychain_id”) every time when reading from this 3rd table.
Can you please clarify ? If there is no guarantee that it will retain the same partition grouping while reading, then is there another efficient way this can be done other than caching?
if there is no data skew(will lead to diff partition file sizes) in keychain_id you can do write with partitionBy:
df.write\
.partitionBy("keychain_id")\
.mode("overwrite")\
.format("parquet")\
.saveAsTable("testing")
Update:
In order to 'retain the grouping of user records having the same keychain_id in the same dataframe partition'
You could repartition before, on unique ids and/or column
from pyspark.sql import functions as F
n = df.select(F.col('keychain_id')).distinct().count()
df.repartition(n, F.col("keychain_id)\
.write \
.partitionBy("keychain_id")\
.mode("overwrite")\
.format("parquet")\
.saveAsTable("testing")
or
df.repartition(n)\
.write \
.partitionBy("keychain_id")\
.mode("overwrite")\
.format("parquet")\
.saveAsTable("testing")
I am using repartition on columns to store the data in parquet. But
I see that the no. of parquet partitioned files are not same with the
no. of Rdd partitions. Is there no correlation between rdd partitions
and parquet partitions?
When I write the data to parquet partition and I use Rdd
repartition and then I read the data from parquet partition , is
there any condition when the rdd partition numbers will be same
during read / write?
How is bucketing a dataframe using a column id and repartitioning a
dataframe via the same column id different?
While considering the performance of joins in Spark should we be
looking at bucketing or repartitioning (or maybe both)
Couple of things here that you;re asking - Partitioning, Bucketing and Balancing of data,
Partitioning:
Partitioning data is often used for distributing load horizontally, this has performance benefit, and helps in organizing data in a logical fashion.
Partitioning tables changes how persisted data is structured and will now create subdirectories reflecting this partitioning structure.
This can dramatically improve query performance, but only if the partitioning scheme reflects common filtering.
In Spark, this is done by df.write.partitionedBy(column*) and groups data by partitioning columns into same sub directory.
Bucketing:
Bucketing is another technique for decomposing data sets into more manageable parts. Based on columns provided, the entire data is hashed into a user-defined number of buckets (files).
Synonymous to Hive's Distribute By
In Spark, this is done by df.write.bucketBy(n, column*) and groups data by partitioning columns into same file. number of files generated is controlled by n
Repartition:
It returns a new DataFrame balanced evenly based on given partitioning expressions into given number of internal files. The resulting DataFrame is hash partitioned.
Spark manages data on these partitions that helps parallelize distributed data processing with minimal network traffic for sending data between executors.
In Spark, this is done by df.repartition(n, column*) and groups data by partitioning columns into same internal partition file. Note that no data is persisted to storage, this is just internal balancing of data based on constraints similar to bucketBy
Tl;dr
1) I am using repartition on columns to store the data in parquet. But I see that the no. of parquet partitioned files are not same with the no. of Rdd partitions. Is there no correlation between rdd partitions and parquet partitions?
repartition has correlation to bucketBy not partitionedBy. partitioned files is governed by other configs like spark.sql.shuffle.partitions and spark.default.parallelism
2) When I write the data to parquet partition and I use Rdd repartition and then I read the data from parquet partition , is there any condition when the rdd partition numbers will be same during read / write?
during read time, the number of partitions will be equal to spark.default.parallelism
3) How is bucketing a dataframe using a column id and repartitioning a dataframe via the same column id different?
Working similar, except, bucketing is a write operation and is used for persistence.
4) While considering the performance of joins in Spark should we be looking at bucketing or repartitioning (or maybe both)
repartition of both datasets are in memory, if one or both the datasets are persisted, then look into bucketBy also.
How can I calculate the trade off between number of partitions and size of DataFrame in Spark whit spark.conf.set configuration?
After reading up on this answer , i know the number of partitions when reading data from Hive will be decided by the HDFS blockSize.
But i meet a problem: i use spark sql to read a hive table, and save the data to an new hive table, but the two hive tables have different partition numbers when loaded by spark sql.
val data = spark.sql("select * from src_table")
val partitionsNum = data.rdd.getNumPartitions
println(partitionsNum)
val newData = data
newData.write.mode("overwrite").format("parquet").saveAsTable("new_table")
I don't understand the same data, why different partition numbers.
I have a Cassandra table XYX with columns(
id uuid,
insert a timestamp,
header text)
Where id and insert are composite primary key.
I'm using Dataframe and in my spark shell I'm fetching id and header column.
I want to have distinct rows based on id and header column.
I'm seeing lot of shuffles which not be the case since Spark Cassandra connector ensures that all rows for a given Cassandra partition are in same spark partition.
After fetching I'm using dropDuplicates to get distinct records.
Spark Dataframe API does not support custom partitioners yet. So the Connector could not introduce the C* partitioner to Dataframe engine.
An RDD Spark API supports custom partitioner from other hand. Thus you could load your data into RDD and then covert it to df.
Here is a Connector doc about C* partitioner usage: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/16_partitioning.md
keyBy() function allow you to define key columns to use for grouping
Here is working example. It is not short, so I expect someone could improve it:
//load data into RDD and define a group key
val rdd = sc.cassandraTable[(String, String)] ("test", "test")
.select("id" as "_1", "header" as "_2")
.keyBy[Tuple1[Int]]("id")
// check that partitioner is CassandraPartitioner
rdd.partitioner
// call distinct for each group, flat it, get two column DF
val df = rdd.groupByKey.flatMap {case (key,group) => group.toSeq.distinct}
.toDF("id", "header")