Spark join: grouping of records having same value for a particular column in the same partition - apache-spark

We have 2 Hive tables which are read in spark and joined using a join key, let’s call it user_id.
Then, we write this joined dataset to S3 and register it hive as a 3rd table for subsequent tasks to use this joined dataset.
One of the other columns in the joined dataset is called keychain_id.
We want to group all the user records belonging to the same keychain_id in the same partition for a reason to avoid shuffles later.
So, can I do a repartition(“keychain_id”) before writing to s3 and registering it in Hive , and when I read the same data back from this third table will it still have the same partition grouping (all users belonging to the
Same keychain_id in the same partition)? Because trying to avoid doing a repartition(“keychain_id”) every time when reading from this 3rd table.
Can you please clarify ? If there is no guarantee that it will retain the same partition grouping while reading, then is there another efficient way this can be done other than caching?

if there is no data skew(will lead to diff partition file sizes) in keychain_id you can do write with partitionBy:
df.write\
.partitionBy("keychain_id")\
.mode("overwrite")\
.format("parquet")\
.saveAsTable("testing")
Update:
In order to 'retain the grouping of user records having the same keychain_id in the same dataframe partition'
You could repartition before, on unique ids and/or column
from pyspark.sql import functions as F
n = df.select(F.col('keychain_id')).distinct().count()
df.repartition(n, F.col("keychain_id)\
.write \
.partitionBy("keychain_id")\
.mode("overwrite")\
.format("parquet")\
.saveAsTable("testing")
or
df.repartition(n)\
.write \
.partitionBy("keychain_id")\
.mode("overwrite")\
.format("parquet")\
.saveAsTable("testing")

Related

Does using multiple columns in partitioning Spark DataFrame makes read slower?

I wonder if using multiple columns while writing a Spark DataFrame in spark makes future read slower?
I know partitioning with critical columns for future filtering improves read performance, but what would be the effect of having multiple columns, even the ones not usable for filtering?
A sample would be:
(ordersDF
.write
.format("parquet")
.mode("overwrite")
.partitionBy("CustomerId", "OrderDate", .....) # <----------- add many columns
.save("/storage/Orders_parquet"))
Yes as spark have to do shuffle and short data to make so may partition .
As there will have many combination of partition key .
ie
suppose CustomerId have unique values 10
suppose orderDate have unique values 10
suppose Orderhave unique values 10
Number of partition will be 10 *10*10
In this small scenario we have 1000 bucket need to be created.
so hell loot of shuffle and short >> more time .

Databricks - How to change a partition of an existing Delta table?

I have a table in Databricks delta which is partitioned by transaction_date. I want to change the partition column to view_date. I tried to drop the table and then create it with a new partition column using PARTITIONED BY (view_date).
However my attempt failed since the actual files reside in S3 and even if I drop a hive table the partitions remain the same.
Is there any way to change the partition of an existing Delta table? Or the only solution will be to drop the actual data and reload it with a newly indicated partition column?
There's actually no need to drop tables or remove files. All you need to do is read the current table, overwrite the contents AND the schema, and change the partition column:
val input = spark.read.table("mytable")
input.write.format("delta")
.mode("overwrite")
.option("overwriteSchema", "true")
.partitionBy("colB") // different column
.saveAsTable("mytable")
UPDATE: There previously was a bug with time travel and changes in partitioning that has now been fixed.
As Silvio pointed out there is no need to drop the table. In fact the strongly recommended approach by databricks is to replace the table.
https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html#parameters
in spark SQL, This can be done easily by
REPLACE TABLE <tablename>
USING DELTA
PARTITIONED BY (view_date)
AS
SELECT * FROM <tablename>
Modded example from:
https://docs.databricks.com/delta/best-practices.html#replace-the-content-or-schema-of-a-table
Python solution:
If you need more than one column in the partition
partitionBy(column, column_2, ...)
def change_partition_of(table_name, column):
df = spark.read.table(tn)
df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").partitionBy(column).saveAsTable(table_name)
change_partition_of("i.love_python", "column_a")

Remove Duplicates without shuffle Spark

I have a Cassandra table XYX with columns(
id uuid,
insert a timestamp,
header text)
Where id and insert are composite primary key.
I'm using Dataframe and in my spark shell I'm fetching id and header column.
I want to have distinct rows based on id and header column.
I'm seeing lot of shuffles which not be the case since Spark Cassandra connector ensures that all rows for a given Cassandra partition are in same spark partition.
After fetching I'm using dropDuplicates to get distinct records.
Spark Dataframe API does not support custom partitioners yet. So the Connector could not introduce the C* partitioner to Dataframe engine.
An RDD Spark API supports custom partitioner from other hand. Thus you could load your data into RDD and then covert it to df.
Here is a Connector doc about C* partitioner usage: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/16_partitioning.md
keyBy() function allow you to define key columns to use for grouping
Here is working example. It is not short, so I expect someone could improve it:
//load data into RDD and define a group key
val rdd = sc.cassandraTable[(String, String)] ("test", "test")
.select("id" as "_1", "header" as "_2")
.keyBy[Tuple1[Int]]("id")
// check that partitioner is CassandraPartitioner
rdd.partitioner
// call distinct for each group, flat it, get two column DF
val df = rdd.groupByKey.flatMap {case (key,group) => group.toSeq.distinct}
.toDF("id", "header")

Join Spark dataframe with Cassandra table [duplicate]

Dataframe A (millions of records) one of the column is create_date,modified_date
Dataframe B 500 records has start_date and end_date
Current approach:
Select a.*,b.* from a join b on a.create_date between start_date and end_date
The above job takes half hour or more to run.
how can I improve the performance
DataFrames currently doesn't have an approach for direct joins like that. It will fully read both tables before performing a join.
https://issues.apache.org/jira/browse/SPARK-16614
You can use the RDD API to take advantage of the joinWithCassandraTable function
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#using-joinwithcassandratable
As others suggested, one of the approach is to broadcast the smaller dataframe. This can be done automatically also by configuring the below parameter.
spark.sql.autoBroadcastJoinThreshold
If the dataframe size is smaller than the value specified here, Spark automatically broadcasts the smaller dataframe instead of performing a join. You can read more about this here.

Spark-Hive partitioning

The Hive table was created using 4 partitions.
CREATE TABLE IF NOT EXISTS hourlysuspect ( cells int, sms_in int) partitioned by (traffic_date_hour string) stored as ORC into 4 buckets
The following lines in the spark code insert data into this table
hourlies.write.partitionBy("traffic_date_hour").insertInto("hourly_suspect")
and in the spark-defaults.conf, the number of parallel processes is 128
spark.default.parallelism=128
The problem is that when the inserts happen in the hive table, it has 128 partitions instead of 4 buckets.
The defaultParallelism cannot be reduced to 4 as that leads to a very very slow system. Also, I have tried the DataFrame.coalesce method but that makes the inserts too slow.
Is there any other way to force the number of buckets to be 4 when the data is inserted into the table?
As of today {spark 2.2.0} Spark does not support writing to bucketed hive tables natively using spark-sql. While creating the bucketed table, there should be a clusteredBy clause on one of the columns form the table schema. I don't see that in the specified CreateTable statement. Assuming, that it does exist and you know the clustering column, you could add the
.bucketBy([colName])
API while using DataFrameWriter API.
More details for Spark2.0+: [Link] (https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html)

Resources