databricks overwriting entire table instead of adding new partition - apache-spark

I have this table
CREATE TABLE `db`.`customer_history` (
`name` STRING,
`addrress` STRING,
`filename` STRING,
`dt` DATE)
USING delta
PARTITIONED BY (dt)
When I use this to load a partition data to the table
df
.write
.partitionBy("dt")
.mode("overwrite")
.format("delta")
.saveAsTable("db.customer_history")
For some reason, it overwrites the entire table. I thought the overwrite mode only overwrites the partition data (if it exists). Is my understanding correct?

Delta makes it easy to update certain disk partitions with the replaceWhere option. You can selectively overwrite only the data that matches predicates over partition columns as like this ,
dataset.write.repartition(1)\
.format("delta")\
.mode("overwrite")\
.partitionBy('Year','Week')\
.option("replaceWhere", "Year == '2019' AND Week >='01' AND Week <='02'")\ #to avoid overwriting Week3
.save("\curataed\dataset")
Note : replaceWhere is particularly useful when you have to run a computationally expensive algorithm, but only on certain partitions'
You can ref : link

In order to overwrite a single partition, use:
df
.write
.format("delta")
.mode("overwrite")
.option("replaceWhere", "dt >= '2021-01-01'")
.save("data_path")

Related

How to update Cassandra table with latest row, where Spark Dataframe is having multiple rows with same primary key?

We have Cassandra table person,
CREATE TABLE test.person (
name text PRIMARY KEY,
score bigint
)
and Dataframe is,
val caseClassDF = Seq(Person("Andy1", 32), Person("Mark1", 27), Person("Ron", 27),Person("Andy1", 20),Person("Ron", 270),Person("Ron", 2700),Person("Mark1", 37),Person("Andy1", 200),Person("Andy1", 2000)).toDF()
In Spark We wanted to save dataframe to table , where dataframe is having multiple records for the same primary key.
Q 1: How Cassandra Connector internally handles ordering of the rows?
Q2: We are reading data from kafka and saving to Cassandra, and our batch will always have multiple events like above. We want to save the latest score to Cassandra. Any suggestion how we can achieve this??
Connector version we used is spark-cassandra-connector_2.12:3.2.1
Here are some Observation from our side,
val spark = SparkSession.builder()
.master("local[1]")
.appName("CassandraConnector")
.config("spark.cassandra.connection.host", "")
.config("spark.cassandra.connection.port", "")
.config("spark.sql.extensions", "com.datastax.spark.connector.CassandraSparkExtensions")
.getOrCreate()
val caseClassDF = Seq(Person("Andy1", 32), Person("Mark1", 27), Person("Ron", 27),Person("Andy1", 20),Person("Ron", 270),Person("Ron", 2700),Person("Mark1", 37),Person("Andy1", 200),Person("Andy1", 2000)).toDF()
caseClassDF.write
.format("org.apache.spark.sql.cassandra")
.option("keyspace", "test")
.option("table", "person")
.mode("APPEND")
.save()
When we have
.master("local[1]")
then in Cassandra table, we always see score 2000 for "Andy1" and 2700 fro "Ron", this is the latest in the Seq
Now when we change to,
.master("local[*]") OR .master("local[2]")
then we see some random score in Cassandra table, either 200 or 32 for "Andy1".
Note : We did each run on fresh table. So it is always insert and update in one batch.
We want to save the latest score to Cassandra. Any suggestion how we can achieve this??
Data in dataframe is by definition aren't ordered, and write into Cassandra will reflect this (inserts and updates are the same things in Cassandra) - data will be written in the random order and last write will win.
If you want to write only the latest value (with max score?) you will need to perform aggregations over your data, and use update output mode to write data to Cassandra (to write intermediate results of your streaming aggregations). Something like this:
caseClassDF.groupBy("name").agg(max("score")).write....

How to process a large delta table with UDF?

I have a delta table with about 300 billion rows. Now I am performing some operations on a column using UDF and creating another column
My code is something like this
def my_udf(data):
return pass
udf_func = udf(my_udf, StringType())
data = spark.sql("""SELECT * FROM large_table """)
data = data.withColumn('new_column', udf_func(data.value))
The issue now is this take a long amount of time as Spark will process all 300 billion rows and then write the output. Is there a way where we can do some Mirco batching and write output of those regularly to the output delta table
The first rule usually is to avoid UDFs as much of possible - what kind of transformation do you need to perform that isn't available in the Spark itself?
Second rule - if you can't avoid using UDF, at least use Pandas UDFs that process data in batches, and don't have so big serialization/deserialization overhead - usual UDFs are handling data row by row, encoding & decoding data for each of them.
If your table was built over the time, and consists of many files, you can try to use Spark Structured Streaming with Trigger.AvailableNow (requires DBR 10.3 or 10.4), something like this:
maxNumFiles = 10 # max number of parquet files processed at once
df = spark.readStream \
.option("maxFilesPerTrigger", maxNumFiles) \
.table("large_table")
df = df.withColumn('new_column', udf_func(data.value))
df.writeStream \
.option("checkpointLocation", "/some/path") \
.trigger(availableNow=True) \
.toTable("my_destination_table")
this will read the source table chunk by chunk, apply your transformation, and write data into a destination table.

How to get new/updated records from Delta table after upsert using merge?

Is there any way to get updated/inserted rows after upsert using merge to Delta table in spark streaming job?
val df = spark.readStream(...)
val deltaTable = DeltaTable.forName("...")
def upsertToDelta(events: DataFrame, batchId: Long) {
deltaTable.as("table")
.merge(
events.as("event"),
"event.entityId == table.entityId")
.whenMatched()
.updateExpr(...))
.whenNotMatched()
.insertAll()
.execute()
}
df
.writeStream
.format("delta")
.foreachBatch(upsertToDelta _)
.outputMode("update")
.start()
I know I can create another job to read updates from the delta table. But is it possible to do the same job? From what I can see, execute() returns Unit.
You can enable Change Data Feed on the table, and then have another stream or batch job to fetch the changes, so you'll able to receive information on what rows has changed/deleted/inserted. It could be enabled with:
ALTER TABLE table_name SET TBLPROPERTIES (delta.enableChangeDataFeed = true)
if thable isn't registered, you can use path instead of table name:
ALTER TABLE delta.`path` SET TBLPROPERTIES (delta.enableChangeDataFeed = true)
The changes will be available if you add the .option("readChangeFeed", "true") option when reading stream from a table:
spark.readStream.format("delta") \
.option("readChangeFeed", "true") \
.table("table_name")
and it will add three columns to table describing the change - the most important is _change_type (please note that there are two different types for update operation).
If you're worried about having another stream - it's not a problem, as you can run multiple streams inside the same job - you just don't need to use .awaitTermination, but something like spark.streams.awaitAnyTermination() to wait on multiple streams.
P.S. But maybe this answer will change if you explain why you need to get changes inside the same job?

orderby is not giving correct results in spark SQL

I have a dataset of around 60 columns and 3000 rows.
I am using orderby for sorting rows in dataset and writing in a file
But its not giving correct results as excpeted.
dataset.orderBy(new Column(col_name).desc())
.coalesce(4)
.write()
.format("com.databricks.spark.csv")
.option("delimiter", ",")
.option("header", "false")
.mode(SaveMode.Overwrite)
.save("hdfs://" + filePath);
Please let me know what I am missing here
Also I found below solution but don't think that is the correct solution
Row[] rows = dataset.take(3000);
for ( Row row : rows){
// here i am writing in a file row by row
System.out.println(row);
}
the problem is that coalesce will merge your existing partitions in an unsorted way (and no, coalesce will not cause a shuffle).
If you want 4 files and sorting within the files, you need to change spark.sql.suffle.partitions before the orderBy, this will cause your shuffle to have 4 partitions.
spark.sql("set spark.sql.shuffle.partitions=4")
dataset.orderBy(new Column(col_name).desc())
.write()
.format("com.databricks.spark.csv")
.option("delimiter", ",")
.option("header", "false")
.mode(SaveMode.Overwrite)
.save("hdfs://" + filePath);
if you only care about the sorting within the files, you could also use sortWithinPartitions(new Column(col_name).desc())
because your .coalesce(4) suffle your dataframe order
coalesce first then sort .
dataset
.coalesce(4)
.orderBy(new Column(col_name).desc())
.write()
.format("com.databricks.spark.csv")
.option("delimiter", ",")
.option("header", "false")
.mode(SaveMode.Overwrite)
.save("hdfs://" + filePath);
you also should set spark.sql.suffle.partitions to 4 in your spark context because order by also provoque suffle.
As per your clarification in the comments, you need your ordered output to be contained in a single file.
With only spark, that's possible only with spark.sql("set spark.sql.shuffle.partitions=1") followed by orderBy and write. But the drawback is it won't scale for big data as it will not be parallelized.
A work around is :
Make your spark do the orderBy with maximum parallelized work, (i.e. don't coalesce or "set spark.sql.shuffle.partitions=1") and have n number of files.
Add some extra logical handling in your file merging code
List All files, fetch the value of col_name and maintain a map of [(col_name value), filepath]
Sort the map by key (value of col_name)
Then perform your merge
This will maintain your ordering.
The idea is, the merging part will be mostly single threaded, at least do the sorting in a distributed way :)

How do I output bucketed parquet files in spark?

Background
I have 8k parquet files representing a table that I want to bucket by a particular column, creating a new set of 8k parquet files. I want to do this so that joins from other data sets on the bucketed column won't require re-shuffling. The documentation I'm working off of is here:
https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#bucketing-sorting-and-partitioning
Question
What's the easiest way to output parquet files that are bucketed? I want to do something like this:
df.write()
.bucketBy(8000, "myBucketCol")
.sortBy("myBucketCol")
.format("parquet")
.save("path/to/outputDir");
But according to the documentation linked above:
Bucketing and sorting are applicable only to persistent tables
I'm guessing I need to use saveAsTable as opposed to save. However saveAsTable doesn't take a path. Do I need to create a table prior to calling saveAsTable. Is it in that table creation statement that I declare where the parquet files should be written? If so, how do I do that?
spark.sql("drop table if exists myTable");
spark.sql("create table myTable ("
+ "myBucketCol string, otherCol string ) "
+ "using parquet location '" + outputPath + "' "
+ "clustered by (myBucketCol) sorted by (myBucketCol) into 8000 buckets"
);
enlDf.write()
.bucketBy(8000, "myBucketCol")
.sortBy("myBucketCol")
.format("parquet")
.mode(SaveMode.Append)
.saveAsTable("myTable");
You can use the path option:
df.write()
.bucketBy(8000, "myBucketCol")
.sortBy("myBucketCol")
.format("parquet")
.option("path", "path/to/outputDir")
.saveAsTable("whatever")

Resources