delta live tables dump final gold table to cassandra - databricks

we have a delta live tables which read from kafka topic, clean/filter/process/aggregate the message, and dump it to bronze/silver/gold table, in order to build a REST service to retrieve the aggregated result, we need to dump the data from gold table to cassandra table. I tried to update the script for gold table, after the aggregated result to dump to gold, i added one more step to further dump the updated result to cassandra table but it didn't work:
#dlt.table
def test_live_gold():
return (
dlt.read("test_kafka_silver").groupBy("user_id", "event_type").count()
# df = spark.read.format("delta")
# .table("customer.test_live_gold")
# .withColumnRenamed("user_id", "account_id")
# .withColumnRenamed("event_type", "event_name")
# .withColumn("last_updated_dt", current_timestamp())
# df.show(5, False)
# write_to_cassandra_table('customer', 'test_keyspace', df)
)
how can I copy result from delta table to cassandra in one workflow as the delta live tables?

By default, Delta Live Tables only store data as Delta. If you need to write data to somewhere else, then you need to add another step in your job (Dataricks workflow) that will use notebook to read data from gold table produced by the test_live_gold and write into Cassandra. Something like this:

Related

spark streaming and delta tables: java.lang.UnsupportedOperationException: Detected a data update

The setup:
Azure Event Hub -> raw delta table -> agg1 delta table -> agg2 delta table
The data is processed by spark structured streaming.
Updates on target delta tables are done via foreachBatch using merge.
In the result I'm getting error:
java.lang.UnsupportedOperationException: Detected a data update (for
example
partKey=ap-2/part-00000-2ddcc5bf-a475-4606-82fc-e37019793b5a.c000.snappy.parquet)
in the source table at version 2217. This is currently not supported.
If you'd like to ignore updates, set the option 'ignoreChanges' to
'true'. If you would like the data update to be reflected, please
restart this query with a fresh checkpoint directory.
Basically I'm not able to read the agg1 delta table via any kind of streaming. If I switch the last streaming from delta to memory I'm getting the same error message. With first streaming I don't have any problems.
Notes.
Between aggregations I'm changing granuality: agg1 delta table (trunc date to minutes), agg2 delta table (trunc date to days).
If I turn off all other streaming, the last one still doesn't work
The agg2 delta table is new fresh table with no data
How the streaming works on the source table:
It reads the files that belongs to our source table. It's not able to handle changes in these files (updates, deletes). If anything like that happens you will get the error above. In other words. DDL operations modify the underlying files. The only difference is for INSERTS. New data arrives in new file if not configured differently.
To fix that you would need to set an option: ignoreChanges to True.
This option will cause that you will get all the records from the modified file. So, you will get again the same records as before plus this one modified.
The problem: we have aggregations, the aggregated values are stored in the checkpoint. If we get again the same record (not modified) we will recognize it as an update and we will increase the aggregation for its grouping key.
Solution: we can't read agg table to make another aggregations. We need to read the raw table.
reference: https://docs.databricks.com/structured-streaming/delta-lake.html#ignore-updates-and-deletes
Note: I'm working on Databricks Runtime 10.4, so I'm using new shuffle merge by default.

dump delta gold table to cassandra table with delta only [duplicate]

we have a delta live tables which read from kafka topic, clean/filter/process/aggregate the message, and dump it to bronze/silver/gold table, in order to build a REST service to retrieve the aggregated result, we need to dump the data from gold table to cassandra table. I tried to update the script for gold table, after the aggregated result to dump to gold, i added one more step to further dump the updated result to cassandra table but it didn't work:
#dlt.table
def test_live_gold():
return (
dlt.read("test_kafka_silver").groupBy("user_id", "event_type").count()
# df = spark.read.format("delta")
# .table("customer.test_live_gold")
# .withColumnRenamed("user_id", "account_id")
# .withColumnRenamed("event_type", "event_name")
# .withColumn("last_updated_dt", current_timestamp())
# df.show(5, False)
# write_to_cassandra_table('customer', 'test_keyspace', df)
)
how can I copy result from delta table to cassandra in one workflow as the delta live tables?
By default, Delta Live Tables only store data as Delta. If you need to write data to somewhere else, then you need to add another step in your job (Dataricks workflow) that will use notebook to read data from gold table produced by the test_live_gold and write into Cassandra. Something like this:

What is the purpose of registering a dataframe as a temporary view?

I am trying to understand why I would register a dataframe as a temporary view in pyspark.
Here's a dummy example
# Create spark dataframe
spark_df = spark.createDataFrame([(1, 'foo'),(2, 'bar'),],['id', 'txt'])
# Pull data using the dataframe
spark_df.selectExpr("id + 1")
# Register spark_df as a temporary view to the catalog
spark_df.createOrReplaceTempView("temp")
# Pull data using the view
spark.sql("select id + 1 from temp")
Whether I register the dataframe as a temporary view or not:
The data can only be accessed in this live spark session
I can use sql statements to query the data in both cases
Pulling data takes pretty much the same time (10K simulations, but I don't have a spark cluster yet, only my local machine).
I am failing to see the benefits of storing a dataframe as a temporary view, but I see it in every introductory classes to pyspark. What am I missing? Tks!!
SQL is quite a powerful language and many consider it beneficial in some cases.

In Pyspark Structured Streaming, how can I discard already generated output before writing to Kafka?

I am trying to do Structured Streaming (Spark 2.4.0) on Kafka source data where I am reading latest data and performing aggregations on a 10 minute window. I am using "update" mode while writing the data.
For example, the data schema is as below:
tx_id, cust_id, product, timestamp
My aim is to find customers who have bought more than 3 products in last 10 minutes. Let's say prod is the dataframe which is read from kafka, then windowed_df is defined as:
windowed_df_1 = prod.groupBy(window("timestamp", "10 minutes"), cust_id).count()
windowed_df = windowed_df_1.filter(col("count")>=3)
Then I am joining this with a master dataframe from hive table "customer_master" to get cust_name:
final_df = windowed_df.join(customer_master, "cust_id")
And finally, write this dataframe to Kafka sink (or console for simplicity)
query = final_df.writeStream.outputMode("update").format("console").option("truncate",False).trigger(processingTime='2 minutes').start()
query.awaitTermination()
Now, when this code runs every 2 minutes, in the subsequent runs, I want to discard all those customers who were already part of my output. I don't want them in my output even if they buy any product again.
Can I write the stream output temporarily somewhere (may be a hive table) and do an "anti-join" for each execution ?
This way I can also have a history maintained in a hive table.
I also read somewhere that we can write the output to a memory sink and then use df.write to save it in HDFS/Hive. But what if we terminate the job and re-run ? The in-memory table will be lost in this case I suppose.
Please help as I am new to Structured Streaming.
**
Update: -
**
I also tried below code to write output in Hive table as well as Console(or Kafka sink):
def write_to_hive(df, epoch_id):
df.persist()
df.write.format("hive").mode("append").saveAsTable("hive_tab_name")
pass
final_df.writeStream.outputMode("update").format("console").option("truncate", False).start()
final_df.writeStream.outputMode("update").foreachBatch(write_to_hive).start()
But this only performs the 1st action, i.e. write to Console.
If i write "foreachBatch" first, it will save to Hive table but does not print to console.
I want to write to 2 different sinks. Please help.

Spark: daily read from Cassandra and save to parquets, how to read only new rows?

I am trying to build an ETL process with Spark. My goal is to read from
Cassandra table and save into parquet files.
What I managed to do so far is reading an entire table from Cassandra, using a Cassandra connector (in pyspark):
df = app.sqlSparkContext.read.format("org.apache.spark.sql.cassandra")\
.option("table", my_table)\
.option("keyspace",my_keyspace)\
.load()
The issue is that my data is growing rapidly, and I would like to repeat the ETL process everyday where I read the newly added rows from Cassandra and save them into a new parquet file.
Having there is no ordering in my Cassandra table, I will not be able to read based on time, is there any way to do it from Spark side instead?
The effective filtering based on time is really possible only if you have time-based first clustering column, something like this:
create table test.test (
pk1 <type>,
pk2 <type>,
cl1 timestamp,
cl2 ...,
primary key ((pk1, pk2), cl1, cl2));
In this case, condition on cl1, like this:
import org.apache.spark.sql.cassandra._
val data = { spark.read.cassandraFormat("test", "test").load()}
val filtered = data.filter("cl1 >= cast('2019-03-10T14:41:34.373+0000' as timestamp)")
will be effectively pushed into Cassandra, and filtering will happen server side, retrieving only necessary data - this is easy to check with explain - it should generate something like this (pushed filter denoted as *):
// *Filter ((cl1#23 >= 1552228894373000))
// +- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation [pk1#21,pk2#22L,cl1#23,...]
PushedFilters: [*GreaterThanOrEqual(cl1,2019-03-10 14:41:34.373)],
ReadSchema: struct<pk1:int,pk2:int,cl1:timestamp,...
In all other cases, filtering will happen on Spark side, retrieving all data from Cassandra.

Resources