we have a delta live tables which read from kafka topic, clean/filter/process/aggregate the message, and dump it to bronze/silver/gold table, in order to build a REST service to retrieve the aggregated result, we need to dump the data from gold table to cassandra table. I tried to update the script for gold table, after the aggregated result to dump to gold, i added one more step to further dump the updated result to cassandra table but it didn't work:
#dlt.table
def test_live_gold():
return (
dlt.read("test_kafka_silver").groupBy("user_id", "event_type").count()
# df = spark.read.format("delta")
# .table("customer.test_live_gold")
# .withColumnRenamed("user_id", "account_id")
# .withColumnRenamed("event_type", "event_name")
# .withColumn("last_updated_dt", current_timestamp())
# df.show(5, False)
# write_to_cassandra_table('customer', 'test_keyspace', df)
)
how can I copy result from delta table to cassandra in one workflow as the delta live tables?
By default, Delta Live Tables only store data as Delta. If you need to write data to somewhere else, then you need to add another step in your job (Dataricks workflow) that will use notebook to read data from gold table produced by the test_live_gold and write into Cassandra. Something like this:
Related
The setup:
Azure Event Hub -> raw delta table -> agg1 delta table -> agg2 delta table
The data is processed by spark structured streaming.
Updates on target delta tables are done via foreachBatch using merge.
In the result I'm getting error:
java.lang.UnsupportedOperationException: Detected a data update (for
example
partKey=ap-2/part-00000-2ddcc5bf-a475-4606-82fc-e37019793b5a.c000.snappy.parquet)
in the source table at version 2217. This is currently not supported.
If you'd like to ignore updates, set the option 'ignoreChanges' to
'true'. If you would like the data update to be reflected, please
restart this query with a fresh checkpoint directory.
Basically I'm not able to read the agg1 delta table via any kind of streaming. If I switch the last streaming from delta to memory I'm getting the same error message. With first streaming I don't have any problems.
Notes.
Between aggregations I'm changing granuality: agg1 delta table (trunc date to minutes), agg2 delta table (trunc date to days).
If I turn off all other streaming, the last one still doesn't work
The agg2 delta table is new fresh table with no data
How the streaming works on the source table:
It reads the files that belongs to our source table. It's not able to handle changes in these files (updates, deletes). If anything like that happens you will get the error above. In other words. DDL operations modify the underlying files. The only difference is for INSERTS. New data arrives in new file if not configured differently.
To fix that you would need to set an option: ignoreChanges to True.
This option will cause that you will get all the records from the modified file. So, you will get again the same records as before plus this one modified.
The problem: we have aggregations, the aggregated values are stored in the checkpoint. If we get again the same record (not modified) we will recognize it as an update and we will increase the aggregation for its grouping key.
Solution: we can't read agg table to make another aggregations. We need to read the raw table.
reference: https://docs.databricks.com/structured-streaming/delta-lake.html#ignore-updates-and-deletes
Note: I'm working on Databricks Runtime 10.4, so I'm using new shuffle merge by default.
we have a delta live tables which read from kafka topic, clean/filter/process/aggregate the message, and dump it to bronze/silver/gold table, in order to build a REST service to retrieve the aggregated result, we need to dump the data from gold table to cassandra table. I tried to update the script for gold table, after the aggregated result to dump to gold, i added one more step to further dump the updated result to cassandra table but it didn't work:
#dlt.table
def test_live_gold():
return (
dlt.read("test_kafka_silver").groupBy("user_id", "event_type").count()
# df = spark.read.format("delta")
# .table("customer.test_live_gold")
# .withColumnRenamed("user_id", "account_id")
# .withColumnRenamed("event_type", "event_name")
# .withColumn("last_updated_dt", current_timestamp())
# df.show(5, False)
# write_to_cassandra_table('customer', 'test_keyspace', df)
)
how can I copy result from delta table to cassandra in one workflow as the delta live tables?
By default, Delta Live Tables only store data as Delta. If you need to write data to somewhere else, then you need to add another step in your job (Dataricks workflow) that will use notebook to read data from gold table produced by the test_live_gold and write into Cassandra. Something like this:
We have ingested CDC data to S3 raw layer. This CDC data in JSON file and has DML records (delete, update etc). We used spark streaming with delta lake to de-dup S3- raw layer data and move to standard layer. Used table partitioning on certain column.
I have 2 questions:
Can we use indexing also in delta table (if supported ) to index on primary key apart from partition ( inside partition data indexed on primary key)
What visualization tool and necessary infra (spark or presto as compute) we can use to analyze delta table in S3. What would be the best approach? Data volume is too high. Should we move delta table to RDBMS and use visualization on top of that (but this will incur cost)
I am trying to do Structured Streaming (Spark 2.4.0) on Kafka source data where I am reading latest data and performing aggregations on a 10 minute window. I am using "update" mode while writing the data.
For example, the data schema is as below:
tx_id, cust_id, product, timestamp
My aim is to find customers who have bought more than 3 products in last 10 minutes. Let's say prod is the dataframe which is read from kafka, then windowed_df is defined as:
windowed_df_1 = prod.groupBy(window("timestamp", "10 minutes"), cust_id).count()
windowed_df = windowed_df_1.filter(col("count")>=3)
Then I am joining this with a master dataframe from hive table "customer_master" to get cust_name:
final_df = windowed_df.join(customer_master, "cust_id")
And finally, write this dataframe to Kafka sink (or console for simplicity)
query = final_df.writeStream.outputMode("update").format("console").option("truncate",False).trigger(processingTime='2 minutes').start()
query.awaitTermination()
Now, when this code runs every 2 minutes, in the subsequent runs, I want to discard all those customers who were already part of my output. I don't want them in my output even if they buy any product again.
Can I write the stream output temporarily somewhere (may be a hive table) and do an "anti-join" for each execution ?
This way I can also have a history maintained in a hive table.
I also read somewhere that we can write the output to a memory sink and then use df.write to save it in HDFS/Hive. But what if we terminate the job and re-run ? The in-memory table will be lost in this case I suppose.
Please help as I am new to Structured Streaming.
**
Update: -
**
I also tried below code to write output in Hive table as well as Console(or Kafka sink):
def write_to_hive(df, epoch_id):
df.persist()
df.write.format("hive").mode("append").saveAsTable("hive_tab_name")
pass
final_df.writeStream.outputMode("update").format("console").option("truncate", False).start()
final_df.writeStream.outputMode("update").foreachBatch(write_to_hive).start()
But this only performs the 1st action, i.e. write to Console.
If i write "foreachBatch" first, it will save to Hive table but does not print to console.
I want to write to 2 different sinks. Please help.
I am trying to build an ETL process with Spark. My goal is to read from
Cassandra table and save into parquet files.
What I managed to do so far is reading an entire table from Cassandra, using a Cassandra connector (in pyspark):
df = app.sqlSparkContext.read.format("org.apache.spark.sql.cassandra")\
.option("table", my_table)\
.option("keyspace",my_keyspace)\
.load()
The issue is that my data is growing rapidly, and I would like to repeat the ETL process everyday where I read the newly added rows from Cassandra and save them into a new parquet file.
Having there is no ordering in my Cassandra table, I will not be able to read based on time, is there any way to do it from Spark side instead?
The effective filtering based on time is really possible only if you have time-based first clustering column, something like this:
create table test.test (
pk1 <type>,
pk2 <type>,
cl1 timestamp,
cl2 ...,
primary key ((pk1, pk2), cl1, cl2));
In this case, condition on cl1, like this:
import org.apache.spark.sql.cassandra._
val data = { spark.read.cassandraFormat("test", "test").load()}
val filtered = data.filter("cl1 >= cast('2019-03-10T14:41:34.373+0000' as timestamp)")
will be effectively pushed into Cassandra, and filtering will happen server side, retrieving only necessary data - this is easy to check with explain - it should generate something like this (pushed filter denoted as *):
// *Filter ((cl1#23 >= 1552228894373000))
// +- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation [pk1#21,pk2#22L,cl1#23,...]
PushedFilters: [*GreaterThanOrEqual(cl1,2019-03-10 14:41:34.373)],
ReadSchema: struct<pk1:int,pk2:int,cl1:timestamp,...
In all other cases, filtering will happen on Spark side, retrieving all data from Cassandra.