Databricks Delta Live Tables: Difference between STREAMING and INCREMENTAL - databricks

Is there a difference between CREATE STREAMING LIVE TABLE and CREATE INCREMENTAL LIVE TABLE? The documentation is mixed: For instance, STREAMING is used here, while INCREMENTAL is used here. I have tested both and so far I have not noticed any difference.

There are two aspects here:
Conceptual - incremental means that the minimal data changes are applied to a destination table, we don't recompute full data set when new data arrive. This is how is explained in the Getting Started book.
Syntax - CREATE INCREMENTAL LIVE TABLE was the original syntax for pipelines that were processing streaming data. But it was deprecated in the favor of CREATE STREAMING LIVE TABLE, but the old syntax is still supported for compatibility reasons.

Related

Apply Changes from a delta live streaming table to another delta live streaming table

I get the below error when I try to do APPLY CHANGES from one delta live streaming table to another delta live stremaing table. Is this scenario not supported?
pyspark.sql.utils.AnalysisException: rajib_db.employee_address_stream is a permanent view, which is not supported by streaming reading API such as `DataStreamReader.table` yet.
If you mean to perform APPLY CHANGES to one table, and then stream from it into another table, then it's not supported yet. Please reach someone from Databricks (solution architect or customer success engineer) to provide feedback to the Databricks product team.
Also, if you'll do the describe extended on the table into which you do APPLY CHANGES then you will see that it's not a real table, but a view over another table that filters out some entries.

Delta live tables data quality checks

I'm using delta live tables from Databricks and I was trying to implement a complex data quality check (so-called expectations) by following this guide. After I tested my implementation, I realized that even though the expectation is failing, the tables dependent downstream on the source table are still loaded.
To illustrate what I mean, here is an image describing the situation.
Image of the pipeline lineage and the incorrect behaviour
I would assume that if the report_table fails due to the expectation not being met (in my case, it was validating for correct primary keys), then the Customer_s table would not be loaded. However, as can be seen in the photo, this is not quite what happened.
Do you have any idea on how to achieve the desired result? How can I define a complex validation with SQL that would cause the future nodes to not be loaded (or it would make the pipeline fail)?
The default behavior when expectation violation occurs in Delta Live Tables is to load the data but track the data quality metrics (retain invalid records). The other options are : ON VIOLATION DROP ROW and ON VIOLATION FAIL UPDATE. Choose "ON VIOLATION DROP ROW" if that is the behavior you want in your pipeline.
https://docs.databricks.com/workflows/delta-live-tables/delta-live-tables-expectations.html#drop-invalid-records

Synchronize data lake with the deleted record

I am building data lake to integrate multiple data sources for advanced analytics.
In the begining, I select HDFS as data lake storage. But I have a requirement for updates and deletes in data sources which I have to synchronise with data lake.
To understand the immutable nature of Data Lake I will consider LastModifiedDate from Data source to detect that this record is updated and insert this record in Data Lake with a current date. The idea is to select the record with max(date).
However, I am not able to understand how
I will detect deleted records from sources and what I will do with Data Lake?
Should I use other data storage like Cassandra and execute a delete command? I am afraid it will lose the immutable property.
can you please suggest me good practice for this situation?
1. Question - Detecting deleted records from datasources
Detecting deleted records from data sources, requires that your data sources supports this. Best is that deletion is only done logically, e. g. with a change flag. For some databases it is possible to track also deleted rows (see for example for SQL-Server). Also some ETL solutions like Informatica offer CDC (Changed Data Capture) capabilities.
2. Question - Changed data handling in a big data solution
There are different approaches. Of cause you can use a key value store adding some kind of complexity to the overall solution. First you have to clarify, if it is also of interest to track changes and deletes. You could consider loading all data (new/changed/deleted) into daily partitions and finally build an actual image (data as it is in your data source). Also consider solutions like Databricks Delta addressing this topics, without the need of an additional store. For example you are able to do an upsert on parquet files with delta as follows:
MERGE INTO events
USING updates
ON events.eventId = updates.eventId
WHEN MATCHED THEN
UPDATE SET
events.data = updates.data
WHEN NOT MATCHED
THEN INSERT (date, eventId, data) VALUES (date, eventId, data)
If your solution also requires low latency access via a key (e. g. to support an API) then a key-values store like HBase, Cassandra, etc. would be helpfull.
Usually this is always a constraint while creating datalake in Hadoop, one can't just update or delete records in it. There is one approach that you can try is
When you are adding lastModifiedDate, you can also add one more column naming status. If a record is deleted, mark the status as Deleted. So the next time, when you want to query the latest active records, you will be able to filter it out.
You can also use cassandra or Hbase (any nosql database), if you are performing ACID operations on a daily basis. If not, first approach would be your ideal choice for creating datalake in Hadoop

Workaround for joining two streams in structured streaming in Spark 2.x

I have a stream of configurations (not changed often, but if there's an update, it will be a message), and another stream of raw data points.
As I understand that for now spark doesn't support joining to streaming datasets or dataframes. Is there a good way to workaround this?
Is it possible to "snapshot" one of the streaming dataset to a static dataset (probably the configuration one, since it has less updates), then join with the other streaming dataset?
Open to suggestions!
"Workaround" is to use current master branch ;)
It's not released yet, but current master branch already has stream-stream inner join and there is outer join in progress. See this Jira ticket for reference, in sub-task you see possible joins to use.
There's no other easy workaround. Streaming joins requires saving state of streams and then correct updates of state. You can see code in pull requests, it's quite complex to implement stream-stream join.
So here is what I'm doing at the end.
Put the stream with less updates into a memory sink. Then do a select from the that table. By this time, it is a static instance and can be joined with the other stream. No trigger needed. Of course, you need to update the table correctly by yourself.
This is not very robust, but that's the best one I can come up with before the official support.

How to migrate data between two tables in Cassandra properly

I have to change the schema of one of my tables in Cassandra. It's cannot be done by simply using ALTER TABLE command, because there are some changes in primary key.
So the question is: How to do such a migration in the best way?
Using COPY command in cql is not an option in here because dump file can be really huge.
Can I solve this problem by not creating some custom application?
Like Guillaume has suggested in the comment - you can't do this directly in cassandra. Schema altering operations are very limited here. You have to perform such migration manually using one of suggested there tools OR if you have very large tables you can leverage Spark.
Spark can efficiently read data from your nodes, transform them locally and save them back to db. Remember that such migration requires reading whole db content, so might take a while. It might be the most performant solution, however needs some bigger preparation - Spark cluster setup.

Resources