Delta live tables data quality checks

Delta live tables data quality checks - databricks

I'm using delta live tables from Databricks and I was trying to implement a complex data quality check (so-called expectations) by following this guide. After I tested my implementation, I realized that even though the expectation is failing, the tables dependent downstream on the source table are still loaded.
To illustrate what I mean, here is an image describing the situation.
Image of the pipeline lineage and the incorrect behaviour
I would assume that if the report_table fails due to the expectation not being met (in my case, it was validating for correct primary keys), then the Customer_s table would not be loaded. However, as can be seen in the photo, this is not quite what happened.
Do you have any idea on how to achieve the desired result? How can I define a complex validation with SQL that would cause the future nodes to not be loaded (or it would make the pipeline fail)?

The default behavior when expectation violation occurs in Delta Live Tables is to load the data but track the data quality metrics (retain invalid records). The other options are : ON VIOLATION DROP ROW and ON VIOLATION FAIL UPDATE. Choose "ON VIOLATION DROP ROW" if that is the behavior you want in your pipeline.
https://docs.databricks.com/workflows/delta-live-tables/delta-live-tables-expectations.html#drop-invalid-records

Related

Copy Data pipeline on Azure Data Factory from SQL Server to Blob Storage

I'm trying to move some data from Azure SQL Server Database to Azure Blob Storage with the "Copy Data" pipeline in Azure Data Factory. In particular, I'm using the "Use query" option with the ?AdfDynamicRangePartitionCondition hook, as suggested by Microsoft's pattern here, in the Source tab of the pipeline, and the copy operation is parallelized by the presence of a partition key used in the query itself.
The source on SQL Server Database consists of two views with ~300k and ~3M rows, respectively.
Additionally, the views have the same query structure, e.g. (pseudo-code)
with
v as (
select hashbyte(field1) [Key1], hashbyte(field2) [Key2]
from Table
)
select *
from v
and so do the tables that are queried by the views. On top of this, the views query the same number of partitions with a roughly equally distributed number of rows.
The unexpected behavior - most likely due to the lack of experience from my side - of the copy operation is that it lasts much longer for the view that query fewer rows. In fact, the copy operation with ~300k rows shows a throughput of ~800 KB/s, whereas the one with ~3M rows shows a throughput of ~15MB/s (!). Lastly, the writing operation to the blob storage is pretty fast for both cases, as opposite to the reading-from-source operation.
I don't expect anyone to provide an actual solution - as the information provided is limited -, but I'd rather like some hints on what could be affecting the copy performance so badly for the case where the view queries much (roughly an order of magnitude) fewer rows, taking into account that the tables under the views have a comparable number of fields, and also the same data types: both the tables that the views query contain int, datetime, and varchar data types.
Thanks in advance for any heads up.

To whoever might stumble upon the same issue, I managed to find out, rather empirically, that the bottleneck was being caused by the presence of several key-hash computations in the view on SQL DB. In fact, once I removed these - calculated later on Azure Synapse Analytics (data warehouse) - I observed a massive performance boost of the copy operation.

When there's a copy activity performance issue in ADF and the root cause is not obvious (e.g. if source is fast, but sink is throttled, and we know why) -- here's how I would go about it :
Start with the Integration Runtime (IR) (doc.). This might be a jobs' concurrency issue, a network throughput issue, or just an undersized VM (in case of self-hosted). Like, >80% of all issues in my prod ETL are caused by IR-s, in one way or another.
Replicate copy activity behavior both on source & sink. Query the views from your local machine (ideally, from a VM in the same environment as your IR), write the flat files to blob, etc. I'm assuming you've done that already, but having another observation rarely hurts.
Test various configurations of copy activity. Changing isolationLevel, partitionOption, parallelCopies and enableStaging would be my first steps here. This won't fix the root cause of your issue, obviously, but can point a direction for you to dig in further.
Try searching the documentation (this doc., provided by #Leon is a good start). This should have been a step #1, however, I find ADF documentation somewhat lacking.
N.B. this is based on my personal experience with Data Factory.
Providing a specific solution in this case is, indeed, quite hard.

Write to a datepartitioned Bigquery table using the beam.io.gcp.bigquery.WriteToBigQuery module in apache beam

I'm trying to write a dataflow job that needs to process logs located on storage and write them in different BigQuery tables. Which output tables are going to be used depends on the records in the logs. So I do some processing on the logs and yield them with a key based on a value in the log. After which I group the logs on the keys. I need to write all the logs grouped on the same key to a table.
I'm trying to use the beam.io.gcp.bigquery.WriteToBigQuery module with a callable as the table argument as described in the documentation here
I would like to use a date-partitioned table as this will easily allow me to write_truncate on the different partitions.
Now I encounter 2 main problems:
The CREATE_IF_NEEDED gives an error because it has to create a partitioned table. I can circumvent this by making sure the tables exist in a previous step and if not create them.
If i load older data I get the following error:
The destination table's partition table_name_x$20190322 is outside the allowed bounds. You can only stream to partitions within 31 days in the past and 16 days in the future relative to the current date."
This seems like a limitation of streaming inserts, any way to do batch inserts ?
Maybe I'm approaching this wrong, and should use another method.
Any guidance as how to tackle these issues are appreciated.
Im using python 3.5 and apache-beam=2.13.0

That error message can be logged when one mixes the use of an ingestion-time partitioned table a column-partitioned table (see this similar issue). Summarizing from the link, it is not possible to use column-based partitioning (not ingestion-time partitioning) and write to tables with partition suffixes.
In your case, since you want to write to different tables based on a value in the log and have partitions within each table, forgo the use of the partition decorator when selecting which table (use "[prefix]_YYYYMMDD") and then have each individual table be column-based partitioned.

Spark dataset jdbc write fails on batch update

I am writing huge datasets to a database (DB2) using the dataset.write.jdbc method. I see that if one of the records has an issue in getting inserted to the db, the entire dataset fails. This is turning out to be expensive since the dataset is prepared by running a huge pipeline. Re-running the entire pipeline for the sake of failed persistence does not making sense.

The problem seems to be much bigger than exception handling. Ideally a data pipeline has to designed in such a way that it should validate the data before it is processed/transformed. In general context this is called data validation and cleansing. In this phase you may want to identify the NULL/empty values and treat them accordingly. Especially the attributes that participate in joins or those that participate in lookups. You need to transform them so that they don't create issues in subsequent steps in your pipeline. In fact this is pretty much applicable for every transformation. Hope this helps.

Tableau executing all datasource joins with Spark

I have a 2 tableau (v10.1.1) worksheets - 5 tables - 5 left outer joins in datasource - everything is same in both worksheets - Just that, one runs on SQL Sever (2012) and the other runs on Spark (v1.6).
The one on SQl Server runs ONLY those joins which are being referenced in the worksheet visualization. However, the Spark worksheet is executing all the 5 joins?
Bit surprised I am - Same tables, same model, same worksheet but different data-source generating different query.
Best Regards
Dev

There is a setting called Assume Referential Integrity on each data source that tells Tableau whether or not you want it to avoid joining in unreferenced tables - an optimization known as join culling. Tableau gives you the option because it is possible the optimization would change the results in some cases.
I believe those problem edge cases only arise if your data violates normal referential integrity constraints, say by having a destination table that does not have a matching primary key for a foreign key in the source table. If your database does not have that problem, either because it is enforced by constraints, ETL processes, application logic etc, then checking the box allows Tableau to safely generate more efficient SQL.
Check whether the setting is the same in both data sources, under the data menu. I believe the default is false - which is the conservative but slower choice.
If both data sources have the same setting, then you may have found a bug in one of the drivers, probably the newer Spark SQL driver.

SSRS: How can I run dataset1 first in datasource1 then run dataset2 in datasource2

I have 2 data source(db1, db2) and 2 dataset. 2 dataset are store procedure from each data source.
Dataset1 must run first to create a table for dataset 2 to update and show (dataset 1 will show result too).
Cause the data of the table must base on some table in DB1, the store procedure will create a table to db2 by using link server.
I have search online and tried "single transaction" in data source, but it show error in data set 1 with no detail.
Is there anyway to do it? cause I want to generate an excel with 2 sheet for this result.

Check out this this post.
The default behavior of SSRS is to run the dataset at the same time. They are run in the order in which they are presented in your rdl (top down when looking at it in the report data area). Changing the behavior of a single data source with multiple datasets is as simple as clicking on a checkbox in data source dialog.
With multiple datsources it is a little bit more tricky!
Here is the explanation from the MSDN Blog posted above:
Serializing dataset executions when using multiple data source:
Note that datasets using different data sources will still be executed in parallel; only datasets of the same data source are serialized when using the single transaction setting. If you need to chain dataset executions across different data sources, there are still other options to consider.
For example, if the source databases of your data sources all reside on the same SQL Server instance, you could use just one data source to connect (with single transaction turned on) and then use the three-part name (catalog.schema.object_name) to execute queries or invoke stored procedures in different databases.
Another option to consider is the linked server feature of SQL Server, and then use the four-part name (linked_server_name.catalog.schema.object_name) to execute queries. However, make sure to carefully read the documentation on linked servers to understand its performance and connection credential implications.
This is an interesting question and while I think there might be another way of doing it, it would take a bit of time and playing around with your datasets and more information on your setup of the datasources.
Hope this helps though.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string