Tableau executing all datasource joins with Spark - apache-spark

I have a 2 tableau (v10.1.1) worksheets - 5 tables - 5 left outer joins in datasource - everything is same in both worksheets - Just that, one runs on SQL Sever (2012) and the other runs on Spark (v1.6).
The one on SQl Server runs ONLY those joins which are being referenced in the worksheet visualization. However, the Spark worksheet is executing all the 5 joins?
Bit surprised I am - Same tables, same model, same worksheet but different data-source generating different query.
Best Regards
Dev

There is a setting called Assume Referential Integrity on each data source that tells Tableau whether or not you want it to avoid joining in unreferenced tables - an optimization known as join culling. Tableau gives you the option because it is possible the optimization would change the results in some cases.
I believe those problem edge cases only arise if your data violates normal referential integrity constraints, say by having a destination table that does not have a matching primary key for a foreign key in the source table. If your database does not have that problem, either because it is enforced by constraints, ETL processes, application logic etc, then checking the box allows Tableau to safely generate more efficient SQL.
Check whether the setting is the same in both data sources, under the data menu. I believe the default is false - which is the conservative but slower choice.
If both data sources have the same setting, then you may have found a bug in one of the drivers, probably the newer Spark SQL driver.

Related

Delta live tables data quality checks

I'm using delta live tables from Databricks and I was trying to implement a complex data quality check (so-called expectations) by following this guide. After I tested my implementation, I realized that even though the expectation is failing, the tables dependent downstream on the source table are still loaded.
To illustrate what I mean, here is an image describing the situation.
Image of the pipeline lineage and the incorrect behaviour
I would assume that if the report_table fails due to the expectation not being met (in my case, it was validating for correct primary keys), then the Customer_s table would not be loaded. However, as can be seen in the photo, this is not quite what happened.
Do you have any idea on how to achieve the desired result? How can I define a complex validation with SQL that would cause the future nodes to not be loaded (or it would make the pipeline fail)?
The default behavior when expectation violation occurs in Delta Live Tables is to load the data but track the data quality metrics (retain invalid records). The other options are : ON VIOLATION DROP ROW and ON VIOLATION FAIL UPDATE. Choose "ON VIOLATION DROP ROW" if that is the behavior you want in your pipeline.
https://docs.databricks.com/workflows/delta-live-tables/delta-live-tables-expectations.html#drop-invalid-records

Copy Data pipeline on Azure Data Factory from SQL Server to Blob Storage

I'm trying to move some data from Azure SQL Server Database to Azure Blob Storage with the "Copy Data" pipeline in Azure Data Factory. In particular, I'm using the "Use query" option with the ?AdfDynamicRangePartitionCondition hook, as suggested by Microsoft's pattern here, in the Source tab of the pipeline, and the copy operation is parallelized by the presence of a partition key used in the query itself.
The source on SQL Server Database consists of two views with ~300k and ~3M rows, respectively.
Additionally, the views have the same query structure, e.g. (pseudo-code)
with
v as (
select hashbyte(field1) [Key1], hashbyte(field2) [Key2]
from Table
)
select *
from v
and so do the tables that are queried by the views. On top of this, the views query the same number of partitions with a roughly equally distributed number of rows.
The unexpected behavior - most likely due to the lack of experience from my side - of the copy operation is that it lasts much longer for the view that query fewer rows. In fact, the copy operation with ~300k rows shows a throughput of ~800 KB/s, whereas the one with ~3M rows shows a throughput of ~15MB/s (!). Lastly, the writing operation to the blob storage is pretty fast for both cases, as opposite to the reading-from-source operation.
I don't expect anyone to provide an actual solution - as the information provided is limited -, but I'd rather like some hints on what could be affecting the copy performance so badly for the case where the view queries much (roughly an order of magnitude) fewer rows, taking into account that the tables under the views have a comparable number of fields, and also the same data types: both the tables that the views query contain int, datetime, and varchar data types.
Thanks in advance for any heads up.
To whoever might stumble upon the same issue, I managed to find out, rather empirically, that the bottleneck was being caused by the presence of several key-hash computations in the view on SQL DB. In fact, once I removed these - calculated later on Azure Synapse Analytics (data warehouse) - I observed a massive performance boost of the copy operation.
When there's a copy activity performance issue in ADF and the root cause is not obvious (e.g. if source is fast, but sink is throttled, and we know why) -- here's how I would go about it :
Start with the Integration Runtime (IR) (doc.). This might be a jobs' concurrency issue, a network throughput issue, or just an undersized VM (in case of self-hosted). Like, >80% of all issues in my prod ETL are caused by IR-s, in one way or another.
Replicate copy activity behavior both on source & sink. Query the views from your local machine (ideally, from a VM in the same environment as your IR), write the flat files to blob, etc. I'm assuming you've done that already, but having another observation rarely hurts.
Test various configurations of copy activity. Changing isolationLevel, partitionOption, parallelCopies and enableStaging would be my first steps here. This won't fix the root cause of your issue, obviously, but can point a direction for you to dig in further.
Try searching the documentation (this doc., provided by #Leon is a good start). This should have been a step #1, however, I find ADF documentation somewhat lacking.
N.B. this is based on my personal experience with Data Factory.
Providing a specific solution in this case is, indeed, quite hard.

In Azure Synapse, is it true that creating a temporary table is discouraged?

I had a good discussion with one of my colleagues and he mentioned creating a temporary table degrades the performance in Azure Synapse because Synapse creates the temporary table first in the master node then distribute them to child node. Is it true? He recommended me to create create permanent table instead of temporary table.
That’s not correct. Temp tables don’t necessarily funnel through the control node. Let’s say you are selecting from a table distributed on ProductKey and loading it into a #temp table distributed on ProductKey. The data will never leave each compute node since it’s a distribution compatible insert.
On the other hand, if you run a query that uses a ROW_NUMBER function, for example, that would have to be calculated on the control node and then the data would be sent back to the compute nodes to be stored in the distributed temp table. But that only happens in the presence of some types of functions and some types of queries. It is not the norm. If you are worried about a particular query then add the word EXPLAIN to the front of it and paste the explain plan XML into your question so we can help you interpret it.
If you load a #temp table with a SELECT INTO statement you can’t specify the table geometry so it will be a round robin distributed columnstore. Usually this isn’t ideal since it takes extra time and memory to compress a columnstore and because round robin distribution isn’t ideal unless there is no good distribution key. Usually the next query which uses the round robin distributed temp table will just reshuffle it so it’s best to properly hash distribute a temp table initially. To do this do a CTAS statement as described here.

Is it bad to use INDEX in Cassandra if performance is not important?

Background
We have recently started a "Big Data" project where we want to track what users are doing with our product - how often they are logging in, which features they are clicking on, etc - your basic user analytics stuff. We still don't know exactly what questions we will be asking, but most of it will be "how often did X occur over the last Y months?" type of thing, so we started storing the data sooner rather than later thinking we can always migrate, re-shape etc when we need to but if we don't store it it is gone forever.
We are now looking at what sorts of questions we can ask. In a typical RDBMS, this stage would consist of slicing and dicing the data in many different dimensions, exporting to Excel, producing graphs, looking for trends etc - it seems that for Cassandra, this is rather difficult to do.
Currently we are using Apache Spark, and submitting Spark SQL jobs to slice and dice the data. This actually works really well, and we are getting the data we need, but it is rather cumbersome as there doesn't seem to be any native API for Spark that we can connect to from our workstations, so we are stuck using the spark-submit script and a Spark app that wraps some SQL from the command line and outputs to a file which we then have to read.
The question
In a table (or Column Family) with ~30 columns running on 3 nodes with RF 2, how bad would it be to add an INDEX to every non-PK column, so that we could simply query it using CQL across any column? Would there be a horrendous impact on the performance of writes? Would there be a large increase in disk space usage?
The other option I have been investigating is using Triggers, so that for each row inserted, we populated another handful of tables (essentially, custom secondary index tables) - is this a more acceptable approach? Does anyone have any experience of the performance impact of Triggers?
Impact of adding more indexes:
This really depends on your data structure, distribution and how you access it; you were right before when you compared this process to RDMS. For Cassandra, it's best to define your queries first and then build the data model.
These guys have a nice write-up on the performance impacts of secondary indexes:
https://pantheon.io/blog/cassandra-scale-problem-secondary-indexes
The main impact (from the post) is that secondary indexes are local to each node, so to satisfy a query by indexed value, each node has to query its own records to build the final result set (as opposed to a primary key query where it is known exactly which node needs to be quired). So there's not just an impact on writes, but on read performance as well.
In terms of working out the performance on your data model, I'd recommend using the cassandra-stress tool; you can combine it with a data modeler tool that Datastax have built, to quickly generate profile yamls:
http://www.datastax.com/dev/blog/data-modeler
For example, I ran the basic stress profile without and then with secondary indexes on the default table, and the "with indexes" batch of writes took a little over 40% longer to complete. There was also an increase in GC operations / duration etc.

SSRS: How can I run dataset1 first in datasource1 then run dataset2 in datasource2

I have 2 data source(db1, db2) and 2 dataset. 2 dataset are store procedure from each data source.
Dataset1 must run first to create a table for dataset 2 to update and show (dataset 1 will show result too).
Cause the data of the table must base on some table in DB1, the store procedure will create a table to db2 by using link server.
I have search online and tried "single transaction" in data source, but it show error in data set 1 with no detail.
Is there anyway to do it? cause I want to generate an excel with 2 sheet for this result.
Check out this this post.
The default behavior of SSRS is to run the dataset at the same time. They are run in the order in which they are presented in your rdl (top down when looking at it in the report data area). Changing the behavior of a single data source with multiple datasets is as simple as clicking on a checkbox in data source dialog.
With multiple datsources it is a little bit more tricky!
Here is the explanation from the MSDN Blog posted above:
Serializing dataset executions when using multiple data source:
Note that datasets using different data sources will still be executed in parallel; only datasets of the same data source are serialized when using the single transaction setting. If you need to chain dataset executions across different data sources, there are still other options to consider.
For example, if the source databases of your data sources all reside on the same SQL Server instance, you could use just one data source to connect (with single transaction turned on) and then use the three-part name (catalog.schema.object_name) to execute queries or invoke stored procedures in different databases.
Another option to consider is the linked server feature of SQL Server, and then use the four-part name (linked_server_name.catalog.schema.object_name) to execute queries. However, make sure to carefully read the documentation on linked servers to understand its performance and connection credential implications.
This is an interesting question and while I think there might be another way of doing it, it would take a bit of time and playing around with your datasets and more information on your setup of the datasources.
Hope this helps though.

Resources