implementing scd type 2 as a generalized stored procedure in azure data warehouse for all dimensions? - azure

I am new to azure and I am working on Azure data warehouse. I have loaded few dimensions and staging tables. I want to implement SCD type 2 as a generalised procedure for all the updates with hashbytes. As we know, ADW doesnt support merge, I am trying to implement this with normal insert and update statements with a startdate and enddate column. But the schemas of the dimension table are not exactly the same as those in the staging table, there are few columns that are not considered.
Initially i thought i will pass in the staging and dimension table as parameters, and fetch schema from sys objects, create a temp table, load necessary column from stage and do a hashbyte and compare hashbyte from temp and dimension, but is this a good approach?
PS: Also one more problem, is the sometimes the column names are mapped different, like branchid as branch_id. How do i fetch columns for these. Note that this is just one case, and this could be the case in many tables as well.

Related

How to identify all columns that have different values in a Spark self-join

I have a Databricks delta table of financial transactions that is essentially a running log of all changes that ever took place on each record. Each record is uniquely identified by 3 keys. So given that uniqueness, each record can have multiple instances in this table. Each representing a historical entry of a change(across one or more columns of that record) Now if I wanted to find out cases where a specific column value changed I can easily achieve that by doing something like this -->
SELECT t1.Key1, t1.Key2, t1.Key3, t1.Col12 as "Before", t2.Col12 as "After"
from table1 t1 inner join table t2 on t1.Key1= t2.Key1 and t1.Key2 = t2.Key2
and t1.Key3 = t2.Key3 where t1.Col12 != t2.Col12
However, these tables have a large amount of columns. What I'm trying to achieve is a way to identify any columns that changed in a self-join like this. Essentially a list of all columns that changed. I don't care about the actual value that changed. Just a list of column names that changed across all records. Doesn't even have to be per row. But the 3 keys will always be excluded, since they uniquely define a record.
Essentially I'm trying to find any columns that are susceptible to change. So that I can focus on them dedicatedly for some other purpose.
Any suggestions would be really appreciated.
Databricks has change data feed (CDF / CDC) functionality that can simplify these type of use cases. https://docs.databricks.com/delta/delta-change-data-feed.html

Azure Data Factory - Exists transformation in Data Flow with generic dataset

I'm having issues using the Exists Transformation within a Data Flow with a generic dataset.
I have two sources (one from staging table "sourceStg", one from DWH table "sourceDwh") and want to compare if the UniqueIdentifier-Column in the staging table is existing in the UniqueIdentifier-Column in the DWH table. For that I have a generic data set which I query with a SQL statement containing parameters.
When I open the "Exists settings" I cannot choose any Column from the source in the conditions since the source is generic and has no Projection until I run the data flow. However, I have a parameter which I get from the parent pipeline which provides me the name of the Column containing the UniqueIdentifier (both column names in staging / DWH are the same).
I tried to add following statement "byName($UniqueIdentifier)" in the left and right column field but the engine resolves them both as the sourceStg-Column since the prefix of the source-transformations is missing and it defaults to the first one. What I basically now try to achieve is having some statement as followed defining the correct source-transformation and the column containing the unique identifier with a parameter.
exists(sourceStg#$UniqueIdentifier == sourceDwh#$UniqueIdentifier)
But either the expression cannot be parsed or the result does not retrieve the actual UniqueIdentifier value from the column but writes the statement (e.g. sourceStg#$UniqueIdentifier) as column value.
The only workaround I found so far is having two derived columns which adds a suffix to the UniqueIdentifier-Column in one source and a new parameter $UniqueIdentiferDwh which is populate with the parameter $UniqueIdentifier and the same suffix as used in the derived column.
Any Azure Data Factory experts out there to help?
Thanks in advance!

Azure Data Flow - Can we have Dynamic columns or change in projections for Unpiovt functionality

The excel consist of 62 columns and 7 columns are fixed and rest of them have weeks as in year(week1 to week 52)
I have used a data flow task to unpivot the 53 columns into rows with 2 extra columns year and value.
The problem is that I have the 52 week column names keep changing on every week data load and how to I handle this change in column names in data flow. For a single run it gives the exact output
What you'll want to do here is to implement late-binding of your schema, or what ADF refers to as "schema drift". Instead of setting a hardened "early binding" schema in your Source projection, leave the dataset schema and projection empty.
Next, add a Derived Column after your source and call it "Projection". This is where you'll build your projection using rules to account for your evolving schema.
Build out your canonical model with the column names for your entire year using byName('columnname'). That will tell ADF to look for the existence of the column in single quotes from your source data while also providing a schema that you can use to build out your pivot table.
If you need to cast the values, wrap byName() inside of a casting function, i.e. toString(), toDate(), etc.

Excel Data Model - Creating a many-to-one summary table by mapping table and column names

I am trying to create a summary calculation on a set of tables which I have added to an Excel data model. I have 5 tables with the following columns:
Datetime (UTC)
Measured Data 1
Simulated Data 1
Measured Data 2
Simulated Data 2
etc.
I have created a master Date-time table which links these 5 tables together on their UTC date-time column.
My query is how to optimally create calculated fields on each of these tables without needing to explicitly specify the calculations on each individual table, as is the case with PivotTables (I need to select the specific measured and simulated data columns from one individual table in the data model). Instead I would like to be able to map all measured fields to one column and all simulated fields to another, and then use filters to select out the fields (or groups of fields) I want to compare.
I started by creating a summary table which lists all my tables in my data model by name along with the names of measured and simulated columns within each. However, I'm unsure what the next step should be. This seems like a pretty straightforward problem, but it has me stumped this morning. I'd appreciate any help. Let me know if I haven't fully explained anything.

how to define dynamic columns in a column family in Cassandra?

We don't want to fix the columns definition when creating a column family, as we might have to insert new columns into the column family. Is it possible to achieve it? I am wondering whether it is possible to not to define the column metadata when creating a column family, but to specify the column when client updates data, for example:
CREATE COLUMN FAMILY products WITH default_validation_class= UTF8Type AND key_validation_class=UTF8Type AND comparator=UTF8Type;
set products['1001']['brand']= ‘Sony’;
Thanks,
Fan
Yes... it is possible to achieve this, without even taking any special effort. Per the DataStax documentation of the Cassandra data model (a good read, by the way, along with the CQL spec):
The Cassandra data model is a schema-optional, column-oriented data model. This means that, unlike a relational database, you do not need to model all of the columns required by your application up front, as each row is not required to have the same set of columns. Columns and their metadata can be added by your application as they are needed without incurring downtime to your application.

Resources