Merge two datasets without common column in Azure Data Factory - azure

I have two datasets where I need to do a join/merge in Azure Data Factory, but without having a common identity column. This might be my an oversight from my side, as it should be a very trivial task to do, but I cannot seem to do it via a join or a union.
One dataset only has a couple of rows with a "name" column, let's say rows A, B, C whereas the other have thousands (1-N).
For each row in the large dataset I want A, B, C rows, so it effectively becomes:
1A
1B
1C
2A
2B
2C
...
Any help is appreciated,
Thank you.

You can use Custom (cross) join type in the Join to get the result in this case.
Follow the demonstration below:
Sample Large Dataset(Numbers) with numbers up to 15.
Small Dataset(Letters)
Now, use Join with Large dataset as left and small dataset as right and use custom join with the condition as true().
In the Optimize of Join, select off at the Broadcast to get the above format of the data.
You can see the merge of two datasets below.
If you want the above in a single column with values like 1A,1B,1C..., first use the derived column to concat the above values and then select any column using select.
Derived Column
Now use select to select any column above.
Output

Related

How to check Spark DataFrame difference?

I need to check my solution for idempotency and check how much it's different with past solution.
I tried next:
spark.sql('''
select * from t1
except
select * from t2
''').count()
It's gives me information how much this tables different (t1 - my solution, t2 - primal data). If here is many different data, I want to check, where it different.
So, I tried that:
diff = {}
columns = t1.columns
for col in columns:
cntr = spark.sql('''
select {col} from t1
except
select {col} from t2
''').count()
diff[col] = cntr
print(diff)
It's not good for me, because it's works about 1-2 hours (both tables have 30 columns and 30 million lines of data).
Do you guys have an idea how to calculate this quickly?
Except is a kind of a join on all columns at the same time. Does your data have a primary key? It could even be complex, comprising of multiple columns, but it's still much better then taking all 30 columns into account.
Once you figure out the primary key you can do the FULL OUTER JOIN and:
check NULLs on the left
check NULLs on the right
check other columns of matching rows (it's much cheaper to compare the values after the join)
Given that your resource remains unchanged, I think there are three ways that you can optimize:
Join two dataframe once but not looping the except: I assume your dataset should have a key / index, otherwise there is no ordering in your both dataframe and you can't perform except to check the difference. Unless you have limited resource, just do join once to concat two dataframe first instead of multiple except.
Check your data partitioning: Even you use point 1 / the method that you're using, make sure that data partition is in even distribution with optimal number of partition. Most of the time, data skew is one of the critical parts to lower your performance. If your key is a string, use repartition. If you're using a sequence number, use repartitionByRange.
Use the when-otherwise pair to check the difference: once you join two dataframe, you can use a when-otherwise condition to compare the difference, for example: df.select(func.sum(func.when(func.col('df1.col_a')!=func.col('df2.col_a'), func.lit(1))).otherwise(func.lit(0)).alias('diff_in_col_a_count')). Therefore, you can calculate all the difference within one action but not multiple action.

How to identify all columns that have different values in a Spark self-join

I have a Databricks delta table of financial transactions that is essentially a running log of all changes that ever took place on each record. Each record is uniquely identified by 3 keys. So given that uniqueness, each record can have multiple instances in this table. Each representing a historical entry of a change(across one or more columns of that record) Now if I wanted to find out cases where a specific column value changed I can easily achieve that by doing something like this -->
SELECT t1.Key1, t1.Key2, t1.Key3, t1.Col12 as "Before", t2.Col12 as "After"
from table1 t1 inner join table t2 on t1.Key1= t2.Key1 and t1.Key2 = t2.Key2
and t1.Key3 = t2.Key3 where t1.Col12 != t2.Col12
However, these tables have a large amount of columns. What I'm trying to achieve is a way to identify any columns that changed in a self-join like this. Essentially a list of all columns that changed. I don't care about the actual value that changed. Just a list of column names that changed across all records. Doesn't even have to be per row. But the 3 keys will always be excluded, since they uniquely define a record.
Essentially I'm trying to find any columns that are susceptible to change. So that I can focus on them dedicatedly for some other purpose.
Any suggestions would be really appreciated.
Databricks has change data feed (CDF / CDC) functionality that can simplify these type of use cases. https://docs.databricks.com/delta/delta-change-data-feed.html

What condition should I supply for a custom cross join in an Azure Data Factory dataflow?

In a dataflow, I have two datasets with one column each. Let's say dataset a with column a and dataset b with column b.
I want to cross join them, but when I select the custom cross join option it asks me to specify a condition. I don't understand what I need to supply here, I just want all the records from column a to be cross joined with all the records from column b. What should I put? I tried checking the official Microsoft documentation but there were no examples there.
The cross join in a join transformation of azure data factory dataflow requires a condition on which the join has to be applied. I have done the following to demonstrate how do cross join on the example that you have given.
I have two datasets (one column each). Dataset A has one column a with the following values.
Dataset B has column b with the following values.
I have used join transformation to join both the sources. Now, the dataflow join transformation prompts you to specify a cross join condition. If you don't have any condition and just want to apply cross join on all the data from both datasets, you give the cross join condition value as true() (As you want to do in this case).
This would apply cross join on all the records of column a with all the records of column b.
This is how you can achieve your requirement. If you have any condition, you can pass it to apply cross join based on it instead of using true(). Refer to this official Microsoft documentation to understand more about joins.

Power BI - Anti Left join

Data:
I have two datasets, design-wise set up in Excel as a matrix with first ID and lots of rows, and with the rest of the columns in the data set have 1-1 headers id numbers, so like 500 rows and around 45 columns.
Like ID, ColumnB, ColumnC
The other matrix has the same headers, but different order. It does not seem to matter.
Challenge:
So I need to find the differences between the two. I made an anti-left join on ID and then I get the ID that are in the one data set and not the other, right? So I make one for each way, so I get the ID that are missing in the respective datasets(/matrix).
I need to do the same trick, even if both IDs are there and then I get only the data with a difference across all columns, so if there for a rowID is a "X" in ColumnB in dataset1, but NO "X" in ColumnB dataset2, then I want to include it in my new table. So if there are, for the two rows compared in the two datasets, a difference in just one of the columns, I need to know and want it in my new data, only the data with a difference.
Tried:
I tried to mark not only ID columns, but all the columns in the anti-left join setup, but it does not seem to work at all.

combining rows/columns from spark data frames by mathematical operation

I have two spark data frames (A and B) with respective sizes a x m and b x m, containing floating point values.
Additionally, each data frame has a column 'ID', that is a string identifier. A and B have exactly the same set of 'ID's (i.e. contain information about the same group of customers.)
I'd like to combine a column of A with a column of B by some function.
More specifically, I'd like to build a scalar product a column of A with a column of B, with ordering of the columns according to the ID.
Even more specifically I'd like to calculate the correlation between columns of A and B.
Performing this operation on all pairs of columns would be the same as a matrix multiplication: A_transposed x B.
However, for now I'm only interested in correlations of a small subset of pairs.
I have two approaches in mind, but I struggle to implement them. (And don't know whether either is possible or advisable, at all.)
(1) Take the column of interest of each data frame and combines each entry to a key value pair, where the key is the ID. Then something like reduceByKey() on the two columns of key value pairs and subsequent summation.
(2) Take the column of interest of each data frame, sort it by its ID, cast it to an RDD (haven't figure out how to do this) and simply apply
Statistics.corr(rdd1,rdd2) from pyspark.mllib.stat.
Also I wonder: Is it generally computationally preferable to operate on columns rather than rows (since spark data frames are columnar oriented) or does that make no difference?
Starting from spark 1.4 and if all you need is pearson correlation then you could go like this:
cor = dfA.join(dfB, dfA.id == dfB.id, how='inner').select(dfA.value.alias('aval'), dfB.value.alias('bval')).corr('aval', 'bval')

Resources