How to identify all columns that have different values in a Spark self-join - apache-spark

I have a Databricks delta table of financial transactions that is essentially a running log of all changes that ever took place on each record. Each record is uniquely identified by 3 keys. So given that uniqueness, each record can have multiple instances in this table. Each representing a historical entry of a change(across one or more columns of that record) Now if I wanted to find out cases where a specific column value changed I can easily achieve that by doing something like this -->
SELECT t1.Key1, t1.Key2, t1.Key3, t1.Col12 as "Before", t2.Col12 as "After"
from table1 t1 inner join table t2 on t1.Key1= t2.Key1 and t1.Key2 = t2.Key2
and t1.Key3 = t2.Key3 where t1.Col12 != t2.Col12
However, these tables have a large amount of columns. What I'm trying to achieve is a way to identify any columns that changed in a self-join like this. Essentially a list of all columns that changed. I don't care about the actual value that changed. Just a list of column names that changed across all records. Doesn't even have to be per row. But the 3 keys will always be excluded, since they uniquely define a record.
Essentially I'm trying to find any columns that are susceptible to change. So that I can focus on them dedicatedly for some other purpose.
Any suggestions would be really appreciated.

Databricks has change data feed (CDF / CDC) functionality that can simplify these type of use cases. https://docs.databricks.com/delta/delta-change-data-feed.html

Related

Spark - partitioning/bucketing of n-tables with overlapping but not identical ids

i'm currently trying to optimize some kind of query of 2 rather large tables, which are characterized like this:
Table 1: id column - alphanumerical, about 300mil unique ids, more than 1bil rows overall
Table 2: id column - identical semantics, about 200mil unique ids, more than 1bil rows overall
Lets say on a given day, 17.03. i want to join those two tables on id.
Table 1 is left, table 2 is right, i get like 90% of matches, meaning table 2 has like 90% of those ids present in table 1.
One week later, said table 1 did not change (could but to make explanation easier, consider it didn't), table 2 was updated and now contains more records. I do the join again and now, from the former missing ids some came up, so i got like 95% matches now.
In general, table1.id has some matches with table2.id at a given time which might change on a day-per-day base.
I now want to optimize this join and came up on the bucketing feature. Is this possible?
Example:
1st join: id "ABC123" is present in table1, not in table2. ABC123 gets sorted into a certain bucket, e.g. "1".
2nd join (week later): id "ABC123" now came up in table2; how can it be ensured it comes into the bucket on table 2 which then is co-located with table 1?
Or am i having a general problem of understanding how it works?

Power Query - Alternative for Join to filter Records

I have two tables:
Table: One Row per Order with the Status (Online / Offline)
Table: Multiple Rows per Order
Now I would like to reduce the number of record/ rows in the second table based on the status (Offline) from Table 1.
Is there any alternative to a right join? The first table is filtered on Status 'Offline'
We are talking about several millions of rows which takes some time to Join.
Any thoughts on this from your sight?
Some thoughts:
Create a relationship between these two tables and filter to "Offline".
You could create a join (Merge queries) in Power query and only select the On/Off State column to append. Then the import needs more time, but you're getting a flat dataset in PowerBI
Create a new column in PowerBI with DAX and use LOOKUPVALUE
Without seeing the data I think I would try the first one. If it's too slow, the I think the only way is the second point. Even it takes some more time for importing.
The third one might be the slowest.

Order of rows shown changes on selection of columns from dependent pyspark dataframe

Why does the order of rows displayed differ, when I take a subset of the dataframe columns to display, via show?
Here is the original dataframe:
Here dates are in the given order, as you can see, via show.
Now the order of rows displayed via show changes when I select a subset of predict_df by method of column selection for a new dataframe.
Because of Spark dataframe itself is unordered. It's due to parallel processing principles wich Spark uses. Different records may be located in different files (and on different nodes) and different executors may read the data in different time and in different sequence.
So You have to excplicitly specify order in Spark action using orderBy (or sort) method. E.g.:
df.orderBy('date').show()
In this case result will be ordered by date column and would be more predictible. But, if many records have equal date value then within those date subset records also would be unordered. So in this case, in order to obtain strongly ordered data, we have to perform orderBy on set of columns. And values in all rows of those set of columns must be unique. E.g.:
df.orderBy(col("date").asc, col("other_column").desc)
In general unordered datasets is a normal case for data processing systems. Even "traditional" DBMS like PostgeSQL or MS SQL Server in general return unordered records and we have to explicitly use ORDER BY clause in SELECT statement. And even if sometime we may see the same results of one query it isn't guarenteed by DBMS that by another execution result will be the same also. Especially if data reading is performed on a large amout of data.
The situation occurs because the show is an action that is called twice.
As no .cache is applied the whole cycle starts again from the start. Moreover, I tried this a few times and got the same order and not the same order as the questioner observed. Processing is non-deterministic.
As soon as I used .cache, the same result was always gotten.
This means that there is ordering preserved over a narrow transformation on a dataframe, if caching has been applied, otherwise the 2nd action will invoke processing from the start again - the basics are evident here as well. And may be the bottom line is always do ordering explicitly - if it matters.
Like #Ihor Konovalenko and #mck mentioned, Sprk dataframe is unordered by its nature. Also, looks like your dataframe doesn’t have a reliable key to order, so one solution is using monotonically_increasing_id https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.monotonically_increasing_id.html to create id and that will keep your dataframe always ordered. However if your dataframe is big, be aware this function might take some time to generate id for each row.

Excel: Order by date within multiple IDs

I have a huge epidemiological dataset containing registry data with pathology reports and clinical information. I have merged several files into one masterfile in order to get all information from one file. Every patient is assigned an unique ID-number. Each patient can have several reports and hence the same ID number can be repeated several times in the ID column. For each ID entry = new row (= pathology or clinical report) there is a date of that sample/information reported.
My goal is to be able to read all pathology/clinical info for a particular ID within one row.
By sorting the IDs, I get a clear picture of the number of each ID that has been entered. The problem arises when there are several reports = multiple rows with identical ID because the dates within this one patients with several IDs = rows do not match. The dates come from pathology (sample date, answer date, clinical info date etc). The dates from pathology and clinical within one patient does not have to match exactly on the day but still within a reasonable timeframe e.g. within 1-2 months. This is best illustrated with an example.
I want to sort the columns so that dates from a particular row match together. I am sure there is a way to do that but I cannot figure it out.
Thanks in advance
The issue of mismatching records seems to arise once the two separate tables are merged into one. In order to fix this, there are several options you can take:
Re-do the merge but strengthen the way in which the tables are joined on.
Instead of only merging based on ID, see if there is another field that could easily connect the records, perhaps a medical record #, case #, or event #, and merge the tables based on this new field AND ID. This would be the strongest solution, however it will only work if you can find said field to strengthen the link.
A separate solution would be to first sort the original tables based on the dates so that they match up and then re-merging them together.
In theory this should solve your problem as I assume currently when matching up the two separate tables it is grabbing the first instance of patient X01 from both tables and matching them together. This can be confirmed by checking the merged query and looking to see if the mismatched records are in the same order as presented in the original tables. This is not perfect, as it relies on no clinical dates occurring between pathology dates for the record, so I would proceed with caution.
And to address your concern about losing track of ID's with multiple rows, this should not matter as in the end result after merged you can then sort by ID, however you can add multiple levels of sort by selecting the data and going to Data -> Sort -> Add Level. You can change the order in which the data is sorted (First by ID and then by Date).

Avoid DISTINCTCOUNT in PowerPivot

Due to performance issues I need to remove a few distinct counts on my DAX. However, I have a particular scenario and I can't figure out how to do it.
As example, let's say one or more restaurants can be hired at one or more feasts and prepare one or more menus (see data below).
I want a PowerPivot table that shows in how many feasts each restaurant was present (see table below). I achieved this by using distinctcount.
Why not precalculating this on Power Query? The real data I have is a bit more complex (more ID columns) and in order to be able to pivot the data I would have to calculate thousands of possible combinations.
I tried adding to my model a Feast dimensional table (on the example this would only be 1 column of 2 rows). I was hoping to use that relationship to be able to make a straight count, but I haven't been able to come up with the right DAX to do so.
You could use COUNTROWS() combined with VALUES().
Specifically, COUNTROWS() will give you the count of rows in a table. That means COUNTROWS is expecting a table is input. Here's the magic part: VALUES() will return a table as results, and the table it returns are the distinct values in the table/column that you provide as the argument for VALUES().
I'm not sure if I'm explaining it well, so for the sample data you provided, the measure would look like this (assuming the table is named Table1):
Unique Feasts:=COUNTROWS(VALUES('Table1'[Feast Id]))
You can then create a pivot table from Powerpivot, and drag Restaurant Id into Rows, and drag the measure above into Values. Same result as DISTINCTCOUNT, but with less performance overhead (I think).

Resources