Python - Join/ Merge Based on multiple column match - python-3.x

I would like to join two data frames based on multiple columns because there are duplicate IDs in the data sets.
I have tried a few ways, one of which is listed below.
However, I cannot get it right. The option below gives me all rows from both data frames. I figure this should be easy but for some reason, it is not working.
I checked the results. There are matches and instead of joining on the match, I just get both rows in the final data frame.
I am comparing two different data sets to ensure the same data exists in both sets.There can be more than one transaction with the same ID but I need to make sure that all that exists in one data frame, also exists in the other.
new_df = Enterprise.merge(Tableau,
left_on=['ID','AID','Amount','Tax','CC'],
right_on = ['ID','AID','Amount','Tax','CC'],
how='left')

Related

Excel data tables: Multiple outputs with only one input column

I am trying to create a data table with multiple outputs across periods, but for the same scenarios.
Is it possible to create that without inserting an extra column between each output column to deliver input for the data table (i.e. input column = index 50-110).
Is this in any way possible? See picture of what I would usually mark to create the data table (this does only cover one period/output though). But if I were to make the scenario for FY23, then I would need to insert a column between FY22 and FY23 where I copy the index 50-110 again. I would like to not have to do that.

How to identify all columns that have different values in a Spark self-join

I have a Databricks delta table of financial transactions that is essentially a running log of all changes that ever took place on each record. Each record is uniquely identified by 3 keys. So given that uniqueness, each record can have multiple instances in this table. Each representing a historical entry of a change(across one or more columns of that record) Now if I wanted to find out cases where a specific column value changed I can easily achieve that by doing something like this -->
SELECT t1.Key1, t1.Key2, t1.Key3, t1.Col12 as "Before", t2.Col12 as "After"
from table1 t1 inner join table t2 on t1.Key1= t2.Key1 and t1.Key2 = t2.Key2
and t1.Key3 = t2.Key3 where t1.Col12 != t2.Col12
However, these tables have a large amount of columns. What I'm trying to achieve is a way to identify any columns that changed in a self-join like this. Essentially a list of all columns that changed. I don't care about the actual value that changed. Just a list of column names that changed across all records. Doesn't even have to be per row. But the 3 keys will always be excluded, since they uniquely define a record.
Essentially I'm trying to find any columns that are susceptible to change. So that I can focus on them dedicatedly for some other purpose.
Any suggestions would be really appreciated.
Databricks has change data feed (CDF / CDC) functionality that can simplify these type of use cases. https://docs.databricks.com/delta/delta-change-data-feed.html

Power BI - Anti Left join

Data:
I have two datasets, design-wise set up in Excel as a matrix with first ID and lots of rows, and with the rest of the columns in the data set have 1-1 headers id numbers, so like 500 rows and around 45 columns.
Like ID, ColumnB, ColumnC
The other matrix has the same headers, but different order. It does not seem to matter.
Challenge:
So I need to find the differences between the two. I made an anti-left join on ID and then I get the ID that are in the one data set and not the other, right? So I make one for each way, so I get the ID that are missing in the respective datasets(/matrix).
I need to do the same trick, even if both IDs are there and then I get only the data with a difference across all columns, so if there for a rowID is a "X" in ColumnB in dataset1, but NO "X" in ColumnB dataset2, then I want to include it in my new table. So if there are, for the two rows compared in the two datasets, a difference in just one of the columns, I need to know and want it in my new data, only the data with a difference.
Tried:
I tried to mark not only ID columns, but all the columns in the anti-left join setup, but it does not seem to work at all.

combining two dataframes sorted by common columns in pandas

I've been stuck on this for a while - I have tried merging, concatenating, joins, but can't get what I want.
given the two dataframes :
I want to combine them to get :
First I want them aligned by Start Location. If the names can be aligned, that is a bonus, but not critical. Although this example looks like the 2nd table is aligning to the 1st, it could be either way, depending on the Start Location.The two dataframes are appropriately sorted beforehand. I can't even figure this out with two dataframes, but ultimately I want to be able to combine several together.

PySpark - A more efficient method to count common elements

I have two dataframes, say dfA and dfB.
I want to take their intersection and then count the number of unique user_ids in that intersection.
I've tried the following which is very slow and it crashes a lot:
dfA.join(broadcast(dfB), ['user_id'], how='inner').select('user_id').dropDuplicates().count()
I need to run many such lines, in order to get a plot.
How can I perform such query in an efficient way?
As described in the question, the only relevant part of the dataframe is the column user_id (in your question you describe that you join on user_id and afterwards uses only the user_id field)
The source of the performance problem is joining two big dataframes when you need only the distinct values of one column in each dataframe.
In order to improve the performance I'd do the following:
Create two small DFs which will holds only the user_id column of each dataframe
This will reduce dramatically the size of each dataframe as it will hold only one column (the only relevant column)
dfAuserid = dfA.select("user_id")
dfBuserid = dfB.select("user_id")
Get the distinct (Note: it is equivalent to dropDuplicate() values of each dataframe
This will reduce dramatically the size of each dataframe as each new dataframe will hold only the distinct values of column user_id.
dfAuseridDist = dfA.select("user_id").distinct()
dfBuseridDist = dfB.select("user_id").distinct()
Perform the join on the above two minimalist dataframes in order to get the unique values in the intersection
I think you can either select the necessary columns before and perform the join afterwards. It should also be beneficial to move the dropDuplicates before the join as well, since then you get rid of user_ids that appear multiple times in one of the dataframes.
The resulting query could look like:
dfA.select("user_id").join(broadcast(dfB.select("user_id")), ['user_id'], how='inner')\
.select('user_id').dropDuplicates().count()
OR:
dfA.select("user_id").dropDuplicates(["user_id",]).join(broadcast(dfB.select("user_id")\
.dropDuplicates(["user_id",])), ['user_id'], how='inner').select('user_id').count()
OR the version with distinct should work as well.
dfA.select("user_id").distinct().join(broadcast(dfB.select("user_id").distinct()),\
['user_id'], how='inner').select('user_id').count()

Resources