Python: Join/Merge 2 dfs on Approximate Key Match

Python: Join/Merge 2 dfs on Approximate Key Match - python-3.x

I have two DataFrames:
data = [['B100',30], ['C200',33], ['C201',11]]
data2 = [['B99/B100/B105','Yes'], ['C150/C200/C201','Yes'], ['D56/D500/D501','Yes']]
df_1 = pd.DataFrame(data, columns = ['code', 'value'])
df_2 = pd.DataFrame(data2, columns = ['code_agg', 'rating'])
I need to pull in the rating from df_2 into df_1 using a partial match from the 'code' columns in each dataframe (df_1 only has a partial key/code). The result should look like this:
I have tried several methods, the most common error I get is "TypeError: 'Series' objects are mutable, thus they cannot be hashed"
I would greatly appreciate any help on this. thank you!

df_1.merge(df_2.assign(code=df_2.code_agg.str.split('/')).explode('code'))
Out[]:
code value code_agg rating
0 B100 30 B99/B100/B105 Yes
1 C200 33 C150/C200/C201 Yes
2 C201 11 C150/C200/C201 Yes

Related

How to transpose and Pandas DataFrame and name new columns?

I have simple Pandas DataFrame with 3 columns. I am trying to Transpose it into and then rename that new dataframe and I am having bit trouble.
df = pd.DataFrame({'TotalInvoicedPrice': [123],
'TotalProductCost': [18],
'ShippingCost': [5]})
I tried using
df =df.T
which transpose the DataFrame into:
TotalInvoicedPrice,123
TotalProductCost,18
ShippingCost,5
So now i have to add column names to this data frame "Metrics" and "Values"
I tried using
df.columns["Metrics","Values"]
but im getting errors.
What I need to get is DataFrame that looks like:
Metrics Values
0 TotalInvoicedPrice 123
1 TotalProductCost 18
2 ShippingCost 5

Let's reset the index then set the column labels
df.T.reset_index().set_axis(['Metrics', 'Values'], axis=1)
Metrics Values
0 TotalInvoicedPrice 123
1 TotalProductCost 18
2 ShippingCost 5

Maybe you can avoid transpose operation (little performance overhead)
#YOUR DATAFRAME
df = pd.DataFrame({'TotalInvoicedPrice': [123],
'TotalProductCost': [18],
'ShippingCost': [5]})
#FORM THE LISTS FROM YOUR COLUMNS AND FIRST ROW VALUES
l1 = df.columns.values.tolist()
l2 = df.iloc[0].tolist()
#CREATE A DATA FRAME.
df2 = pd.DataFrame(list(zip(l1, l2)),columns = ['Metrics', 'Values'])
print(df2)

Sort values in a dataframe by a column and take second one only if equal

I've created a dataframe using random values using the following code:
values = random(5)
values_1= random(5)
col1= list(values/ values .sum())
col2= list(values_1)
df = pd.DataFrame({'col1':col1, 'col2':col2})
df.sort_values(by=['col2','col1'],ascending=[False,False]).reset_index(inplace=True)
The dataframe created in my case looks like this:
As you can see, the dataframe is not sorted in descending order by 'col2'. What I want to achieve is that it first sorts by 'col2' and if any 2 rows have same values for 'col2', then it should sort by 'col1' as well. Any suggestions? Any help would be appreciated.

Your solution almost working well, but if use inplace in reset_index it is not reused in sort_values.
Possible solution is add ignore_index=True, so reset_index is not necessary.
np.random.seed(2022)
df = pd.DataFrame({'col1':np.random.random(5), 'col2':np.random.random(5)})
df = df.sort_values(by=['col2','col1'],ascending=False, ignore_index=True)
print (df)
col1 col2
0 0.499058 0.897657
1 0.049974 0.896963
2 0.685408 0.721135
3 0.113384 0.647452
4 0.009359 0.486988
Or if want use inplace add it only to sort_values and add also ignore_index=True:
df.sort_values(by=['col2','col1'],ascending=False, ignore_index=True,inplace=True)
print (df)
col1 col2
0 0.499058 0.897657
1 0.049974 0.896963
2 0.685408 0.721135
3 0.113384 0.647452
4 0.009359 0.486988

Your logic is correct but you've missed an inplace=True inside sort_values. Due to this, the sorting does not actually take place in your dataframe. Replace it with this:
df.sort_values(by=['col2','col1'],ascending=[False,False],inplace=True)
df.reset_index(inplace=True,drop=True)

You want to also do the sort inplace=True, not only the reset_index()

Convert lists present in each column to its respective datatypes

I have a sample dataframe as given below.
import pandas as pd
data = {'ID':['A', 'B', 'C', 'D],
'Age':[[20], [21], [19], [24]],
'Sex':[['Male'], ['Male'],['Female'], np.nan],
'Interest': [['Dance','Music'], ['Dance','Sports'], ['Hiking','Surfing'], np.nan]}
df = pd.DataFrame(data)
df
Each of the columns are in list datatype. I want to remove those lists and preserve the datatypes present within the lists for all columns.
The final output should look something shown below.
Any help is greatly appreciated. Thank you.

Option 1. You can use the .str column accessor to index the lists stored in the DataFrame values (or strings, or any other iterable):
# Replace columns containing length-1 lists with the only item in each list
df['Age'] = df['Age'].str[0]
df['Sex'] = df['Sex'].str[0]
# Pass the variable-length list into the join() string method
df['Interest'] = df['Interest'].apply(', '.join)
Option 2. explode Age and Sex, then apply ', '.join to Interest:
df = df.explode(['Age', 'Sex'])
df['Interest'] = df['Interest'].apply(', '.join)
Both options return:
df
ID Age Sex Interest
0 A 20 Male Dance, Music
1 B 21 Male Dance, Sports
2 C 19 Female Hiking, Surfing
EDIT
Option 3. If you have many columns which contain lists with possible missing values as np.nan, you can get the list-column names and then loop over them as follows:
# Get columns which contain at least one python list
list_cols = [c for c in df
if df[c].apply(lambda x: isinstance(x, list)).any()]
list_cols
['Age', 'Sex', 'Interest']
# Process each column
for c in list_cols:
# If all lists in column c contain a single item:
if (df[c].str.len() == 1).all():
df[c] = df[c].str[0]
else:
df[c] = df[c].apply(', '.join)

Get only rows of dataframe where a subset of columns exist in another dataframe

I want to get all rows of a dataframe (df2) where the city column value and postcode column value also exist in another dataframe (df1).
Important is here that I want the combination of both columns and not look at the column individually.
My approach was this:
#1. Get all combinations
df_combinations=np.array(df1.select("Ort","Postleitzahl").dropDuplicates().collect())
sc.broadcast(df_combinations)
#2.Define udf
def combination_in_vx(ort,plz):
for arr_el in dfSpark_combinations:
if str(arr_el[0]) == ort and int(arr_el[1]) == plz:
return True
return False
combination_in_vx = udf(combination_in_vx, BooleanType())
#3.
df_tmp=df_2.withColumn("Combination_Exists", combination_in_vx('city','postcode'))
df_result=df_tmp.filter(df_tmp.Combination_Exists)
Although this should theoretically work it takes forever!
Does anybody know about a better solution here? Thank you very much!

You can do a left semi join using the two columns. This will include the rows in df2 where the values in both of the two specified columns exist in df1:
import pyspark.sql.functions as F
df_result = df2.join(df1, ["Ort", "Postleitzahl"], 'left_semi')

How do I get the maximum and minimum values of a column depending on another two columns in pandas dataframe?

This is my first time asking a question. I have a dataframe that looks like below:
import pandas as pd
data = [['AK', 'Co',2957],
['AK', 'Ot', 15],
['AK','Petr', 86848],
['AL', 'Co',167],
['AL', 'Ot', 10592],
['AL', 'Petr',1667]]
my_df = pd.DataFrame(data, columns = ['State', 'Energy', 'Elec'])
print(my_df)
I need to find the maximum and minimum values of the third column based on the first two columns. I did browse through a few stackoverflow questions but couldn't find the right way to solve this.
My output should look like below:
data = [['AK','Ot', 15],
['AK','Petr',86848],
['AL','Co',167],
['AL','Ot', 10592]]
my_df = pd.DataFrame(data, columns = ['State', 'Energy', 'Elec'])
print(my_df)
Note: Please let me know where I am lagging before leaving a negative marking on the question
This link helped me: Python pandas dataframe: find max for each unique values of an another column

try idxmin and idxmax with .loc filter.
new_df = my_df.loc[
my_df.groupby(["State"])
.agg(ElecMin=("Elec", "idxmin"), ElecMax=("Elec", "idxmax"))
.stack()
]
)
print(new_df)
State Energy Elec
0 AK Ot 15
1 AK Petr 86848
2 AL Co 167
3 AL Ot 10592

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Python: Join/Merge 2 dfs on Approximate Key Match - python-3.x

df_1.merge(df_2.assign(code=df_2.code_agg.str.split('/')).explode('code')) Out[]: code value code_agg rating 0 B100 30 B99/B100/B105 Yes 1 C200 33 C150/C200/C201 Yes 2 C201 11 C150/C200/C201 Yes

Related

How to transpose and Pandas DataFrame and name new columns?

Sort values in a dataframe by a column and take second one only if equal

Convert lists present in each column to its respective datatypes

Get only rows of dataframe where a subset of columns exist in another dataframe

How do I get the maximum and minimum values of a column depending on another two columns in pandas dataframe?

Categories

Resources