pyspark left join only with the first record - apache-spark

I have 2 pysaprk dataframes.
I am looking for option to join df1 with df2. Left join only with the first row from df2.
df1:
ID string
1 sfafsda
2 trwe
3 gfdgsd
df2
ID address state
1 Montreal Quebec
1 Quebec Quebec
2 Trichy TN
2 Madurai TN
3 Bangalore KN
3 Mysore KN
3 Hosur KN
Expected output from join:
ID string address state
1 sfafsda Montreal Quebec
2 trwe Trichy TN
3 gfdgsd Bangalore KN
As I am working on databricks, please let me know whether it's easier to implement pyspark left join only with the first row or sql join is possible to achieve the expected output. Thanks.

Yes it's possible using pyspark, but you need to add an index column to df2. See the code below:
df2 = df2.withColumn('index', F.monotonically_increasing_id())
df1.join(df2, 'ID', 'left') \
.select('*', F.first(F.array('address', 'state')).over(Window.partitionBy('ID').orderBy('index')).alias('array')) \
.select('ID', 'string', F.col('array')[0].alias('address'), F.col('array')[1].alias('state')) \
.groupBy('ID', 'string') \
.agg(F.first('address'), F.first('state')) \
.orderBy('ID')

Related

Replace values of several columns with values mapping in other dataframe PySpark

I need to replace values of several columns (many more than those in the example, so I would like to avoid doing multiple left joins) of a dataframe with values from another dataframe (mapping).
Example:
df1 EXAM
id
question1
question2
question3
1
12
12
5
2
12
13
6
3
3
7
5
df2 VOTE MAPPING :
id
description
3
bad
5
insufficient
6
sufficient
12
very good
13
excellent
Output
id
question1
question2
question3
1
very good
very good
insufficient
2
very good
excellent
sufficient
3
bad
null
insufficient
Edit 1: Corrected id for excellent in vote map
First of all, you can create a reference dataframe:
df3 = df2.select(
func.create_map(func.col('id'), func.col('desc')).alias('ref')
).groupBy().agg(
func.collect_list('ref').alias('ref')
).withColumn(
'ref', func.udf(lambda lst: {k:v for element in lst for k, v in element.items()}, returnType=MapType(StringType(), StringType()))(func.col('ref'))
)
+---------------------------------------------------------------------------+
|ref |
+---------------------------------------------------------------------------+
|{3 -> bad, 12 -> good, 5 -> insufficient, 13 -> excellent, 6 -> sufficient}|
+---------------------------------------------------------------------------+
Then you can replace the value in question columns by getting the value in reference with 1 crossJoin:
df4 = df1.crossJoin(df3)\
.select(
'id',
*[func.col('ref').getItem(func.col(col)).alias(col) for col in df1.columns[1:]]
)
df4.show(10, False)
+---+----+---------+------------+
|id |q1 |q2 |q3 |
+---+----+---------+------------+
|1 |good|good |insufficient|
|2 |good|excellent|sufficient |
|3 |bad |null |insufficient|
+---+----+---------+------------+

Update dataframe cells according to match cells within another dataframe in pandas [duplicate]

I have two dataframes in python. I want to update rows in first dataframe using matching values from another dataframe. Second dataframe serves as an override.
Here is an example with same data and code:
DataFrame 1 :
DataFrame 2:
I want to update update dataframe 1 based on matching code and name. In this example Dataframe 1 should be updated as below:
Note : Row with Code =2 and Name= Company2 is updated with value 1000 (coming from Dataframe 2)
import pandas as pd
data1 = {
'Code': [1, 2, 3],
'Name': ['Company1', 'Company2', 'Company3'],
'Value': [200, 300, 400],
}
df1 = pd.DataFrame(data1, columns= ['Code','Name','Value'])
data2 = {
'Code': [2],
'Name': ['Company2'],
'Value': [1000],
}
df2 = pd.DataFrame(data2, columns= ['Code','Name','Value'])
Any pointers or hints?
Using DataFrame.update, which aligns on indices (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.update.html):
>>> df1.set_index('Code', inplace=True)
>>> df1.update(df2.set_index('Code'))
>>> df1.reset_index() # to recover the initial structure
Code Name Value
0 1 Company1 200.0
1 2 Company2 1000.0
2 3 Company3 400.0
You can using concat + drop_duplicates which updates the common rows and adds the new rows in df2
pd.concat([df1,df2]).drop_duplicates(['Code','Name'],keep='last').sort_values('Code')
Out[1280]:
Code Name Value
0 1 Company1 200
0 2 Company2 1000
2 3 Company3 400
Update due to below comments
df1.set_index(['Code', 'Name'], inplace=True)
df1.update(df2.set_index(['Code', 'Name']))
df1.reset_index(drop=True, inplace=True)
You can merge the data first and then use numpy.where, here's how to use numpy.where
updated = df1.merge(df2, how='left', on=['Code', 'Name'], suffixes=('', '_new'))
updated['Value'] = np.where(pd.notnull(updated['Value_new']), updated['Value_new'], updated['Value'])
updated.drop('Value_new', axis=1, inplace=True)
Code Name Value
0 1 Company1 200.0
1 2 Company2 1000.0
2 3 Company3 400.0
There is a update function available
example:
df1.update(df2)
for more info:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.update.html
You can align indices and then use combine_first:
res = df2.set_index(['Code', 'Name'])\
.combine_first(df1.set_index(['Code', 'Name']))\
.reset_index()
print(res)
# Code Name Value
# 0 1 Company1 200.0
# 1 2 Company2 1000.0
# 2 3 Company3 400.0
Assuming company and code are redundant identifiers, you can also do
import pandas as pd
vdic = pd.Series(df2.Value.values, index=df2.Name).to_dict()
df1.loc[df1.Name.isin(vdic.keys()), 'Value'] = df1.loc[df1.Name.isin(vdic.keys()), 'Name'].map(vdic)
# Code Name Value
#0 1 Company1 200
#1 2 Company2 1000
#2 3 Company3 400
You can use pd.Series.where on the result of left-joining df1 and df2
merged = df1.merge(df2, on=['Code', 'Name'], how='left')
df1.Value = merged.Value_y.where(~merged.Value_y.isnull(), df1.Value)
>>> df1
Code Name Value
0 1 Company1 200.0
1 2 Company2 1000.0
2 3 Company3 400.0
You can change the line to
df1.Value = merged.Value_y.where(~merged.Value_y.isnull(), df1.Value).astype(int)
in order to return the value to be an integer.
There's something I often do.
I merge 'left' first:
df_merged = pd.merge(df1, df2, how = 'left', on = 'Code')
Pandas will create columns with extension '_x' (for your left dataframe) and
'_y' (for your right dataframe)
You want the ones that came from the right. So just remove any columns with '_x' and rename '_y':
for col in df_merged.columns:
if '_x' in col:
df_merged .drop(columns = col, inplace = True)
if '_y' in col:
new_name = col.strip('_y')
df_merged .rename(columns = {col : new_name }, inplace=True)
Append the dataset
Drop the duplicate by code
Sort the values
combined_df = combined_df.append(df2).drop_duplicates(['Code'],keep='last').sort_values('Code')
None of the above solutions worked for my particular example, which I think is rooted in the dtype of my columns, but I eventually came to this solution
indexes = df1.loc[df1.Code.isin(df2.Code.values)].index
df1.at[indexes,'Value'] = df2['Value'].values

Convert pandas Columns in Rows using (melt doesn't work)

How can I achieve this in pandas, I have a way where I take out each column as a new data frame and then so a insert in SQL but in that way if I have 10 columns I want to do the same I cannot make 10 data frames so I want to know how can I achieve it dynamically
I have a data set where I have the following data
Output I have
Id col1 col2 col3
1 Ab BC CD
2 har Adi tony
Output I want
Id col1
1 AB
1 BC
1 CD
2 har
2 ADI
2 Tony
melt does work, you just need a few extra steps for the exact output.
Assuming "Id" is a column (if not, reset_index).
(df.melt(id_vars='Id', value_name='col1')
.sort_values(by='Id')
.drop('variable', axis=1)
)
Output:
Id col1
0 1 Ab
2 1 BC
4 1 CD
1 2 har
3 2 Adi
5 2 tony
Used input:
df = pd.DataFrame({'Id': [1, 2],
'col1': ['Ab', 'har'],
'col2': ['BC', 'Adi'],
'col3': ['CD', 'tony']})

Updating multiple columns of df from another df

I have two dataframes, df1 and df2. I want to update some columns(not all) of df1 from the value which is in df2 columns(names of common column is same in both dataframes) based on key column. df1 can have multiple entries of that key but in df2 each key has only one entry.
df2 :
party_id age person_name col2
0 1 12 abdjc abc
1 2 35 fAgBS sfd
2 3 65 Afdc shd
3 5 34 Afazbf qfwjk
4 6 78 asgsdb fdgd
5 7 35 sdgsd dsfbds
df1:
party_id account_id product_type age dob status col2
0 1 1 Current 25 28-01-1994 active sdag
1 2 2 Savings 31 14-07-1988 pending asdg
2 3 3 Loans 65 22-07-1954 frozen sgsdf
3 3 4 Over Draft Facility 93 29-01-1927 active dsfhgd
4 4 5 Mortgage 93 01-03-1926 pending sdggsd
In this example I want to update age, col2 in df1 based on the value present in df2. And key column here is party_id.
I tried mapping df2 into dict with their key (column wise, one column at time). Here key_name = party_id and column_name = age
dict_key = df2[key_name]
dict_value = df2[column_name]
temp_dict = dict(zip(dict_key, dict_value))
and then map it to df1
df1[column_name].map(temp_dict).fillna(df1[column_name])
But issue here is it is only mapping the one entry not all for that key value.In this example party_id == 3 have multiple entry in df1.
Keys which is not in df2, their respective value for that column should be unchanged.
Can anyone help me with efficient solution as my df1 is of big size more than 500k? So that all columns can update at the same time.
df2 is of moderate size around 3k or something.
Thanks
Idea is use DataFrame.merge with left join first, then get columns with are same in both DataFrames to cols and replace missing values by original values by DataFrame.fillna:
df = df1.merge(df2.drop_duplicates('party_id'), on='party_id', suffixes=('','_'), how='left')
cols = df2.columns.intersection(df1.columns).difference(['party_id'])
df[cols] = df[cols + '_'].rename(columns=lambda x: x.strip('_')).fillna(df[cols])
df = df[df1.columns]
print (df)
party_id age person_name col2
0 1 25.0 abdjc sdag
1 2 31.0 fAgBS asdg
2 3 65.0 Afdc sgsdf
3 5 34.0 Afazbf qfwjk
4 6 78.0 asgsdb fdgd
5 7 35.0 sdgsd dsfbds

How to merge 2 pivot tables (DataFrame objects) in 1 [duplicate]

I have the following dataframes:
> df1
id begin conditional confidence discoveryTechnique
0 278 56 false 0.0 1
1 421 18 false 0.0 1
> df2
concept
0 A
1 B
How do I merge on the indices to get:
id begin conditional confidence discoveryTechnique concept
0 278 56 false 0.0 1 A
1 421 18 false 0.0 1 B
I ask because it is my understanding that merge() i.e. df1.merge(df2) uses columns to do the matching. In fact, doing this I get:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 4618, in merge
copy=copy, indicator=indicator)
File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 58, in merge
copy=copy, indicator=indicator)
File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 491, in __init__
self._validate_specification()
File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 812, in _validate_specification
raise MergeError('No common columns to perform merge on')
pandas.tools.merge.MergeError: No common columns to perform merge on
Is it bad practice to merge on index? Is it impossible? If so, how can I shift the index into a new column called "index"?
Use merge, which is an inner join by default:
pd.merge(df1, df2, left_index=True, right_index=True)
Or join, which is a left join by default:
df1.join(df2)
Or concat, which is an outer join by default:
pd.concat([df1, df2], axis=1)
Samples:
df1 = pd.DataFrame({'a':range(6),
'b':[5,3,6,9,2,4]}, index=list('abcdef'))
print (df1)
a b
a 0 5
b 1 3
c 2 6
d 3 9
e 4 2
f 5 4
df2 = pd.DataFrame({'c':range(4),
'd':[10,20,30, 40]}, index=list('abhi'))
print (df2)
c d
a 0 10
b 1 20
h 2 30
i 3 40
# Default inner join
df3 = pd.merge(df1, df2, left_index=True, right_index=True)
print (df3)
a b c d
a 0 5 0 10
b 1 3 1 20
# Default left join
df4 = df1.join(df2)
print (df4)
a b c d
a 0 5 0.0 10.0
b 1 3 1.0 20.0
c 2 6 NaN NaN
d 3 9 NaN NaN
e 4 2 NaN NaN
f 5 4 NaN NaN
# Default outer join
df5 = pd.concat([df1, df2], axis=1)
print (df5)
a b c d
a 0.0 5.0 0.0 10.0
b 1.0 3.0 1.0 20.0
c 2.0 6.0 NaN NaN
d 3.0 9.0 NaN NaN
e 4.0 2.0 NaN NaN
f 5.0 4.0 NaN NaN
h NaN NaN 2.0 30.0
i NaN NaN 3.0 40.0
You can use concat([df1, df2, ...], axis=1) in order to concatenate two or more DFs aligned by indexes:
pd.concat([df1, df2, df3, ...], axis=1)
Or merge for concatenating by custom fields / indexes:
# join by _common_ columns: `col1`, `col3`
pd.merge(df1, df2, on=['col1','col3'])
# join by: `df1.col1 == df2.index`
pd.merge(df1, df2, left_on='col1' right_index=True)
or join for joining by index:
df1.join(df2)
By default:
join is a column-wise left join
pd.merge is a column-wise inner join
pd.concat is a row-wise outer join
pd.concat:
takes Iterable arguments. Thus, it cannot take DataFrames directly (use [df,df2])
Dimensions of DataFrame should match along axis
Join and pd.merge:
can take DataFrame arguments
This answer has been resolved for a while and all the available
options are already out there. However in this answer I'll attempt to
shed a bit more light on these options to help you understand when to
use what.
This post will go through the following topics:
Merging with index under different conditions
options for index-based joins: merge, join, concat
merging on indexes
merging on index of one, column of other
effectively using named indexes to simplify merging syntax
Index-based joins
TL;DR
There are a few options, some simpler than others depending on the use
case.
DataFrame.merge with left_index and right_index (or left_on and right_on using named indexes)
DataFrame.join (joins on index)
pd.concat (joins on index)
PROS
CONS
merge
• supports inner/left/right/full • supports column-column, index-column, index-index joins
• can only join two frames at a time
join
• supports inner/left (default)/right/full • can join multiple DataFrames at a time
• only supports index-index joins
concat
• specializes in joining multiple DataFrames at a time • very fast (concatenation is linear time)
• only supports inner/full (default) joins • only supports index-index joins
Index to index joins
Typically, an inner join on index would look like this:
left.merge(right, left_index=True, right_index=True)
Other types of joins (left, right, outer) follow similar syntax (and can be controlled using how=...).
Notable Alternatives
DataFrame.join defaults to a left outer join on the index.
left.join(right, how='inner',)
If you happen to get ValueError: columns overlap but no suffix specified, you will need to specify lsuffix and rsuffix= arguments to resolve this. Since the column names are same, a differentiating suffix is required.
pd.concat joins on the index and can join two or more DataFrames at once. It does a full outer join by default.
pd.concat([left, right], axis=1, sort=False)
For more information on concat, see this post.
Index to Column joins
To perform an inner join using index of left, column of right, you will use DataFrame.merge a combination of left_index=True and right_on=....
left.merge(right, left_index=True, right_on='key')
Other joins follow a similar structure. Note that only merge can perform index to column joins. You can join on multiple levels/columns, provided the number of index levels on the left equals the number of columns on the right.
join and concat are not capable of mixed merges. You will need to set the index as a pre-step using DataFrame.set_index.
This post is an abridged version of my work in Pandas Merging 101. Please follow this link for more examples and other topics on merging.
A silly bug that got me: the joins failed because index dtypes differed. This was not obvious as both tables were pivot tables of the same original table. After reset_index, the indices looked identical in Jupyter. It only came to light when saving to Excel...
I fixed it with: df1[['key']] = df1[['key']].apply(pd.to_numeric)
Hopefully this saves somebody an hour!
If you want to join two dataframes in Pandas, you can simply use available attributes like merge or concatenate.
For example, if I have two dataframes df1 and df2, I can join them by:
newdataframe = merge(df1, df2, left_index=True, right_index=True)
You can try these few ways to merge/join your dataframe.
merge (inner join by default)
df = pd.merge(df1, df2, left_index=True, right_index=True)
join (left join by default)
df = df1.join(df2)
concat (outer join by default)
df = pd.concat([df1, df2], axis=1)

Resources