drop multiple columns pySpark - apache-spark

I'm using pyspark 2.4 version.
I have a weird issue with dropping columns after joining.
I get the correct result if I drop one column, but I get an error if I drop two columns.
I want to drop the 'address' and 'role' columns from the workers1 data frame on the joined data frame (joined_workers).
from pyspark.sql import functions as f
workers1 = spark.createDataFrame(
[("barmen", "Paris", "25"),
("waitress", None, "22")],
["role", "address", "age"])
workers1.toPandas()
>>>
role address age
0 barmen Paris 25
1 waitress None 22
workers2 = spark.createDataFrame(
[("barmen", "Paris"),
(None, "Berlin")],
["role", "address"])
workers2.toPandas()
>>>
role address
0 barmen Paris
1 None Berlin
clumns_to_join_on = ["role", "address"]
joined_workers = workers1.alias("workers1").join(
workers2.alias("workers2"),
[
*[getattr(workers1, col).eqNullSafe(
getattr(workers2, col)) for col in clumns_to_join_on]
],
how="right",
)
joined_workers.toPandas()
>>>
role address age role address
0 None None None None Berlin
1 barmen Paris 25 barmen Paris
# expected result
joined_workers.drop(*[f.col("workers1.role")]).toPandas()
>>>
address age role address
0 None None None Berlin
1 Paris 25 barmen Paris
# Work as expected
joined_workers.drop(*[f.col("workers1.address")]).toPandas()
>>>
role age role address
0 None None None Berlin
1 barmen 25 barmen Paris
# Work as expected
joined_workers.drop(*[f.col("workers1.role"), f.col("workers1.address")]).toPandas()
>>>
TypeError: each col in the param list should be a string

Just select the columns you want to retain or select all columns except the one's to drop.
df.select([col for col in df.columns if col not in ['workers1.role', 'workers1.address']])
Update: In case of join with common column names:
joined_workers.select(["workers1."+col for col in workers1.columns if col not in ['role', 'address']]+["workers2."+col for col in workers2.columns if col not in [<if_any_from_2>]]).show()
Remove the second If condition from workers2.columns if all are to be retained from workers2 Dataframe

Related

How to aggregate Python Pandas dataframe such that value of a variable corresponds to the row a variable is selected in aggfunc?

I have the following data
ID DATE AGE COUNT
1 Nat 16 1
1 2021-06-06 19 2
1 2020-01-05 20 3
2 Nat 23 3
2 Nat 16 3
2 2019-02-04 36 12
I want to aggregate this so that the DATE will be the earliest valid date (in time), while AGE will be extracted from the corresponding row the earliest date is selected. The output should be
ID DATE AGE COUNT
1 2021-06-06 19 1
2 2019-02-04 36 3
My code which gives this error TypeError: Must provide 'func' or named aggregation **kwargs..
df_agg = pd.pivot_table(df, index=['ID'],
values=['DATE', 'AGE'],
aggfunc={'DATE': np.min, 'AGE': None, 'COUNT': np.min})
I don't want to use 'AGE': np.min since for ID=1, AGE=16 will be extracted which is not what I want.
///////////// Edits ///////////////
Edits made to provide a more generic example.
You can try .first_valid_index():
x = df.loc[df.groupby("ID").apply(lambda x: x["DATE"].first_valid_index())]
print(x)
Prints:
ID DATE AGE
1 1 2021-06-06 19
5 2 2019-02-04 36
EDIT: Using .pivot_table(). You can extract the "DATE"/"AGE" together as a list, for "COUNT" you can use np.min or "min". The second step would be explode the "DATE"/"AGE" list to separate columns:
df_agg = pd.pivot_table(
df,
index=["ID"],
values=["DATE", "AGE", "COUNT"],
aggfunc={
"DATE": lambda x: df.loc[x.first_valid_index()][
["DATE", "AGE"]
].tolist(),
"COUNT": "min",
},
)
df_agg[["DATE", "AGE"]] = pd.DataFrame(df_agg["DATE"].apply(pd.Series))
print(df_agg)
Prints:
COUNT DATE AGE
ID
1 1 2021-06-06 19
2 3 2019-02-04 36
You can sort values and drop the duplicates (sort_index is optional)
df.sort_values(['DATE']).drop_duplicates('ID').sort_index()
ID DATE AGE
1 1 2021-06-06 19
5 2 2019-02-04 36
With groupby and transform:
df[df['DATE'] == df.groupby("ID")['DATE'].transform('min')]
Assuming you have an index, a simple solution would be:
def min_val(group):
group = group.loc[group.DATE.idxmin]
return group
df.groupby(['ID']).apply(min_val)
If you do not have an index you can use:
df.reset_index().groupby(['ID']).apply(min_val).drop(columns=['ID'])

merging varying number of rows and columns by multiple conditions in python

updated Problem: Why does it not merge a_date, a_par, a_cons, a_ment and a_le. These are appended as columns without values but in the original dataset they have values.
Here is how the dataset looks like
connector type q_text a_text var1 var2
1 1111 1 aa None xx ps
2 9999 2 None tt jjjj pppp
3 1111 2 None uu None oo
4 9999 1 bb None yy Rt
5 9999 1 cc None zz tR
Goal: how the dataset should look like
connector q_text a_text var1 var1.1 var2 var2.1
1 1111 aa uu xx None ps oo
2 9999 bb tt yy jjjj Rt pppp
3 9999 cc tt zz jjjj tR pppp
Logic: Column type has either value 1 or 2 with multiple rows having value 1 but only one row (with the same value in connector) has value 2
Here are the main merging rules:
Merge every row of type=1 with its corresponding (connector) type=2 row.
Since multiple rows of type=1 have the same connector value, I don't want to merge solely one row of type=1 but all of them, each with the sole type==2 row.
Since some columns (e.g. a_text) follow left-join logic, values can be overridden without adding an extra column.
Since var2 values cannot be merged by left-join because they are non-exclusionary with regard to the rows connector value, i want to have extra columns (var1.1, var2.1) for those values (pppp, jjjj).
In summary (and having in mind that i only speak of rows that have the same connector values): If q_text is None i first, want to replace the values in a_text with the a_text value (see above table tt and uu) of the corresponding row (same connector value) and secondly, want to append some other values (var1 and var2) of the very same corresponding row as new columns.
Also, there are rows with a unique connector value that is not going to be matched. I want to keep those rows though.
I only want to "drop" the type=2 rows that get merged with their corresponding type=1 row**(s)**. In other words: I dont want to keep the rows of type=2 that have a match and get merged into their corresponding (connector) type=1 rows. I want to keep all other rows though.
Solution by #victor__von__doom here
merging varying number of rows by multiple conditions in python
was answered when i originally wanted to keep all of the "type"=2 columns(values).
Code i used: merged Perso, q_text and a_text
df.loc[df['type'] == 2, 'a_date'] = df['q_date']
df.loc[df['type'] == 2, 'a_par'] = df['par']
df.loc[df['type'] == 2, 'a_cons'] = df['cons']
df.loc[df['type'] == 2, 'a_ment'] = df['pret']
df.loc[df['type'] == 2, 'a_le'] = df['q_le']
my_cols = ['Perso', 'q_text','a_text', 'a_le', 'q_le', 'q_date', 'par', 'cons', 'pret', 'q_le', 'a_date','a_par', 'a_cons', 'a_ment', 'a_le']
df[my_cols] = df.sort_values(['connector','type']).groupby('connector')[my_cols].transform(lambda x: x.bfill())
df.dropna(subset=['a_text', 'Perso'],inplace=True)
df.reset_index(drop=True,inplace=True)
Data: This is a representation of the core dataset. Unfortunately i cannot share the actual data due to privacy laws.
Perso
ID
per
q_le
a_le
pret
par
form
q_date
name
IO_ID
part
area
q_text
a_text
country
cons
dig
connector
type
J Ws
1-1/4/2001-11-12/1
1999-2009
None
4325
'Mi, h', 'd'
Cew
Thre
2001-11-12
None
345
rede
s — H
None
wr ede
Terd e
e r
2001-11-12.1.g9
999999999
2
S ts
9-3/6/2003-10-14/1
1994-2004
None
23
'sd, h'
d-g
Thre
2003-10-14
None
34555
The
l? I
None
Tre
Thr ede
re
2001-04-16.1.a9
333333333
2
On d
6-1/6/2005-09-03/1
1992-2006
None
434
'uu h'
d-g
Thre
2005-09-03
None
7313
Thde
l? I
None
T e
Th rede
dre
2001-08-07.1.e4
111111111
2
None
3-4/4/2000-07-07/1
1992-2006
1223
None
'uu h'
dfs
Thre
2000-07-07
Th r
7413
Thde
Tddde
Thd de
None
Thre de
2001-07-06.1.j3
111111111
1
None
2-1/6/2001-11-12/1
1999-2009
1444
None
'Mi, h', 'd'
d-g
Thre
2001-11-12
T rj
7431
Thde
l? I
Th dde
None
Thr ede
2001-11-12.1.s7
999999999
1
None
1-6/4/2007-11-01/1
1993-2010
2353
None
None
d-g
Thre
2007-11-01
Thrj
444
Thed
l. I
Tgg gg
None
Thre de
we e
2001-06-11.1.g9
654982984
1
EDIT v2 with additional columns
This version ensures the values in the additional columns are not impacted.
c = ['connector','type','q_text','a_text','var1','var2','cumsum','country','others']
d = [[1111, 1, 'aa', None, 'xx', 'ps', 0, 'US', 'other values'],
[9999, 2, None, 'tt', 'jjjj', 'pppp', 0, 'UK', 'no values'],
[1111, 2, None, 'uu', None, 'oo', 1, 'US', 'some values'],
[9999, 1, 'bb', None, 'yy', 'Rt', 1, 'UK', 'more values'],
[9999, 1, 'cc', None, 'zz', 'tR', 2, 'UK', 'less values']]
import pandas as pd
pd.set_option('display.max_columns', None)
df = pd.DataFrame(d,columns=c)
print (df)
df.loc[df['type'] == 2, 'var1.1'] = df['var1']
df.loc[df['type'] == 2, 'var2.1'] = df['var2']
my_cols = ['q_text','a_text','var1','var2','var1.1','var2.1']
df[my_cols] = df.sort_values(['connector','type']).groupby('connector')[my_cols].transform(lambda x: x.bfill())
df.dropna(subset=['q_text'],inplace=True)
df.reset_index(drop=True,inplace=True)
print (df)
Original DataFrame:
connector type q_text a_text var1 var2 cumsum country others
0 1111 1 aa None xx ps 0 US other values
1 9999 2 None tt jjjj pppp 0 UK no values
2 1111 2 None uu None oo 1 US some values
3 9999 1 bb None yy Rt 1 UK more values
4 9999 1 cc None zz tR 2 UK less values
Updated DataFrame
connector type q_text a_text var1 var2 cumsum country others var1.1 var2.1
0 1111 1 aa uu xx ps 0 US other values None oo
1 9999 1 bb tt yy Rt 1 UK more values jjjj pppp
2 9999 1 cc tt zz tR 2 UK less values jjjj pppp

Adding new columns to a dataframe

I have a list containing the column names of a dataframe. I want to add these empty columns to a dataframe that already exists.
col_names = ["a", "b", "e"]
df = pd.Dataframe()
df = # stores some content
I understand a single new column can be added in the following manner and I could do the same for other columns
df['e'] = None
However, I'd like to know how to add these new columns at once.
You can use the same syntax as adding a single new column:
df[col_names] = None
When you create the data frame, you can pass the col_names to the columns parameter:
import pandas as pd
col_names = ["a", "b", "e"]
df = pd.DataFrame(columns=col_names)
print(df)
# Empty DataFrame
# Columns: [a, b, e]
# Index: []
print(df.columns)
# Index(['a', 'b', 'e'], dtype='object')
You can simply give df[col_list] = None. Here's an example of how you can do it.
import pandas as pd
df = pd.DataFrame({'col1':['river','sea','lake','pond','ocean'],
'year':[2000,2001,2002,2003,2004],
'col2':['apple','peach','banana','grape','cherry']})
print (df)
Created a dataframe with 3 columns and 5 rows:
Output of df is:
col1 year col2
0 river 2000 apple
1 sea 2001 peach
2 lake 2002 banana
3 pond 2003 grape
4 ocean 2004 cherry
Now I want to add columns ['a','b','c','d','e'] to the df. I can do it by just assigning None to the column list.
temp_cols = ['a','b','c','d','e']
df[temp_cols] = None
print (df)
The updated dataframe will have:
col1 year col2 a b c d e
0 river 2000 apple None None None None None
1 sea 2001 peach None None None None None
2 lake 2002 banana None None None None None
3 pond 2003 grape None None None None None
4 ocean 2004 cherry None None None None None
assign with dict.fromkeys also works
In [219]: df = pd.DataFrame()
In [220]: df.assign(**dict.fromkeys(col_names))
Out[220]:
Empty DataFrame
Columns: [a, b, e]
Index: []
This is also work for adding empty (None values) columns to an existing dataframe
sample df
np.random.seed(20)
df = pd.DataFrame(np.random.randint(0, 4, 3*2).reshape(3,2), columns=['col1','col2'])
Out[240]:
col1 col2
0 3 2
1 3 3
2 0 2
df = df.assign(**dict.fromkeys(col_names))
Out[242]:
col1 col2 a b e
0 3 2 None None None
1 3 3 None None None
2 0 2 None None None
Please try reindex
#To add to an existing dataframe
df=df.reindex(list(df.columns)+col_names, axis='columns', fill_value='None')
Name Weight a b e
0 John Average None None None
1 Paul Below Average None None None
2 Darren Above Average None None None
3 John Average None None None
4 Darren Above Average None None None

pandas help: map and match tab delimted strings in a column and print into new column

I have a dataframe data which have last column containing a bunch of sting and digits and i have one more dataframe info where those sting and digits means, i want to map user input(item) with info and match, print and count how many of them present in the last column in data and prioritize the dataframe data based on numbder of match
import pandas
#data
data = {'id': [123, 456, 789, 1122, 3344],
'Name': ['abc', 'def', 'hij', 'klm', 'nop'],
'MP-ID': ['MP:001|MP:0085|MP:0985', 'MP:005|MP:0258', 'MP:025|MP:5890', 'MP:0589|MP:02546', 'MP:08597|MP:001|MP:005']}
test_data = pd.DataFrame(data)
#info
info = {'MP-ID': ['MP:001', 'MP:002', 'MP:003', 'MP:004', 'MP:005'], 'Item': ['apple', 'orange', 'grapes', 'bannan', 'mango']}
test_info = pd.DataFrame(info)
user input exmaple:
run.py apple mango
desired output:
id Name MP-ID match count
3344 nop MP:08597|MP:001|MP:005 MP:001|MP:005 2
123 abc MP:001|MP:0085|MP:0985 MP:001 1
456 def MP:005|MP:0258 MP:005 1
789 hij MP:025|MP:5890 0
1122 klm MP:0589|MP:02546 0
Thank you for your help in advance
First get all arguments to variable vars, filter MP-ID by Series.isin with DataFrame.loc and extract them by Series.str.findall with Series.str.join, last use Series.str.count with DataFrame.sort_values:
import sys
vals = sys.argv[1:]
#vals = ['apple','mango']
s = test_info.loc[test_info['Item'].isin(vals), 'MP-ID']
test_data['MP-ID match'] = test_data['MP-ID'].str.findall('|'.join(s)).str.join('|')
test_data['count'] = test_data['MP-ID match'].str.count('MP')
test_data = test_data.sort_values('count', ascending=False, ignore_index=True)
print (test_data)
id Name MP-ID MP-ID match count
0 3344 nop MP:08597|MP:001|MP:005 MP:001|MP:005 2
1 123 abc MP:001|MP:0085|MP:0985 MP:001 1
2 456 def MP:005|MP:0258 MP:005 1
3 789 hij MP:025|MP:5890 0
4 1122 klm MP:0589|MP:02546 0

Updating multiple columns of df from another df

I have two dataframes, df1 and df2. I want to update some columns(not all) of df1 from the value which is in df2 columns(names of common column is same in both dataframes) based on key column. df1 can have multiple entries of that key but in df2 each key has only one entry.
df2 :
party_id age person_name col2
0 1 12 abdjc abc
1 2 35 fAgBS sfd
2 3 65 Afdc shd
3 5 34 Afazbf qfwjk
4 6 78 asgsdb fdgd
5 7 35 sdgsd dsfbds
df1:
party_id account_id product_type age dob status col2
0 1 1 Current 25 28-01-1994 active sdag
1 2 2 Savings 31 14-07-1988 pending asdg
2 3 3 Loans 65 22-07-1954 frozen sgsdf
3 3 4 Over Draft Facility 93 29-01-1927 active dsfhgd
4 4 5 Mortgage 93 01-03-1926 pending sdggsd
In this example I want to update age, col2 in df1 based on the value present in df2. And key column here is party_id.
I tried mapping df2 into dict with their key (column wise, one column at time). Here key_name = party_id and column_name = age
dict_key = df2[key_name]
dict_value = df2[column_name]
temp_dict = dict(zip(dict_key, dict_value))
and then map it to df1
df1[column_name].map(temp_dict).fillna(df1[column_name])
But issue here is it is only mapping the one entry not all for that key value.In this example party_id == 3 have multiple entry in df1.
Keys which is not in df2, their respective value for that column should be unchanged.
Can anyone help me with efficient solution as my df1 is of big size more than 500k? So that all columns can update at the same time.
df2 is of moderate size around 3k or something.
Thanks
Idea is use DataFrame.merge with left join first, then get columns with are same in both DataFrames to cols and replace missing values by original values by DataFrame.fillna:
df = df1.merge(df2.drop_duplicates('party_id'), on='party_id', suffixes=('','_'), how='left')
cols = df2.columns.intersection(df1.columns).difference(['party_id'])
df[cols] = df[cols + '_'].rename(columns=lambda x: x.strip('_')).fillna(df[cols])
df = df[df1.columns]
print (df)
party_id age person_name col2
0 1 25.0 abdjc sdag
1 2 31.0 fAgBS asdg
2 3 65.0 Afdc sgsdf
3 5 34.0 Afazbf qfwjk
4 6 78.0 asgsdb fdgd
5 7 35.0 sdgsd dsfbds

Resources