Split corresponding column values in pyspark - apache-spark

Below table would be the input dataframe
col1
col2
col3
1
12;34;56
Aus;SL;NZ
2
31;54;81
Ind;US;UK
3
null
Ban
4
Ned
null
Expected output dataframe [values of col2 and col3 should be split by ; correspondingly]
col1
col2
col3
1
12
Aus
1
34
SL
1
56
NZ
2
31
Ind
2
54
US
2
81
UK
3
null
Ban
4
Ned
null

You can use the pyspark function split() to convert the column with multiple values into an array and then the function explode() to make multiple rows out of the different values.
It may look like this:
df = df.withColumn("<columnName>", explode(split(df.<columnName>, ";")))
If you want to keep NULL values you can use explode_outer().
If you want the values of multiple exploded arrays to match in the rows, you could work with posexplode() and then filter() to the rows where the positions are corresponding.

Below code works perfectly fine
data = [(1,'12;34;56', 'Aus;SL;NZ'),
(2,'31;54;81', 'Ind;US;UK'),
(3,None, 'Ban'),
(4,'Ned', None) ]
columns = ['Id', 'Score','Countries']
df = spark.createDataFrame(data, columns)
#df.show()
df2=df.select("*",posexplode_outer(split("Countries",";")).alias("pos1","value1"))
#df2.show()
df3=df2.select("*",posexplode_outer(split("Score",";")).alias("pos2","value2"))
#df3.show()
df4=df3.filter((df3.pos1==df3.pos2) | (df3.pos1.isNull() | df3.pos2.isNull()))
df4=df4.select("Id","value2","value1")
df4.show() #Final Output

Related

Is there a Pandas way to group dates that are less than 2 minutes apart in a dataframe?

i would like to make a groupby on my data to put together dates that are close. (less than 2 minutes)
Here an example of what i get
> datas = [['A', 51, 'id1', '2020-05-27 05:50:43.346'], ['A', 51, 'id2',
> '2020-05-27 05:51:08.347'], ['B', 45, 'id3', '2020-05-24
> 17:23:55.142'],['B', 45, 'id4', '2020-05-24 17:23:30.141'], ['C', 34,
> 'id5', '2020-05-23 17:31:10.341']]
>
> df = pd.DataFrame(datas, columns = ['col1', 'col2', 'cold_id',
> 'dates'])
The 2 first rows have close dates, same for the 3th and 4th rows, 5th row is alone.
I would like to get something like this :
> datas = [['A', 51, 'id1 id2', 'date_1'], ['B', 45, 'id3 id4',
> 'date_2'], ['C', 34, 'id5', 'date_3']]
>
> df = pd.DataFrame(datas, columns = ['col1', 'col2', 'col_id',
> 'dates'])
Making it in a pythonic way is not that hard, but i have to make it on big dataframe, a pandas way using groupby method would be much efficient.
After apply a datetime method on the dates column i tried :
> df.groupby([df['dates'].dt.date]).agg(','.join)
but the .dt.date method gives a date every day and not every 2 minutes.
Do you have a solution ?
Thank you
Looking at the output seems like we are trying to group dates by 2 min freq and col1, col2.
Code
df['dates'] = pd.to_datetime(df.dates)
df.groupby([pd.Grouper(key='dates', freq='2 min'),
'col1', 'col2']).agg(','.join).reset_index().sort_values('col1').reset_index(drop=True)
Output
dates col1 col2 cold_id
0 2020-05-27 05:50:00 A 51 id1,id2
1 2020-05-24 17:22:00 B 45 id3,id4
2 2020-05-23 17:30:00 C 34 id5
Using only dates, this is what I did to classifiate your rows :
First of all, convert date to timestamp to compare them easily :
from datetime import datetime
import time
df["dates"] = df["dates"].apply(lambda x : int(time.mktime(datetime.strptime(x,"%Y-%m-%d %H:%M:%S.%f").timetuple())))
Then, sort them by date :
df = df.sort_values("dates")
Finally, using this answer, I create a column group in order to identify close dates. The first line add 1 into group column if dates are close enough (120 seconds in our case). The second line will fill in the column group to remove the NaN and assign groups :
df.loc[(df.dates.shift() < df.dates - 120),"group"] = 1
df['group'] = df['group'].cumsum().ffill().fillna(0)
This gives me :
col1 col2 cold_id dates group
4 C 34 id5 1590247870 0.00
3 B 45 id4 1590333810 1.00
2 B 45 id3 1590333835 1.00
0 A 51 id1 1590551443 2.00
1 A 51 id2 1590551468 2.00
Now, to concatenate your cold_id, you groupby your groups and join the cold_id of each group by using transform:
df["cold_id"] = df.groupby(["group"],as_index=False)["cold_id"].transform(lambda x: ','.join(x))
df = df.drop_duplicates(subset=["cold_id"])
This finally gives you this dataframe :
col1 col2 cold_id dates group
4 C 34 id5 1590247870 0.00
3 B 45 id4,id3 1590333810 1.00
0 A 51 id1,id2 1590551443 2.00

groupby column in pandas

I am trying to groupby columns value in pandas but I'm not getting.
Example:
Col1 Col2 Col3
A 1 2
B 5 6
A 3 4
C 7 8
A 11 12
B 9 10
-----
result needed grouping by Col1
Col1 Col2 Col3
A 1,3,11 2,4,12
B 5,9 6,10
c 7 8
but I getting this ouput
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000025BEB4D6E50>
I am getting using excel power query with function group by and count all rows, but I canĀ“t get the same with python and pandas. Any help?
Try this
(
df
.groupby('Col1')
.agg(lambda x: ','.join(x.astype(str)))
.reset_index()
)
it outputs
Col1 Col2 Col3
0 A 1,3,11 2,4,12
1 B 5,9 6,10
2 C 7 8
Very good I created solution between 0 and 0:
df[df['A'] != 0].groupby((df['A'] == 0).cumsum()).sub()
It will group column between 0 and 0 and sum it

Updating multiple columns of df from another df

I have two dataframes, df1 and df2. I want to update some columns(not all) of df1 from the value which is in df2 columns(names of common column is same in both dataframes) based on key column. df1 can have multiple entries of that key but in df2 each key has only one entry.
df2 :
party_id age person_name col2
0 1 12 abdjc abc
1 2 35 fAgBS sfd
2 3 65 Afdc shd
3 5 34 Afazbf qfwjk
4 6 78 asgsdb fdgd
5 7 35 sdgsd dsfbds
df1:
party_id account_id product_type age dob status col2
0 1 1 Current 25 28-01-1994 active sdag
1 2 2 Savings 31 14-07-1988 pending asdg
2 3 3 Loans 65 22-07-1954 frozen sgsdf
3 3 4 Over Draft Facility 93 29-01-1927 active dsfhgd
4 4 5 Mortgage 93 01-03-1926 pending sdggsd
In this example I want to update age, col2 in df1 based on the value present in df2. And key column here is party_id.
I tried mapping df2 into dict with their key (column wise, one column at time). Here key_name = party_id and column_name = age
dict_key = df2[key_name]
dict_value = df2[column_name]
temp_dict = dict(zip(dict_key, dict_value))
and then map it to df1
df1[column_name].map(temp_dict).fillna(df1[column_name])
But issue here is it is only mapping the one entry not all for that key value.In this example party_id == 3 have multiple entry in df1.
Keys which is not in df2, their respective value for that column should be unchanged.
Can anyone help me with efficient solution as my df1 is of big size more than 500k? So that all columns can update at the same time.
df2 is of moderate size around 3k or something.
Thanks
Idea is use DataFrame.merge with left join first, then get columns with are same in both DataFrames to cols and replace missing values by original values by DataFrame.fillna:
df = df1.merge(df2.drop_duplicates('party_id'), on='party_id', suffixes=('','_'), how='left')
cols = df2.columns.intersection(df1.columns).difference(['party_id'])
df[cols] = df[cols + '_'].rename(columns=lambda x: x.strip('_')).fillna(df[cols])
df = df[df1.columns]
print (df)
party_id age person_name col2
0 1 25.0 abdjc sdag
1 2 31.0 fAgBS asdg
2 3 65.0 Afdc sgsdf
3 5 34.0 Afazbf qfwjk
4 6 78.0 asgsdb fdgd
5 7 35.0 sdgsd dsfbds

Grouping corresponding Rows based on One column

I have an Excel Sheet Dataframe with no fixed number of rows and columns.
eg.
Col1 Col2 Col3
A 1 -
A - 2
B 3 -
B - 4
C 5 -
I would like to Group Col1 which has the same content. Like the following.
Col1 Col2 Col3
A 1 2
B 3 4
C 5 -
I am using pandas GroupBy, but not getting what I wanted.
Try using groupby:
print(df.replace('-', pd.np.nan).groupby('Col1', as_index=False).first().fillna('-'))
Output:
Col1 Col2 Col3
0 A 1 2
1 B 3 4
2 C 5 -

Filter a dataframe with NOT and AND condition

I know this question has been asked multiple times, but for some reason it is not working for my case.
So I want to filter the dataframe using the NOT and AND condition.
For example, my dataframe df looks like:
col1 col2
a 1
a 2
b 3
b 4
b 5
c 6
Now, I want to use a condition to remove where col1 has "a" AND col2 has 2
My resulting dataframe should look like:
col1 col2
a 1
b 3
b 4
b 5
c 6
I tried this: Even though I used & but it removes all the rows which have "a" in col1 .
df = df[(df['col1'] != "a") & (df['col2'] != "2")]
To remove cells where col1 is "a" AND col2 is 2 means to keep cells where col1 isn't "a" OR col2 isn't 2 (negation of A AND B is NOT(A) OR NOT(B)):
df = df[(df['col1'] != "a") | (df['col2'] != 2)] # or "2", depending on whether the `2` is an int or a str

Resources