Is there a Pandas way to group dates that are less than 2 minutes apart in a dataframe? - python-3.x

i would like to make a groupby on my data to put together dates that are close. (less than 2 minutes)
Here an example of what i get
> datas = [['A', 51, 'id1', '2020-05-27 05:50:43.346'], ['A', 51, 'id2',
> '2020-05-27 05:51:08.347'], ['B', 45, 'id3', '2020-05-24
> 17:23:55.142'],['B', 45, 'id4', '2020-05-24 17:23:30.141'], ['C', 34,
> 'id5', '2020-05-23 17:31:10.341']]
>
> df = pd.DataFrame(datas, columns = ['col1', 'col2', 'cold_id',
> 'dates'])
The 2 first rows have close dates, same for the 3th and 4th rows, 5th row is alone.
I would like to get something like this :
> datas = [['A', 51, 'id1 id2', 'date_1'], ['B', 45, 'id3 id4',
> 'date_2'], ['C', 34, 'id5', 'date_3']]
>
> df = pd.DataFrame(datas, columns = ['col1', 'col2', 'col_id',
> 'dates'])
Making it in a pythonic way is not that hard, but i have to make it on big dataframe, a pandas way using groupby method would be much efficient.
After apply a datetime method on the dates column i tried :
> df.groupby([df['dates'].dt.date]).agg(','.join)
but the .dt.date method gives a date every day and not every 2 minutes.
Do you have a solution ?
Thank you

Looking at the output seems like we are trying to group dates by 2 min freq and col1, col2.
Code
df['dates'] = pd.to_datetime(df.dates)
df.groupby([pd.Grouper(key='dates', freq='2 min'),
'col1', 'col2']).agg(','.join).reset_index().sort_values('col1').reset_index(drop=True)
Output
dates col1 col2 cold_id
0 2020-05-27 05:50:00 A 51 id1,id2
1 2020-05-24 17:22:00 B 45 id3,id4
2 2020-05-23 17:30:00 C 34 id5

Using only dates, this is what I did to classifiate your rows :
First of all, convert date to timestamp to compare them easily :
from datetime import datetime
import time
df["dates"] = df["dates"].apply(lambda x : int(time.mktime(datetime.strptime(x,"%Y-%m-%d %H:%M:%S.%f").timetuple())))
Then, sort them by date :
df = df.sort_values("dates")
Finally, using this answer, I create a column group in order to identify close dates. The first line add 1 into group column if dates are close enough (120 seconds in our case). The second line will fill in the column group to remove the NaN and assign groups :
df.loc[(df.dates.shift() < df.dates - 120),"group"] = 1
df['group'] = df['group'].cumsum().ffill().fillna(0)
This gives me :
col1 col2 cold_id dates group
4 C 34 id5 1590247870 0.00
3 B 45 id4 1590333810 1.00
2 B 45 id3 1590333835 1.00
0 A 51 id1 1590551443 2.00
1 A 51 id2 1590551468 2.00
Now, to concatenate your cold_id, you groupby your groups and join the cold_id of each group by using transform:
df["cold_id"] = df.groupby(["group"],as_index=False)["cold_id"].transform(lambda x: ','.join(x))
df = df.drop_duplicates(subset=["cold_id"])
This finally gives you this dataframe :
col1 col2 cold_id dates group
4 C 34 id5 1590247870 0.00
3 B 45 id4,id3 1590333810 1.00
0 A 51 id1,id2 1590551443 2.00

Related

Split corresponding column values in pyspark

Below table would be the input dataframe
col1
col2
col3
1
12;34;56
Aus;SL;NZ
2
31;54;81
Ind;US;UK
3
null
Ban
4
Ned
null
Expected output dataframe [values of col2 and col3 should be split by ; correspondingly]
col1
col2
col3
1
12
Aus
1
34
SL
1
56
NZ
2
31
Ind
2
54
US
2
81
UK
3
null
Ban
4
Ned
null
You can use the pyspark function split() to convert the column with multiple values into an array and then the function explode() to make multiple rows out of the different values.
It may look like this:
df = df.withColumn("<columnName>", explode(split(df.<columnName>, ";")))
If you want to keep NULL values you can use explode_outer().
If you want the values of multiple exploded arrays to match in the rows, you could work with posexplode() and then filter() to the rows where the positions are corresponding.
Below code works perfectly fine
data = [(1,'12;34;56', 'Aus;SL;NZ'),
(2,'31;54;81', 'Ind;US;UK'),
(3,None, 'Ban'),
(4,'Ned', None) ]
columns = ['Id', 'Score','Countries']
df = spark.createDataFrame(data, columns)
#df.show()
df2=df.select("*",posexplode_outer(split("Countries",";")).alias("pos1","value1"))
#df2.show()
df3=df2.select("*",posexplode_outer(split("Score",";")).alias("pos2","value2"))
#df3.show()
df4=df3.filter((df3.pos1==df3.pos2) | (df3.pos1.isNull() | df3.pos2.isNull()))
df4=df4.select("Id","value2","value1")
df4.show() #Final Output

Convert pandas Columns in Rows using (melt doesn't work)

How can I achieve this in pandas, I have a way where I take out each column as a new data frame and then so a insert in SQL but in that way if I have 10 columns I want to do the same I cannot make 10 data frames so I want to know how can I achieve it dynamically
I have a data set where I have the following data
Output I have
Id col1 col2 col3
1 Ab BC CD
2 har Adi tony
Output I want
Id col1
1 AB
1 BC
1 CD
2 har
2 ADI
2 Tony
melt does work, you just need a few extra steps for the exact output.
Assuming "Id" is a column (if not, reset_index).
(df.melt(id_vars='Id', value_name='col1')
.sort_values(by='Id')
.drop('variable', axis=1)
)
Output:
Id col1
0 1 Ab
2 1 BC
4 1 CD
1 2 har
3 2 Adi
5 2 tony
Used input:
df = pd.DataFrame({'Id': [1, 2],
'col1': ['Ab', 'har'],
'col2': ['BC', 'Adi'],
'col3': ['CD', 'tony']})

sort values a data frame with duplicates values

I have a dataframe with a format like this:
d = {'col1': ['PC', 'PO', 'PC', 'XY', 'XY', 'AB', 'AB', 'PC', 'PO'], 'col2':
[1,2,3,4,5,6,7,8,9]}
df = pd.DataFrame(data=d)
df.sort_values(by = 'col1')
This gives me the result like this:
I want to sort the values based on col1 values with desired order, keep the duplicates. The result I expect would be like this:
Any idea?
Thanks in advance!
You can create an order beforehand and then sort values as below.
order = ['PO','XY','AB','PC']
df['col1'] = pd.CategoricalIndex(df['col1'], ordered=True, categories=order)
df = df.sort_values(by = 'col1')
df
col1 col2
1 PO 2
8 PO 9
3 XY 4
4 XY 5
5 AB 6
6 AB 7
0 PC 1
2 PC 3
7 PC 8

Updating multiple columns of df from another df

I have two dataframes, df1 and df2. I want to update some columns(not all) of df1 from the value which is in df2 columns(names of common column is same in both dataframes) based on key column. df1 can have multiple entries of that key but in df2 each key has only one entry.
df2 :
party_id age person_name col2
0 1 12 abdjc abc
1 2 35 fAgBS sfd
2 3 65 Afdc shd
3 5 34 Afazbf qfwjk
4 6 78 asgsdb fdgd
5 7 35 sdgsd dsfbds
df1:
party_id account_id product_type age dob status col2
0 1 1 Current 25 28-01-1994 active sdag
1 2 2 Savings 31 14-07-1988 pending asdg
2 3 3 Loans 65 22-07-1954 frozen sgsdf
3 3 4 Over Draft Facility 93 29-01-1927 active dsfhgd
4 4 5 Mortgage 93 01-03-1926 pending sdggsd
In this example I want to update age, col2 in df1 based on the value present in df2. And key column here is party_id.
I tried mapping df2 into dict with their key (column wise, one column at time). Here key_name = party_id and column_name = age
dict_key = df2[key_name]
dict_value = df2[column_name]
temp_dict = dict(zip(dict_key, dict_value))
and then map it to df1
df1[column_name].map(temp_dict).fillna(df1[column_name])
But issue here is it is only mapping the one entry not all for that key value.In this example party_id == 3 have multiple entry in df1.
Keys which is not in df2, their respective value for that column should be unchanged.
Can anyone help me with efficient solution as my df1 is of big size more than 500k? So that all columns can update at the same time.
df2 is of moderate size around 3k or something.
Thanks
Idea is use DataFrame.merge with left join first, then get columns with are same in both DataFrames to cols and replace missing values by original values by DataFrame.fillna:
df = df1.merge(df2.drop_duplicates('party_id'), on='party_id', suffixes=('','_'), how='left')
cols = df2.columns.intersection(df1.columns).difference(['party_id'])
df[cols] = df[cols + '_'].rename(columns=lambda x: x.strip('_')).fillna(df[cols])
df = df[df1.columns]
print (df)
party_id age person_name col2
0 1 25.0 abdjc sdag
1 2 31.0 fAgBS asdg
2 3 65.0 Afdc sgsdf
3 5 34.0 Afazbf qfwjk
4 6 78.0 asgsdb fdgd
5 7 35.0 sdgsd dsfbds

Combine data from two columns into one without affecting the data values

I have two columns in a data frame. I want to combine those columns into a single column.
df = pd.DataFrame({'a': [500, 200, 13, 47], 'b':['$', '€', .586,.02]})
df
Out:
a b
0 500 $
1 200 €
2 13 .586
3 47 .02
I want to merge that two columns without affecting the data.
Expected output:
df
Out:
a
0 500$
1 200€
2 13.586
3 47.02
Please help me with this...
I tried the below solution, but it does not work for me,
df.b=np.where(df.b,df.b,df.a)
df.loc[df['b'] == '', 'b'] = df['a']
First solution working with convert both columns to strings and then join with +, last convert Series to one column DataFrame - but it working only if numbers less like 1 for column b:
df1 = df.astype(str)
df = (df1.a + df1.b.str.replace(r'^0', '')).to_frame('a')
print (df)
a
0 500$
1 200€
2 13.586
3 47.02
Or if want mixed values numeric for last 2 rows and strings for first 2 rows use:
m = df.b.apply(lambda x: isinstance(x, str))
df.loc[m, 'a'] = df.loc[m, 'a'].astype(str) + df.b
df.loc[~m, 'a'] += df.pop('b')
print (df)
a
0 500$
1 200€
2 13.586
3 47.02

Resources