Calculating dynamic difference between two columns in pandas - python-3.x

I have a situation in which I need to calculate the difference between multiple columns and store them in a separate column under separate headers. My dataset looks like below:
cat_1 cat_2 cat_3 cat_4 date_1 date_2 date_3 date_4
a b b c 2020-01-01 2020-01-01 2020-01-25 2020-01-10
b c d 2019-01-11 2020-01-01 2020-01-15 2020-01-10
a b d 2018-11-01 2019-01-01 2020-01-15 2020-01-10
a b c d 2015-01-01 2016-01-29 2018-01-25 2019-01-10
.. and so on
The order will follow : a->b->c->d and the reverse is not true
I want to store the following combinations in new columns in number of days. There will be 4 combinations in total. Essentially I want to calculate the difference in two dates and store in days for the combination.
An example output for the first row of my dataset:
days_a-b days_a-c days_a-d days_b-c days_b-d days_c-d
0 9
355 364 -5
How to solve this one?

Related

Select top n columns based on another column

I have a database as the following:
And I would like to obtain a pandas dataframe filtered for the 2 rows per date, based on the top ones that have the highest population. The output should look like this:
I know that pandas offers a formula called nlargest:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nlargest.html
but I don't think it is usable for this use case. Is there any workaround?
Thanks so much in advance!
I have mimicked your dataframe as below and provided a way forward to get the desired, hope that will helpful.
Your Dataframe:
>>> df
Date country population
0 2019-12-31 A 100
1 2019-12-31 B 10
2 2019-12-31 C 1000
3 2020-01-01 A 200
4 2020-01-01 B 20
5 2020-01-01 C 3500
6 2020-01-01 D 12
7 2020-02-01 D 2000
8 2020-02-01 E 54
Your Desired Solution:
You can use nlargest method along with set_index ans groupby method.
This is what you will get..
>>> df.set_index('country').groupby('Date')['population'].nlargest(2)
Date country
2019-12-31 C 1000
A 100
2020-01-01 C 3500
A 200
2020-02-01 D 2000
E 54
Name: population, dtype: int64
Now, as you want the DataFrame into original state by resetting the index of the DataFrame, which will give you following ..
>>> df.set_index('country').groupby('Date')['population'].nlargest(2).reset_index()
Date country population
0 2019-12-31 C 1000
1 2019-12-31 A 100
2 2020-01-01 C 3500
3 2020-01-01 A 200
4 2020-02-01 D 2000
5 2020-02-01 E 54
Another way around:
With groupby and apply function use reset_index with parameter drop=True and level= ..
>>> df.groupby('Date').apply(lambda p: p.nlargest(2, columns='population')).reset_index(level=[0,1], drop=True)
# df.groupby('Date').apply(lambda p: p.nlargest(2, columns='population')).reset_index(level=['Date',1], drop=True)
Date country population
0 2019-12-31 C 1000
1 2019-12-31 A 100
2 2020-01-01 C 3500
3 2020-01-01 A 200
4 2020-02-01 D 2000
5 2020-02-01 E 54

Adjust the overlapping dates in group by with priority from another columns

As Title Suggest, I am working on a problem to find overlapping dates based on ID and adjust overlapping date based on priority(weight). Following piece of code helped to find overlapping dates.
df['overlap'] = (df.groupby('ID')
.apply(lambda x: (x['End_date'].shift() - x['Start_date']) > timedelta(0))
.reset_index(level=0, drop=True))
df
Now issue I'm facing is, how to introduce priority(weight) and adjust start_date by that. In the below image, I have highlighted adjusted dates based on weight where A takes precedence over B and B takes over C.
Should I create a dictionary for string to numeric weight values and then what? I'm stuck here to set up logic.
Dataframe:
op_d = {'ID': [1,1,1,2,2,3,3,3],'Start_date':['9/1/2020','10/10/2020','11/18/2020','4/1/2015','5/12/2016','4/1/2015','5/15/2016','8/1/2018'],\
'End_date':['10/9/2020','11/25/2020','12/31/2020','5/31/2016','12/31/2016','5/29/2016','9/25/2018','10/15/2020'],\
'Weight':['A','B','C','A','B','A','B','C']}
df = pd.DataFrame(data=op_d)
You have already identified the overlap condition, you can then try adding a day to End_Date and shift, then assign them to start date where overlap column is true:
arr = np.where(df['overlap'],df['End_date'].add(pd.Timedelta(1,unit='d')).shift(),
df['Start_date'])
out = df.assign(Output_Start_Date = arr,Output_End_Date=df['End_date'])
print(out)
ID Start_date End_date Weight overlap Output_Start_Date Output_End_Date
0 1 2020-09-01 2020-10-09 A False 2020-09-01 2020-10-09
1 1 2020-10-10 2020-11-25 B False 2020-10-10 2020-11-25
2 1 2020-11-18 2020-12-31 C True 2020-11-26 2020-12-31
3 2 2015-04-01 2016-05-31 A False 2015-04-01 2016-05-31
4 2 2016-05-12 2016-12-31 B True 2016-06-01 2016-12-31
5 3 2015-04-01 2016-05-29 A False 2015-04-01 2016-05-29
6 3 2016-05-15 2018-09-25 B True 2016-05-30 2018-09-25
7 3 2018-08-01 2020-10-15 C True 2018-09-26 2020-10-15

Select two or more consecutive rows based on a criteria using python

I have a data set like this:
user time city cookie index
A 2019-01-01 11.00 NYC 123456 1
A 2019-01-01 11.12 CA 234567 2
A 2019-01-01 11.18 TX 234567 3
B 2019-01-02 12.19 WA 456789 4
B 2019-01-02 12.21 FL 456789 5
B 2019-01-02 12.31 VT 987654 6
B 2019-01-02 12.50 DC 157890 7
A 2019-01-03 09:12 CA 123456 8
A 2019-01-03 09:27 NYC 345678 9
A 2019-01-03 09:34 TX 123456 10
A 2019-01-04 09:40 CA 234567 11
In this data set I want to compare and select two or more consecutive which fit the following criteria:
User should be same
Time difference should be less than 15 mins
Cookie should be different
So if I apply the filter I should get the following data:
user time city cookie index
A 2019-01-01 11.00 NYC 123456 1
A 2019-01-01 11.12 CA 234567 2
B 2019-01-02 12.21 FL 456789 5
B 2019-01-02 12.31 VT 987654 6
A 2019-01-03 09:12 CA 123456 8
A 2019-01-03 09:27 NYC 345678 9
A 2019-01-03 09:34 TX 123456 10
So, in the above, comparing first two rows(index 1 and 2) satisfy all the conditions above. The next two (index 2 and 3) has same cookie, index 3 and 4 has different user, 5 and 6 is selected and displayed, 6 and 7 has time difference more than 15 mins. 8,9 and 10 fit the criteria but 11 doesnt as the date is 24 hours apart.
How can I solve this using python dataframe? All help is appreciated.
What I have tried:
I tried creating flags using
shift()
cookiediff=pd.DataFrame(df.Cookie==df.Cookie.shift())
cookiediff.columns=['Cookiediffs']
timediff=pd.DataFrame(pd.to_datetime(df.time) - pd.to_datetime(df.time.shift()))
timediff.columns=['timediff']
mask = df.user != df.user.shift(1)
timediff.timediff[mask] = np.nan
cookiediff['Cookiediffs'][mask] = np.nan
This will do the trick:
import numpy as np
#you have inconsistent time delim-just to correct it per your sample data
df["time"]=df["time"].str.replace(":", ".")
df["time"]=pd.to_datetime(df["time"], format="%Y-%m-%d %H.%M")
cond_=np.logical_or(
df["time"].sub(df["time"].shift()).astype('timedelta64[m]').lt(15) &\
df["user"].eq(df["user"].shift()) &\
df["cookie"].ne(df["cookie"].shift()),
df["time"].sub(df["time"].shift(-1)).astype('timedelta64[m]').lt(15) &\
df["user"].eq(df["user"].shift(-1)) &\
df["cookie"].ne(df["cookie"].shift(-1)),
)
res=df.loc[cond_]
Few points- you need to ensure your time column is datetime in order to make the 15 minutes condition verifiable.
Then - the final filter (cond_) you obtain by comparing each row to the previous one, checking all 3 conditions OR by doing the same, but checking against the next one (otherwise you would just get all the consecutive matching rows, except the first one).
Outputs:
user time city cookie index
0 A 2019-01-01 11:00:00 NYC 123456 1
1 A 2019-01-01 11:12:00 CA 234567 2
4 B 2019-01-02 12:21:00 FL 456789 5
5 B 2019-01-02 12:31:00 VT 987654 6
7 A 2019-01-03 09:12:00 CA 123456 8
8 A 2019-01-03 09:27:00 NYC 345678 9
9 A 2019-01-03 09:34:00 TX 123456 10
You could use regular expressions to isolate the fields and use named groups and the groupdict() function to store the value of each field into a dictionary and compare the values from the last dictionary to the current one. So iterate through each line of the dataset with two dictionaries, the current dictionary and the last dictionary, and perform a re.search() on each line with the regex pattern string to separate each line into named fields, then compare the value of the two dictionaries.
So, something like:
import re
c_dict=re.search('(?P<user>\w) +(?P<time>\d{4}-\d{2}-\d{2} \d{2}\.\d{2}) +(?P<city>\w+) +(?P<cookie>\d{6}) +(?P<index>\d+)',s).groupdict()
for each line of your dataset. For the first line of your dataset, this would create the dictionary {'user': 'A', 'time': '2019-01-01 11.00', 'city': 'NYC', 'cookie': '123456', 'index': '1'}. With the fields isolated, you could easily compare the values of the fields to previous lines if you stored those in another dictionary.

Select rows of every specific number of col where values is negative and convert values 0 in another column in python3

i have below dataframe
date B C D E
2019-07-01 00:00 0.400157 0.978738 2.240893 1.867558
2019-07-01 00:10 -0.950088 0.151357 -0.103219 0.410599
2019-07-01 00:20 1.454274 0.761038 0.121675 0.443863
2019-07-01 00:30 -1.494079 0.205158 0.313068 0.854096
suppose every even columns of rows contains -ve values(may be multiple condition,ex. rows contains -ve value or more than 10) then i wants to next odd column values into 0
expected output
date B C D E
2019-07-01 00:00 0.400157 0.978738 2.240893 1.867558
2019-07-01 00:10 -0.950088 0 -0.103219 0
2019-07-01 00:20 1.454274 0.761038 0.121675 0.443863
2019-07-01 00:30 -1.494079 0 0.313068 0.854096
if code one liner solution is then its best or can we write function for this
This solution requires the date column to be set as the index:
df.set_index('date', inplace=True)
df[df.shift(axis=1) < 0] = 0
df.reset_index(inplace=True)
df.shift returns a new dataframe with all the columns shifted to the right (default behaviour; can be changed using the periods parameter). This enables you to compare a cell with one to its left.
Source: DataFrame.shift

Python: Merge Pandas Data Frame with Pivot Table

I have the two following dataframes that I want to merge.
df1:
id time station
0 a 22.08.2017 12:00:00 A1
1 b 22.08.2017 12:00:00 A3
2 a 22.08.2017 13:00:00 A2
...
pivot:
station A1 A2 A3
0 time
1 22.08.2017 12:00:00 10 12 11
2 22.08.2017 13:00:00 9 7 3
3 22.08.2017 14:00:00 2 3 4
4 22.08.2017 15:00:00 3 2 7
...
it should look like:
merge:
id time station value
0 a 22.08.2017 12:00:00 A1 10
1 b 22.08.2017 12:00:00 A3 11
2 a 22.08.2017 13:00:00 A2 7
...
Now I want to add a column in the data frame with the right value from the pivot table. I failed including the column labels for the merge.
I constructed something like that, but it does not work:
merge = pd.merge(df1, pivot, how="left", left_on=["time", "station"], right_on=["station", pivot.columns])
Any help?
EDIT:
As advised, instead of the pivot table I tried to use the following data:
df2:
time station value
22.08.2017 12:00:00 A1 10
22.08.2017 12:00:00 A2 12
22.08.2017 12:00:00 A3 11
...
22.08.2017 13:00:00 A1 9
22.08.2017 13:00:00 A2 7
22.08.2017 13:00:00 A3 3
The table contains about 1300 different stations for every timestamp. All in all I have more than 115.000.000 rows. My df1 have 5.000.000 rows.
Now I tried to merge df1.head(100) and df2, but in the result all values are nan. Therefore I used this:
merge = pd.merge(df1.head(100), df2, how="left", on=["time", "station"])
Another problem is that the merge takes a few minutes so that I expect the whole df1 will take several days.
I guess you got the dataframe pivot using either pivot or pivot_table in pandas, if you can perform the merge using the dataframe you had before the pivot it should work just fine.
Otherwise you will have to reverse the pivot using melt before merging:
melt = pd.concat([pivot[['time']],pivot[['A1']].melt()],axis = 1)
melt = pd.concat([melt,pd.concat([pivot[['time']],pivot[['A2']].melt()],axis = 1)])
melt = pd.concat([melt,pd.concat([pivot[['time']],pivot[['A3']].melt()],axis = 1)])
melt.columns = ['time','station','value']
Then just perform a merge like you expected it:
my_df.merge(melt,on = ['time','station'])
id time station value
0 a time1 A1 10
1 b time1 A3 11
2 a time2 A2 7
EDIT:
If your dataframes are as big as in your edit, you indeed have to perform the merges on chunks of them. You could try to reduce it to chunk both your dataframes.
First, sort your df1 in order to have only close values of time:
df1.sort_values('time',inplace = True)
Then you chunk it, chunk the second dataframe in the way you are sure to have all the rows you might need, and then merge those chunks:
chunk1 = df1.head(100)
chunk2 = df2.loc[df2.time.between(chunk1.time.min(),chunk1.time.max())]
merge = chunk1.merge(chunk2,on = ['time','station'],how = 'left')

Resources