Last Visited Interval for different people - python-3.x

Given the following dataframe
import pandas as pd
df = pd.DataFrame({'visited': ['2015-3-4', '2015-3-5','2015-3-6','2016-3-4', '2016-3-6', '2016-3-8'],'name':['John','John','John','Mary','Mary','Mary']})
df['visited']=pd.to_datetime(df['visited'])
visited name
0 2015-03-01 John
1 2015-03-05 John
2 2015-03-06 John
3 2016-03-04 Mary
4 2016-03-06 Mary
5 2016-03-08 Mary
I wish to calculate the last visited interval by day for two people, in this example, the outcome should be
last_visited_interval name
0 1 John
1 2 Mary
Since '2015-3-5','2015-3-6' has interval of 1 and '2016-3-6', '2016-3-8' has interval of 2
I tried
df.groupby('name').agg(last_visited_interval=('visited',lambda x: x.diff().dt.days.last())),
but got the exception of
last() missing 1 required positional argument: 'offset'
How should I do it?

If check Series.last it working different - it return last value of datetimes by DatetimeIndex, also it is not GroupBy.last, because working with Series in lambda function. So you can use Series.iloc or Series.iat:
df.groupby('name').agg(last_visited_interval=('visited',lambda x:x.diff().dt.days.iat[-1]))
last_visited_interval
name
John 1.0
Mary 2.0

Related

Map Pandas Series Containing key/value pairs to a new columns with data

I have a dataframe containing a pandas series (column 2) as below:
column 1
column 2
column 3
1123
Requested By = John Doe 1\n Requested On = 12 October 2021\n Comments = This is a generic request
INC29192
1251
NaN
INC18217
1918
Requested By = John Doe 2\n Requested On = 2 September 2021\n Comments = This is another generic request
INC19281
I'm struggling to extract, split and map column 2 data to a series of new column names with the appropriate data for that record (where possible, that is where there is data available as I have NaNs).
The Desired output is something like (where Ive dropped the column 3 data for legibility):
column 1
column 3
Requested By
Requested On
Comments
1123
INC29192
John Doe 1
12 October 2021
This is a generic request
1251
INC18217
NaN
NaN
NaN
1918
INC19281
John Doe 2
2 September 2021
This is another generic request
I have spent quite some time, trying various approaches, from lambda functions to comprehensions to explode methods but havent quite found a solution that provides the desired output.
First I would convert column 2 values to dictionaries and then convert them to Dataframes and join them to your df:
df['column 2'] = df['column 2'].apply(lambda x:
{y.split(' = ',1)[0]:y.split(' = ',1)[1]
for y in x.split(r'\n ')}
if not pd.isna(x) else {})
df = df.join(pd.DataFrame(df['column 2'].values.tolist())).drop('column 2', axis=1)
print(df)
Output:
column 1 column 3 Requested By Requested On Comments
0 1123 INC29192 John Doe 1 12 October 2021 This is a generic request
1 1251 INC18217 NaN NaN NaN
2 1918 INC19281 John Doe 2 2 September 2021 This is another generic request

How update a dataframe column value from second dataframe where values on two specific columns that can repeat on first match on both dataframes?

I have two dataframes with different information about a person, on the first dataframe, person's name may repeat in different rows. I want to add/update the first dataframe with data from the second dataframe where the two columns containing person's data matches on both. Here an example on what I need to accomplish:
df1:
name surname
0 john doe
1 mary doe
2 peter someone
3 mary doe
4 john another
5 paul another
df2:
name surname account_id
0 peter someone 100
1 john doe 200
2 mary doe 300
3 john another 400
I need to accomplish this:
df1:
name surname account_id
0 john doe 200
1 mary doe 300
2 peter someone 100
3 mary doe 300
4 john another 400
5 paul another <empty>
Thanks!

list of visited interval

Given the following dataframe
import pandas as pd
df = pd.DataFrame({'visited': ['2015-3-1', '2015-3-5','2015-3-6','2016-3-4', '2016-3-6', '2016-3-8'],'name':['John','John','John','Mary','Mary','Mary']})
df['visited']=pd.to_datetime(df['visited'])
visited name
0 2015-03-01 John
1 2015-03-05 John
2 2015-03-06 John
3 2016-03-04 Mary
4 2016-03-06 Mary
5 2016-03-08 Mary
I wish to get the list of visited interval for two people, in this example, the outcome should be
avg_visited_interval name
0 [4,1] John
1 [2,2] Mary
How should I achieve this?
(e.g., for first example there is 4 days between rows 0 and 1 and 2 days between rows 1 and 2, which resulted in [4,1])
Use custom lambda function with Series.diff, remove first value by position, convert to integers and lists:
df = (df.groupby('name')['visited']
.apply(lambda x: x.diff().iloc[1:].dt.days.astype(int).tolist())
.reset_index(name='intervals'))
print (df)
name intervals
0 John [4, 1]
1 Mary [2, 2]

Pandas Window Average Based On Conditions

I am trying to perform a window operation on the following pandas data frame:
import pandas as pd
df = pd.DataFrame({'visitor_id': ['a','a','a','a','a','a','b','b','b','b','c','c','c','c','c'],
'time_on_site' : [3,5,6,4,5,3,7,6,7,8,1,2,2,1,2],
'site_visit': [1,2,3,4,5,6,1,2,3,4,1,2,3,4,5],
'feature_visit' : [np.nan,np.nan,1,np.nan,2,3,1,2,3,4,np.nan,1,2,3,np.nan]
})
"For each distinct user, calculate the average time they spent on the website and their total number of visits before they interacted with a feature."
The data consists of four columns with the following definitions:
visitor_id is a string that identifies a unique given visitor
time_on_site is the time they spent on the website
site_visit is an incrementing counter of the times they visited the
website.
feature_visit is an incrementing counter of the times they used a specific feature on the site. If a customer visited the site before they interacted with the feature, a NaN is produced. If they visited the site and did not interact with the new feature, a NaN is produced. For each time they visited the site and interacted with the feature, the counter is incremented by one.
visitor_id time_on_site site_visit feature_visit
a 3 1 NaN
a 5 2 NaN
a 6 3 1
a 4 4 NaN
a 5 5 2
a 3 6 3
b 7 1 1
b 6 2 2
b 7 3 3
b 8 4 4
c 1 1 NaN
c 2 2 1
c 2 3 2
c 1 4 3
c 2 5 NaN
The expected output should look like this:
id mean count
a 4 2
b NaN 0
c 1 1
Which was created based on the following logic:
For user a, the expected output is 4, which is the average time_on_site for site_visit 1 and 2, which occurred before the first feature interaction on site_visit 3.
For user b the average time should be NaN because they had no prior visits before their first interaction with the feature.
For user c, their average time is just 1, since they only had one visit before interacting with the new feature.
If a user never used the new feature, their mean and count should be NaN.
Thanks in advance for the help.
Try this:
def summarize(x):
index = x[x['feature_visit'].notnull()].index[0]
return pd.Series({
'mean': x[x.index < index]['time_on_site'].mean(),
'count': x[x.index < index]['site_visit'].count()
})
df.groupby('visitor_id').apply(summarize)

How to select and subset rows based on sting in pandas dataframe?

My dataset looks like following. I am trying to subset my pandas dataframe such that only the responses by all 3 people will get selected. For example, in below data frame the responses that were answered by all 3 people were "I like to eat" and "You have nice day" . Thus only those should be subsetted. I am not sure how to achieve this in Pandas dataframe.
Note: I am new to Python ,please provide explanation with your code.
DataFrame example
import pandas as pd
data = {'Person':['1', '1','1','2','2','2','2','3','3'],'Response':['I like to eat','You have nice day','My name is ','I like to eat','You have nice day','My name is','This is it','I like to eat','You have nice day'],
}
df = pd.DataFrame(data)
print (df)
Output:
Person Response
0 1 I like to eat
1 1 You have nice day
2 1 My name is
3 2 I like to eat
4 2 You have nice day
5 2 My name is
6 2 This is it
7 3 I like to eat
8 3 You have nice day
IIUC I am using transform with nunique
yourdf=df[df.groupby('Response').Person.transform('nunique')==df.Person.nunique()]
yourdf
Out[463]:
Person Response
0 1 I like to eat
1 1 You have nice day
3 2 I like to eat
4 2 You have nice day
7 3 I like to eat
8 3 You have nice day
Method 2
df.groupby('Response').filter(lambda x : pd.Series(df['Person'].unique()).isin(x['Person']).all())
Out[467]:
Person Response
0 1 I like to eat
1 1 You have nice day
3 2 I like to eat
4 2 You have nice day
7 3 I like to eat
8 3 You have nice day

Resources