Given the following dataframe
import pandas as pd
df = pd.DataFrame({'visited': ['2015-3-1', '2015-3-5','2015-3-6','2016-3-4', '2016-3-6', '2016-3-8'],'name':['John','John','John','Mary','Mary','Mary']})
df['visited']=pd.to_datetime(df['visited'])
visited name
0 2015-03-01 John
1 2015-03-05 John
2 2015-03-06 John
3 2016-03-04 Mary
4 2016-03-06 Mary
5 2016-03-08 Mary
I wish to get the list of visited interval for two people, in this example, the outcome should be
avg_visited_interval name
0 [4,1] John
1 [2,2] Mary
How should I achieve this?
(e.g., for first example there is 4 days between rows 0 and 1 and 2 days between rows 1 and 2, which resulted in [4,1])
Use custom lambda function with Series.diff, remove first value by position, convert to integers and lists:
df = (df.groupby('name')['visited']
.apply(lambda x: x.diff().iloc[1:].dt.days.astype(int).tolist())
.reset_index(name='intervals'))
print (df)
name intervals
0 John [4, 1]
1 Mary [2, 2]
Related
Suppose I have the following df:
df = pd.DataFrame({'name':['Sara', 'John', 'Christine','Paul', 'Jo', 'Zack','Chris', 'Mathew', 'Suzan'],
'visits': [0, 0, 1,2, 3, 9,6, 10, 3]})
df
looks like:
name visits
0 Sara 0
1 John 0
2 Christine 1
3 Paul 2
4 Jo 3
5 Zack 9
6 Chris 6
7 Mathew 10
8 Suzan 3
I did some lines of code to get the percentage of visits per name and sort them descending:
df['percent'] = (df['visits'] / np.sum(df['visits']))
df.sort_values(by='percent', ascending=False).reset_index(drop=True)
Now I have got the percent of visits to total visits by all names:
name visits percent
0 Mathew 10 0.294118
1 Zack 9 0.264706
2 Chris 6 0.176471
3 Jo 3 0.088235
4 Suzan 3 0.088235
5 Paul 2 0.058824
6 Christine 1 0.029412
7 Sara 0 0.000000
8 John 0 0.000000
What I need to get is the largest proportion of names with the highest percentage. For example, the first 3 rows represent ~73% of the total visits, and others could be neglected compared sum of % of the first 3 rows.
I know I can select the top 3 by using nlargest:
df.nlargest(3, 'percent')
But there is high variability in the data and the largest proportion could be the first 2 or 3 rows or even more.
EDIT:
How can I do it automatically to find the largest(N) proportion of % out of the total count of rows?
You have to define outliers in some way. One way is to use scipy.stats.zscore like in this answer:
import pandas as pd
import numpy as np
from scipy import stats
df = pd.DataFrame({'name':['Sara', 'John', 'Christine','Paul', 'Jo', 'Zack','Chris', 'Mathew', 'Suzan'],
'visits': [0, 0, 1,2, 3, 9,6, 10, 3]})
df['percent'] = (df['visits'] / np.sum(df['visits']))
df.loc[df['percent'][stats.zscore(df['percent']) > 0.6].index]
which prints
name visits percent
5 Zack 9 0.264706
6 Chris 6 0.176471
7 Mathew 10 0.294118
Given the following dataframe
import pandas as pd
df = pd.DataFrame({'visited': ['2015-3-4', '2015-3-5','2015-3-6','2016-3-4', '2016-3-6', '2016-3-8'],'name':['John','John','John','Mary','Mary','Mary']})
df['visited']=pd.to_datetime(df['visited'])
visited name
0 2015-03-01 John
1 2015-03-05 John
2 2015-03-06 John
3 2016-03-04 Mary
4 2016-03-06 Mary
5 2016-03-08 Mary
I wish to calculate the last visited interval by day for two people, in this example, the outcome should be
last_visited_interval name
0 1 John
1 2 Mary
Since '2015-3-5','2015-3-6' has interval of 1 and '2016-3-6', '2016-3-8' has interval of 2
I tried
df.groupby('name').agg(last_visited_interval=('visited',lambda x: x.diff().dt.days.last())),
but got the exception of
last() missing 1 required positional argument: 'offset'
How should I do it?
If check Series.last it working different - it return last value of datetimes by DatetimeIndex, also it is not GroupBy.last, because working with Series in lambda function. So you can use Series.iloc or Series.iat:
df.groupby('name').agg(last_visited_interval=('visited',lambda x:x.diff().dt.days.iat[-1]))
last_visited_interval
name
John 1.0
Mary 2.0
I am trying to sum all columns based on the value of the first, but groupby.sum is unexpectedly not working.
Here is a minimal example:
import pandas as pd
data = [['Alex',10, 11],['Bob',12, 10],['Clarke',13, 9], ['Clarke',1, 1]]
df = pd.DataFrame(data,columns=['Name','points1', 'points2'])
print(df)
df.groupby('Name').sum()
print(df)
I get this:
Name points1 points2
0 Alex 10 11
1 Bob 12 10
2 Clarke 13 9
3 Clarke 1 1
And not this:
Name points1 points2
0 Alex 10 11
1 Bob 12 10
2 Clarke 14 10
From what i understand, the dataframe is not the right format for pandas to perform group by. I would like to understand what is wrong with it because this is just a toy example but i have the same problem with a real data-set.
The real data i'm trying to read is the John Hopkins University Covid-19 dataset:
https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series
You forget assign output of aggregation to variable, because aggregation not working inplace. So in your solution print (df) before and after groupby returned same original DataFrame.
df1 = df.groupby('Name', as_index=False).sum()
print (df1)
Name points1 points2
0 Alex 10 11
1 Bob 12 10
2 Clarke 14 10
Or you can set to same variable df:
df = df.groupby('Name', as_index=False).sum()
print (df)
Name points1 points2
0 Alex 10 11
1 Bob 12 10
2 Clarke 14 10
I have a dataframe:
import pandas as pd
d = {'user': ['bob','bob','peter','peter'], 'item':
['s1','s1','s2','s2'],'value':
[1,2,5,4]}
df = pd.DataFrame(data=d)
which is
user item value
0 bob s1 1
1 bob s1 2
2 peter s2 5
3 peter s2 4
I tend to aggregate the value based on [user, item]. My new dataframe will be
user item value
0 bob s1 [1,2]
1 peter s2 [5,4]
value is an array, how to do that ?
df.groupby(['user','item']).agg(list).reset_index()
Out[110]:
user item value
0 bob s1 [1, 2]
1 peter s2 [5, 4]
How can I select all rows of a data frame where a condition is met according to a column, which has to do with the relationship between every 2 entries of that column. To give the specific example, lets say I have a DataFrame:
>>>df = pd.DataFrame({'A': [ 1, 2, 3, 4],
'B':['spam', 'ham', 'egg', 'foo'],
'C':[4, 5, 3, 4]})
>>> df
A B C
0 1 spam 4
1 2 ham 5
2 3 egg 3
3 4 foo 4
>>>df2 = df[ return every row of df where C[i] > C[i-1] ]
>>> df2
A B C
1 2 ham 5
3 4 foo 4
There is plenty of great information about slicing and indexing in the pandas docs and here, but this is a bit more complicated, I think. I could also be going about it wrong. What I'm looking for is the rows of data where the value stored in C is no longer monotonously declining.
Any help is appreciated!
Use boolean indexing with compare by shifted column values:
print (df[df['C'] > df['C'].shift()])
A B C
1 2 ham 5
3 4 foo 4
Detail:
print (df['C'] > df['C'].shift())
0 False
1 True
2 False
3 True
Name: C, dtype: bool
If want all monotonously declining rows compare diff of column:
print (df[df['C'].diff() > 0])
A B C
1 2 ham 5
3 4 foo 4