list of visited interval

list of visited interval - python-3.x

Given the following dataframe
import pandas as pd
df = pd.DataFrame({'visited': ['2015-3-1', '2015-3-5','2015-3-6','2016-3-4', '2016-3-6', '2016-3-8'],'name':['John','John','John','Mary','Mary','Mary']})
df['visited']=pd.to_datetime(df['visited'])
visited name
0 2015-03-01 John
1 2015-03-05 John
2 2015-03-06 John
3 2016-03-04 Mary
4 2016-03-06 Mary
5 2016-03-08 Mary
I wish to get the list of visited interval for two people, in this example, the outcome should be
avg_visited_interval name
0 [4,1] John
1 [2,2] Mary
How should I achieve this?
(e.g., for first example there is 4 days between rows 0 and 1 and 2 days between rows 1 and 2, which resulted in [4,1])

Use custom lambda function with Series.diff, remove first value by position, convert to integers and lists:
df = (df.groupby('name')['visited']
.apply(lambda x: x.diff().iloc[1:].dt.days.astype(int).tolist())
.reset_index(name='intervals'))
print (df)
name intervals
0 John [4, 1]
1 Mary [2, 2]

Related

Finding the largest (N) proportion of percentage in pandas dataframe

Suppose I have the following df:
df = pd.DataFrame({'name':['Sara', 'John', 'Christine','Paul', 'Jo', 'Zack','Chris', 'Mathew', 'Suzan'],
'visits': [0, 0, 1,2, 3, 9,6, 10, 3]})
df
looks like:
name visits
0 Sara 0
1 John 0
2 Christine 1
3 Paul 2
4 Jo 3
5 Zack 9
6 Chris 6
7 Mathew 10
8 Suzan 3
I did some lines of code to get the percentage of visits per name and sort them descending:
df['percent'] = (df['visits'] / np.sum(df['visits']))
df.sort_values(by='percent', ascending=False).reset_index(drop=True)
Now I have got the percent of visits to total visits by all names:
name visits percent
0 Mathew 10 0.294118
1 Zack 9 0.264706
2 Chris 6 0.176471
3 Jo 3 0.088235
4 Suzan 3 0.088235
5 Paul 2 0.058824
6 Christine 1 0.029412
7 Sara 0 0.000000
8 John 0 0.000000
What I need to get is the largest proportion of names with the highest percentage. For example, the first 3 rows represent ~73% of the total visits, and others could be neglected compared sum of % of the first 3 rows.
I know I can select the top 3 by using nlargest:
df.nlargest(3, 'percent')
But there is high variability in the data and the largest proportion could be the first 2 or 3 rows or even more.
EDIT:
How can I do it automatically to find the largest(N) proportion of % out of the total count of rows?

You have to define outliers in some way. One way is to use scipy.stats.zscore like in this answer:
import pandas as pd
import numpy as np
from scipy import stats
df = pd.DataFrame({'name':['Sara', 'John', 'Christine','Paul', 'Jo', 'Zack','Chris', 'Mathew', 'Suzan'],
'visits': [0, 0, 1,2, 3, 9,6, 10, 3]})
df['percent'] = (df['visits'] / np.sum(df['visits']))
df.loc[df['percent'][stats.zscore(df['percent']) > 0.6].index]
which prints
name visits percent
5 Zack 9 0.264706
6 Chris 6 0.176471
7 Mathew 10 0.294118

Last Visited Interval for different people

Given the following dataframe
import pandas as pd
df = pd.DataFrame({'visited': ['2015-3-4', '2015-3-5','2015-3-6','2016-3-4', '2016-3-6', '2016-3-8'],'name':['John','John','John','Mary','Mary','Mary']})
df['visited']=pd.to_datetime(df['visited'])
visited name
0 2015-03-01 John
1 2015-03-05 John
2 2015-03-06 John
3 2016-03-04 Mary
4 2016-03-06 Mary
5 2016-03-08 Mary
I wish to calculate the last visited interval by day for two people, in this example, the outcome should be
last_visited_interval name
0 1 John
1 2 Mary
Since '2015-3-5','2015-3-6' has interval of 1 and '2016-3-6', '2016-3-8' has interval of 2
I tried
df.groupby('name').agg(last_visited_interval=('visited',lambda x: x.diff().dt.days.last())),
but got the exception of
last() missing 1 required positional argument: 'offset'
How should I do it?

If check Series.last it working different - it return last value of datetimes by DatetimeIndex, also it is not GroupBy.last, because working with Series in lambda function. So you can use Series.iloc or Series.iat:
df.groupby('name').agg(last_visited_interval=('visited',lambda x:x.diff().dt.days.iat[-1]))
last_visited_interval
name
John 1.0
Mary 2.0

Pandas dataframe not correct format for groupby, what is wrong?

I am trying to sum all columns based on the value of the first, but groupby.sum is unexpectedly not working.
Here is a minimal example:
import pandas as pd
data = [['Alex',10, 11],['Bob',12, 10],['Clarke',13, 9], ['Clarke',1, 1]]
df = pd.DataFrame(data,columns=['Name','points1', 'points2'])
print(df)
df.groupby('Name').sum()
print(df)
I get this:
Name points1 points2
0 Alex 10 11
1 Bob 12 10
2 Clarke 13 9
3 Clarke 1 1
And not this:
Name points1 points2
0 Alex 10 11
1 Bob 12 10
2 Clarke 14 10
From what i understand, the dataframe is not the right format for pandas to perform group by. I would like to understand what is wrong with it because this is just a toy example but i have the same problem with a real data-set.
The real data i'm trying to read is the John Hopkins University Covid-19 dataset:
https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series

You forget assign output of aggregation to variable, because aggregation not working inplace. So in your solution print (df) before and after groupby returned same original DataFrame.
df1 = df.groupby('Name', as_index=False).sum()
print (df1)
Name points1 points2
0 Alex 10 11
1 Bob 12 10
2 Clarke 14 10
Or you can set to same variable df:
df = df.groupby('Name', as_index=False).sum()
print (df)
Name points1 points2
0 Alex 10 11
1 Bob 12 10
2 Clarke 14 10

how to aggregate information in one column of dataframe in python3?

I have a dataframe:
import pandas as pd
d = {'user': ['bob','bob','peter','peter'], 'item':
['s1','s1','s2','s2'],'value':
[1,2,5,4]}
df = pd.DataFrame(data=d)
which is
user item value
0 bob s1 1
1 bob s1 2
2 peter s2 5
3 peter s2 4
I tend to aggregate the value based on [user, item]. My new dataframe will be
user item value
0 bob s1 [1,2]
1 peter s2 [5,4]
value is an array, how to do that ?

df.groupby(['user','item']).agg(list).reset_index()
Out[110]:
user item value
0 bob s1 [1, 2]
1 peter s2 [5, 4]

Getting all rows where for column 'C' the entry is larger than the preceding element in column 'C'

How can I select all rows of a data frame where a condition is met according to a column, which has to do with the relationship between every 2 entries of that column. To give the specific example, lets say I have a DataFrame:
>>>df = pd.DataFrame({'A': [ 1, 2, 3, 4],
'B':['spam', 'ham', 'egg', 'foo'],
'C':[4, 5, 3, 4]})
>>> df
A B C
0 1 spam 4
1 2 ham 5
2 3 egg 3
3 4 foo 4
>>>df2 = df[ return every row of df where C[i] > C[i-1] ]
>>> df2
A B C
1 2 ham 5
3 4 foo 4
There is plenty of great information about slicing and indexing in the pandas docs and here, but this is a bit more complicated, I think. I could also be going about it wrong. What I'm looking for is the rows of data where the value stored in C is no longer monotonously declining.
Any help is appreciated!

Use boolean indexing with compare by shifted column values:
print (df[df['C'] > df['C'].shift()])
A B C
1 2 ham 5
3 4 foo 4
Detail:
print (df['C'] > df['C'].shift())
0 False
1 True
2 False
3 True
Name: C, dtype: bool
If want all monotonously declining rows compare diff of column:
print (df[df['C'].diff() > 0])
A B C
1 2 ham 5
3 4 foo 4

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

list of visited interval - python-3.x

Use custom lambda function with Series.diff, remove first value by position, convert to integers and lists: df = (df.groupby('name')['visited'] .apply(lambda x: x.diff().iloc[1:].dt.days.astype(int).tolist()) .reset_index(name='intervals')) print (df) name intervals 0 John [4, 1] 1 Mary [2, 2]

Related

Finding the largest (N) proportion of percentage in pandas dataframe

Last Visited Interval for different people

Pandas dataframe not correct format for groupby, what is wrong?

how to aggregate information in one column of dataframe in python3?

Getting all rows where for column 'C' the entry is larger than the preceding element in column 'C'

Categories

Resources