I have a timeseries dataframe like below
ts_ms a. b. c. x. y. z
1614772770705. 10. 10. 4. 1 2 3
1614772770800. 10. 10. 2. 1 2 4
1614772770750. 10. 5. 4. 1 2 3
I need to create 5 min buckets and then apply the dataframe equivalent of the SQL below
select sum(x), sum(y), sum(z)
group by a, b, c
What I have so far is
#convert to datetimes
df['ts_date'] = pd.to_datetime(df['ts_ms'])
# create bucket
df.set_index('ts_date').groupby(pd.Grouper(freq='5Min'))
But I am not sure how to apply the SQL equivalent to this dataframe after this point.
Please suggest.
If need grouping by 5Min together by a,b,c columns use one DataFrame.groupby:
df['ts_date'] = pd.to_datetime(df['ts_ms'])
df1 = df.groupby(['a','b','c',pd.Grouper(freq='5Min',key='ts_date')])[["x", "y", "z"]].sum()
print (df1)
x y z
a b c ts_date
10.0 5.0 4.0 1970-01-01 00:25:00 7 8 9
1970-01-01 00:45:00 7 8 9
10.0 2.0 1970-01-01 00:25:00 8 10 12
4.0 1970-01-01 00:25:00 2 4 6
Or is possible use DataFrame.groupby with DataFrame.resample by 5Min:
df['ts_date'] = pd.to_datetime(df['ts_ms'])
df2 = df.set_index('ts_date').groupby(['a','b','c'])[["x", "y", "z"]].resample('5Min').sum()
print (df2)
x y z
a b c ts_date
10.0 5.0 4.0 1970-01-01 00:25:00 7 8 9
1970-01-01 00:30:00 0 0 0
1970-01-01 00:35:00 0 0 0
1970-01-01 00:40:00 0 0 0
1970-01-01 00:45:00 7 8 9
10.0 2.0 1970-01-01 00:25:00 8 10 12
4.0 1970-01-01 00:25:00 2 4 6
Setup:
# data.csv
ts_ms,a,b,c,x,y,z
1614772770705.,10.,10.,4.,1,2,3
1614772770800.,10.,10.,2.,4,5,6
1614772770750.,10.,5.,4.,7,8,9
1614772770805.,10.,10.,4.,1,2,3
1614772770900.,10.,10.,2.,4,5,6
2714772770850.,10.,5.,4.,7,8,9
Code:
import pandas as pd
def func(grp):
return grp.groupby(pd.Grouper(freq='5Min'))[["x", "y", "z"]].sum()
df = pd.read_csv("data.csv")
df['ts_date'] = pd.to_datetime(df['ts_ms'])
df.set_index('ts_date', inplace=True)
df.groupby(["a", "b", "c"]).apply(func)
Outputs:
x y z
a b c ts_date
10.0 5.0 4.0 1970-01-01 00:25:00 7 8 9
1970-01-01 00:30:00 0 0 0
1970-01-01 00:35:00 0 0 0
1970-01-01 00:40:00 0 0 0
1970-01-01 00:45:00 7 8 9
10.0 2.0 1970-01-01 00:25:00 8 10 12
4.0 1970-01-01 00:25:00 2 4 6
Related
I have the following dataframe:
data = pd.DataFrame({
'ID': [1, 1, 1, 1, 2, 2, 3, 4, 4, 5, 6, 6],
'Date_Time': ['2010-01-01 12:01:00', '2010-01-01 01:27:33',
'2010-04-02 12:01:00', '2010-04-01 07:24:00', '2011-01-01 12:01:00',
'2011-01-01 01:27:33', '2013-01-01 12:01:00', '2014-01-01 12:01:00',
'2014-01-01 01:27:33', '2015-01-01 01:27:33', '2016-01-01 01:27:33',
'2011-01-01 01:28:00'],
'order': [2, 4, 5, 6, 7, 8, 9, 2, 3, 5, 6, 8],
'sort': [1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0]})
An would like to get the following columns:
1- sum_order_total_1 which sums up the values in the column order grouped by the column sort so in this case for the value 1 from column sort for each ID and returns Nans for zeros form column sort
2- sum_order_total_0 which sums up the values in the column order grouped by the column sort so in this case for the value 0 from column sort for each ID and returns Nans for oness form column sort
3- count_order_date_1 which sums up the values in column order of each ID grouped by column Date_Time for 1 in column sort and returns Nans for 0 from column sort
4- count_order_date_0 which sums up the values in column order of each ID grouped by column Date_Time for 0 in column sort and returns Nans for 1 from column sort
The expected reults should look like that attached photo here:
The problem with groupby (and pd.pivot_table) is that only provide half of the job. They give you the numbers but not in the format that you want. To finalize the format you can use apply.
For the total counts I used:
# Retrieve your data, similar as in the groupby query you provided.
data_total = pd.pivot_table(df, values='order', index=['ID'], columns=['sort'], aggfunc=np.sum)
data_total.reset_index(inplace=True)
Which results in the table:
sort ID 0 1
0 1 6.0 11.0
1 2 15.0 NaN
2 3 NaN 9.0
3 4 3.0 2.0
4 5 5.0 NaN
5 6 8.0 6.0
Now using this as an index ('ID' and 0 or 1 for the sort.) We can write a small function that will input the right value:
def filter_count(data, row, sort_value):
""" Select the count that belongs to the correct ID and sort combination. """
if row['sort'] == sort_value:
return data[data['ID'] == row['ID']][sort_value].values[0]
return np.NaN
# Applying the above function for both sort values 0 and 1.
df['total_0'] = df.apply(lambda row: filter_count(data_total, row, 0), axis=1, result_type='expand')
df['total_1'] = df.apply(lambda row: filter_count(data_total, row, 1), axis=1, result_type='expand')
This leads to:
ID Date_Time order sort total_1 total_0
0 1 2010-01-01 12:01:00 2 1 11.0 NaN
1 1 2010-01-01 01:27:33 4 1 11.0 NaN
2 1 2010-04-02 12:01:00 5 1 11.0 NaN
3 1 2010-04-01 07:24:00 6 0 NaN 6.0
4 2 2011-01-01 12:01:00 7 0 NaN 15.0
5 2 2011-01-01 01:27:33 8 0 NaN 15.0
6 3 2013-01-01 12:01:00 9 1 9.0 NaN
7 4 2014-01-01 12:01:00 2 1 2.0 NaN
8 4 2014-01-01 01:27:33 3 0 NaN 3.0
9 5 2015-01-01 01:27:33 5 0 NaN 5.0
10 6 2016-01-01 01:27:33 6 1 6.0 NaN
11 6 2011-01-01 01:28:00 8 0 NaN 8.0
Now we can apply the same logic to the date, except that the date also contains information about the hours, minutes and seconds. Which can be filtered out using:
# Since we are interesting on a per day bases, we remove the hour/minute/seconds part
df['order_day'] = pd.to_datetime(df['Date_Time']).dt.strftime('%Y/%m/%d')
Now applying the same trick as above, we create a new pivot table, based on the 'ID' and 'order_day':
data_date = pd.pivot_table(df, values='order', index=['ID', 'order_day'], columns=['sort'], aggfunc=np.sum)
data_date.reset_index(inplace=True)
Which is:
sort ID order_day 0 1
0 1 2010/01/01 NaN 6.0
1 1 2010/04/01 6.0 NaN
2 1 2010/04/02 NaN 5.0
3 2 2011/01/01 15.0 NaN
4 3 2013/01/01 NaN 9.0
5 4 2014/01/01 3.0 2.0
6 5 2015/01/01 5.0 NaN
7 6 2011/01/01 8.0 NaN
Writing a second function to fill in the correct value based on 'ID' and 'date':
def filter_date(data, row, sort_value):
if row['sort'] == sort_value:
return data[(data['ID'] == row['ID']) & (data['order_day'] == row['order_day'])][sort_value].values[0]
return np.NaN
# Applying the above function for both sort values 0 and 1.
df['total_1'] = df.apply(lambda row: filter_count(data_total, row, 1), axis=1, result_type='expand')
df['total_0'] = df.apply(lambda row: filter_count(data_total, row, 0), axis=1, result_type='expand')
Now we only have to drop the temporary column 'order_day':
df.drop(labels=['order_day'], axis=1, inplace=True)
And the final answer becomes:
ID Date_Time order sort total_1 total_0 date_0 date_1
0 1 2010-01-01 12:01:00 2 1 11.0 NaN NaN 6.0
1 1 2010-01-01 01:27:33 4 1 11.0 NaN NaN 6.0
2 1 2010-04-02 12:01:00 5 1 11.0 NaN NaN 5.0
3 1 2010-04-01 07:24:00 6 0 NaN 6.0 6.0 NaN
4 2 2011-01-01 12:01:00 7 0 NaN 15.0 15.0 NaN
5 2 2011-01-01 01:27:33 8 0 NaN 15.0 15.0 NaN
6 3 2013-01-01 12:01:00 9 1 9.0 NaN NaN 9.0
7 4 2014-01-01 12:01:00 2 1 2.0 NaN NaN 2.0
8 4 2014-01-01 01:27:33 3 0 NaN 3.0 3.0 NaN
9 5 2015-01-01 01:27:33 5 0 NaN 5.0 5.0 NaN
10 6 2016-01-01 01:27:33 6 1 6.0 NaN NaN 6.0
11 6 2011-01-01 01:28:00 8 0 NaN 8.0 8.0 NaN
I'm working with a dataset which has monthly information about several users. And each user has a different time range. There is also missing "time" data for each user. What I would like to do is fill in the missing month data for each user based on the time range for each user(from min.time to max.time in months)
I've read approaches to similar situation using re-sample, re-index from here, but I'm not getting the desired output/there is row mismatch after filling the missing months.
Any help/pointers would be much appreciated.
-Luc
Tried using re-sample, re-index, but not getting desired output
x = pd.DataFrame({'user': ['a','a','b','b','c','a','a','b','a','c','c','b'], 'dt': ['2015-01-01','2015-02-01', '2016-01-01','2016-02-01','2017-01-01','2015-05-01','2015-07-01','2016-05-01','2015-08-01','2017-03-01','2017-08-01','2016-09-01'], 'val': [1,33,2,1,5,4,2,5,66,7,5,1]})
date id value
0 2015-01-01 a 1
1 2015-02-01 a 33
2 2016-01-01 b 2
3 2016-02-01 b 1
4 2017-01-01 c 5
5 2015-05-01 a 4
6 2015-07-01 a 2
7 2016-05-01 b 5
8 2015-08-01 a 66
9 2017-03-01 c 7
10 2017-08-01 c 5
11 2016-09-01 b 1
What I would like to see is - for each 'id' generate missing months based on min.date and max.date for that id and fill 'val' for those months with 0.
Create DatetimeIndex, so possible use groupby with custom lambda function and Series.asfreq:
x['dt'] = pd.to_datetime(x['dt'])
x = (x.set_index('dt')
.groupby('user')['val']
.apply(lambda x: x.asfreq('MS', fill_value=0))
.reset_index())
print (x)
user dt val
0 a 2015-01-01 1
1 a 2015-02-01 33
2 a 2015-03-01 0
3 a 2015-04-01 0
4 a 2015-05-01 4
5 a 2015-06-01 0
6 a 2015-07-01 2
7 a 2015-08-01 66
8 b 2016-01-01 2
9 b 2016-02-01 1
10 b 2016-03-01 0
11 b 2016-04-01 0
12 b 2016-05-01 5
13 b 2016-06-01 0
14 b 2016-07-01 0
15 b 2016-08-01 0
16 b 2016-09-01 1
17 c 2017-01-01 5
18 c 2017-02-01 0
19 c 2017-03-01 7
20 c 2017-04-01 0
21 c 2017-05-01 0
22 c 2017-06-01 0
23 c 2017-07-01 0
24 c 2017-08-01 5
Or use Series.reindex with min and max datetimes per groups:
x = (x.set_index('dt')
.groupby('user')['val']
.apply(lambda x: x.reindex(pd.date_range(x.index.min(),
x.index.max(), freq='MS'), fill_value=0))
.rename_axis(('user','dt'))
.reset_index())
I have data in the following way
A B C
1 2 3
2 5 6
7 8 9
I want to change the dataframe into
A B C
2 3
1 5 6
2 8 9
3
One way would be to add a blank row to the dataframe and then use shift
# input df:
A B C
0 1 2 3
1 2 5 6
2 7 8 9
df.loc[len(df.index), :] = None
df['A'] = df.A.shift(1)
print (df)
A B C
0 NaN 2.0 3.0
1 1.0 5.0 6.0
2 2.0 8.0 9.0
3 7.0 NaN NaN
I have the following data frame:
date my_count
--------------------------
2017-01-01 6
2017-01-04 5
2017-01-05 3
2017-01-08 8
I would like to pad the skipped date with my_count = 0, so the padded data frame will look like:
date my_count
--------------------------
2017-01-01 6
2017-01-02 0
2017-01-03 0
2017-01-04 5
2017-01-05 3
2017-01-06 0
2017-01-07 0
2017-01-08 8
Except checking the data frame line by line, is there a more elegant way to do this? Thanks!
1st option resample,
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
print(df.resample('D').sum().fillna(0).reset_index())
date my_count
0 2017-01-01 6.0
1 2017-01-02 0.0
2 2017-01-03 0.0
3 2017-01-04 5.0
4 2017-01-05 3.0
5 2017-01-06 0.0
6 2017-01-07 0.0
7 2017-01-08 8.0
2nd option reindex by date_range,
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
print(df.reindex(pd.date_range('2017-01-01', '2017-01-08')).fillna(0))
my_count
2017-01-01 6.0
2017-01-02 0.0
2017-01-03 0.0
2017-01-04 5.0
2017-01-05 3.0
2017-01-06 0.0
2017-01-07 0.0
2017-01-08 8.0
If values of DatetimeIndex are unique use:
You can use asfreq or reindex by min or max value of index or by first and last (if DatetimeIndex is sorted):
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
print(df.asfreq('D', fill_value=0).reset_index())
date my_count
0 2017-01-01 6
1 2017-01-02 0
2 2017-01-03 0
3 2017-01-04 5
4 2017-01-05 3
5 2017-01-06 0
6 2017-01-07 0
7 2017-01-08 8
rng = pd.date_range(df.index.min(), df.index.max())
#alternative
#rng = pd.date_range(df.index[0], df.index[-1])
print(df.reindex(rng, fill_value=0).rename_axis('date').reset_index())
date my_count
0 2017-01-01 6
1 2017-01-02 0
2 2017-01-03 0
3 2017-01-04 5
4 2017-01-05 3
5 2017-01-06 0
6 2017-01-07 0
7 2017-01-08 8
If DatetimeIndex are not unique get:
ValueError: cannot reindex from a duplicate axis
Then need resample with some aggregate function like mean or groupby with Grouper and last replace NaNs by fillna:
print (df)
date my_count
0 2017-01-01 4 <-duplicate date
1 2017-01-01 6 <-duplicate date
2 2017-01-04 5
3 2017-01-05 3
4 2017-01-08 8
df['date'] = pd.to_datetime(df['date'])
print(df.resample('D', on='date')['my_count'].mean().fillna(0).reset_index())
date my_count
0 2017-01-01 5.0
1 2017-01-02 0.0
2 2017-01-03 0.0
3 2017-01-04 5.0
4 2017-01-05 3.0
5 2017-01-06 0.0
6 2017-01-07 0.0
7 2017-01-08 8.0
df = df.set_index('date')
print(df.groupby(pd.Grouper(freq='D'))['my_count'].mean().fillna(0).reset_index())
date my_count
0 2017-01-01 5.0
1 2017-01-02 0.0
2 2017-01-03 0.0
3 2017-01-04 5.0
4 2017-01-05 3.0
5 2017-01-06 0.0
6 2017-01-07 0.0
7 2017-01-08 8.0
I have a dataframe that looks like this:
userId movieId rating
0 1 31 2.5
1 1 1029 3.0
2 1 3671 3.0
3 2 10 4.0
4 2 17 5.0
5 3 60 3.0
6 3 110 4.0
7 3 247 3.5
8 4 10 4.0
9 4 112 5.0
10 5 3 4.0
11 5 39 4.0
12 5 104 4.0
I need to get a dataframe which has unique userId, number of ratings by the user and the average rating by the user as shown below:
userId count mean
0 1 3 2.83
1 2 2 4.5
2 3 3 3.5
3 4 2 4.5
4 5 3 4.0
Can someone help?
df1 = df.groupby('userId')['rating'].agg(['count','mean']).reset_index()
print(df1)
userId count mean
0 1 3 2.833333
1 2 2 4.500000
2 3 3 3.500000
3 4 2 4.500000
4 5 3 4.000000
Drop movieId since we're not using it, groupby userId, and then apply the aggregation methods:
import pandas as pd
df = pd.DataFrame({'userId': [1,1,1,2,2,3,3,3,4,4,5,5,5],
'movieId':[31,1029,3671,10,17,60,110,247,10,112,3,39,104],
'rating':[2.5,3.0,3.0,4.0,5.0,3.0,4.0,3.5,4.0,5.0,4.0,4.0,4.0]})
df = df.drop('movieId', axis=1).groupby('userId').agg(['count','mean'])
print(df)
Which produces:
rating
count mean
userId
1 3 2.833333
2 2 4.500000
3 3 3.500000
4 2 4.500000
5 3 4.000000
Here's a NumPy based approach using the fact that userID column appears to be sorted -
unq, tags, count = np.unique(df.userId.values, return_inverse=1, return_counts=1)
mean_vals = np.bincount(tags, df.rating.values)/count
df_out = pd.DataFrame(np.c_[unq, count], columns = (('userID', 'count')))
df_out['mean'] = mean_vals
Sample run -
In [103]: df
Out[103]:
userId movieId rating
0 1 31 2.5
1 1 1029 3.0
2 1 3671 3.0
3 2 10 4.0
4 2 17 5.0
5 3 60 3.0
6 3 110 4.0
7 3 247 3.5
8 4 10 4.0
9 4 112 5.0
10 5 3 4.0
11 5 39 4.0
12 5 104 4.0
In [104]: df_out
Out[104]:
userID count mean
0 1 3 2.833333
1 2 2 4.500000
2 3 3 3.500000
3 4 2 4.500000
4 5 3 4.000000