Is there a way to increment a date field in a pandas data frame by the number of working days specified in an another column?
np.random.seed(10)
df = pd.DataFrame({'Date':pd.date_range(start=dt.datetime(2020,7,1), end = dt.datetime(2020,7,10))})
df['Offset'] = np.random.randint(0,10, len(df))
Date Offset
0 2020-07-01 9
1 2020-07-02 4
2 2020-07-03 0
3 2020-07-04 1
4 2020-07-05 9
5 2020-07-06 0
6 2020-07-07 1
7 2020-07-08 8
8 2020-07-09 9
9 2020-07-10 0
I would expect this to work, however it throws and error:
df['Date'] + pd.tseries.offsets.BusinessDay(n = df['Offset'])
TypeError: n argument must be an integer, got <class
'pandas.core.series.Series'>
pd.to_timedelta does not support working days.
Like I mentioned in my comment, you are trying to pass an entire Series as an integer. instead you want to apply the function row wise:
df['your_answer'] = df.apply(lambda x:x['Date'] + pd.tseries.offsets.BusinessDay(n= x['Offset']), axis=1)
df
Date Offset your_answer
0 2020-07-01 9 2020-07-14
1 2020-07-02 7 2020-07-13
2 2020-07-03 3 2020-07-08
3 2020-07-04 2 2020-07-07
4 2020-07-05 7 2020-07-14
5 2020-07-06 7 2020-07-15
6 2020-07-07 7 2020-07-16
7 2020-07-08 2 2020-07-10
8 2020-07-09 1 2020-07-10
9 2020-07-10 0 2020-07-10
Line of code broken down:
# notice how this returns every value of that column
df.apply(lambda x:x['Date'], axis=1)
0 2020-07-01
1 2020-07-02
2 2020-07-03
3 2020-07-04
4 2020-07-05
5 2020-07-06
6 2020-07-07
7 2020-07-08
8 2020-07-09
9 2020-07-10
# same thing with `Offset`
df.apply(lambda x:x['Offset'], axis=1)
0 9
1 7
2 3
3 2
4 7
5 7
6 7
7 2
8 1
9 0
Since pd.tseries.offsets.BusinessDay(n=foo_bar) takes an integer and not a series. We use the two columns in the apply() together - It's as if you are looping each number in the Offset column into the offsets.BusinessDay() function
Related
I have this store_df DataFrame:
store_id date sales
0 1 2023-1-2 11
1 2 2023-1-3 22
2 3 2023-1-4 33
3 1 2023-1-5 44
4 2 2023-1-6 55
5 3 2023-1-7 66
6 1 2023-1-8 77
7 2 2023-1-9 88
8 3 2023-1-10 99
I am not able to solve this in the interview.
This was the exact question asked :
Create a dataset with 3 columns – store_id, date, sales Create 3 Store_id Each store_id has 3 consecutive dates Sales are recorded for 9 rows We are considering the same 9 dates across all stores Sales can be any random number
Write a function that fetches the previous day’s sales as output once we give store_id & date as input
The question can be handled in multiple ways.
If you want to just get the previous row per group, assuming that the values are consecutive and sorted by increasing dates, use a groupby.shift:
store_df['prev_day_sales'] = store_df.groupby('store_id')['sales'].shift()
Output:
store_id date sales prev_day_sales
0 1 2023-01-02 11 NaN
1 2 2023-01-02 22 NaN
2 3 2023-01-02 33 NaN
3 1 2023-01-03 44 11.0
4 2 2023-01-03 55 22.0
5 3 2023-01-03 66 33.0
6 1 2023-01-04 77 44.0
7 2 2023-01-05 88 55.0
8 3 2023-01-04 99 66.0
If, you really want to get the previous day's value (not the previous available day), use a merge:
store_df['date'] = pd.to_datetime(store_df['date'])
store_df.merge(store_df.assign(date=lambda d: d['date'].add(pd.Timedelta('1D'))),
on=['store_id', 'date'], suffixes=(None, '_prev_day'), how='left'
)
Note. This makes it easy to handle other deltas, like business days (replace pd.Timedelta('1D') with pd.offsets.BusinessDay(1)).
Example (with a different input):
store_id date sales sales_prev_day
0 1 2023-01-02 11 NaN
1 2 2023-01-02 22 NaN
2 3 2023-01-02 33 NaN
3 1 2023-01-03 44 11.0
4 2 2023-01-03 55 22.0
5 3 2023-01-03 66 33.0
6 1 2023-01-04 77 44.0
7 2 2023-01-05 88 NaN # there is no data for 2023-01-04
8 3 2023-01-04 99 66.0
This question already has an answer here:
Pandas - Fill N rows for a specific column with a integer value and increment the integer there after
(1 answer)
Closed 1 year ago.
Given a dataframe df as follows:
df = pd.DataFrame({'Date': ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'], 'Sym': ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'], 'Value': [11, 8, 10, 15, 110, 60, 100, 40]})
Out:
Date Sym Value
0 2015-05-08 aapl 11
1 2015-05-07 aapl 8
2 2015-05-06 aapl 10
3 2015-05-05 aapl 15
4 2015-05-08 aaww 110
5 2015-05-07 aaww 60
6 2015-05-06 aaww 100
7 2015-05-05 aaww 40
I hope to create a new column Group to indicate groups with a range of integers starting from 1, each group should have 3 rows, except for the last group which may have less than 3 rows.
The final result will like this:
Date Sym Value Group
0 2015-05-08 aapl 11 1
1 2015-05-07 aapl 8 1
2 2015-05-06 aapl 10 1
3 2015-05-05 aapl 15 2
4 2015-05-08 aaww 110 2
5 2015-05-07 aaww 60 2
6 2015-05-06 aaww 100 3
7 2015-05-05 aaww 40 3
How could I achieve that with Pandas or Numpy? Thanks.
My trial code:
n = 3
for g, df in df.groupby(np.arange(len(df)) // n):
print(df.shape)
You are close, assign output from groupby to new column and add 1:
n = 3
df['Group'] = np.arange(len(df)) // n + 1
print (df)
Date Sym Value Group
0 2015-05-08 aapl 11 1
1 2015-05-07 aapl 8 1
2 2015-05-06 aapl 10 1
3 2015-05-05 aapl 15 2
4 2015-05-08 aaww 110 2
5 2015-05-07 aaww 60 2
6 2015-05-06 aaww 100 3
7 2015-05-05 aaww 40 3
I'm working with a dataset which has monthly information about several users. And each user has a different time range. There is also missing "time" data for each user. What I would like to do is fill in the missing month data for each user based on the time range for each user(from min.time to max.time in months)
I've read approaches to similar situation using re-sample, re-index from here, but I'm not getting the desired output/there is row mismatch after filling the missing months.
Any help/pointers would be much appreciated.
-Luc
Tried using re-sample, re-index, but not getting desired output
x = pd.DataFrame({'user': ['a','a','b','b','c','a','a','b','a','c','c','b'], 'dt': ['2015-01-01','2015-02-01', '2016-01-01','2016-02-01','2017-01-01','2015-05-01','2015-07-01','2016-05-01','2015-08-01','2017-03-01','2017-08-01','2016-09-01'], 'val': [1,33,2,1,5,4,2,5,66,7,5,1]})
date id value
0 2015-01-01 a 1
1 2015-02-01 a 33
2 2016-01-01 b 2
3 2016-02-01 b 1
4 2017-01-01 c 5
5 2015-05-01 a 4
6 2015-07-01 a 2
7 2016-05-01 b 5
8 2015-08-01 a 66
9 2017-03-01 c 7
10 2017-08-01 c 5
11 2016-09-01 b 1
What I would like to see is - for each 'id' generate missing months based on min.date and max.date for that id and fill 'val' for those months with 0.
Create DatetimeIndex, so possible use groupby with custom lambda function and Series.asfreq:
x['dt'] = pd.to_datetime(x['dt'])
x = (x.set_index('dt')
.groupby('user')['val']
.apply(lambda x: x.asfreq('MS', fill_value=0))
.reset_index())
print (x)
user dt val
0 a 2015-01-01 1
1 a 2015-02-01 33
2 a 2015-03-01 0
3 a 2015-04-01 0
4 a 2015-05-01 4
5 a 2015-06-01 0
6 a 2015-07-01 2
7 a 2015-08-01 66
8 b 2016-01-01 2
9 b 2016-02-01 1
10 b 2016-03-01 0
11 b 2016-04-01 0
12 b 2016-05-01 5
13 b 2016-06-01 0
14 b 2016-07-01 0
15 b 2016-08-01 0
16 b 2016-09-01 1
17 c 2017-01-01 5
18 c 2017-02-01 0
19 c 2017-03-01 7
20 c 2017-04-01 0
21 c 2017-05-01 0
22 c 2017-06-01 0
23 c 2017-07-01 0
24 c 2017-08-01 5
Or use Series.reindex with min and max datetimes per groups:
x = (x.set_index('dt')
.groupby('user')['val']
.apply(lambda x: x.reindex(pd.date_range(x.index.min(),
x.index.max(), freq='MS'), fill_value=0))
.rename_axis(('user','dt'))
.reset_index())
I have a dataframe like below. I would like to sum row 0 to 4 (every 5 rows) and create another column with summed value ("new column"). My real dataframe has 263 rows so, last three rows every 12 rows will be sum of three rows only. How I can do this using Pandas/Python. I have started to learn Python recently. Thanks for any advice in advance!
My data patterns is more complex as I am using the index as one of my column values and it repeats like:
Row Data "new column"
0 5
1 1
2 3
3 3
4 2 14
5 4
6 8
7 1
8 2
9 1 16
10 0
11 2
12 3 5
0 3
1 1
2 2
3 3
4 2 11
5 2
6 6
7 2
8 2
9 1 13
10 1
11 0
12 1 2
...
259 50 89
260 1
261 4
262 5 10
I tried iterrows and groupby but can't make it work so far.
Use this:
df['new col'] = df.groupby(df.index // 5)['Data'].transform('sum')[lambda x: ~(x.duplicated(keep='last'))]
Output:
Data new col
0 5 NaN
1 1 NaN
2 3 NaN
3 3 NaN
4 2 14.0
5 4 NaN
6 8 NaN
7 1 NaN
8 2 NaN
9 1 16.0
Edit to handle updated question:
g = df.groupby(df.Row).cumcount()
df['new col'] = df.groupby([g, df.Row // 5])['Data']\
.transform('sum')[lambda x: ~(x.duplicated(keep='last'))]
Output:
Row Data new col
0 0 5 NaN
1 1 1 NaN
2 2 3 NaN
3 3 3 NaN
4 4 2 14.0
5 5 4 NaN
6 6 8 NaN
7 7 1 NaN
8 8 2 NaN
9 9 1 16.0
10 10 0 NaN
11 11 2 NaN
12 12 3 5.0
13 0 3 NaN
14 1 1 NaN
15 2 2 NaN
16 3 3 NaN
17 4 2 11.0
18 5 2 NaN
19 6 6 NaN
20 7 2 NaN
21 8 2 NaN
22 9 1 13.0
23 10 1 NaN
24 11 0 NaN
25 12 1 2.0
I have a large DataFrame which is indexed by datetime, in particular, by days. I am looking for an efficient function which, for each column, checks the most common non-null value in each week, and outputs a dataframe which is indexed by weeks consisting of these within-week most common values.
Here is an example. The following DataFrame consists of two weeks of daily data:
0 1
2015-11-12 00:00:00 8 nan
2015-11-13 00:00:00 7 nan
2015-11-14 00:00:00 nan 5
2015-11-15 00:00:00 7 nan
2015-11-16 00:00:00 8 nan
2015-11-17 00:00:00 7 nan
2015-11-18 00:00:00 5 nan
2015-11-19 00:00:00 9 nan
2015-11-20 00:00:00 8 nan
2015-11-21 00:00:00 6 nan
2015-11-22 00:00:00 6 nan
2015-11-23 00:00:00 6 nan
2015-11-24 00:00:00 6 nan
2015-11-25 00:00:00 2 nan
and should be transformed into:
0 1
2015-11-12 00:00:00 7 5
2015-11-19 00:00:00 6 nan
My DataFrame is very large so efficiency is important. Thanks.
EDIT: If possible, can someone suggest a method that would be applicable if the entries are tuples (instead of floats as in my example)?
You can use resample to group your data by the weekly interval. Then, count the number of occurences via pd.value_counts and select the most common with idxmax:
df.resample("7D").apply(lambda x: x.apply(pd.value_counts).idxmax())
0 1
2015-11-12 00:00:00 7.0 5.0
2015-11-19 00:00:00 6.0 NaN
Edit
Here is another numpy version which is faster than the above solution:
def numpy_mode(series):
values = series.values
dropped = values[~np.isnan(values)]
# check for empty array and return NaN
if not dropped.size:
return np.NaN
uniques, counts = np.unique(series.dropna(), return_counts=True)
return uniques[np.argmax(counts)]
df2.resample("7D").apply(lambda x: x.apply(get_mode))
0 1
2015-11-12 00:00:00 7.0 5.0
2015-11-19 00:00:00 6.0 NaN
And here the timings based on the dummy data (for further improvements, have a look here):
%%timeit
df2.resample("7D").apply(lambda x: x.apply(pd.value_counts).idxmax())
>>> 100 loops, best of 3: 18.6 ms per loop
%%timeit
df2.resample("7D").apply(lambda x: x.apply(get_mode))
>>> 100 loops, best of 3: 3.72 ms per loop
I also tried scipy.stats.mode however it was also slower than the numpy solution:
size = 1000
index = pd.DatetimeIndex(start="2012-12-12", periods=size, freq="D")
dummy = pd.DataFrame(np.random.randint(0, 20, size=(size, 50)), index=index)
print(dummy.head)
0 1 2 3 4 5 6 7 8 9 ... 40 41 42 43 44 45 46 47 48 49
2012-12-12 18 2 7 1 7 9 16 2 19 19 ... 10 2 18 16 15 10 7 19 9 6
2012-12-13 7 4 11 19 17 10 18 0 10 7 ... 19 11 5 5 11 4 0 16 12 19
2012-12-14 14 0 14 5 1 11 2 19 5 9 ... 2 9 4 2 9 5 19 2 16 2
2012-12-15 12 2 7 2 12 12 11 11 19 5 ... 16 0 4 9 13 5 10 2 14 4
2012-12-16 8 15 2 18 3 16 15 0 14 14 ... 18 2 6 13 19 10 3 16 11 4
%%timeit
dummy.resample("7D").apply(lambda x: x.apply(get_mode))
>>> 1 loop, best of 3: 926 ms per loop
%%timeit
dummy.resample("7D").apply(lambda x: x.apply(pd.value_counts).idxmax())
>>> 1 loop, best of 3: 5.84 s per loop
%%timeit
dummy.resample("7D").apply(lambda x: stats.mode(x).mode)
>>> 1 loop, best of 3: 1.32 s per loop