This question already has an answer here:
Pandas - Fill N rows for a specific column with a integer value and increment the integer there after
(1 answer)
Closed 1 year ago.
Given a dataframe df as follows:
df = pd.DataFrame({'Date': ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'], 'Sym': ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'], 'Value': [11, 8, 10, 15, 110, 60, 100, 40]})
Out:
Date Sym Value
0 2015-05-08 aapl 11
1 2015-05-07 aapl 8
2 2015-05-06 aapl 10
3 2015-05-05 aapl 15
4 2015-05-08 aaww 110
5 2015-05-07 aaww 60
6 2015-05-06 aaww 100
7 2015-05-05 aaww 40
I hope to create a new column Group to indicate groups with a range of integers starting from 1, each group should have 3 rows, except for the last group which may have less than 3 rows.
The final result will like this:
Date Sym Value Group
0 2015-05-08 aapl 11 1
1 2015-05-07 aapl 8 1
2 2015-05-06 aapl 10 1
3 2015-05-05 aapl 15 2
4 2015-05-08 aaww 110 2
5 2015-05-07 aaww 60 2
6 2015-05-06 aaww 100 3
7 2015-05-05 aaww 40 3
How could I achieve that with Pandas or Numpy? Thanks.
My trial code:
n = 3
for g, df in df.groupby(np.arange(len(df)) // n):
print(df.shape)
You are close, assign output from groupby to new column and add 1:
n = 3
df['Group'] = np.arange(len(df)) // n + 1
print (df)
Date Sym Value Group
0 2015-05-08 aapl 11 1
1 2015-05-07 aapl 8 1
2 2015-05-06 aapl 10 1
3 2015-05-05 aapl 15 2
4 2015-05-08 aaww 110 2
5 2015-05-07 aaww 60 2
6 2015-05-06 aaww 100 3
7 2015-05-05 aaww 40 3
All the date(s) in df are present in ref_date of ref_df and not vice versa. Corresponding to each date in df, I need to get ref_date from ref_df based on following logic:
If a date is repeated more than once and either previous or next ref_date(s) are missing then from edges of repeated date allocate to the nearest missing previous or next ref_date(s).
If a date is repeated more than once but there is no missing prev/next ref_date then ref_date is same as date.
There can be missing ref_date(s) not included in df. This happens when date(s) are not repeated around given ref_date(s) to fill for it.
Example:
>>> import pandas as pd
>>> from datetime import datetime as dt
>>> df = pd.DataFrame({'date':[dt(2020,1,20), dt(2020,1,20), dt(2020,1,20), dt(2020,2,25), dt(2020,3,18), dt(2020,3,18), dt(2020,4,9), dt(2020,4,12), dt(2020,4,12), dt(2020,4,12), dt(2020,4,12), dt(2020,4,12), dt(2020,5,28), dt(2020,6,1), dt(2020,6,1), dt(2020,6,1), dt(2020,6,28), dt(2020,6,28)], 'qty':range(18)})
>>> ref_df = pd.DataFrame({'ref_date':[dt(2019,12,8), dt(2020,1,20), dt(2020,2,25), dt(2020,3,18), dt(2020,4,9), dt(2020,4,10), dt(2020,4,12), dt(2020,4,13), dt(2020,4,14), dt(2020,5,28), dt(2020,5,29), dt(2020,5,30), dt(2020,6,1), dt(2020,6,2), dt(2020,6,3), dt(2020,6,28), dt(2020,6,29), dt(2020,7,7)]})
>>> df
date qty
0 2020-01-20 0
1 2020-01-20 1
2 2020-01-20 2
3 2020-02-25 3
4 2020-03-18 4
5 2020-03-18 5
6 2020-04-09 6
7 2020-04-12 7
8 2020-04-12 8
9 2020-04-12 9
10 2020-04-12 10
11 2020-04-12 11
12 2020-05-28 12
13 2020-06-01 13
14 2020-06-01 14
15 2020-06-01 15
16 2020-06-28 16
17 2020-06-28 17
>>> ref_df
ref_date
0 2019-12-08
1 2020-01-20
2 2020-02-25
3 2020-03-18
4 2020-04-09
5 2020-04-10
6 2020-04-12
7 2020-04-13
8 2020-04-14
9 2020-05-28
10 2020-05-29
11 2020-05-30
12 2020-06-01
13 2020-06-02
14 2020-06-03
15 2020-06-28
16 2020-06-29
17 2020-07-07
Expected_output:
>>> df
date qty ref_date
0 2020-01-20 0 2019-12-08
1 2020-01-20 1 2020-01-20 # Note: repeated as no gap
2 2020-01-20 2 2020-01-20
3 2020-02-25 3 2020-02-25
4 2020-03-18 4 2020-03-18
5 2020-03-18 5 2020-03-18 # Note: repeated as no gap
6 2020-04-09 6 2020-04-09
7 2020-04-12 7 2020-04-10 # Note: Filling from the edges
8 2020-04-12 8 2020-04-12
9 2020-04-12 9 2020-04-12 # Note: repeated as not enough gap
10 2020-04-12 10 2020-04-13
11 2020-04-12 11 2020-04-14
12 2020-05-28 12 2020-05-28
13 2020-06-01 13 2020-05-30 # Filling nearest previous
14 2020-06-01 14 2020-06-01 # First filling previous
15 2020-06-01 15 2020-06-02 # Filling nearest next
16 2020-06-28 16 2020-06-28
17 2020-06-28 17 2020-06-29
I am able to get the answer but it doesn't look like the most efficient way to do so. Could someone suggest an optimal way to do it:
ref_df['date'] = ref_df['ref_date']
df = pd.merge_asof(df, ref_df, on='date', direction='nearest')
df = df.rename(columns={'ref_date':'nearest_ref_date'})
nrd_cnt = df.groupby('nearest_ref_date')['date'].count().reset_index().rename(columns={'date':'nrd_count'})
nrd_cnt['lc'] = nrd_cnt['nearest_ref_date'].shift(1)
nrd_cnt['uc'] = nrd_cnt['nearest_ref_date'].shift(-1)
df = df.merge(nrd_cnt, how='left', on='nearest_ref_date')
# TODO: Review it. Looping it to finite number 100 to avoid infite loop (in case of edge cases)
for _ in range(100):
df2 = df.copy()
df2['days'] = np.abs((df2['nearest_ref_date'] - df2['date']).dt.days)
df2['repeat_rank'] = df2.groupby('nearest_ref_date')['days'].rank(method='first')
reduced_ref_df = ref_df[~ref_df['ref_date'].isin(df2['nearest_ref_date'].unique())]
df2 = pd.merge_asof(df2, reduced_ref_df, on='date', direction='nearest')
df2 = df2.rename(columns={'ref_date':'new_nrd'})
df2.loc[(df2['new_nrd']<=df2['lc']) | (df2['new_nrd']>=df2['uc']), 'new_nrd'] = pd.to_datetime(np.nan)
df2.loc[(~pd.isna(df2['new_nrd'])) & (df2['repeat_rank'] > 1), 'nearest_ref_date'] = df2['new_nrd']
df2 = df2[['date', 'qty', 'nearest_ref_date', 'lc', 'uc']]
if df.equals(df2):
break
df = df2
df = df[['date', 'qty', 'nearest_ref_date']]
df.loc[:, 'repeat_rank'] = df.groupby('nearest_ref_date')['nearest_ref_date'].rank(method='first')
df = pd.merge_asof(df, ref_df, on='date', direction='nearest')
# Repeated nearest_ref_date set to nearest ref_date
df.loc[df['repeat_rank'] > 1, 'nearest_ref_date'] = df['ref_date']
# Sorting nearest_ref_date within the ref_date group (without changing order of rest of cols).
df.loc[:, 'nearest_ref_date'] = df[['ref_date', 'nearest_ref_date']].sort_values(['ref_date', 'nearest_ref_date']).reset_index().drop('index',axis=1)['nearest_ref_date']
df = df[['date', 'qty', 'nearest_ref_date']]
df
date qty ref_date
0 2020-01-20 0 2019-12-08
1 2020-01-20 1 2020-01-20
2 2020-01-20 2 2020-01-20
3 2020-02-25 3 2020-02-25
4 2020-03-18 4 2020-03-18
5 2020-03-18 5 2020-03-18
6 2020-04-09 6 2020-04-09
7 2020-04-12 7 2020-04-10
8 2020-04-12 8 2020-04-12
9 2020-04-12 9 2020-04-12
10 2020-04-12 10 2020-04-13
11 2020-04-12 11 2020-04-14
12 2020-05-28 12 2020-05-28
13 2020-06-01 13 2020-05-30
14 2020-06-01 14 2020-06-01
15 2020-06-01 15 2020-06-02
16 2020-06-28 16 2020-06-28
17 2020-06-28 17 2020-06-29
Is there a way to increment a date field in a pandas data frame by the number of working days specified in an another column?
np.random.seed(10)
df = pd.DataFrame({'Date':pd.date_range(start=dt.datetime(2020,7,1), end = dt.datetime(2020,7,10))})
df['Offset'] = np.random.randint(0,10, len(df))
Date Offset
0 2020-07-01 9
1 2020-07-02 4
2 2020-07-03 0
3 2020-07-04 1
4 2020-07-05 9
5 2020-07-06 0
6 2020-07-07 1
7 2020-07-08 8
8 2020-07-09 9
9 2020-07-10 0
I would expect this to work, however it throws and error:
df['Date'] + pd.tseries.offsets.BusinessDay(n = df['Offset'])
TypeError: n argument must be an integer, got <class
'pandas.core.series.Series'>
pd.to_timedelta does not support working days.
Like I mentioned in my comment, you are trying to pass an entire Series as an integer. instead you want to apply the function row wise:
df['your_answer'] = df.apply(lambda x:x['Date'] + pd.tseries.offsets.BusinessDay(n= x['Offset']), axis=1)
df
Date Offset your_answer
0 2020-07-01 9 2020-07-14
1 2020-07-02 7 2020-07-13
2 2020-07-03 3 2020-07-08
3 2020-07-04 2 2020-07-07
4 2020-07-05 7 2020-07-14
5 2020-07-06 7 2020-07-15
6 2020-07-07 7 2020-07-16
7 2020-07-08 2 2020-07-10
8 2020-07-09 1 2020-07-10
9 2020-07-10 0 2020-07-10
Line of code broken down:
# notice how this returns every value of that column
df.apply(lambda x:x['Date'], axis=1)
0 2020-07-01
1 2020-07-02
2 2020-07-03
3 2020-07-04
4 2020-07-05
5 2020-07-06
6 2020-07-07
7 2020-07-08
8 2020-07-09
9 2020-07-10
# same thing with `Offset`
df.apply(lambda x:x['Offset'], axis=1)
0 9
1 7
2 3
3 2
4 7
5 7
6 7
7 2
8 1
9 0
Since pd.tseries.offsets.BusinessDay(n=foo_bar) takes an integer and not a series. We use the two columns in the apply() together - It's as if you are looping each number in the Offset column into the offsets.BusinessDay() function
I have 4 dataframe with value count of number of occurance per month.
I want to compare all 4 value counts in one graph, so i can see visual difference between every month on these four years.
Like below
i like to have output like this image with years and month
newdf2018.Month.value_counts()
output
1 3451
2 3895
3 3408
4 3365
5 3833
6 3543
7 3333
8 3219
9 3447
10 2943
11 3296
12 2909
newdf2017.Month.value_counts()
1 2801
2 3048
3 3620
4 3014
5 3226
6 3962
7 3500
8 3707
9 3601
10 3349
11 3743
12 2002
newdf2016.Month.value_counts()
1 3201
2 2034
3 2405
4 3805
5 3308
6 3212
7 3049
8 3777
9 3275
10 3099
11 3775
12 2115
newdf2015.Month.value_counts()
1 2817
2 2604
3 2711
4 2817
5 2670
6 2507
7 3256
8 2195
9 3304
10 3238
11 2005
12 2008
Create dictionary of DataFrames and concat together, then use plot:
dfs = {2015:newdf2015, 2016:newdf2016, 2017:newdf2017, 2018:newdf2018}
df = pd.concat({k:v['Month'].value_counts() for k, v in dfs.items()}, axis=1)
df.plot.bar()
I have a large DataFrame which is indexed by datetime, in particular, by days. I am looking for an efficient function which, for each column, checks the most common non-null value in each week, and outputs a dataframe which is indexed by weeks consisting of these within-week most common values.
Here is an example. The following DataFrame consists of two weeks of daily data:
0 1
2015-11-12 00:00:00 8 nan
2015-11-13 00:00:00 7 nan
2015-11-14 00:00:00 nan 5
2015-11-15 00:00:00 7 nan
2015-11-16 00:00:00 8 nan
2015-11-17 00:00:00 7 nan
2015-11-18 00:00:00 5 nan
2015-11-19 00:00:00 9 nan
2015-11-20 00:00:00 8 nan
2015-11-21 00:00:00 6 nan
2015-11-22 00:00:00 6 nan
2015-11-23 00:00:00 6 nan
2015-11-24 00:00:00 6 nan
2015-11-25 00:00:00 2 nan
and should be transformed into:
0 1
2015-11-12 00:00:00 7 5
2015-11-19 00:00:00 6 nan
My DataFrame is very large so efficiency is important. Thanks.
EDIT: If possible, can someone suggest a method that would be applicable if the entries are tuples (instead of floats as in my example)?
You can use resample to group your data by the weekly interval. Then, count the number of occurences via pd.value_counts and select the most common with idxmax:
df.resample("7D").apply(lambda x: x.apply(pd.value_counts).idxmax())
0 1
2015-11-12 00:00:00 7.0 5.0
2015-11-19 00:00:00 6.0 NaN
Edit
Here is another numpy version which is faster than the above solution:
def numpy_mode(series):
values = series.values
dropped = values[~np.isnan(values)]
# check for empty array and return NaN
if not dropped.size:
return np.NaN
uniques, counts = np.unique(series.dropna(), return_counts=True)
return uniques[np.argmax(counts)]
df2.resample("7D").apply(lambda x: x.apply(get_mode))
0 1
2015-11-12 00:00:00 7.0 5.0
2015-11-19 00:00:00 6.0 NaN
And here the timings based on the dummy data (for further improvements, have a look here):
%%timeit
df2.resample("7D").apply(lambda x: x.apply(pd.value_counts).idxmax())
>>> 100 loops, best of 3: 18.6 ms per loop
%%timeit
df2.resample("7D").apply(lambda x: x.apply(get_mode))
>>> 100 loops, best of 3: 3.72 ms per loop
I also tried scipy.stats.mode however it was also slower than the numpy solution:
size = 1000
index = pd.DatetimeIndex(start="2012-12-12", periods=size, freq="D")
dummy = pd.DataFrame(np.random.randint(0, 20, size=(size, 50)), index=index)
print(dummy.head)
0 1 2 3 4 5 6 7 8 9 ... 40 41 42 43 44 45 46 47 48 49
2012-12-12 18 2 7 1 7 9 16 2 19 19 ... 10 2 18 16 15 10 7 19 9 6
2012-12-13 7 4 11 19 17 10 18 0 10 7 ... 19 11 5 5 11 4 0 16 12 19
2012-12-14 14 0 14 5 1 11 2 19 5 9 ... 2 9 4 2 9 5 19 2 16 2
2012-12-15 12 2 7 2 12 12 11 11 19 5 ... 16 0 4 9 13 5 10 2 14 4
2012-12-16 8 15 2 18 3 16 15 0 14 14 ... 18 2 6 13 19 10 3 16 11 4
%%timeit
dummy.resample("7D").apply(lambda x: x.apply(get_mode))
>>> 1 loop, best of 3: 926 ms per loop
%%timeit
dummy.resample("7D").apply(lambda x: x.apply(pd.value_counts).idxmax())
>>> 1 loop, best of 3: 5.84 s per loop
%%timeit
dummy.resample("7D").apply(lambda x: stats.mode(x).mode)
>>> 1 loop, best of 3: 1.32 s per loop