I would like to estimate a rolling slope on a grouped dataframe.
Let's say that I have the following df:
Date tags weight
22 2004-05-12 a 0.000081
23 2004-05-13 a 0.000073
24 2004-05-14 a 0.000085
25 2004-05-17 a 0.000089
26 2004-05-18 b 0.000034
27 2004-05-19 b 0.000048
......
1000 2004-05-20 b 0.000034
1001 2004-05-21 b 0.000037
1002 2004-05-24 c 0.000043
1003 2004-05-25 c 0.000038
1004 2004-05-26 c 0.000029
How could I calculate a rolling slope over 10 dates and for each group?
I tried:
from scipy.stats import linregress
df['rolling_slope'] = df.groupby('tags').rolling(window=10,
min_periods=2).apply(lambda v: linregress(v.Date, v.weight))
but it seems that I can't apply the function to a Series
Try:
df['rolling_slope'] = (df.groupby('tags')['weight']
.rolling(window=10, min_period=2)
.apply(lambda v: linregress(np.arange(len(v)), v).slope )
.reset_index(level=0, drop=True)
)
But this is rolling on number of rows only, not really looking back 10 days. There's also an option rolling('10D') but you would need to set date as index.
Related
I have a pandas DataFrame with date as the index and a column, 'spendings'. I intend to get the rolling max() of the 'spendings' column for the trailing 1 calendar month (not 30 days or 4 weeks).
I tried to capture a snippet with custom data for addressing the problem, below (borrowed from Pandas monthly rolling operation):
import pandas as pd
from io import StringIO
data = StringIO(
"""\
date spendings
20210325 15
20210405 20
20210415 10
20210425 40
20210505 3
20210515 2
20210525 2
20210527 1
"""
)
df = pd.read_csv(data,sep="\s+", parse_dates=True)
df.index = pd.to_datetime(df.date, format='%Y%m%d')
del(df['date'])
Now, to create a column 'max' to hold rolling last 1 calendar month's max() val, I use:
df['max'] = df.loc[(df.index - pd.tseries.offsets.DateOffset(months=1)):df.index, 'spendings'].max()
This raises an exception like:
TypeError: cannot do slice indexing on DatetimeIndex with these indexers [DatetimeIndex(['2021-02-25', '2021-03-05', '2021-03-15', '2021-03-25',
'2021-04-05', '2021-04-15', '2021-04-25'],
dtype='datetime64[ns]', name='date', freq=None)] of type DatetimeIndex
However, if I manually access a random month window like below, it works without exception:
>>> df['2021-04-16':'2021-05-15']
spendings
date
2021-04-25 40
2021-05-05 3
2021-05-15 2
(I could have followed the method using list comprehension here: https://stackoverflow.com/a/47199274/235415, but I would like to use panda's vectorized method. I have many DataFrames and each is very large - using list comprehension is very slow here).
Q: How to get the vectorized method of performing rolling 1 calendar month's max()?
The expected o/p, ie primarily the 'max' column (holding the max value of 'spendings' for last 1 calendar month) will be something like this:
>>> df
spendings max
date
2021-03-25 15 15
2021-04-05 20 20
2021-04-15 10 20
2021-04-25 40 40
2021-05-05 3 40
2021-05-15 2 40
2021-05-25 2 40
2021-05-27 1 3
The answer will be
[df.loc[x- pd.tseries.offsets.DateOffset(months=1):x, 'spendings'].max() for x in df.index]
Out[53]: [15, 20, 20, 40, 40, 40, 40, 3]
I have the following Pandas DataFrame:
POS Price Cost (...)
10122 100 20
10123 500 5
(...)
I would like to pivot rows and columns, obtaining a single line, adding a suffix as:
Price_POS10122 Cost_POS10122 Price_POS10123 Cost_POS10123 (...)
100 20 500 5
(...)
How can I achieve that?
let's unstack:
df=df.set_index('POS').unstack().to_frame().T
df.columns=[f"{x}_POS{y}" for x,y in df.columns]
output of df:
Price_POS10122 Price_POS10123 Cost_POS10122 Cost_POS10123
0 100 500 20 5
i have a dataframe called Data
Date Value Frequency
06/01/2020 256 A
07/01/2020 235 A
14/01/2020 85 Q
16/01/2020 625 Q
22/01/2020 125 Q
here it is observed that 6/01/2020 and 07/01/2020 are in the same week that is monday and tuesday.
Therefore i wanted to take maximum date from week.
my final dataframe should look like this
Date Value Frequency
07/01/2020 235 A
16/01/2020 625 Q
22/01/2020 125 Q
I want the maximum date from the week , like i have showed in my final dataframe example.
I am new to python, And i am searching answer for this which i didnt find till now ,Please help
First convert column to datetimes by to_datetime and use DataFrameGroupBy.idxmax for rows with maximum datetime per rows with Series.dt.strftime, last select rows by DataFrame.loc:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
print (df['Date'].dt.strftime('%Y-%U'))
0 2020-01
1 2020-01
2 2020-02
3 2020-02
4 2020-03
Name: Date, dtype: object
df = df.loc[df.groupby(df['Date'].dt.strftime('%Y-%U'))['Date'].idxmax()]
print (df)
Date Value Frequency
1 2020-01-07 235 A
3 2020-01-16 625 Q
4 2020-01-22 125 Q
If format of datetimes cannot be changed:
d = pd.to_datetime(df['Date'], dayfirst=True)
df = df.loc[d.groupby(d.dt.strftime('%Y-%U')).idxmax()]
print (df)
Date Value Frequency
1 07/01/2020 235 A
3 16/01/2020 625 Q
4 22/01/2020 125 Q
In a dataframe with 6 columns (A B C D E F), from columns E or F, one is a linear combination of the first 4 columns with varying coefficients while the other column is a polynomial function of the same inputs.
Find which column is linear function and which is polynomial function.
Providing 30 Samples from dataframe (512 total rows)
A B C D E F
0 28400 28482 28025 28060 738.0 117.570740
1 28136 28382 28135 28184 -146.0 295.430176
2 28145 28255 28097 28119 30.0 132.123714
3 28125 28192 27947 27981 357.0 101.298064
4 28060 28146 27981 28007 124.0 112.153318
5 27995 28100 27945 28022 149.0 182.427089
6 28088 28195 27985 28019 167.0 141.255137
7 28049 28157 27996 28008 22.0 120.069010
8 28025 28159 28025 28109 34.0 218.401641
9 28170 28638 28170 28614 420.0 919.376358
10 28666 28980 28551 28710 234.0 475.389093
11 28660 28779 28531 28634 345.0 222.895307
12 28590 28799 28568 28783 265.0 425.738484
13 28804 28930 28740 28808 138.0 194.449548
14 28770 28770 28650 28719 378.0 69.289005
15 28769 28770 28600 28638 413.0 39.225874
16 28694 28866 28674 28847 214.0 346.158401
17 28843 28928 28807 28874 121.0 152.281425
18 28921 28960 28680 28704 491.0 63.234310
19 28683 28950 28628 28905 397.0 547.115621
20 28877 28877 28712 28749 404.0 37.212629
21 28685 29011 28680 28949 222.0 598.104568
22 29045 29180 29045 29111 -3.0 201.306765
23 29220 29499 29216 29481 259.0 546.566915
24 29439 29485 29310 29376 344.0 112.394063
25 29319 29345 28951 29049 906.0 125.333702
26 29001 29009 28836 28938 526.0 110.611943
27 28905 28971 28851 28917 174.0 132.274514
28 28907 28916 28711 28862 685.0 161.078158
29 28890 29025 28802 28946 329.0 280.114923
Performed Linear regression on (512 total rows)
Column A B C D as input, column E as target values.
OUTPUT-
Intercept [-2.67164069e-12]
coefficients[[ 2. 3. -1. -4.]]
Column A B C D as input, column F as target values.
OUTPUT-
Intercept [0.32815962]
coefficients[[ 1.01293825 -1.0003835 1.00503772 -1.01765453]]
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
For column E
x = df.iloc[:, :4].values
y = df.iloc[:, [4]].values
regressor = LinearRegression()
regressor.fit(x, y)
print(regressor.intercept_)
print(regressor.coef_)
output
[-2.67164069e-12]
[[ 2. 3. -1. -4.]]
For column F
x_new = df.iloc[:, :4].values
y_new = df.iloc[:, [5]].values
regressor_new = LinearRegression()
regressor_new.fit(x_new, y_new)
print(regressor_new.intercept_)
print(regressor_new.coef_)
output
[0.32815962]
[[ 1.01293825 -1.0003835 1.00503772 -1.01765453]]
One of the 2 columns is a linear combination of the first 4 columns with varying coefficients while the other is a polynomial function of the same inputs.
Mention which column is a linear function and which is polynomial.
I think the columns with linear combination can be found by checking the multicollinearity between the columns. So, the column/s which is/are linear combination of remaining column/s will have a high VIF.
Try plotting the graphs (histograms) of the two columns, and see if you can identify the function as linear or polynomial based on the graph.
I have 2 pandas dataframes, one of them contains dates with measurements, and the other contains dates with an event ID.
df1
from datetime import datetime as dt
from datetime import timedelta
import pandas as pd
import numpy as np
today = dt.now()
ndays = 10
df1 = pd.DataFrame({'Date': [today + timedelta(days = x) for x in range(ndays)], 'measurement': pd.Series(np.random.randint(1, high = 10, size = ndays))})
df1.Date = df1.Date.dt.date
Date measurement
2018-01-10 8
2018-01-11 2
2018-01-12 7
2018-01-13 3
2018-01-14 1
2018-01-15 1
2018-01-16 6
2018-01-17 9
2018-01-18 8
2018-01-19 4
df2
df2 = pd.DataFrame({'Date': ['2018-01-11', '2018-01-14', '2018-01-16', '2018-01-19'], 'letter': ['event_a', 'event_b', 'event_c', 'event_d']})
df2.Date = pd.to_datetime(df2.Date, format = '%Y-%m-%d')
df2.Date = df2.Date.dt.date
Date event_id
2018-01-11 event_a
2018-01-14 event_b
2018-01-16 event_c
2018-01-19 event_d
I give the dates in df1 an event_id from df2 only if it's between two event dates. The resulting dataframe would look something like:
df3
today = dt.now()
ndays = 10
df3 = pd.DataFrame({'Date': [today + timedelta(days = x) for x in range(ndays)], 'measurement': pd.Series(np.random.randint(1, high = 10, size = ndays)), 'event_id': ['event_a', 'event_a', 'event_b', 'event_b', 'event_b', 'event_c', 'event_c', 'event_d', 'event_d', 'event_d']})
df3.Date = df3.Date.dt.date
Date event_id measurement
2018-01-10 event_a 4
2018-01-11 event_a 2
2018-01-12 event_b 1
2018-01-13 event_b 5
2018-01-14 event_b 5
2018-01-15 event_c 4
2018-01-16 event_c 6
2018-01-17 event_d 6
2018-01-18 event_d 9
2018-01-19 event_d 6
The code I use to achieve this is:
n = 1
while n <= len(list(df2.Date)) - 1 :
for date in list(df1.Date):
if date <= df2.iloc[n].Date and (date > df2.iloc[n-1].Date):
df1.loc[df1.Date == date, 'event_id'] = df2.iloc[n].event_id
n += 1
The dataset that I am working with is significantly larger than this (a few million rows) and this method runs far too long. Is there a more efficient way to accomplish this?
So there are quite a few things to improve performance.
The first question I have is: does it have to be a pandas frame to begin with? Meaning can't df1 and df2 just be lists of tuples or list of lists?
The thing is that pandas adds a significant overhead when accessing items but especially when setting values individually.
Pandas excels when it comes to vectorized operations but I don't see an efficient alternative right now (maybe someone comes up with such an answer, that would be ideal).
Now what I'd do is:
Convert your df1 and df2 to records -> e.g. d1 = df1.to_records() what you get is an array of tuples, basically with the same structure as the dataframe.
Now run your algorithm but instead of operating on pandas dataframes you operate on the arrays of tuples d1 and d2
Use a third list of tuples d3 where you store the newly created data (each tuple is a row)
Now if you want you can convert d3 back to a pandas dataframe:
df3 = pd.DataFrame.from_records(d3, myKwArgs**)
This will speed up your code significantly I'd assume by more than 100-1000%. It does increase memory usage though, so if you are low on memory try to avoid the pandas dataframes all-together or dereference unused pandas frames df1, df2 once you used them to create the records (and if you run into problems call gc manually).
EDIT: Here a version of your code using the procedure above:
d3 = []
n = 1
while n < range(len(d2)):
for i in range(len(d1)):
date = d1[i][0]
if date <= d2[n][0] and date > d2[n-1][0]:
d3.append( (date, d2[n][1], d1[i][1]) )
n += 1
You can try df.apply() method to achieve this. Refer pandas.DataFrame.apply. I think my code will works faster than yours.
My approach:
Merge two dataframes df1 and df2 and create new one df3 by
df3 = pd.merge(df1, df2, on='Date', how='outer')
Sort df3 by date to make easy to travserse.
df3['Date'] = pd.to_datetime(df3.Date)
df3.sort_values(by='Date')
Create set_event_date() method to apply for each rows in df3.
new_event_id = np.nan
def set_event_date(df3):
global new_event_id
if df3.event_id is not np.nan:
new_event_id = df3.event_id
return new_event_id
Apply set_event_method() to each rows in df3.
df3['new_event_id'] = df3.apply(set_event_date,axis=1)
Final Output will be:
Date Measurement New_event_id
0 2018-01-11 2 event_a
1 2018-01-12 1 event_a
2 2018-01-13 3 event_a
3 2018-01-14 6 event_b
4 2018-01-15 3 event_b
5 2018-01-16 5 event_c
6 2018-01-17 7 event_c
7 2018-01-18 9 event_c
8 2018-01-19 7 event_d
9 2018-01-20 4 event_d
Let me know once you tried my solution and it works faster than yours.
Thanks.