pandas how to get the date of n-day before monthend - python-3.x

Suppose I have a dataframe, which the first column is the stock trading date. I represent the date with number for convenience here.
data = pd.DataFrame({'date': [1,2,3,1,2,3,4,1,2,3],
'value': range(1, 11)})
I have another dataframe which contain the date of monthend. So that firstly I can get the row of monthend from data like this.
date value
2 3.0 3.0
6 4.0 7.0
9 3.0 10.0
I want to get the data of n-days before monthend, for example, 1-day before
date value
1 2.0 2.0
5 3.0 6.0
8 2.0 9.0
How can I code this.

I am using the cumsum with groupby
df1=data.groupby(data.date.eq(1).cumsum()).tail(1)
df1
Out[208]:
date value
2 3 3
6 4 7
9 3 10
df2=data.loc[df1.index-1]
df2
Out[213]:
date value
1 2 2
5 3 6
8 2 9

Related

pandas - rolling sum last seven days over different rows

Starting from this data frame:
id
date
value
1
01.01.
2
2
01.01.
3
1
01.03.
5
2
01.03.
3
1
01.09.
5
2
01.09.
2
1
01.10.
5
2
01.10.
2
I would like to get a weekly sum of value:
id
date
value
1
01.01.
2
2
01.01.
3
1
01.03.
7
2
01.03.
6
1
01.09.
10
2
01.09.
5
1
01.10.
15
2
01.10.
8
I use this command, but it is not working:
df['value'] = df.groupby('id')['value'].rolling(7).sum()
Any ideas?
You can do groupby and apply.
df['date'] = pd.to_datetime(df['date'], format='%m.%d.')
df['value'] = (df.groupby('id', as_index=False, group_keys=False)
.apply(lambda g: g.rolling('7D', on='date')['value'].sum()))
Note that for 1900-01-10, the rolling window is 1900-01-04, 1900-01-05...1900-01-10
print(df)
id date value
0 1 1900-01-01 2.0
1 2 1900-01-01 3.0
2 1 1900-01-03 7.0
3 2 1900-01-03 6.0
4 1 1900-01-09 10.0
5 2 1900-01-09 5.0
6 1 1900-01-10 10.0
7 2 1900-01-10 4.0

Fill missing value in different columns of dataframe using mean or median of last n values

I have a dataframe which contains timeseries data. What i want to do is efficiently fill all the missing values in different columns by substituting with a median value using timedelta of say "N" mins. E.g if for a column say i have data for 10:20, 10:21,10:22,10:23,10:24,.... and data in 10:22 is missing then with timedelta of say 2 mins i would want it to be filled by median value of 10:20,10:21,10:23 and 10:24.
One way i can do is :
for all column in dataframe:
Find index which has nan value
for all index which has nan value:
extract all values using between_time with index-timedelta and index_+deltatime
find the media of extracted value
set value in the index with that extracted median value.
This looks like 2 for loops running and not a very efficient one. Is there a efficient way to do it.
Thanks
IIUC you can resample your time column, then fillna with rolling window set to center:
# dummy data setup
np.random.seed(500)
n = 2
df = pd.DataFrame({"time":pd.to_timedelta([f"10:{i}:00" for i in range(15)]),
"value":np.random.randint(2, 10, 15)})
df = df.drop(df.index[[5,10]]).reset_index(drop=True)
print (df)
time value
0 10:00:00 4
1 10:01:00 9
2 10:02:00 3
3 10:03:00 3
4 10:04:00 8
5 10:06:00 9
6 10:07:00 2
7 10:08:00 9
8 10:09:00 9
9 10:11:00 7
10 10:12:00 3
11 10:13:00 3
12 10:14:00 7
s = df.set_index("time").resample("60S").asfreq()
print (s.fillna(s.rolling(n*2+1, min_periods=1, center=True).mean()))
value
time
10:00:00 4.0
10:01:00 9.0
10:02:00 3.0
10:03:00 3.0
10:04:00 8.0
10:05:00 5.5
10:06:00 9.0
10:07:00 2.0
10:08:00 9.0
10:09:00 9.0
10:10:00 7.0
10:11:00 7.0
10:12:00 3.0
10:13:00 3.0
10:14:00 7.0

Looping through a pandas DataFrame accessing previous elements

I have two DataFrames, FirstColumn & SecondColumn.
How do I create a new column, containing the correlation coeff. row by row for the two columns 5 periods back?
For example, the 5th row would be the R2 value of the two columns 5 periods back, the 6th row would be the corr coeff. value of the columns ranging from row 1-6 etc etc.
Additionally, what method is the most efficient when looping through a DataFrame, having to access previous rows?
FirstColumn SecondColumn
0 2 1.0
1 3 3.0
2 4 4.0
3 5 5.0
4 6 2.0
5 7 6.0
6 2 2.0
7 3 3.0
8 5 9.0
9 3 2.0
10 2 3.0
11 4 2.0
12 2 2.0
13 4 2.0
14 2 4.0
15 5 3.0
16 3 1.0
You can do:
df["corr"]=df.rolling(5, min_periods=1).corr()["FirstColumn"].loc[(slice(None), "SecondColumn")]
Outputs:
FirstColumn SecondColumn corr
0 2.0 1.0 NaN
1 3.0 3.0 1.000000
2 4.0 4.0 0.981981
3 5.0 5.0 0.982708
4 6.0 2.0 0.400000
5 7.0 6.0 0.400000
6 2.0 2.0 0.566707
7 3.0 3.0 0.610572
8 5.0 9.0 0.426961
9 3.0 2.0 0.737804
10 2.0 3.0 0.899659
11 4.0 2.0 0.698774
12 2.0 2.0 0.716769
13 4.0 2.0 -0.559017
14 2.0 4.0 -0.612372
15 5.0 3.0 -0.250000
16 3.0 1.0 -0.067267
You can use the shift(n) method to access the element n rows back. One approach would be to create "lag" columns, like so:
for i in range(5):
df['FirstCol_lag'+str(i)] = df.FirstColumn.shift(i)
Then you can do your formula operations on a row-by-row basis, e.g.
df['R2'] = foo([df.FirstCol_lag1, ... df.SecondCol_lag5])
The most efficient approach would be to not use a loop and do it this way. But if the data is very large this may not be feasible. I think the iterrows() function is pretty efficient too, you can test which is faster if you really care. For that you'd have to offset the row index manually and it would take more code.
Still you'll have to be careful about handling nans because the shift will be null for the first n columns of your dataframe.

python-3: how to create a new pandas column as subtraction of two consecutive rows of another column?

I have a pandas dataframe
x
1
3
4
7
10
I want to create a new column y as y[i] = x[i] - x[i-1] (and y[0] = x[0]).
So the above data frame will become:
x y
1 1
3 2
4 1
7 3
10 3
How to do that with python-3? Many thanks
Using .shift() and fillna():
df['y'] = (df['x'] - df['x'].shift(1)).fillna(df['x'])
To explain what this is doing, if we print(df['x'].shift(1)) we get the following series:
0 NaN
1 1.0
2 3.0
3 4.0
4 7.0
Which is your values from 'x' shifted down one row. The first row gets NaN because there is no value above it to shift down. So, when we do:
print(df['x'] - df['x'].shift(1))
We get:
0 NaN
1 2.0
2 1.0
3 3.0
4 3.0
Which is your subtracted values, but in our first row we get a NaN again. To clear this, we use .fillna(), telling it that we want to just take the value from df['x'] whenever a null value is encountered.

Replacing values in specific columns in a Pandas Dataframe, when number of columns are unknown

I am brand new to Python and stacks exchange. I have been trying to replace invalid values ( x<-3 and x>12) with np.nan in specific columns.
I don't know how many columns I will have to deal with and thus will have to create a general code that takes this into account. I do however know, that the first two columns are ids and names respectively. I have searched google and stacks exchange for a solution but haven't been able to find a solution that solves my specific objective.
My question is; How would one replace values found in the third column and onwards?
My dataframe looks like this;
Data
I tried this line:
Data[Data > 12.0] = np.nan.
this replaced the first two columns with nan
1st attempt
I tried this line:
Data[(Data.iloc[(range(2,Columns))] >=12) & (Data.iloc[(range(2,Columns))]<=-3)] = np.nan
where,
Columns = len(Data.columns)
This is clearly wrong replacing all values in rows 2 to 6 (Columns = 7).
2nd attempt
Any thoughts would be greatly appreciated.
Python 3.6.1 64bits, Qt 5.6.2, PyQt5 5.6 on Darwin
You're looking for the applymap() method.
import pandas as pd
import numpy as np
# get the columns after the second one
cols = Data.columns[2:]
# apply mask to those columns
new_df = Data[cols].applymap(lambda x: np.nan if x > 12 or x <= -3 else x)
Documentation: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.applymap.html
This approach assumes your columns after the second contain float or int values.
You can set values to specific columns of a dataframe by using iloc and slicing the columns that you need. Then we can set the values using where
A short example using some random data
df = pd.DataFrame(np.random.randint(0,10,(4,10)))
0 1 2 3 4 5 6 7 8 9
0 7 7 9 4 2 6 6 1 7 9
1 0 1 2 4 5 5 3 9 0 7
2 0 1 4 4 3 8 7 0 6 1
3 1 4 0 2 5 7 2 7 9 9
Now we set the region to update and the region we want to update using iloc, and we slice columns indexed as 2 to the last column
df.iloc[:,2:] = df.iloc[:,2:].where((df < 7) & (df > 2))
Which will set the values in the Data Frame to NaN.
0 1 2 3 4 5 6 7 8 9
0 7 7 NaN 4.0 NaN 6.0 6.0 NaN NaN NaN
1 0 1 NaN 4.0 5.0 5.0 3.0 NaN NaN NaN
2 0 1 4.0 4.0 3.0 NaN NaN NaN 6.0 NaN
3 1 4 NaN NaN 5.0 NaN NaN NaN NaN NaN
For your data the code would be this
Data.iloc[:,2:] = Data.iloc[:,2:].where((Data <= 12) & (Data >= -3))
Operator clarification
The setup I show directly above would look like this
-3 <= Data <= 12, gives everything between those numbers
If we reverse this logic using the & operator it looks like this
-3 >= Data <= 12, a number cannot be both less than -3 and greater than 12 at the same time.
So we use the or operator instead |. Code looks like this now....
Data.iloc[:,2:] = Data.iloc[:,2:].where((Data >= 12) | (Data <= -3))
So the data is checked on a conditional basis
Data <= -3 or Data >= 12

Resources