Looping through a pandas DataFrame accessing previous elements - python-3.x

I have two DataFrames, FirstColumn & SecondColumn.
How do I create a new column, containing the correlation coeff. row by row for the two columns 5 periods back?
For example, the 5th row would be the R2 value of the two columns 5 periods back, the 6th row would be the corr coeff. value of the columns ranging from row 1-6 etc etc.
Additionally, what method is the most efficient when looping through a DataFrame, having to access previous rows?
FirstColumn SecondColumn
0 2 1.0
1 3 3.0
2 4 4.0
3 5 5.0
4 6 2.0
5 7 6.0
6 2 2.0
7 3 3.0
8 5 9.0
9 3 2.0
10 2 3.0
11 4 2.0
12 2 2.0
13 4 2.0
14 2 4.0
15 5 3.0
16 3 1.0

You can do:
df["corr"]=df.rolling(5, min_periods=1).corr()["FirstColumn"].loc[(slice(None), "SecondColumn")]
Outputs:
FirstColumn SecondColumn corr
0 2.0 1.0 NaN
1 3.0 3.0 1.000000
2 4.0 4.0 0.981981
3 5.0 5.0 0.982708
4 6.0 2.0 0.400000
5 7.0 6.0 0.400000
6 2.0 2.0 0.566707
7 3.0 3.0 0.610572
8 5.0 9.0 0.426961
9 3.0 2.0 0.737804
10 2.0 3.0 0.899659
11 4.0 2.0 0.698774
12 2.0 2.0 0.716769
13 4.0 2.0 -0.559017
14 2.0 4.0 -0.612372
15 5.0 3.0 -0.250000
16 3.0 1.0 -0.067267

You can use the shift(n) method to access the element n rows back. One approach would be to create "lag" columns, like so:
for i in range(5):
df['FirstCol_lag'+str(i)] = df.FirstColumn.shift(i)
Then you can do your formula operations on a row-by-row basis, e.g.
df['R2'] = foo([df.FirstCol_lag1, ... df.SecondCol_lag5])
The most efficient approach would be to not use a loop and do it this way. But if the data is very large this may not be feasible. I think the iterrows() function is pretty efficient too, you can test which is faster if you really care. For that you'd have to offset the row index manually and it would take more code.
Still you'll have to be careful about handling nans because the shift will be null for the first n columns of your dataframe.

Related

Fill missing value in different columns of dataframe using mean or median of last n values

I have a dataframe which contains timeseries data. What i want to do is efficiently fill all the missing values in different columns by substituting with a median value using timedelta of say "N" mins. E.g if for a column say i have data for 10:20, 10:21,10:22,10:23,10:24,.... and data in 10:22 is missing then with timedelta of say 2 mins i would want it to be filled by median value of 10:20,10:21,10:23 and 10:24.
One way i can do is :
for all column in dataframe:
Find index which has nan value
for all index which has nan value:
extract all values using between_time with index-timedelta and index_+deltatime
find the media of extracted value
set value in the index with that extracted median value.
This looks like 2 for loops running and not a very efficient one. Is there a efficient way to do it.
Thanks
IIUC you can resample your time column, then fillna with rolling window set to center:
# dummy data setup
np.random.seed(500)
n = 2
df = pd.DataFrame({"time":pd.to_timedelta([f"10:{i}:00" for i in range(15)]),
"value":np.random.randint(2, 10, 15)})
df = df.drop(df.index[[5,10]]).reset_index(drop=True)
print (df)
time value
0 10:00:00 4
1 10:01:00 9
2 10:02:00 3
3 10:03:00 3
4 10:04:00 8
5 10:06:00 9
6 10:07:00 2
7 10:08:00 9
8 10:09:00 9
9 10:11:00 7
10 10:12:00 3
11 10:13:00 3
12 10:14:00 7
s = df.set_index("time").resample("60S").asfreq()
print (s.fillna(s.rolling(n*2+1, min_periods=1, center=True).mean()))
value
time
10:00:00 4.0
10:01:00 9.0
10:02:00 3.0
10:03:00 3.0
10:04:00 8.0
10:05:00 5.5
10:06:00 9.0
10:07:00 2.0
10:08:00 9.0
10:09:00 9.0
10:10:00 7.0
10:11:00 7.0
10:12:00 3.0
10:13:00 3.0
10:14:00 7.0

How to replace the missing values with average of ffill() and bfill() in pandas?

This is a sample dataframe and it containsNA:
x y z datetime
0 2 3 4 02-02-2019
1 NA NA NA 03-02-2019
2 3 5 7 04-02-2019
3 NA NA NA 05-02-2019
4 4 7 9 06-02-2019
Now, i want to fill these NA values and i can do this by using either ffill() or bfill(). But what if want to apply the average of the ffill() & bfill(). Then how can i do this?
The direct average df = (df.fill() + df.bfill()) / 2 didn't work because of datetime column.
The end dataframe should look like this:
x y z datetime
0 2 3 4 02-02-2019
1 2.5 4 5.5 03-02-2019
2 3 5 7 04-02-2019
3 3.5 6 8 05-02-2019
4 4 7 9 06-02-2019
Check with df.interpolate:
df.interpolate()
x y z datetime
0 2.0 3.0 4.0 02-02-2019
1 2.5 4.0 5.5 03-02-2019
2 3.0 5.0 7.0 04-02-2019
3 3.5 6.0 8.0 05-02-2019
4 4.0 7.0 9.0 06-02-2019

pandas how to get the date of n-day before monthend

Suppose I have a dataframe, which the first column is the stock trading date. I represent the date with number for convenience here.
data = pd.DataFrame({'date': [1,2,3,1,2,3,4,1,2,3],
'value': range(1, 11)})
I have another dataframe which contain the date of monthend. So that firstly I can get the row of monthend from data like this.
date value
2 3.0 3.0
6 4.0 7.0
9 3.0 10.0
I want to get the data of n-days before monthend, for example, 1-day before
date value
1 2.0 2.0
5 3.0 6.0
8 2.0 9.0
How can I code this.
I am using the cumsum with groupby
df1=data.groupby(data.date.eq(1).cumsum()).tail(1)
df1
Out[208]:
date value
2 3 3
6 4 7
9 3 10
df2=data.loc[df1.index-1]
df2
Out[213]:
date value
1 2 2
5 3 6
8 2 9

Pandas DataFrame Apply Efficiency

I have a dataframe to which I wan't to add a column with a kind of status if there is a matching value in another dataframe. I have the current code which works:
df1['NewColumn'] = df1['ComparisonColumn'].apply(lambda x: 'Match' if any(df2.ComparisonColumn == x) else ('' if x is None else 'Missing'))
I know the line is ugly, but I get the impression that its inefficient. Can you suggest a better way to make this comparison?
You can use np.where, isin, and isnull:
Create some dummy data:
np.random.seed(123)
df = pd.DataFrame({'ComparisonColumn':np.random.randint(10,20,20)})
df.iloc[4] = np.nan #Create missing data
df2 = pd.DataFrame({'ComparisonColumn':np.random.randint(15,30,20)})
Do matching with np.where:
df['NewColumn'] = np.where(df.ComparisonColumn.isin(df2.ComparisonColumn),'Matched',np.where(df.ComparisonColumn.isnull(),'Missing',''))
Output:
ComparisonColumn NewColumn
0 12.0
1 12.0
2 16.0 Matched
3 11.0
4 NaN Missing
5 19.0 Matched
6 16.0 Matched
7 11.0
8 10.0
9 11.0
10 19.0 Matched
11 10.0
12 10.0
13 19.0 Matched
14 13.0
15 14.0
16 10.0
17 10.0
18 14.0
19 11.0

Get number of rows for all combinations of attribute levels in Pandas

I have a dataframe bunch of categorical variables, each row corresponds to a product.I wanted to find the number of rows for every combination of attribute levels and decided to run the following:
att1=list(frame_base.columns.values)
f1=att.groupby(att1,as_index=False).size().rename('counts').to_frame()
att1 is the list of all attributes, f1 does not seem to provide the correct value as f1.counts.sum() is not equal to len(f1) before the group by.Why doesn't this work?
One possible problem is NaN row, but maybe there is typo - need att instead frame_base:
att = pd.DataFrame({'A':[1,1,3,np.nan],
'B':[1,1,6,np.nan],
'C':[2,2,9,np.nan],
'D':[1,1,5,np.nan],
'E':[1,1,6,np.nan],
'F':[1,1,3,np.nan]})
print (att)
A B C D E F
0 1.0 1.0 2.0 1.0 1.0 1.0
1 1.0 1.0 2.0 1.0 1.0 1.0
2 3.0 6.0 9.0 5.0 6.0 3.0
3 NaN NaN NaN NaN NaN NaN
att1=list(att.columns.values)
f1=att.groupby(att1).size().reset_index(name='counts')
print (f1)
A B C D E F counts
0 1.0 1.0 2.0 1.0 1.0 1.0 2
1 3.0 6.0 9.0 5.0 6.0 3.0 1

Resources