Fill missing value in different columns of dataframe using mean or median of last n values - python-3.x

I have a dataframe which contains timeseries data. What i want to do is efficiently fill all the missing values in different columns by substituting with a median value using timedelta of say "N" mins. E.g if for a column say i have data for 10:20, 10:21,10:22,10:23,10:24,.... and data in 10:22 is missing then with timedelta of say 2 mins i would want it to be filled by median value of 10:20,10:21,10:23 and 10:24.
One way i can do is :
for all column in dataframe:
Find index which has nan value
for all index which has nan value:
extract all values using between_time with index-timedelta and index_+deltatime
find the media of extracted value
set value in the index with that extracted median value.
This looks like 2 for loops running and not a very efficient one. Is there a efficient way to do it.
Thanks

IIUC you can resample your time column, then fillna with rolling window set to center:
# dummy data setup
np.random.seed(500)
n = 2
df = pd.DataFrame({"time":pd.to_timedelta([f"10:{i}:00" for i in range(15)]),
"value":np.random.randint(2, 10, 15)})
df = df.drop(df.index[[5,10]]).reset_index(drop=True)
print (df)
time value
0 10:00:00 4
1 10:01:00 9
2 10:02:00 3
3 10:03:00 3
4 10:04:00 8
5 10:06:00 9
6 10:07:00 2
7 10:08:00 9
8 10:09:00 9
9 10:11:00 7
10 10:12:00 3
11 10:13:00 3
12 10:14:00 7
s = df.set_index("time").resample("60S").asfreq()
print (s.fillna(s.rolling(n*2+1, min_periods=1, center=True).mean()))
value
time
10:00:00 4.0
10:01:00 9.0
10:02:00 3.0
10:03:00 3.0
10:04:00 8.0
10:05:00 5.5
10:06:00 9.0
10:07:00 2.0
10:08:00 9.0
10:09:00 9.0
10:10:00 7.0
10:11:00 7.0
10:12:00 3.0
10:13:00 3.0
10:14:00 7.0

Related

how to update rows based on previous row of dataframe python

I have a time series data given below:
date product price amount
11/01/2019 A 10 20
11/02/2019 A 10 20
11/03/2019 A 25 15
11/04/2019 C 40 50
11/05/2019 C 50 60
I have a high dimensional data, and I have just added the simplified version with two columns {price, amount}. I am trying to transform it relatively based on time index illustrated below:
date product price amount
11/01/2019 A NaN NaN
11/02/2019 A 0 0
11/03/2019 A 15 -5
11/04/2019 C NaN NaN
11/05/2019 C 10 10
I am trying to get relative changes of each product based on time indexes. If previous date does not exist for a specified product, I am adding "NaN".
Can you please tell me is there any function to do this?
Group by product and use .diff()
df[["price", "amount"]] = df.groupby("product")[["price", "amount"]].diff()
output :
date product price amount
0 2019-11-01 A NaN NaN
1 2019-11-02 A 0.0 0.0
2 2019-11-03 A 15.0 -5.0
3 2019-11-04 C NaN NaN
4 2019-11-05 C 10.0 10.0

Looping through a pandas DataFrame accessing previous elements

I have two DataFrames, FirstColumn & SecondColumn.
How do I create a new column, containing the correlation coeff. row by row for the two columns 5 periods back?
For example, the 5th row would be the R2 value of the two columns 5 periods back, the 6th row would be the corr coeff. value of the columns ranging from row 1-6 etc etc.
Additionally, what method is the most efficient when looping through a DataFrame, having to access previous rows?
FirstColumn SecondColumn
0 2 1.0
1 3 3.0
2 4 4.0
3 5 5.0
4 6 2.0
5 7 6.0
6 2 2.0
7 3 3.0
8 5 9.0
9 3 2.0
10 2 3.0
11 4 2.0
12 2 2.0
13 4 2.0
14 2 4.0
15 5 3.0
16 3 1.0
You can do:
df["corr"]=df.rolling(5, min_periods=1).corr()["FirstColumn"].loc[(slice(None), "SecondColumn")]
Outputs:
FirstColumn SecondColumn corr
0 2.0 1.0 NaN
1 3.0 3.0 1.000000
2 4.0 4.0 0.981981
3 5.0 5.0 0.982708
4 6.0 2.0 0.400000
5 7.0 6.0 0.400000
6 2.0 2.0 0.566707
7 3.0 3.0 0.610572
8 5.0 9.0 0.426961
9 3.0 2.0 0.737804
10 2.0 3.0 0.899659
11 4.0 2.0 0.698774
12 2.0 2.0 0.716769
13 4.0 2.0 -0.559017
14 2.0 4.0 -0.612372
15 5.0 3.0 -0.250000
16 3.0 1.0 -0.067267
You can use the shift(n) method to access the element n rows back. One approach would be to create "lag" columns, like so:
for i in range(5):
df['FirstCol_lag'+str(i)] = df.FirstColumn.shift(i)
Then you can do your formula operations on a row-by-row basis, e.g.
df['R2'] = foo([df.FirstCol_lag1, ... df.SecondCol_lag5])
The most efficient approach would be to not use a loop and do it this way. But if the data is very large this may not be feasible. I think the iterrows() function is pretty efficient too, you can test which is faster if you really care. For that you'd have to offset the row index manually and it would take more code.
Still you'll have to be careful about handling nans because the shift will be null for the first n columns of your dataframe.

Python - Create copies of rows based on column value and increase date by number of iterations

I have a dataframe in Python:
md
Out[94]:
Key_ID ronDt multidays
0 Actuals-788-8AA-0001 2017-01-01 1.0
11 Actuals-788-8AA-0012 2017-01-09 1.0
20 Actuals-788-8AA-0021 2017-01-16 1.0
33 Actuals-788-8AA-0034 2017-01-25 1.0
36 Actuals-788-8AA-0037 2017-01-28 1.0
... ... ...
55239 Actuals-789-8LY-0504 2020-02-12 1.0
55255 Actuals-788-T11-0001 2018-08-23 8.0
55257 Actuals-788-T11-0003 2018-09-01 543.0
55258 Actuals-788-T15-0001 2019-02-20 368.0
55259 Actuals-788-T15-0002 2020-02-24 2.0
I want to create an additional record for every multiday and increase the date (ronDt) by number of times that record was duplicated.
For example:
row[0] would repeat one time with the new date reading 2017-01-02.
row[55255] would be repeated 8 times with the corresponding dates ranging from 2018-08-24 - 2018-08-31.
When I did this in VBA, I used loops, and in Alteryx I used multirow functions. What is the best way to achieve this in Python? Thanks.
Here's a way to in pandas:
# get list of dates possible
df['datecol'] = df.apply(lambda x: pd.date_range(start=x['ronDt'], periods=x['multidays'], freq='D'), 1)
# convert the list into new rows
df = df.explode('datecol').drop('ronDt', 1)
# rename the columns
df.rename(columns={'datecol': 'ronDt'}, inplace=True)
print(df)
Key_ID multidays ronDt
0 Actuals-788-8AA-0001 1.0 2017-01-01
1 Actuals-788-8AA-0012 1.0 2017-01-09
2 Actuals-788-8AA-0021 1.0 2017-01-16
3 Actuals-788-8AA-0034 1.0 2017-01-25
4 Actuals-788-8AA-0037 1.0 2017-01-28
.. ... ... ...
8 Actuals-788-T15-0001 368.0 2020-02-20
8 Actuals-788-T15-0001 368.0 2020-02-21
8 Actuals-788-T15-0001 368.0 2020-02-22
9 Actuals-788-T15-0002 2.0 2020-02-24
9 Actuals-788-T15-0002 2.0 2020-02-25
# Get count of duplication for each row which corresponding to multidays col
df = df.groupby(df.columns.tolist()).size().reset_index().rename(columns={0: 'multidays'})
# Assume ronDt dtype is str so convert it to datetime object
# Then sum ronDt and multidays columns
df['ronDt_new'] = pd.to_datetime(df['ronDt']) + pd.to_timedelta(df['multidays'], unit='d')

pandas how to get the date of n-day before monthend

Suppose I have a dataframe, which the first column is the stock trading date. I represent the date with number for convenience here.
data = pd.DataFrame({'date': [1,2,3,1,2,3,4,1,2,3],
'value': range(1, 11)})
I have another dataframe which contain the date of monthend. So that firstly I can get the row of monthend from data like this.
date value
2 3.0 3.0
6 4.0 7.0
9 3.0 10.0
I want to get the data of n-days before monthend, for example, 1-day before
date value
1 2.0 2.0
5 3.0 6.0
8 2.0 9.0
How can I code this.
I am using the cumsum with groupby
df1=data.groupby(data.date.eq(1).cumsum()).tail(1)
df1
Out[208]:
date value
2 3 3
6 4 7
9 3 10
df2=data.loc[df1.index-1]
df2
Out[213]:
date value
1 2 2
5 3 6
8 2 9

Replacing values in specific columns in a Pandas Dataframe, when number of columns are unknown

I am brand new to Python and stacks exchange. I have been trying to replace invalid values ( x<-3 and x>12) with np.nan in specific columns.
I don't know how many columns I will have to deal with and thus will have to create a general code that takes this into account. I do however know, that the first two columns are ids and names respectively. I have searched google and stacks exchange for a solution but haven't been able to find a solution that solves my specific objective.
My question is; How would one replace values found in the third column and onwards?
My dataframe looks like this;
Data
I tried this line:
Data[Data > 12.0] = np.nan.
this replaced the first two columns with nan
1st attempt
I tried this line:
Data[(Data.iloc[(range(2,Columns))] >=12) & (Data.iloc[(range(2,Columns))]<=-3)] = np.nan
where,
Columns = len(Data.columns)
This is clearly wrong replacing all values in rows 2 to 6 (Columns = 7).
2nd attempt
Any thoughts would be greatly appreciated.
Python 3.6.1 64bits, Qt 5.6.2, PyQt5 5.6 on Darwin
You're looking for the applymap() method.
import pandas as pd
import numpy as np
# get the columns after the second one
cols = Data.columns[2:]
# apply mask to those columns
new_df = Data[cols].applymap(lambda x: np.nan if x > 12 or x <= -3 else x)
Documentation: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.applymap.html
This approach assumes your columns after the second contain float or int values.
You can set values to specific columns of a dataframe by using iloc and slicing the columns that you need. Then we can set the values using where
A short example using some random data
df = pd.DataFrame(np.random.randint(0,10,(4,10)))
0 1 2 3 4 5 6 7 8 9
0 7 7 9 4 2 6 6 1 7 9
1 0 1 2 4 5 5 3 9 0 7
2 0 1 4 4 3 8 7 0 6 1
3 1 4 0 2 5 7 2 7 9 9
Now we set the region to update and the region we want to update using iloc, and we slice columns indexed as 2 to the last column
df.iloc[:,2:] = df.iloc[:,2:].where((df < 7) & (df > 2))
Which will set the values in the Data Frame to NaN.
0 1 2 3 4 5 6 7 8 9
0 7 7 NaN 4.0 NaN 6.0 6.0 NaN NaN NaN
1 0 1 NaN 4.0 5.0 5.0 3.0 NaN NaN NaN
2 0 1 4.0 4.0 3.0 NaN NaN NaN 6.0 NaN
3 1 4 NaN NaN 5.0 NaN NaN NaN NaN NaN
For your data the code would be this
Data.iloc[:,2:] = Data.iloc[:,2:].where((Data <= 12) & (Data >= -3))
Operator clarification
The setup I show directly above would look like this
-3 <= Data <= 12, gives everything between those numbers
If we reverse this logic using the & operator it looks like this
-3 >= Data <= 12, a number cannot be both less than -3 and greater than 12 at the same time.
So we use the or operator instead |. Code looks like this now....
Data.iloc[:,2:] = Data.iloc[:,2:].where((Data >= 12) | (Data <= -3))
So the data is checked on a conditional basis
Data <= -3 or Data >= 12

Resources