Excel data extraction using python - excel

I have an excel file with the following dataset:
date y
1/1/12 0:00 86,580
1/1/12 0:30 86,580
1/1/12 1:00 nan
1/1/12 1:30 86,910
1/1/12 2:00 87,240
1/1/12 2:30 87,130
1/1/12 3:00 nan
1/1/12 3:30 nan
1/1/12 4:00 87,570
1/1/12 4:30 91,400
1/1/12 5:00 91,880
1/1/12 5:30 92,600
1/1/12 6:00 nan
1/1/12 6:30 nan
1/1/12 7:00 nan
1/1/12 7:30 94,160
1/1/12 8:00 94,280
1/1/12 8:30 94,640
The data contains some NaN values. I need to extract the start and end date of each NaN groups. Here is what I tried:
import pandas as pd
import numpy as np
from datetime import date, datetime, time, timedelta
import re
df=pd.read_excel(r'test_nan.xlsx',
sheet_name='Sheet1', header=0)
nan_index = df.y.index[df.y.apply(np.isnan)]
start=df.y.apply(str).str.findall(r'\d\nnan')
end=begin=df.y.apply(str).str.findall(r'nan\n\d')
Here is what I want to extract:
start end
1/1/12 0:30 1/1/12 1:30
1/1/12 2:30 1/1/12 4:00
1/1/12 5:30 1/1/12 7:30
Please find the attached excel file: test_nan.xlsx

Based on #jezrael's answer here we can continue to get your task done.
m = df['y'].isna()
df = pd.concat([df[m.shift(fill_value=False)],
df[m.shift(-1, fill_value=False)]]).sort_index()
m.shift() will get the rows after a NaN (and some NaN's where there is a group of NaN's) and m.shift(-1) will get the rows before a NaN (and also here some extra NaN's where there is a group of NaN's). Then we get rid of all NaN with:
df = df.dropna(subset='y')
Now we have rows with alternating values, start and end.
res = pd.DataFrame({
'start' : df['x'][::2].values,
'end' : df['x'][1::2].values
})
Output res:
start end
0 2012-01-01 00:30:00 2012-01-01 01:30:00
1 2012-01-01 02:30:00 2012-01-01 04:00:00
2 2012-01-01 05:30:00 2012-01-01 07:30:00

Related

Replace timeseries missing values with previous years value

As the title suggests, I have an hourly df looks like this:
date_time traffic_volume
date_time
2012-10-02 09:00:00 2012-10-02 09:00:00 5545.0
2012-10-02 10:00:00 2012-10-02 10:00:00 4516.0
2012-10-02 11:00:00 2012-10-02 11:00:00 NaN
2012-10-02 12:00:00 2012-10-02 12:00:00 NaN
2012-10-02 13:00:00 2012-10-02 13:00:00 NaN
2012-10-02 14:00:00 2012-10-02 14:00:00 NaN
2012-10-02 15:00:00 2012-10-02 15:00:00 5584.0
2012-10-02 16:00:00 2012-10-02 16:00:00 6015.0
The majority of the NaNs I imputed using
df['traffic_volume'] = df['traffic_volume'].interpolate(method='time')
The problem now is that for a certain subset of time-series (the remaining NaN's), I want to impute by putting the same value of that day but last year.
I used
df['traffic_volume'] = df.apply(lambda x: df.loc[ x['date_time'] + pd.offsets.DateOffset(years=-1)]['traffic_volume'] if x['traffic_volume']==np.NaN else x['traffic_volume'], axis=1)
The line of code ran but my NaN's weren't Imputed. My question is why? and if there is a better way what is it?
Thank you.
P.S The reason I don't want to use bfill, ffill or interpolate is because the sequence of NaN's are too much and the data loses granularity.
The fix is to use pd.isna(x['traffic']) instead of x['traffic_volume']==np.NaN for the if condition in the lambda. I still don't understand why the initial line ran but didn't impute.

Pandas Merge with interpolation

I have two dataframes df1 and df2
df1
Date/Time S
1/1/2012 0:00 7.51
1/1/2012 1:00 7.28
1/1/2012 2:00 6.75
1/1/2012 3:00 15.00
1/1/2012 4:00 8.18
1/1/2012 5:00 0.00
1/1/2012 6:00 5.00
df2
S Val
3.00 30
4.00 186
5.00 406
6.00 723
7.00 1169
8.00 1704
9.00 2230
10.00 2520
11.00 2620
12.00 2700
I would like to merged the two dataframes with interpolated val.
pd.merge(df1, df2, left_on=['S'], right_on=['S'])
For example:
df1 'S' column will be lookup value, and column 'S' in df2 will be lookupRange, and the outputRange will be column 'Val'.
The value below 3 and above 12 will be 0.
The output should be as shown below, How can i achieve this in pandas?. or any alternative solution in python other then looping much appreciated.
Output
Date/Time S Val
1/1/2012 0:00 7.51 1441.9
1/1/2012 1:00 7.28 1318.8
1/1/2012 2:00 6.75 1057.5
1/1/2012 3:00 15.00 0.0
1/1/2012 4:00 8.18 1798.7
1/1/2012 5:00 0.00 0.0
1/1/2012 6:00 5.00 406.00
Assuming df2 is sorted by column S, you can do:
tmp = df1.assign(tmp=df1.S.apply(np.floor)).merge(df2.assign(tmp2=(df2.Val.shift(-1) - df2.Val)), how='outer', left_on='tmp', right_on='S')
tmp.loc[tmp.Val.isna(), 'S_x'] = 0
tmp['Val'] = (tmp['S_x'] - tmp['S_y'].fillna(0)) * tmp['tmp2'].fillna(1) + tmp['Val'].fillna(0)
print(tmp[['Date/Time', 'S_x', 'Val']].dropna().sort_values(by='Date/Time').rename(columns={'S_x': 'S'}))
Prints:
Date/Time S Val
0 1/1/2012 0:00 7.51 1441.85
1 1/1/2012 1:00 7.28 1318.80
2 1/1/2012 2:00 6.75 1057.50
3 1/1/2012 3:00 15.00 0.00
4 1/1/2012 4:00 8.18 1798.68
5 1/1/2012 5:00 0.00 0.00
6 1/1/2012 6:00 5.00 406.00

Pandas: Find original index of a value with a grouped dataframe

I have a dataframe with a RangeIndex, timestamps in the first column and several thousands hourly temperature observations in the second.
It is easy enough to group the observations by 24 and find daily Tmax and Tmin. But I also want the timestamp of each day's max and min values.
How can I do that?
I hope I can get help without posting a working example, because the nature of the data makes it unpractical.
EDIT: Here's some data, spanning two days.
DT T-C
0 2015-01-01 00:00:00 -2.5
1 2015-01-01 01:00:00 -2.1
2 2015-01-01 02:00:00 -2.3
3 2015-01-01 03:00:00 -2.3
4 2015-01-01 04:00:00 -2.3
5 2015-01-01 05:00:00 -2.0
...
24 2015-01-02 00:00:00 1.1
25 2015-01-02 01:00:00 1.1
26 2015-01-02 02:00:00 0.8
27 2015-01-02 03:00:00 0.5
28 2015-01-02 04:00:00 1.0
29 2015-01-02 05:00:00 0.7
First create DatetimeIndex, then aggregate by Grouper with days and idxmax
idxmin for datetimes for min and max temperature:
df['DT'] = pd.to_datetime(df['DT'])
df = df.set_index('DT')
df = df.groupby(pd.Grouper(freq='D'))['T-C'].agg(['idxmax','idxmin','max','min'])
print (df)
idxmax idxmin max min
DT
2015-01-01 2015-01-01 05:00:00 2015-01-01 00:00:00 -2.0 -2.5
2015-01-02 2015-01-02 00:00:00 2015-01-02 03:00:00 1.1 0.5

Transform CSV structure with pandas dataframe

My CSV contains rows such as:
entryTime entryPrice exitTime exitPrice
06/01/2009 04:00 93.565 06/01/2009 06:00 93.825
I want to load them into a Dataframe that will have two rows per CSV row, in the following format:
datetime signal price
06/01/2009 04:00 entry 93.565
06/01/2009 06:00 exit 93.825
indexed by datetime column. What would be a fast way to do it?
Use numpy.tile with numpy.ravel:
print (df)
entryTime entryPrice exitTime exitPrice
0 01/01/2009 04:00 90.565 02/01/2009 06:00 91.825
1 03/01/2009 04:00 92.565 04/01/2009 06:00 93.825
2 05/01/2009 04:00 94.565 06/01/2009 06:00 95.825
3 07/01/2009 04:00 96.565 08/01/2009 07:00 97.825
4 09/01/2009 04:00 98.565 10/01/2009 06:00 99.825
a = np.tile(['entry','exit'], len(df))
b = df[['entryTime','exitTime']].values.ravel()
c = df[['entryPrice','exitPrice']].values.ravel()
df = pd.DataFrame({'price':c, 'signal':a},
index=pd.to_datetime(b),
columns=['signal','price'])
print (df)
signal price
2009-01-01 04:00:00 entry 90.565
2009-02-01 06:00:00 exit 91.825
2009-03-01 04:00:00 entry 92.565
2009-04-01 06:00:00 exit 93.825
2009-05-01 04:00:00 entry 94.565
2009-06-01 06:00:00 exit 95.825
2009-07-01 04:00:00 entry 96.565
2009-08-01 07:00:00 exit 97.825
2009-09-01 04:00:00 entry 98.565
2009-10-01 06:00:00 exit 99.825

Setting start time from previous night without dates from CSV using pandas

tI would like to run timeseries analysis on repeated measures data (time only, no dates) taken overnight from 22:00:00 to 09:00:00 the next morning.
How is the time set so that the Timeseries starts at 22:00:00. At the moment even when plotting it starts at 00:00:00 and ends at 23:00:00 with a flat line between 09:00:00 and 23:00:00?
df = pd.read_csv('1310.csv', parse_dates=True)
df['Time'] = pd.to_datetime(df['Time'])
df['Time'].apply( lambda d : d.time() )
df = df.set_index('Time')
df['2017-05-16 22:00:00'] + pd.Timedelta('-1 day')
Note: The date in the last line of code is automatically added, seen when df['Time'] is executed, so I inserted the same format with date in the last line for 22:00:00.
This is the error:
TypeError: Could not operate Timedelta('-1 days +00:00:00') with block values unsupported operand type(s) for +: 'numpy.ndarray' and 'Timedelta'
You should consider your timestamps as pd.Timedeltas and add a day to the samples before your start time.
Create some example data:
import pandas as pd
d = pd.date_range(start='22:00:00', periods=12, freq='h')
s = pd.Series(d).dt.time
df = pd.DataFrame(pd.np.random.randn(len(s)), index=s, columns=['value'])
df.to_csv('data.csv')
df
value
22:00:00 -0.214977
23:00:00 -0.006585
00:00:00 0.568259
01:00:00 0.603196
02:00:00 0.358124
03:00:00 0.027835
04:00:00 -0.436322
05:00:00 0.627624
06:00:00 0.168189
07:00:00 -0.321916
08:00:00 0.737383
09:00:00 1.100500
Read in, make index a timedelta, add a day to timedeltas before the start time, then assign back to the index.
df2 = pd.read_csv('data.csv', index_col=0)
df2.index = pd.to_timedelta(df2.index)
s = pd.Series(df2.index)
s[s < pd.Timedelta('22:00:00')] += pd.Timedelta('1d')
df2.index = pd.to_datetime(s)
df2
value
1970-01-01 22:00:00 -0.214977
1970-01-01 23:00:00 -0.006585
1970-01-02 00:00:00 0.568259
1970-01-02 01:00:00 0.603196
1970-01-02 02:00:00 0.358124
1970-01-02 03:00:00 0.027835
1970-01-02 04:00:00 -0.436322
1970-01-02 05:00:00 0.627624
1970-01-02 06:00:00 0.168189
1970-01-02 07:00:00 -0.321916
1970-01-02 08:00:00 0.737383
1970-01-02 09:00:00 1.100500
If you want to set the date of the first day:
df2.index += (pd.Timestamp('2015-06-06') - pd.Timestamp(0))
df2
value
2015-06-06 22:00:00 -0.214977
2015-06-06 23:00:00 -0.006585
2015-06-07 00:00:00 0.568259
2015-06-07 01:00:00 0.603196
2015-06-07 02:00:00 0.358124
2015-06-07 03:00:00 0.027835
2015-06-07 04:00:00 -0.436322
2015-06-07 05:00:00 0.627624
2015-06-07 06:00:00 0.168189
2015-06-07 07:00:00 -0.321916
2015-06-07 08:00:00 0.737383
2015-06-07 09:00:00 1.100500

Resources