Pandas Merge with interpolation - python-3.x

I have two dataframes df1 and df2
df1
Date/Time S
1/1/2012 0:00 7.51
1/1/2012 1:00 7.28
1/1/2012 2:00 6.75
1/1/2012 3:00 15.00
1/1/2012 4:00 8.18
1/1/2012 5:00 0.00
1/1/2012 6:00 5.00
df2
S Val
3.00 30
4.00 186
5.00 406
6.00 723
7.00 1169
8.00 1704
9.00 2230
10.00 2520
11.00 2620
12.00 2700
I would like to merged the two dataframes with interpolated val.
pd.merge(df1, df2, left_on=['S'], right_on=['S'])
For example:
df1 'S' column will be lookup value, and column 'S' in df2 will be lookupRange, and the outputRange will be column 'Val'.
The value below 3 and above 12 will be 0.
The output should be as shown below, How can i achieve this in pandas?. or any alternative solution in python other then looping much appreciated.
Output
Date/Time S Val
1/1/2012 0:00 7.51 1441.9
1/1/2012 1:00 7.28 1318.8
1/1/2012 2:00 6.75 1057.5
1/1/2012 3:00 15.00 0.0
1/1/2012 4:00 8.18 1798.7
1/1/2012 5:00 0.00 0.0
1/1/2012 6:00 5.00 406.00

Assuming df2 is sorted by column S, you can do:
tmp = df1.assign(tmp=df1.S.apply(np.floor)).merge(df2.assign(tmp2=(df2.Val.shift(-1) - df2.Val)), how='outer', left_on='tmp', right_on='S')
tmp.loc[tmp.Val.isna(), 'S_x'] = 0
tmp['Val'] = (tmp['S_x'] - tmp['S_y'].fillna(0)) * tmp['tmp2'].fillna(1) + tmp['Val'].fillna(0)
print(tmp[['Date/Time', 'S_x', 'Val']].dropna().sort_values(by='Date/Time').rename(columns={'S_x': 'S'}))
Prints:
Date/Time S Val
0 1/1/2012 0:00 7.51 1441.85
1 1/1/2012 1:00 7.28 1318.80
2 1/1/2012 2:00 6.75 1057.50
3 1/1/2012 3:00 15.00 0.00
4 1/1/2012 4:00 8.18 1798.68
5 1/1/2012 5:00 0.00 0.00
6 1/1/2012 6:00 5.00 406.00

Related

Excel data extraction using python

I have an excel file with the following dataset:
date y
1/1/12 0:00 86,580
1/1/12 0:30 86,580
1/1/12 1:00 nan
1/1/12 1:30 86,910
1/1/12 2:00 87,240
1/1/12 2:30 87,130
1/1/12 3:00 nan
1/1/12 3:30 nan
1/1/12 4:00 87,570
1/1/12 4:30 91,400
1/1/12 5:00 91,880
1/1/12 5:30 92,600
1/1/12 6:00 nan
1/1/12 6:30 nan
1/1/12 7:00 nan
1/1/12 7:30 94,160
1/1/12 8:00 94,280
1/1/12 8:30 94,640
The data contains some NaN values. I need to extract the start and end date of each NaN groups. Here is what I tried:
import pandas as pd
import numpy as np
from datetime import date, datetime, time, timedelta
import re
df=pd.read_excel(r'test_nan.xlsx',
sheet_name='Sheet1', header=0)
nan_index = df.y.index[df.y.apply(np.isnan)]
start=df.y.apply(str).str.findall(r'\d\nnan')
end=begin=df.y.apply(str).str.findall(r'nan\n\d')
Here is what I want to extract:
start end
1/1/12 0:30 1/1/12 1:30
1/1/12 2:30 1/1/12 4:00
1/1/12 5:30 1/1/12 7:30
Please find the attached excel file: test_nan.xlsx
Based on #jezrael's answer here we can continue to get your task done.
m = df['y'].isna()
df = pd.concat([df[m.shift(fill_value=False)],
df[m.shift(-1, fill_value=False)]]).sort_index()
m.shift() will get the rows after a NaN (and some NaN's where there is a group of NaN's) and m.shift(-1) will get the rows before a NaN (and also here some extra NaN's where there is a group of NaN's). Then we get rid of all NaN with:
df = df.dropna(subset='y')
Now we have rows with alternating values, start and end.
res = pd.DataFrame({
'start' : df['x'][::2].values,
'end' : df['x'][1::2].values
})
Output res:
start end
0 2012-01-01 00:30:00 2012-01-01 01:30:00
1 2012-01-01 02:30:00 2012-01-01 04:00:00
2 2012-01-01 05:30:00 2012-01-01 07:30:00

Creating a datetime range for each category in a dataframe: python/SQL

I have a dataframe with columns: ID, Tech, Price and factor (see below). I want to assign this datetime range to each of the 'ID' in this dataframe. Accordingly, I created another datetime dataframe as per my requirements. I have worked on merging dataframes based on the "pandas.merge" function which requires a common key between the dataframes. My datetime dataframe does not include any variable that is common with the parent dataframe to perform the merge operation between dataframes. How can solve this problem?
df:
ID Tech Price Factor
100-10A A 688.3 0.36
100-10B A 123 0.36
200-11A A 543 0.34
450-11B A 688.3 0.34
570-1 B 675 0.31
430-2 B 952 0.28
698-5A C 52.8 0
129-1 D 177.6 0.08
I have created a datetime dataframe (times) that varies hourly.
times:
import pandas as pd
a = pd.date_range(start='2010-01-01 00:00:00', end='2010-01-01 6:00:00', freq = 'H')
pd.DataFrame(a):
a:
0
0 2010-01-01 00:00:00
1 2010-01-01 01:00:00
2 2010-01-01 02:00:00
3 2010-01-01 03:00:00
4 2010-01-01 04:00:00
5 2010-01-01 05:00:00
6 2010-01-01 06:00:00
How can I achieve the date to dataframe mapping in such a case? I want my dataframe to look like below
Datetime ID Tech Price Factor
1/1/2010 0:00 100-10A A 688.3 0.36
1/1/2010 1:00 100-10A A 688.3 0.36
1/1/2010 2:00 100-10A A 688.3 0.36
1/1/2010 3:00 100-10A A 688.3 0.36
1/1/2010 4:00 100-10A A 688.3 0.36
1/1/2010 5:00 100-10A A 688.3 0.36
1/1/2010 6:00 100-10A A 688.3 0.36
1/1/2010 0:00 100-10B A 123 0.36
1/1/2010 1:00 100-10B A 123 0.36
1/1/2010 2:00 100-10B A 123 0.36
1/1/2010 3:00 100-10B A 123 0.36
1/1/2010 4:00 100-10B A 123 0.36
1/1/2010 5:00 100-10B A 123 0.36
1/1/2010 6:00 100-10B A 123 0.36
1/1/2010 0:00 200-11A A 543 0.34
1/1/2010 1:00 200-11A A 543 0.34
1/1/2010 2:00 200-11A A 543 0.34
1/1/2010 3:00 200-11A A 543 0.34
1/1/2010 4:00 200-11A A 543 0.34
1/1/2010 5:00 200-11A A 543 0.34
1/1/2010 6:00 200-11A A 543 0.34
1/1/2010 0:00 450-11B A 688.3 0.34
1/1/2010 1:00 450-11B A 688.3 0.34
1/1/2010 2:00 450-11B A 688.3 0.34
1/1/2010 3:00 450-11B A 688.3 0.34
1/1/2010 4:00 450-11B A 688.3 0.34
1/1/2010 5:00 450-11B A 688.3 0.34
1/1/2010 6:00 450-11B A 688.3 0.34
1/1/2010 0:00 570-1 B 675 0.31
1/1/2010 1:00 570-2 B 675 0.31
1/1/2010 2:00 570-3 B 675 0.31
1/1/2010 3:00 570-4 B 675 0.31
1/1/2010 4:00 570-5 B 675 0.31
1/1/2010 5:00 570-6 B 675 0.31
1/1/2010 6:00 570-7 B 675 0.31
Found the solution to my problem. This can be achieved through the cross join. since thee no key to match, we temporarily assign a key to each if the dfs/ tables and perform a merge between both to perform a cross join:
def cartesian_product_basic(left, right):
return (left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))
result = cartesian_product_basic(df, a)

Pandas: Find original index of a value with a grouped dataframe

I have a dataframe with a RangeIndex, timestamps in the first column and several thousands hourly temperature observations in the second.
It is easy enough to group the observations by 24 and find daily Tmax and Tmin. But I also want the timestamp of each day's max and min values.
How can I do that?
I hope I can get help without posting a working example, because the nature of the data makes it unpractical.
EDIT: Here's some data, spanning two days.
DT T-C
0 2015-01-01 00:00:00 -2.5
1 2015-01-01 01:00:00 -2.1
2 2015-01-01 02:00:00 -2.3
3 2015-01-01 03:00:00 -2.3
4 2015-01-01 04:00:00 -2.3
5 2015-01-01 05:00:00 -2.0
...
24 2015-01-02 00:00:00 1.1
25 2015-01-02 01:00:00 1.1
26 2015-01-02 02:00:00 0.8
27 2015-01-02 03:00:00 0.5
28 2015-01-02 04:00:00 1.0
29 2015-01-02 05:00:00 0.7
First create DatetimeIndex, then aggregate by Grouper with days and idxmax
idxmin for datetimes for min and max temperature:
df['DT'] = pd.to_datetime(df['DT'])
df = df.set_index('DT')
df = df.groupby(pd.Grouper(freq='D'))['T-C'].agg(['idxmax','idxmin','max','min'])
print (df)
idxmax idxmin max min
DT
2015-01-01 2015-01-01 05:00:00 2015-01-01 00:00:00 -2.0 -2.5
2015-01-02 2015-01-02 00:00:00 2015-01-02 03:00:00 1.1 0.5

changing hourly to daily data while there are some missing values in my hourly data

how can I change my hourly to daily data while there are some missing values in my hourly data? my excel is like:
date hour ENERGY(MJ)
1/01/2002 0:00 0
1/01/2002 1:00 0
1/01/2002 2:00 0
1/01/2002 3:00 0
1/01/2002 4:00 0
1/01/2002 5:00 0
1/01/2002 6:00 0.15
1/01/2002 7:00 0.74
1/01/2002 8:00 1.46
1/01/2002 9:00 2.23
1/01/2002 10:00 2.89
Thanks
If your data is in A1:C12 you have 5 hourly readings (total 7.47MJ) which may be summed with SUM(C2:C12), the divided with 5 derivable with COUNTIF(C2:C12,">"&0) to compute an average hourly rate from the data available, then scaled up for a full day by multiplying by 24:
=SUM(C2:C12)*24/COUNTIF(C2:C12,">"&0)

Compare 2 columns of 2 different pandas dataframes, if the same insert 1 into the other in Python

I have a panda DataFrame with date_time/voltage data like this (df1):
Date_Time Chan
0 20130401 9:00 AAT
1 20130401 10:00 AAT
2 20130401 11:00 AAT
3 20130401 12:00 AAT
4 20130401 13:00 AAT
5 20130401 14:00 AAT
6 20130401 15:00 AAT
I am using this as a prototype to load in data from a much bigger data file and create one DataFrame . The other DataFrame looks like this (df2):
Chan date_time Sens1 Sens2
AAC 01-Apr-2013 09:00 5.17 1281
AAC 01-Apr-2013 10:00 5.01 500
AAC 01-Apr-2013 12:00 5.17 100
AAC 01-Apr-2013 13:00 5.19 41997
AAC 01-Apr-2013 16:00 5.21 2123
AAT 01-Apr-2013 09:00 28.82 300
AAT 01-Apr-2013 10:00 28.35 4900
AAT 01-Apr-2013 12:00 28.04 250
AAE 01-Apr-2013 11:00 3.36 400
AAE 01-Apr-2013 12:00 3.41 200
AAE 01-Apr-2013 13:00 3.40 2388
AAE 01-Apr-2013 14:00 3.37 300
AAE 01-Apr-2013 15:00 3.35 500
AXN 01-Apr-2013 09:00 23.96 6643
AXN 01-Apr-2013 10:00 24.03 1000
AXW 01-Apr-2013 11:00 46.44 2343
So what I want to do is search df2 for all instances of a match from both columns of df1 (noting the different data formats) and insert the data from df2 into df1. Like this (df1)
Date_Time Chan Sens1 Sens2
0 20130401 9:00 AAT 28.82 300
1 20130401 10:00 AAT 28.35 4900
2 20130401 11:00 AAT NaN NaN
3 20130401 12:00 AAT 28.04 250
4 20130401 13:00 AAT NaN NaN
5 20130401 14:00 AAT NaN NaN
6 20130401 15:00 AAT NaN NaN
Could you give me some suggestions for the python/pandas code to match this psuedocode:
if (df1['date_time'] = df2['date_time']) & (df1['Chan'] = df2['Chan'])):
df1['Sens1'] = df2['Sens1']
df1['Sens2'] = df2['Sens2']
If it effects the answer, it is my intention to bfill and ffill the NaNs and then add this DataFrame to a Panel and then repeat with another channel name in place of AAT.
You can use a plain ol' merge to do this. But first, you should do a little cleanup of you DataFrames, to make sure your datetime columns are actually datetimes rather than strings (Note: it may be better to do this when reading as csv or whatever):
df1['Date_Time'] = pd.to_datetime(df1['Date_Time'], format='%Y%m%d %H:%M')
df2['date_time'] = pd.to_datetime(df2['date_time'])
Let's also rename the Datetime columns with the same name:
df1.rename(columns={'Date_Time': 'Datetime'}, inplace=True)
df2.rename(columns={'date_time': 'Datetime'}, inplace=True)
Now a simple merge will give you what you're after:
In [11]: df1.merge(df2)
Out[11]:
Datetime Chan Sens1 Sens2
0 2013-04-01 09:00:00 AAT 28.82 300
1 2013-04-01 10:00:00 AAT 28.35 4900
2 2013-04-01 12:00:00 AAT 28.04 250
In [12]: df1.merge(df2, how='left')
Out[12]:
Datetime Chan Sens1 Sens2
0 2013-04-01 09:00:00 AAT 28.82 300
1 2013-04-01 10:00:00 AAT 28.35 4900
2 2013-04-01 11:00:00 AAT NaN NaN
3 2013-04-01 12:00:00 AAT 28.04 250
4 2013-04-01 13:00:00 AAT NaN NaN
5 2013-04-01 14:00:00 AAT NaN NaN
6 2013-04-01 15:00:00 AAT NaN NaN

Resources