Comparing just the time component of two datetime64 columns - python-3.x

I am trying to subtract or compare Only the time component of two datetime64 columns but have been unsuccessful. I have tried using strftime with an exception block to catch NaTs but no luck. Any help is much appreciated. I have attached the Python code below.
Column A Column B
1/1/1900 10:00 NaT
1/1/1900 10:30 NaT
1/1/1900 11:00 NaT
1/1/1900 9:00 2/6/2021 23:59
1/1/1900 11:00 2/6/2021 8:59
1/1/1900 9:30 2/6/2021 16:00
def convert(x):
try:
return x.strftime("%H:%M:%S")
except ValueError:
return x
df['B'].apply(convert)-df['A'].apply(convert)
I get the error TypeError: unsupported operand type(s) for -: 'NaTType' and 'str'

Convert both columns to pandas datetime using pd.to_datetime. Then extract just time using Series.dt.time:
df['Column A'] = pd.to_datetime(df['Column A'])
df['Column B'] = pd.to_datetime(df['Column B'])
In [213]: (df['Column A'] - df['Column B']).dt.components
Out[213]:
days hours minutes seconds milliseconds microseconds nanoseconds
0 NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN
3 -44232.0 9.0 1.0 0.0 0.0 0.0 0.0
4 -44231.0 2.0 1.0 0.0 0.0 0.0 0.0
5 -44232.0 17.0 30.0 0.0 0.0 0.0 0.0
From the above, you can extract hours, minutes, etc.. separately:
In [215]: (df['Column A'] - df['Column B']).dt.components.hours
Out[215]:
0 NaN
1 NaN
2 NaN
3 9.0
4 2.0
5 17.0
Name: hours, dtype: float64

Related

Keep column when resampling hourly to daily data in pandas

I have a dataset of hourly weather observations in this format:
df = pd.DataFrame({ 'date': ['2019-01-01 09:30:00', '2019-01-01 10:00', '2019-01-02 04:30:00','2019-01-02 05:00:00','2019-07-04 02:00:00'],
'windSpeedHigh': [155,90,35,45,15],
'windSpeedHigh_Dir':['NE','NNW','SW','W','S']})
My goal is to find the highest wind speed each day and the wind direction associated with that maximum daily wind speed.
Using resample, I have sucessfully found the maximum wind speed for each day, but not its associated direction:
df['date'] = pd.to_datetime(df['date'])
df['windSpeedHigh'] = pd.to_numeric(df['windSpeedHigh'])
df_daily = df.resample('D', on='date')[['windSpeedHigh_Dir','windSpeedHigh']].max()
df_daily
Results in:
windSpeedHigh_Dir windSpeedHigh
date
2019-01-01 NNW 155.0
2019-01-02 W 45.0
2019-01-03 NaN NaN
2019-01-04 NaN NaN
2019-01-05 NaN NaN
... ... ...
2019-06-30 NaN NaN
2019-07-01 NaN NaN
2019-07-02 NaN NaN
2019-07-03 NaN NaN
2019-07-04 S 15.0
This is incorrect as this resample is also grabbing the max() for 'windSpeedHigh_Dir'. For 2019-01-01 the direction for the associated windspeed should be 'NE' not 'NNW', because the wind direction df['windSpeedHigh_Dir'] == 'NE' when the maximum wind speed occurred.
So my question is, is it possible for me to resample this dataset from half-hourly to daily maximum wind speed while keeping the wind direction associated with that speed?
Use DataFrameGroupBy.idxmax for indices by dates first:
df_daily = df.loc[df.groupby(df['date'].dt.date)['windSpeedHigh'].idxmax()]
print (df_daily)
date windSpeedHigh windSpeedHigh_Dir
0 2019-01-01 09:30:00 155 NE
3 2019-01-02 05:00:00 45 W
4 2019-07-04 02:00:00 15 S
And then for add DatetimeIndex use DataFrame.set_index with Series.dt.normalize and DataFrame.asfreq:
df_daily = df_daily.set_index(df_daily['date'].dt.normalize().rename('day')).asfreq('d')
print (df_daily)
date windSpeedHigh windSpeedHigh_Dir
day
2019-01-01 2019-01-01 09:30:00 155.0 NE
2019-01-02 2019-01-02 05:00:00 45.0 W
2019-01-03 NaT NaN NaN
2019-01-04 NaT NaN NaN
2019-01-05 NaT NaN NaN
... ... ...
2019-06-30 NaT NaN NaN
2019-07-01 NaT NaN NaN
2019-07-02 NaT NaN NaN
2019-07-03 NaT NaN NaN
2019-07-04 2019-07-04 02:00:00 15.0 S
[185 rows x 3 columns]
Your solution shoudl working with custom function, because idxmax failed for missing values with DataFrame.join:
f = lambda x: x.idxmax() if len(x) > 0 else np.nan
df_daily = df.resample('D', on='date')['windSpeedHigh'].agg(f).to_frame('idx').join(df, on='idx')
print (df_daily)
idx date windSpeedHigh windSpeedHigh_Dir
date
2019-01-01 0.0 2019-01-01 09:30:00 155.0 NE
2019-01-02 3.0 2019-01-02 05:00:00 45.0 W
2019-01-03 NaN NaT NaN NaN
2019-01-04 NaN NaT NaN NaN
2019-01-05 NaN NaT NaN NaN
... ... ... ...
2019-06-30 NaN NaT NaN NaN
2019-07-01 NaN NaT NaN NaN
2019-07-02 NaN NaT NaN NaN
2019-07-03 NaN NaT NaN NaN
2019-07-04 4.0 2019-07-04 02:00:00 15.0 S
[185 rows x 4 columns]

Error while performing operation on DatetimeIndexResampler type

I have a time-series data frame and want to find difference between the date in each record and the last (maximum) date within that data-frame. But getting error - TypeError: unsupported operand type(s) for -: 'DatetimeIndex' and 'SeriesGroupBy'. Seems from the error that data frame is not in the 'right' type to be allowed to have these operations allowed. How can I avoid this or possibly cast the data in some other format to be able to do the operation. Below is sample code which reproduces the error
import pandas as pd
df = pd.DataFrame([[54.7,36.3,'2010-07-20'],[54.7,36.3,'2010-07-21'],[52.3,38.7,'2010-07-26'],[52.3,38.7,'2010-07-30']],
columns=['col1','col2','date'])
df.date = pd.to_datetime(df.date)
df.index = df.date
df = df.resample('D')
print(type(df))
diff = (df.date.max() - df.date).values
I think you need create DatetimeIndex first by DataFrame.set_index, so if aggregate by max then get consecutive values:
df = pd.DataFrame([[54.7,36.3,'2010-07-20'],
[54.7,36.3,'2010-07-21'],
[52.3,38.7,'2010-07-26'],
[52.3,38.7,'2010-07-30']],
columns=['col1','col2','date'])
df.date = pd.to_datetime(df.date)
df1 = df.set_index('date').resample('D').max()
#alternative if not duplicated datetimes
#df1 = df.set_index('date').asfreq('D')
print (df1)
col1 col2
date
2010-07-20 54.7 36.3
2010-07-21 54.7 36.3
2010-07-22 NaN NaN
2010-07-23 NaN NaN
2010-07-24 NaN NaN
2010-07-25 NaN NaN
2010-07-26 52.3 38.7
2010-07-27 NaN NaN
2010-07-28 NaN NaN
2010-07-29 NaN NaN
2010-07-30 52.3 38.7
Then subtract max value of index with itself and convert timedeltas to days by TimedeltaIndex.days:
df1['diff'] = (df1.index.max() - df1.index).days
print (df1)
col1 col2 diff
date
2010-07-20 54.7 36.3 10
2010-07-21 54.7 36.3 9
2010-07-22 NaN NaN 8
2010-07-23 NaN NaN 7
2010-07-24 NaN NaN 6
2010-07-25 NaN NaN 5
2010-07-26 52.3 38.7 4
2010-07-27 NaN NaN 3
2010-07-28 NaN NaN 2
2010-07-29 NaN NaN 1
2010-07-30 52.3 38.7 0

How sum function works with NaN element?

I have a DataFrame with some NaN values. In this DataFrame there are some rows with all NaN values. When I apply sum function on these rows, it is returning zero instead of NaN. Code is as follows:
df = pd.DataFrame(np.random.randint(10,60,size=(5,3)),
index = ['a','c','e','f','h'],
columns = ['One','Two','Three'])
df = df.reindex(index=['a','b','c','d','e','f','g','h'])
print(df.loc['b'].sum())
Any Suggestion?
the sum function takes the NaN values ​​as 0.
if you want the result of the sum of NaN values ​​to be NaN:
df.loc['b'].sum(min_count=1)
Output:
nan
if you apply to all rows(
after using reindex) you will get the following:
df.sum(axis=1,min_count=1)
a 137.0
b NaN
c 79.0
d NaN
e 132.0
f 95.0
g NaN
h 81.0
dtype: float64
if you now modify a NaN value of a row:
df.at['b','One']=0
print(df)
One Two Three
a 54.0 20.0 29.0
b 0.0 NaN NaN
c 13.0 24.0 27.0
d NaN NaN NaN
e 28.0 53.0 25.0
f 46.0 55.0 50.0
g NaN NaN NaN
h 47.0 26.0 48.0
df.sum(axis=1,min_count=1)
a 103.0
b 0.0
c 64.0
d NaN
e 106.0
f 151.0
g NaN
h 121.0
dtype: float64
as you can see now the result of row b is 0

How To create Multiple Columns From Values of The Same Column?

I Have A DataFrame , & I Want to Create New Columns Based o The Values of The Same Column , And At Each of This Column I want The Values To Be The Sum of repetition of Plate over the Time.
So I have This DataFrame:
Val_Tra.Head():
Plate EURO
Timestamp
2013-11-01 00:00:00 NaN NaN
2013-11-01 01:00:00 dcc2f657e897ffef752003469c688381 0.0
2013-11-01 02:00:00 a5ac0c2f48ea80707621e530780139ad 6.0
So I Have The EURO Column That Looks Like This:
Veh_Tra.EURO.value_counts():
5 1590144
6 745865
4 625512
0 440834
3 243800
2 40664
7 14207
1 4301
And This My Desired Output:
Plate EURO_1 EURO_2 EURO_3 EURO_4 EURO_5 EURO_6 EURO_7
Timestamp
2013-11-01 00:00:00 NaN NaN NaN NaN NaN NaN NaN NaN
2013-11-01 01:00:00 dcc2f657e897ffef752003469c688381 1.0 NaN NaN NaN NaN NaN NaN
2013-11-01 02:00:00 a5ac0c2f48ea80707621e530780139ad NaN NaN 1.0 NaN NaN NaN NaN
So Basically , What I Want , Is The Sum in Which Each Time That a Plate Value repeats Itself on a Specific Type of Euro over a specific Time.
Any Suggestions Would Be Much Appreciated , Thank U.
This is more like a get_dummies problem
s=df.dropna().EURO.astype(int).astype(str).str.get_dummies().add_prefix('EURO')
df=pd.concat([df,s],axis=1,sort=True)
df
Out[259]:
Plate EURO EURO0 EURO6
2013-11-0100:00:00 NaN NaN NaN NaN
2013-11-0101:00:00 dcc2f657e897ffef752003469c688381 0.0 1.0 0.0
2013-11-0102:00:00 a5ac0c2f48ea80707621e530780139ad 6.0 0.0 1.0

merging multiple columns into one columns in pandas

I have a dataframe called ref(first dataframe) with columns c1, c2 ,c3 and c4.
ref= pd.DataFrame([[1,3,.3,7],[0,4,.5,4.5],[2,5,.6,3]], columns=['c1','c2','c3','c4'])
print(ref)
c1 c2 c3 c4
0 1 3 0.3 7.0
1 0 4 0.5 4.5
2 2 5 0.6 3.0
I wanted to create a new column i.e, c5 ( second dataframe) that has all the values from columns c1,c2,c3 and c4.
I tried concat, merge columns but i cannot get it work.
Please let me know if you have a solutions?
You can use unstack for creating Series from DataFrame and then concat to original:
print (pd.concat([ref, ref.unstack().reset_index(drop=True).rename('c5')], axis=1))
c1 c2 c3 c4 c5
0 1.0 3.0 0.3 7.0 1.0
1 0.0 4.0 0.5 4.5 0.0
2 2.0 5.0 0.6 3.0 2.0
3 NaN NaN NaN NaN 3.0
4 NaN NaN NaN NaN 4.0
5 NaN NaN NaN NaN 5.0
6 NaN NaN NaN NaN 0.3
7 NaN NaN NaN NaN 0.5
8 NaN NaN NaN NaN 0.6
9 NaN NaN NaN NaN 7.0
10 NaN NaN NaN NaN 4.5
11 NaN NaN NaN NaN 3.0
Alternative solution for creating Series is convert df to numpy array by values and then reshape by ravel:
print (pd.concat([ref, pd.Series(ref.values.ravel('F'), name='c5')], axis=1))
c1 c2 c3 c4 c5
0 1.0 3.0 0.3 7.0 1.0
1 0.0 4.0 0.5 4.5 0.0
2 2.0 5.0 0.6 3.0 2.0
3 NaN NaN NaN NaN 3.0
4 NaN NaN NaN NaN 4.0
5 NaN NaN NaN NaN 5.0
6 NaN NaN NaN NaN 0.3
7 NaN NaN NaN NaN 0.5
8 NaN NaN NaN NaN 0.6
9 NaN NaN NaN NaN 7.0
10 NaN NaN NaN NaN 4.5
11 NaN NaN NaN NaN 3.0
using join + ravel('F')
ref.join(pd.Series(ref.values.ravel('F')).to_frame('c5'), how='right')
using join + T.ravel()
ref.join(pd.Series(ref.values.T.ravel()).to_frame('c5'), how='right')
pd.concat + T.stack() + rename
pd.concat([ref, ref.T.stack().reset_index(drop=True).rename('c5')], axis=1)
way too many transposes + append
ref.T.append(ref.T.stack().reset_index(drop=True).rename('c5')).T
combine_first + ravel('F') <--- my favorite
ref.combine_first(pd.Series(ref.values.ravel('F')).to_frame('c5'))
All yield
c1 c2 c3 c4 c5
0 1.0 3.0 0.3 7.0 1.0
1 0.0 4.0 0.5 4.5 0.0
2 2.0 5.0 0.6 3.0 2.0
3 NaN NaN NaN NaN 3.0
4 NaN NaN NaN NaN 4.0
5 NaN NaN NaN NaN 5.0
6 NaN NaN NaN NaN 0.3
7 NaN NaN NaN NaN 0.5
8 NaN NaN NaN NaN 0.6
9 NaN NaN NaN NaN 7.0
10 NaN NaN NaN NaN 4.5
11 NaN NaN NaN NaN 3.0
use the list(zip()) as follows:
d=list(zip(df1.c1,df1.c2,df1.c3,df1.c4))
df2['c5']=pd.Series(d)
try this one, works as you expected
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2,3,4],[2,3,4,5],[3,4,5,6]], columns=['c1','c2','c3','c4'])
print(df)
r = len(df['c1'])
c = len(list(df))
ndata = list(df.c1) + list(df.c2) + list(df.c3) + list(df.c4)
r = len(ndata) - r
t = r*c
dfnan = pd.DataFrame(np.reshape([np.nan]*t, (r,c)), columns=list(df))
df = df.append(dfnan)
df['c5'] = ndata
print(df)
output is below
This could be a fast option and maybe you can use it inside a loop.
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2,3,4],[2,3,4,5],[3,4,5,6]], columns=['c1','c2','c3','c4'])
df['c5'] = df.iloc[:,0].astype(str) + df.iloc[:,1].astype(str) + df.iloc[:,2].astype(str) + df.iloc[:,3].astype(str)
Greetings

Resources