Keep column when resampling hourly to daily data in pandas

Keep column when resampling hourly to daily data in pandas - python-3.x

I have a dataset of hourly weather observations in this format:
df = pd.DataFrame({ 'date': ['2019-01-01 09:30:00', '2019-01-01 10:00', '2019-01-02 04:30:00','2019-01-02 05:00:00','2019-07-04 02:00:00'],
'windSpeedHigh': [155,90,35,45,15],
'windSpeedHigh_Dir':['NE','NNW','SW','W','S']})
My goal is to find the highest wind speed each day and the wind direction associated with that maximum daily wind speed.
Using resample, I have sucessfully found the maximum wind speed for each day, but not its associated direction:
df['date'] = pd.to_datetime(df['date'])
df['windSpeedHigh'] = pd.to_numeric(df['windSpeedHigh'])
df_daily = df.resample('D', on='date')[['windSpeedHigh_Dir','windSpeedHigh']].max()
df_daily
Results in:
windSpeedHigh_Dir windSpeedHigh
date
2019-01-01 NNW 155.0
2019-01-02 W 45.0
2019-01-03 NaN NaN
2019-01-04 NaN NaN
2019-01-05 NaN NaN
... ... ...
2019-06-30 NaN NaN
2019-07-01 NaN NaN
2019-07-02 NaN NaN
2019-07-03 NaN NaN
2019-07-04 S 15.0
This is incorrect as this resample is also grabbing the max() for 'windSpeedHigh_Dir'. For 2019-01-01 the direction for the associated windspeed should be 'NE' not 'NNW', because the wind direction df['windSpeedHigh_Dir'] == 'NE' when the maximum wind speed occurred.
So my question is, is it possible for me to resample this dataset from half-hourly to daily maximum wind speed while keeping the wind direction associated with that speed?

Use DataFrameGroupBy.idxmax for indices by dates first:
df_daily = df.loc[df.groupby(df['date'].dt.date)['windSpeedHigh'].idxmax()]
print (df_daily)
date windSpeedHigh windSpeedHigh_Dir
0 2019-01-01 09:30:00 155 NE
3 2019-01-02 05:00:00 45 W
4 2019-07-04 02:00:00 15 S
And then for add DatetimeIndex use DataFrame.set_index with Series.dt.normalize and DataFrame.asfreq:
df_daily = df_daily.set_index(df_daily['date'].dt.normalize().rename('day')).asfreq('d')
print (df_daily)
date windSpeedHigh windSpeedHigh_Dir
day
2019-01-01 2019-01-01 09:30:00 155.0 NE
2019-01-02 2019-01-02 05:00:00 45.0 W
2019-01-03 NaT NaN NaN
2019-01-04 NaT NaN NaN
2019-01-05 NaT NaN NaN
... ... ...
2019-06-30 NaT NaN NaN
2019-07-01 NaT NaN NaN
2019-07-02 NaT NaN NaN
2019-07-03 NaT NaN NaN
2019-07-04 2019-07-04 02:00:00 15.0 S
[185 rows x 3 columns]
Your solution shoudl working with custom function, because idxmax failed for missing values with DataFrame.join:
f = lambda x: x.idxmax() if len(x) > 0 else np.nan
df_daily = df.resample('D', on='date')['windSpeedHigh'].agg(f).to_frame('idx').join(df, on='idx')
print (df_daily)
idx date windSpeedHigh windSpeedHigh_Dir
date
2019-01-01 0.0 2019-01-01 09:30:00 155.0 NE
2019-01-02 3.0 2019-01-02 05:00:00 45.0 W
2019-01-03 NaN NaT NaN NaN
2019-01-04 NaN NaT NaN NaN
2019-01-05 NaN NaT NaN NaN
... ... ... ...
2019-06-30 NaN NaT NaN NaN
2019-07-01 NaN NaT NaN NaN
2019-07-02 NaN NaT NaN NaN
2019-07-03 NaN NaT NaN NaN
2019-07-04 4.0 2019-07-04 02:00:00 15.0 S
[185 rows x 4 columns]

Related

Comparing just the time component of two datetime64 columns

I am trying to subtract or compare Only the time component of two datetime64 columns but have been unsuccessful. I have tried using strftime with an exception block to catch NaTs but no luck. Any help is much appreciated. I have attached the Python code below.
Column A Column B
1/1/1900 10:00 NaT
1/1/1900 10:30 NaT
1/1/1900 11:00 NaT
1/1/1900 9:00 2/6/2021 23:59
1/1/1900 11:00 2/6/2021 8:59
1/1/1900 9:30 2/6/2021 16:00
def convert(x):
try:
return x.strftime("%H:%M:%S")
except ValueError:
return x
df['B'].apply(convert)-df['A'].apply(convert)
I get the error TypeError: unsupported operand type(s) for -: 'NaTType' and 'str'

Convert both columns to pandas datetime using pd.to_datetime. Then extract just time using Series.dt.time:
df['Column A'] = pd.to_datetime(df['Column A'])
df['Column B'] = pd.to_datetime(df['Column B'])
In [213]: (df['Column A'] - df['Column B']).dt.components
Out[213]:
days hours minutes seconds milliseconds microseconds nanoseconds
0 NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN
3 -44232.0 9.0 1.0 0.0 0.0 0.0 0.0
4 -44231.0 2.0 1.0 0.0 0.0 0.0 0.0
5 -44232.0 17.0 30.0 0.0 0.0 0.0 0.0
From the above, you can extract hours, minutes, etc.. separately:
In [215]: (df['Column A'] - df['Column B']).dt.components.hours
Out[215]:
0 NaN
1 NaN
2 NaN
3 9.0
4 2.0
5 17.0
Name: hours, dtype: float64

Separating multiple time series from a single column to separate columns for each series?

I'm working with Python 3 on Mac OS 10.11.06 (el capitan).
I have a .csv dataset consisting of about 3,700 time series sets (of unequal lengths). The data are currently formatted as follows:
Current Format
trade_date price_usd ticker
0 2016-01-01 434.33000 BTC
1 2016-01-02 433.44000 BTC
2 2016-01-03 430.01000 BTC
3 2016-01-04 433.09000 BTC
4 2016-01-05 431.96000 BTC
... ... ... ...
2347227 2020-10-19 74.13000 BRAIN
2347228 2020-10-20 71.97000 BRAIN
2347229 2020-10-21 76.64000 BRAIN
2347230 2020-10-22 80.90000 BRAIN
2347231 2020-10-19 0.15004 DAOFI
Ignoring the default numerical index for the moment, notice that the datetime column, trade_date, is such that the sequence of values repeats with each new ticker group. My goal is to transform the data such that each ticker name becomes a column header under which its corresponding daily prices are listed in correct order with the datetime value on which it was recorded (i.e. the datetime index does not repeat and the daily price values for the ticker symbols are the rows):
Target Format
trade_date ticker1 ticker2 ... tickerN
day1 t1p1 t2p1 ... tNp1
day2 t1p2 t2p2 ... etc...
.
.
.
dayK
Thus far I've tried various approaches, including experiments with various methods, e.g. stack()/unstack(), groupby(), etc., as well as custom functions that attempt to iterate through the values to assign them to a new DF in which I created a structured frame into which to drop the values, but to no avail (see failed attempt below).
New, empty target data frame with ticker symbol as col and trade_date range as index:
BTC ETH XRP MKR LTC USDT BCH XLM EOS BNB ... MTLX INDEX WOA HAUT THRM YFED NMT DOKI BRAIN DAOFI
2016-01-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2016-01-02 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2016-01-03 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2016-01-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2016-01-05 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Failed attempt to populate the above ...
for element in crypto_df['ticker']:
if element == new_df.column and crypto['trade_date'] == new_df.index:
df['ticker'] = element
new_df.head()
My ultimate goal is to produce a multi-series time series forecast using FBProphet because of its ability to handle multiple time series forecasts in a "single" model.
One last thought I've just had is that one could maybe create separate data frames for each ticker, then rejoin along the datetime index, creating the separate columns in the new DF along the way, but that seems a bit round-about (I've literally just done this for a couple thousand .csv files with equities data, for example)... But I'd still like to find a more direct solution, if there is one? Surely this scenario will arise again in the future!
Thanks for any thoughts ...

You can set_index and unstack:
print(df.set_index(["trade_date", "ticker"]).unstack("ticker"))
price_usd
ticker BRAIN BTC DAOFI
trade_date
2016-01-01 NaN 434.33 NaN
2016-01-02 NaN 433.44 NaN
2016-01-03 NaN 430.01 NaN
2016-01-04 NaN 433.09 NaN
2016-01-05 NaN 431.96 NaN
2020-10-19 74.13 NaN 0.15004
2020-10-20 71.97 NaN NaN
2020-10-21 76.64 NaN NaN
2020-10-22 80.90 NaN NaN

First use .groupby(), then use .unstack():
import pandas as pd
from io import StringIO
text = """
trade_date price_usd ticker
2016-01-01 434.33000 BTC
2016-01-02 433.44000 BTC
2016-01-02 430.01000 Google
2016-01-03 433.09000 BTC
2016-01-03 431.96000 Google
"""
df = pd.read_csv(StringIO(text), sep='\s+', header=0)
df.groupby(['trade_date', 'ticker'])['price_usd'].mean().unstack()
Resulting dataframe:
trade_date ticker BTC Google
2016-01-01 434.33 NaN
2016-01-02 433.44 430.01
2016-01-03 433.09 431.96

Combine rows with same datetime while keeping columns intact

new to Python and coding. I'm struggling on how I could approach this.
I have a dataframe formatted as such:
Timestamp A B C
00:00:00 NaN NaN 15.67
00:00:00 NaN 1.66 NaN
00:00:00 95.30 NaN NaN
00:10:00 NaN NaN 5.44
00:10:00 NaN 22.67 NaN
00:10:00 96.55 NaN NaN
and I want to combine the rows with the same timestamp while keeping the data in their respective columns as such:
Timestamp A B C
00:00:00 95.30 1.66 15.67
00:10:00 96.55 22.67 5.44
I'm thinking of iterating through each row and removing the NaN's and replacing it with the value below it, but I don't know if that would be consistent with keeping the same Timestamps.
Thanks!

If Timestamp is the index
A B C
Timestamp
00:00:00 NaN NaN 15.67
00:00:00 NaN 1.66 NaN
00:00:00 95.30 NaN NaN
00:10:00 NaN NaN 5.44
00:10:00 NaN 22.67 NaN
00:10:00 96.55 NaN NaN
Then
df.groupby('Timestamp').first()
A B C
Timestamp
00:00:00 95.30 1.66 15.67
00:10:00 96.55 22.67 5.44

If Timestamp is a column
Timestamp A B C
0 00:00:00 NaN NaN 15.67
1 00:00:00 NaN 1.66 NaN
2 00:00:00 95.30 NaN NaN
3 00:10:00 NaN NaN 5.44
4 00:10:00 NaN 22.67 NaN
5 00:10:00 96.55 NaN NaN
Then
df.groupby('Timestamp', as_index=False).first()
Timestamp A B C
0 00:00:00 95.30 1.66 15.67
1 00:10:00 96.55 22.67 5.44

How To create Multiple Columns From Values of The Same Column?

I Have A DataFrame , & I Want to Create New Columns Based o The Values of The Same Column , And At Each of This Column I want The Values To Be The Sum of repetition of Plate over the Time.
So I have This DataFrame:
Val_Tra.Head():
Plate EURO
Timestamp
2013-11-01 00:00:00 NaN NaN
2013-11-01 01:00:00 dcc2f657e897ffef752003469c688381 0.0
2013-11-01 02:00:00 a5ac0c2f48ea80707621e530780139ad 6.0
So I Have The EURO Column That Looks Like This:
Veh_Tra.EURO.value_counts():
5 1590144
6 745865
4 625512
0 440834
3 243800
2 40664
7 14207
1 4301
And This My Desired Output:
Plate EURO_1 EURO_2 EURO_3 EURO_4 EURO_5 EURO_6 EURO_7
Timestamp
2013-11-01 00:00:00 NaN NaN NaN NaN NaN NaN NaN NaN
2013-11-01 01:00:00 dcc2f657e897ffef752003469c688381 1.0 NaN NaN NaN NaN NaN NaN
2013-11-01 02:00:00 a5ac0c2f48ea80707621e530780139ad NaN NaN 1.0 NaN NaN NaN NaN
So Basically , What I Want , Is The Sum in Which Each Time That a Plate Value repeats Itself on a Specific Type of Euro over a specific Time.
Any Suggestions Would Be Much Appreciated , Thank U.

This is more like a get_dummies problem
s=df.dropna().EURO.astype(int).astype(str).str.get_dummies().add_prefix('EURO')
df=pd.concat([df,s],axis=1,sort=True)
df
Out[259]:
Plate EURO EURO0 EURO6
2013-11-0100:00:00 NaN NaN NaN NaN
2013-11-0101:00:00 dcc2f657e897ffef752003469c688381 0.0 1.0 0.0
2013-11-0102:00:00 a5ac0c2f48ea80707621e530780139ad 6.0 0.0 1.0

using Pandas to download/load zipped csv file from URL

I am trying to load a csv file from the following URL into a dataframe using Python 3.5 and Pandas:
link = "http://api.worldbank.org/v2/en/indicator/NY.GDP.MKTP.CD?downloadformat=csv"
The csv file (API_NY.GDP.MKTP.CD_DS2_en_csv_v2.csv) is inside of a zip file. My try:
import urllib.request
urllib.request.urlretrieve(link, "GDP.zip")
import zipfile
compressed_file = zipfile.ZipFile('GDP.zip')
csv_file = compressed_file.open('API_NY.GDP.MKTP.CD_DS2_en_csv_v2.csv')
GDP = pd.read_csv(csv_file)
But when reading it, I got the error "pandas.io.common.CParserError: Error tokenizing data. C error: Expected 3 fields in line 5, saw 62".
Any idea?

I think you need parameter skiprows, because csv header is in row 5:
GDP = pd.read_csv(csv_file, skiprows=4)
print (GDP.head())
Country Name Country Code Indicator Name Indicator Code 1960 \
0 Aruba ABW GDP (current US$) NY.GDP.MKTP.CD NaN
1 Andorra AND GDP (current US$) NY.GDP.MKTP.CD NaN
2 Afghanistan AFG GDP (current US$) NY.GDP.MKTP.CD 5.377778e+08
3 Angola AGO GDP (current US$) NY.GDP.MKTP.CD NaN
4 Albania ALB GDP (current US$) NY.GDP.MKTP.CD NaN
1961 1962 1963 1964 1965 \
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 5.488889e+08 5.466667e+08 7.511112e+08 8.000000e+08 1.006667e+09
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
2008 2009 2010 2011 \
0 ... 2.791961e+09 2.498933e+09 2.467704e+09 2.584464e+09
1 ... 4.001201e+09 3.650083e+09 3.346517e+09 3.427023e+09
2 ... 1.019053e+10 1.248694e+10 1.593680e+10 1.793024e+10
3 ... 8.417803e+10 7.549238e+10 8.247091e+10 1.041159e+11
4 ... 1.288135e+10 1.204421e+10 1.192695e+10 1.289087e+10
2012 2013 2014 2015 2016 Unnamed: 61
0 NaN NaN NaN NaN NaN NaN
1 3.146152e+09 3.248925e+09 NaN NaN NaN NaN
2 2.053654e+10 2.004633e+10 2.005019e+10 1.933129e+10 NaN NaN
3 1.153984e+11 1.249121e+11 1.267769e+11 1.026269e+11 NaN NaN
4 1.231978e+10 1.278103e+10 1.321986e+10 1.139839e+10 NaN NaN

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string