new to Python and coding. I'm struggling on how I could approach this.
I have a dataframe formatted as such:
Timestamp A B C
00:00:00 NaN NaN 15.67
00:00:00 NaN 1.66 NaN
00:00:00 95.30 NaN NaN
00:10:00 NaN NaN 5.44
00:10:00 NaN 22.67 NaN
00:10:00 96.55 NaN NaN
and I want to combine the rows with the same timestamp while keeping the data in their respective columns as such:
Timestamp A B C
00:00:00 95.30 1.66 15.67
00:10:00 96.55 22.67 5.44
I'm thinking of iterating through each row and removing the NaN's and replacing it with the value below it, but I don't know if that would be consistent with keeping the same Timestamps.
Thanks!
If Timestamp is the index
A B C
Timestamp
00:00:00 NaN NaN 15.67
00:00:00 NaN 1.66 NaN
00:00:00 95.30 NaN NaN
00:10:00 NaN NaN 5.44
00:10:00 NaN 22.67 NaN
00:10:00 96.55 NaN NaN
Then
df.groupby('Timestamp').first()
A B C
Timestamp
00:00:00 95.30 1.66 15.67
00:10:00 96.55 22.67 5.44
​
If Timestamp is a column
Timestamp A B C
0 00:00:00 NaN NaN 15.67
1 00:00:00 NaN 1.66 NaN
2 00:00:00 95.30 NaN NaN
3 00:10:00 NaN NaN 5.44
4 00:10:00 NaN 22.67 NaN
5 00:10:00 96.55 NaN NaN
Then
df.groupby('Timestamp', as_index=False).first()
Timestamp A B C
0 00:00:00 95.30 1.66 15.67
1 00:10:00 96.55 22.67 5.44
Related
I have a dataset of hourly weather observations in this format:
df = pd.DataFrame({ 'date': ['2019-01-01 09:30:00', '2019-01-01 10:00', '2019-01-02 04:30:00','2019-01-02 05:00:00','2019-07-04 02:00:00'],
'windSpeedHigh': [155,90,35,45,15],
'windSpeedHigh_Dir':['NE','NNW','SW','W','S']})
My goal is to find the highest wind speed each day and the wind direction associated with that maximum daily wind speed.
Using resample, I have sucessfully found the maximum wind speed for each day, but not its associated direction:
df['date'] = pd.to_datetime(df['date'])
df['windSpeedHigh'] = pd.to_numeric(df['windSpeedHigh'])
df_daily = df.resample('D', on='date')[['windSpeedHigh_Dir','windSpeedHigh']].max()
df_daily
Results in:
windSpeedHigh_Dir windSpeedHigh
date
2019-01-01 NNW 155.0
2019-01-02 W 45.0
2019-01-03 NaN NaN
2019-01-04 NaN NaN
2019-01-05 NaN NaN
... ... ...
2019-06-30 NaN NaN
2019-07-01 NaN NaN
2019-07-02 NaN NaN
2019-07-03 NaN NaN
2019-07-04 S 15.0
This is incorrect as this resample is also grabbing the max() for 'windSpeedHigh_Dir'. For 2019-01-01 the direction for the associated windspeed should be 'NE' not 'NNW', because the wind direction df['windSpeedHigh_Dir'] == 'NE' when the maximum wind speed occurred.
So my question is, is it possible for me to resample this dataset from half-hourly to daily maximum wind speed while keeping the wind direction associated with that speed?
Use DataFrameGroupBy.idxmax for indices by dates first:
df_daily = df.loc[df.groupby(df['date'].dt.date)['windSpeedHigh'].idxmax()]
print (df_daily)
date windSpeedHigh windSpeedHigh_Dir
0 2019-01-01 09:30:00 155 NE
3 2019-01-02 05:00:00 45 W
4 2019-07-04 02:00:00 15 S
And then for add DatetimeIndex use DataFrame.set_index with Series.dt.normalize and DataFrame.asfreq:
df_daily = df_daily.set_index(df_daily['date'].dt.normalize().rename('day')).asfreq('d')
print (df_daily)
date windSpeedHigh windSpeedHigh_Dir
day
2019-01-01 2019-01-01 09:30:00 155.0 NE
2019-01-02 2019-01-02 05:00:00 45.0 W
2019-01-03 NaT NaN NaN
2019-01-04 NaT NaN NaN
2019-01-05 NaT NaN NaN
... ... ...
2019-06-30 NaT NaN NaN
2019-07-01 NaT NaN NaN
2019-07-02 NaT NaN NaN
2019-07-03 NaT NaN NaN
2019-07-04 2019-07-04 02:00:00 15.0 S
[185 rows x 3 columns]
Your solution shoudl working with custom function, because idxmax failed for missing values with DataFrame.join:
f = lambda x: x.idxmax() if len(x) > 0 else np.nan
df_daily = df.resample('D', on='date')['windSpeedHigh'].agg(f).to_frame('idx').join(df, on='idx')
print (df_daily)
idx date windSpeedHigh windSpeedHigh_Dir
date
2019-01-01 0.0 2019-01-01 09:30:00 155.0 NE
2019-01-02 3.0 2019-01-02 05:00:00 45.0 W
2019-01-03 NaN NaT NaN NaN
2019-01-04 NaN NaT NaN NaN
2019-01-05 NaN NaT NaN NaN
... ... ... ...
2019-06-30 NaN NaT NaN NaN
2019-07-01 NaN NaT NaN NaN
2019-07-02 NaN NaT NaN NaN
2019-07-03 NaN NaT NaN NaN
2019-07-04 4.0 2019-07-04 02:00:00 15.0 S
[185 rows x 4 columns]
I'm facing a weird behavior with pandas set_index function. I initilally have this dataframe:
Unnamed: 0 Timestamps PM10
0 NaN NaT PM10
1 NaN NaT â–’g/mâ–’
2 NaN 2018-12-31 23:00:00 10.76
3 NaN 2018-12-31 22:00:00 9.46
4 NaN 2018-12-31 21:00:00 8.67
... ... ... ...
8682 NaN 2018-01-01 04:00:00 25.14
8683 NaN 2018-01-01 03:00:00 31.34
8684 NaN 2018-01-01 02:00:00 36.28
8685 NaN 2018-01-01 01:00:00 21.78
8686 NaN 2018-01-01 00:00:00 20.59
I want to drop the first two rows and set the Timestamps as indeces so I do this:
df_final = df.drop([0,1]).set_index('Timestamps', drop=True)
and I get this dataframe:
Unnamed: 0 PM10
Timestamps
2018-12-31 23:00:00 NaN 10.76
2018-12-31 22:00:00 NaN 9.46
2018-12-31 21:00:00 NaN 8.67
2018-12-31 20:00:00 NaN 10.42
2018-12-31 19:00:00 NaN 10.04
... ... ...
2018-01-01 04:00:00 NaN 25.14
2018-01-01 03:00:00 NaN 31.34
2018-01-01 02:00:00 NaN 36.28
2018-01-01 01:00:00 NaN 21.78
2018-01-01 00:00:00 NaN 20.59
So far so good, but finally I want to re-index the PM10 column by a new time index I have created called t_index, so I do this:
data_write = df_final.PM10[-1::-1].reindex(t_index)
That is where I get an error:
TypeError: 'NoneType' object is not iterable
After some debugging I have concluded that set_index is causing this but I can't figure out why, any help is appreciated!
After some trial and error I managed to make this work and here is the code that does it:
df = df.drop([0,1]).drop("Unnamed: 0", axis=1).set_index('Timestamps', drop=True)
df = df.sort_values(by="Timestamps", ascending=True)
year = 2018
start_index = '{}-01-01 00:00:00'.format(year) # define start of the year
end_index = '{}-12-31 23:00:00'.format(year) # define end of the year
t_index = pd.DatetimeIndex(start=start_index, end=end_index, freq='1h').strftime("%Y-%m-%d %H:%M:%S")
df_final = pd.to_numeric(df.PM10).resample('H').mean().reindex(t_index)
Still not sure what was causing the erro, or why the .asfreq method did not work.
My df looks like this,
param per per_date per_num
0 XYZ 1.0 2018-10-01 11.0
1 XYZ 2.0 2017-08-01 15.25
2 XYZ 1.0 2019-10-01 11.25
3 XYZ 2.0 2019-08-01 15.71
4 XYZ 3.0 2020-10-01 11.50
5 XYZ NaN NaN NaN
6 MMG 1.0 2021-10-01 11.75
7 MMG 2.0 2014-01-01 14.00
8 MMG 3.0 2021-10-01 12.50
9 MMG 1.0 2014-01-01 15.00
10 LKG NaN NaN NaN
11 LKG NaN NaN NaN
I need my output like this,
param per_1 per_date_1 per_num_1 per_2 per_date_2 per_num_2 per_3 per_date_3 per_num_3
0 XYZ 1 2018-10-01 11.0 2 2017-08-01 15.25 NaN NaN NaN
1 XYZ 1 2019-10-01 11.25 2 2019-08-01 15.71 3 2020-10-01 11.50
2 XYZ NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 MMG 1 2021-10-01 11.75 2 2014-01-01 14.00 3 2021-10-01 12.50
5 MMG 1 2014-01-01 15.00 NaN NaN NaN NaN NaN NaN
6 LKG NaN NaN NaN NaN NaN NaN NaN NaN NaN
If you see param column has values that are repeating and transposed column names are created from these values. Also, a new records gets created as soon as param values starts with 1. How can I achieve this?
Here main problem are NaNs in last LKG group - first replace missing values by counter created by cumcount and assign to new column per1:
s = df['per'].isna().groupby(df['param']).cumsum()
df = df.assign(per1=df['per'].fillna(s).astype(int))
print (df)
param per per_date per_num per1
0 XYZ 1.0 2018-10-01 11.00 1
1 XYZ 2.0 2017-08-01 15.25 2
2 XYZ 1.0 2019-10-01 11.25 1
3 XYZ 2.0 2019-08-01 15.71 2
4 XYZ 3.0 2020-10-01 11.50 3
5 XYZ NaN NaN NaN 1
6 MMG 1.0 2021-10-01 11.75 1
7 MMG 2.0 2014-01-01 14.00 2
8 MMG 3.0 2021-10-01 12.50 3
9 MMG 1.0 2014-01-01 15.00 1
10 LKG NaN NaN NaN 1
11 LKG NaN NaN NaN 2
Then create MultiIndex with groups with compare by 1 and cumulative sum and reshape by unstack:
g = df['per1'].eq(1).cumsum()
df = df.set_index(['param', 'per1',g]).unstack(1).sort_index(axis=1, level=1)
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.reset_index(level=1, drop=True).reset_index()
print (df)
param per_1 per_date_1 per_num_1 per_2 per_date_2 per_num_2 per_3 \
0 LKG NaN NaN NaN NaN NaN NaN NaN
1 MMG 1.0 2021-10-01 11.75 2.0 2014-01-01 14.00 3.0
2 MMG 1.0 2014-01-01 15.00 NaN NaN NaN NaN
3 XYZ 1.0 2018-10-01 11.00 2.0 2017-08-01 15.25 NaN
4 XYZ 1.0 2019-10-01 11.25 2.0 2019-08-01 15.71 3.0
5 XYZ NaN NaN NaN NaN NaN NaN NaN
per_date_3 per_num_3
0 NaN NaN
1 2021-10-01 12.5
2 NaN NaN
3 NaN NaN
4 2020-10-01 11.5
5 NaN NaN
I Have A DataFrame , & I Want to Create New Columns Based o The Values of The Same Column , And At Each of This Column I want The Values To Be The Sum of repetition of Plate over the Time.
So I have This DataFrame:
Val_Tra.Head():
Plate EURO
Timestamp
2013-11-01 00:00:00 NaN NaN
2013-11-01 01:00:00 dcc2f657e897ffef752003469c688381 0.0
2013-11-01 02:00:00 a5ac0c2f48ea80707621e530780139ad 6.0
So I Have The EURO Column That Looks Like This:
Veh_Tra.EURO.value_counts():
5 1590144
6 745865
4 625512
0 440834
3 243800
2 40664
7 14207
1 4301
And This My Desired Output:
Plate EURO_1 EURO_2 EURO_3 EURO_4 EURO_5 EURO_6 EURO_7
Timestamp
2013-11-01 00:00:00 NaN NaN NaN NaN NaN NaN NaN NaN
2013-11-01 01:00:00 dcc2f657e897ffef752003469c688381 1.0 NaN NaN NaN NaN NaN NaN
2013-11-01 02:00:00 a5ac0c2f48ea80707621e530780139ad NaN NaN 1.0 NaN NaN NaN NaN
So Basically , What I Want , Is The Sum in Which Each Time That a Plate Value repeats Itself on a Specific Type of Euro over a specific Time.
Any Suggestions Would Be Much Appreciated , Thank U.
This is more like a get_dummies problem
s=df.dropna().EURO.astype(int).astype(str).str.get_dummies().add_prefix('EURO')
df=pd.concat([df,s],axis=1,sort=True)
df
Out[259]:
Plate EURO EURO0 EURO6
2013-11-0100:00:00 NaN NaN NaN NaN
2013-11-0101:00:00 dcc2f657e897ffef752003469c688381 0.0 1.0 0.0
2013-11-0102:00:00 a5ac0c2f48ea80707621e530780139ad 6.0 0.0 1.0
I am trying to load a csv file from the following URL into a dataframe using Python 3.5 and Pandas:
link = "http://api.worldbank.org/v2/en/indicator/NY.GDP.MKTP.CD?downloadformat=csv"
The csv file (API_NY.GDP.MKTP.CD_DS2_en_csv_v2.csv) is inside of a zip file. My try:
import urllib.request
urllib.request.urlretrieve(link, "GDP.zip")
import zipfile
compressed_file = zipfile.ZipFile('GDP.zip')
csv_file = compressed_file.open('API_NY.GDP.MKTP.CD_DS2_en_csv_v2.csv')
GDP = pd.read_csv(csv_file)
But when reading it, I got the error "pandas.io.common.CParserError: Error tokenizing data. C error: Expected 3 fields in line 5, saw 62".
Any idea?
I think you need parameter skiprows, because csv header is in row 5:
GDP = pd.read_csv(csv_file, skiprows=4)
print (GDP.head())
Country Name Country Code Indicator Name Indicator Code 1960 \
0 Aruba ABW GDP (current US$) NY.GDP.MKTP.CD NaN
1 Andorra AND GDP (current US$) NY.GDP.MKTP.CD NaN
2 Afghanistan AFG GDP (current US$) NY.GDP.MKTP.CD 5.377778e+08
3 Angola AGO GDP (current US$) NY.GDP.MKTP.CD NaN
4 Albania ALB GDP (current US$) NY.GDP.MKTP.CD NaN
1961 1962 1963 1964 1965 \
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 5.488889e+08 5.466667e+08 7.511112e+08 8.000000e+08 1.006667e+09
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
2008 2009 2010 2011 \
0 ... 2.791961e+09 2.498933e+09 2.467704e+09 2.584464e+09
1 ... 4.001201e+09 3.650083e+09 3.346517e+09 3.427023e+09
2 ... 1.019053e+10 1.248694e+10 1.593680e+10 1.793024e+10
3 ... 8.417803e+10 7.549238e+10 8.247091e+10 1.041159e+11
4 ... 1.288135e+10 1.204421e+10 1.192695e+10 1.289087e+10
2012 2013 2014 2015 2016 Unnamed: 61
0 NaN NaN NaN NaN NaN NaN
1 3.146152e+09 3.248925e+09 NaN NaN NaN NaN
2 2.053654e+10 2.004633e+10 2.005019e+10 1.933129e+10 NaN NaN
3 1.153984e+11 1.249121e+11 1.267769e+11 1.026269e+11 NaN NaN
4 1.231978e+10 1.278103e+10 1.321986e+10 1.139839e+10 NaN NaN