How in Python can I get dataframe coordinates that are not items in a column, into columns within the same dataframe? - python-3.x

I am using xarray to read in two different netCDF files, using combine='by_coords'. The data read in is then converted to a dataframe, and the printed output is shown below.
tag p
lat lon time
23.025642 -110.925552 2010-01-01 0 NaN
2010-01-02 0 NaN
2010-01-03 0 NaN
2010-01-04 0 NaN
2010-01-05 0 NaN
... ... ...
29.974609 -90.084259 2010-12-20 0 9.711414
2010-12-21 0 8.313345
2010-12-22 0 6.525973
2010-12-23 0 1.124200
2010-12-24 0 0.000000
[64110060 rows x 2 columns]
The variables are put as columns, however the coordinate variables are not columns. I have tried pulling the lat and long separately and appending them to the dataframe, however that is not working (size differences).
How might I be able to get the lat and lon as columns, so I can then use pandas groupby function with these?

Related

How to unmerge cells and create a standard dataframe when reading excel file?

I would like to convert this dataframe
into this dataframe
So far reading excel the standard way gives me the following result.
df = pd.read_excel(folder + 'abcd.xlsx', sheet_name="Sheet1")
Unnamed: 0 Unnamed: 1 T12006 T22006 T32006 \
0 Casablanca Global 100 97.27252 93.464538
1 NaN RĂ©sidentiel 100 95.883979 92.414063
2 NaN Appartement 100 95.425152 91.674379
3 NaN Maison 100 101.463607 104.039383
4 NaN Villa 100 102.45132 101.996932
Thank you
You can try method .fillna() with parameter method='ffill'. According to the pandas documentation for the ffill method: ffill: propagate last valid observation forward to next valid backfill.
So, your code would be like:
df.fillna(method='ffill', inplace=True)
And change name of 0 and 1 columns with this lines:
df.columns.values[0] = "City"
df.columns.values[1] = "Type"

Time series resampling with column of type object

Good evening,
I want to resample on an irregular time series with column of type object but it does not work
Here is my sample data:
Actual start date Ingredients NumberShortage
2002-01-01 LEVOBUNOLOL HYDROCHLORIDE 1
2006-07-30 LEVETIRACETAM 1
2008-03-19 FLAVOXATE HYDROCHLORIDE 1
2010-01-01 LEVOTHYROXINE SODIUM 1
2011-04-01 BIMATOPROST 1
I tried to re-sample my data frame daily but it does not work with my code which is as follows:
df3 = df1.resample('D', on='Actual start date').sum()
and here is what it gives:
Actual start date NumberShortage
2002-01-01 1
2002-01-02 0
2002-01-03 0
2002-01-04 0
2002-01-05 0
and what I want as a result:
Actual start date Ingredients NumberShortage
2002-01-01 LEVOBUNOLOL HYDROCHLORIDE 1
2002-01-02 NAN 0
2002-01-03 NAN 0
2002-01-04 NAN 0
2002-01-05 NAN 0
Any ideas?
details on the data
So I use an excel file which contains several attributes before it is a csv file (this file can be downloaded from this site web https://www.drugshortagescanada.ca/search?perform=0 ) then I group by 'Actual start date' and 'Ingredients'to obtain 'NumberShortage'
and here is the source code:
import pandas as pd
df = pd.read_excel("Data/Data.xlsx")
df = df.dropna(how='any')
df = df.groupby(['Actual start date','Ingredients']).size().reset_index(name='NumberShortage')
finally after having applied your source code here is the eureur which gives me :
and here is the sample excel file :
Brand name Company Name Ingredients Actual start date
ACETAMINOPHEN PHARMASCIENCE INC ACETAMINOPHEN CODEINE 2017-03-23
PMS-METHYLPHENIDATE ER PHARMASCIENCE INC METHYLPHENIDATE 2017-03-28
You rather need to reindex using date_range as a source of new dates, and the time series as temporary index:
df['Actual start date'] = pd.to_datetime(df['Actual start date'])
(df
.set_index('Actual start date')
.reindex(pd.date_range(df['Actual start date'].min(),
df['Actual start date'].max(), freq='D'))
.fillna({'NumberShortage': 0}, downcast='infer')
.reset_index()
)
output:
index Ingredients NumberShortage
0 2002-01-01 LEVOBUNOLOL HYDROCHLORIDE 1
1 2002-01-02 NaN 0
2 2002-01-03 NaN 0
3 2002-01-04 NaN 0
4 2002-01-05 NaN 0
... ... ... ...
3373 2011-03-28 NaN 0
3374 2011-03-29 NaN 0
3375 2011-03-30 NaN 0
3376 2011-03-31 NaN 0
3377 2011-04-01 BIMATOPROST 1
[3378 rows x 3 columns]

How to reformat time series to fill in missing entries with NaNs?

I have a problem that involves converting time series from one
representation to another. Each item in the time series has
attributes "time", "id", and "value" (think of it as a measurement
at "time" for sensor "id"). I'm storing all the items in a
Pandas dataframe with columns named by the attributes.
The set of "time"s is a small set of integers (say, 32),
but some of the "id"s are missing "time"s/"value"s. What I want to
construct is an output dataframe with the form:
id time0 time1 ... timeN
val0 val1 ... valN
where the missing "value"s are represented by NaNs.
For example, suppose the input looks like the following:
time id value
0 0 13
2 0 15
3 0 20
2 1 10
3 1 12
Then, assuming the set of possible times is 0, 2, and 3, the
desired output is:
id time0 time1 time2 time3
0 13 NaN 15 20
1 NaN NaN 10 12
I'm looking for a Pythonic way to do this since there are several
million rows in the input and around 1/4 million groups.
You can transform your table with a pivot. If you need to handle duplicate values for index/column pairs, you can use the more general pivot_table.
For your example, the simple pivot is sufficient:
>>> df = df.pivot(index="id", columns="time", values="value")
time 0 2 3
id
0 13.0 15.0 20.0
1 NaN 10.0 12.0
To get the exact result from your question, you could reindex the columns to fill in the empty values, and rename the column index like this:
# add missing time columns, fill with NaNs
df = df.reindex(range(df.columns.max() + 1), axis=1)
# name them "time#"
df.columns = "time" + df.columns.astype(str)
# remove the column index name "time"
df = df.rename_axis(None, axis=1)
Final df:
time0 time1 time2 time3
id
0 13.0 NaN 15.0 20.0
1 NaN NaN 10.0 12.0

substract two ECDF time series

Hi I have a ECDF plot by seaborn which is the following.
I can obtain this by doing sns.ecdfplot(data=df2, x='time', hue='seg_oper', stat='count').
My dataframe is very simple:
In [174]: df2
Out[174]:
time seg_oper
265 18475 1->0:ADD['TX']
2342 78007 0->1:ADD['RX']
2399 78613 1->0:DELETE['TX']
2961 87097 0->1:ADD['RX']
2994 87210 0->1:ADD['RX']
... ... ...
330823 1002281 1->0:DELETE['TX']
331256 1003545 1->0:DELETE['TX']
331629 1004961 1->0:DELETE['TX']
332375 1006663 1->0:DELETE['TX']
333083 1008644 1->0:DELETE['TX']
[834 rows x 2 columns]
How can I substract series 0->1:ADD['RX'] from 1->0:DELETE['TX']?
I like seaborn because most of this data mangling is done inside the library, but in this case I need to substract these two series ...
Thanks.
So the first thing is to obtain what seaborn does, but manually. After that (because I need to) I can subtract one series from the other.
Cumulative Count
First we need to obtain a cumulative count per each series.
In [304]: df2['cum'] = df2.groupby(['seg_oper']).cumcount()
In [305]: df2
Out[305]:
time seg_oper cum
265 18475 1->0:ADD['TX'] 0
2961 87097 0->1:ADD['RX'] 1
2994 87210 0->1:ADD['RX'] 2
... ... ... ...
332375 1006663 1->0:DELETE['TX'] 413
333083 1008644 1->0:DELETE['TX'] 414
Pivot the data
Rearrange the DF.
In [307]: df3 = df2.pivot(index='time', columns='seg_oper',values='cum').reset_index()
In [308]: df3
Out[308]:
seg_oper time 0->1:ADD['RX'] 1->0:ADD['TX'] 1->0:DELETE['TX']
0 18475 NaN 0.0 NaN
1 78007 0.0 NaN NaN
2 78613 NaN NaN 0.0
3 87097 1.0 NaN NaN
4 87210 2.0 NaN NaN
.. ... ... ... ...
828 1002281 NaN NaN 410.0
829 1003545 NaN NaN 411.0
830 1004961 NaN NaN 412.0
831 1006663 NaN NaN 413.0
832 1008644 NaN NaN 414.0
[833 rows x 4 columns]
Fill the gaps
I'm assuming that the NaN values can be filled with the previous value of the row until the next one.
df3=df3.fillna(method='ffill')
At this point, if you plot df3 you'll obtain the same as doing sns.ecdfplot(df2) with seaborn.
I still want to substract one series from the other.
df3['diff'] = df3["0->1:ADD['RX']"] - df3["1->0:DELETE['TX']"]
df3.plot(x='time')
The following plot, is the result.
pd: I don't understand the negative vote on the question. If someone can explain, I'll appreciate it.

TypeError: '(slice(None, 59, None), slice(None, None, None))' is an invalid key

I am having the below table where I want to remove these rows with NaN values.
date Open ... Real Lower Band Real Upper Band
0 2020-07-08 08:05:00 2.1200 ... NaN NaN
1 2020-07-08 09:00:00 2.1400 ... NaN NaN
2 2020-07-08 09:30:00 2.1800 ... NaN NaN
3 2020-07-08 09:35:00 2.2000 ... NaN NaN
4 2020-07-08 09:40:00 2.1710 ... NaN NaN
5 2020-07-08 09:45:00 2.1550 ... NaN NaN
These NaN values are til row no. 58
For this, I wrote the following code. But the above error occurred.
data.drop(data[:59,:],inplace= True)
print(data)
Please help me!
There are many options to choose from:
Drop rows by index label.
df.drop(list(range(59)), axis=0, inplace=True)
Drop if nans in selected columns.
df.dropna(axis=0, subset=['Real Upper Band'], inplace=True)
Select rows to keep by index label slice
df = df.loc[59:, :] # 59 is the label in index, if index was date then replace 59 with corresponding datetime
Select rows to keep by integer index slice (similar to slicing a list)
df = df.iloc[59:, :] # 59 is the 0-index row number, regardless of what index is set on df
Filter with .loc and boolean array returned by .isna()
df = df.loc[~df['Real Upper Band'].isna(), :]
Remember that loc and iloc work with two dimensions when applied to dataframes, it is recomended to use full slice : to avoid ambiguity and improve performance according to the docs https://pandas.pydata.org/docs/user_guide/indexing.html
You want to keep rows from 59-th on, so the shortest code you can run is:
data = data[59:]

Resources