Time series resampling with column of type object - python-3.x

Good evening,
I want to resample on an irregular time series with column of type object but it does not work
Here is my sample data:
Actual start date Ingredients NumberShortage
2002-01-01 LEVOBUNOLOL HYDROCHLORIDE 1
2006-07-30 LEVETIRACETAM 1
2008-03-19 FLAVOXATE HYDROCHLORIDE 1
2010-01-01 LEVOTHYROXINE SODIUM 1
2011-04-01 BIMATOPROST 1
I tried to re-sample my data frame daily but it does not work with my code which is as follows:
df3 = df1.resample('D', on='Actual start date').sum()
and here is what it gives:
Actual start date NumberShortage
2002-01-01 1
2002-01-02 0
2002-01-03 0
2002-01-04 0
2002-01-05 0
and what I want as a result:
Actual start date Ingredients NumberShortage
2002-01-01 LEVOBUNOLOL HYDROCHLORIDE 1
2002-01-02 NAN 0
2002-01-03 NAN 0
2002-01-04 NAN 0
2002-01-05 NAN 0
Any ideas?
details on the data
So I use an excel file which contains several attributes before it is a csv file (this file can be downloaded from this site web https://www.drugshortagescanada.ca/search?perform=0 ) then I group by 'Actual start date' and 'Ingredients'to obtain 'NumberShortage'
and here is the source code:
import pandas as pd
df = pd.read_excel("Data/Data.xlsx")
df = df.dropna(how='any')
df = df.groupby(['Actual start date','Ingredients']).size().reset_index(name='NumberShortage')
finally after having applied your source code here is the eureur which gives me :
and here is the sample excel file :
Brand name Company Name Ingredients Actual start date
ACETAMINOPHEN PHARMASCIENCE INC ACETAMINOPHEN CODEINE 2017-03-23
PMS-METHYLPHENIDATE ER PHARMASCIENCE INC METHYLPHENIDATE 2017-03-28

You rather need to reindex using date_range as a source of new dates, and the time series as temporary index:
df['Actual start date'] = pd.to_datetime(df['Actual start date'])
(df
.set_index('Actual start date')
.reindex(pd.date_range(df['Actual start date'].min(),
df['Actual start date'].max(), freq='D'))
.fillna({'NumberShortage': 0}, downcast='infer')
.reset_index()
)
output:
index Ingredients NumberShortage
0 2002-01-01 LEVOBUNOLOL HYDROCHLORIDE 1
1 2002-01-02 NaN 0
2 2002-01-03 NaN 0
3 2002-01-04 NaN 0
4 2002-01-05 NaN 0
... ... ... ...
3373 2011-03-28 NaN 0
3374 2011-03-29 NaN 0
3375 2011-03-30 NaN 0
3376 2011-03-31 NaN 0
3377 2011-04-01 BIMATOPROST 1
[3378 rows x 3 columns]

Related

Convert one dataframe's format and check if each row exits in another dataframe in Python

Given a small dataset df1 as follow:
city year quarter
0 sh 2019 q4
1 bj 2020 q3
2 bj 2020 q2
3 sh 2020 q4
4 sh 2020 q1
5 bj 2021 q1
I would like to create date range in quarter from 2019-q2 to 2021-q1 as column names, then check if each row in df1's year and quarter for each city exist in df2.
If they exist, then return ys for that cell, otherwise, return NaNs.
The final result will like:
city 2019-q2 2019-q3 2019-q4 2020-q1 2020-q2 2020-q3 2020-q4 2021-q1
0 bj NaN NaN NaN NaN y y NaN y
1 sh NaN NaN y y NaN NaN y NaN
To create column names for df2:
pd.date_range('2019-04-01', '2021-04-01', freq = 'Q').to_period('Q')
How could I achieve this in Python? Thanks.
We can use crosstab on city and the string concatenation of the year and quarter columns:
new_df = pd.crosstab(df['city'], df['year'].astype(str) + '-' + df['quarter'])
new_df:
col_0 2019-q4 2020-q1 2020-q2 2020-q3 2020-q4 2021-q1
city
bj 0 0 1 1 0 1
sh 1 1 0 0 1 0
We can convert to bool, replace False and True to be the correct values, reindex to add missing columns, and cleanup axes and index to get exact output:
col_names = pd.date_range('2019-01-01', '2021-04-01', freq='Q').to_period('Q')
new_df = (
pd.crosstab(df['city'], df['year'].astype(str) + '-' + df['quarter'])
.astype(bool) # Counts to boolean
.replace({False: np.NaN, True: 'y'}) # Fill values
.reindex(columns=col_names.strftime('%Y-q%q')) # Add missing columns
.rename_axis(columns=None) # Cleanup axis name
.reset_index() # reset index
)
new_df:
city 2019-q1 2019-q2 2019-q3 2019-q4 2020-q1 2020-q2 2020-q3 2020-q4 2021-q1
0 bj NaN NaN NaN NaN NaN y y NaN y
1 sh NaN NaN NaN y y NaN NaN y NaN
DataFrame and imports:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'city': ['sh', 'bj', 'bj', 'sh', 'sh', 'bj'],
'year': [2019, 2020, 2020, 2020, 2020, 2021],
'quarter': ['q4', 'q3', 'q2', 'q4', 'q1', 'q1']
})

Display row with False values in validated pandas dataframe column [duplicate]

This question already has answers here:
Display rows with one or more NaN values in pandas dataframe
(5 answers)
Closed 2 years ago.
I was validating 'Price' column in my dataframe. Sample:
ArticleId SiteId ZoneId Date Quantity Price CostPrice
53 194516 9 2 2018-11-26 11.0 40.64 27.73
164 200838 9 2 2018-11-13 5.0 99.75 87.24
373 200838 9 2 2018-11-27 1.0 99.75 87.34
pd.to_numeric(df_sales['Price'], errors='coerce').notna().value_counts()
And I'd love to display those rows with False values so I know whats wrong with them. How do I do that?
True 17984
False 13
Name: Price, dtype: int64
Thank you.
You could print your rows when price isnull():
print(df_sales[df_sales['Price'].isnull()])
ArticleId SiteId ZoneId Date Quantity Price CostPrice
1 200838 9 2 2018-11-13 5 NaN 87.240
pd.to_numeric(df['Price'], errors='coerce').isna() returns a Boolean, which can be used to select the rows that cause errors.
This includes NaN or rows with strings
import pandas as pd
# test data
df = pd.DataFrame({'Price': ['40.64', '99.75', '99.75', pd.NA, 'test', '99. 0', '98 0']})
Price
0 40.64
1 99.75
2 99.75
3 <NA>
4 test
5 99. 0
6 98 0
# find the value of the rows that are causing issues
problem_rows = df[pd.to_numeric(df['Price'], errors='coerce').isna()]
# display(problem_rows)
Price
3 <NA>
4 test
5 99. 0
6 98 0
Alternative
Create an extra column and then use it to select the problem rows
df['Price_Updated'] = pd.to_numeric(df['Price'], errors='coerce')
Price Price_Updated
0 40.64 40.64
1 99.75 99.75
2 99.75 99.75
3 <NA> NaN
4 test NaN
5 99. 0 NaN
6 98 0 NaN
# find the problem rows
problem_rows = df.Price[df.Price_Updated.isna()]
Explanation
Updating the column with .to_numeric(), and then checking for NaNs will not tell you why the rows had to be coerced.
# update the Price row
df.Price = pd.to_numeric(df['Price'], errors='coerce')
# check for NaN
problem_rows = df.Price[df.Price.isnull()]
# display(problem_rows)
3 NaN
4 NaN
5 NaN
6 NaN
Name: Price, dtype: float64

Converting timedeltas to integers for consecutive time points in pandas

Suppose I have the dataframe
import pandas as pd
df = pd.DataFrame({"Time": ['2010-01-01', '2010-01-02', '2010-01-03', '2010-01-04']})
print(df)
Time
0 2010-01-01
1 2010-01-02
2 2010-01-03
3 2010-01-04
If I want to calculate the time from the lowest time point for each time in the dataframe, I can use the apply function like
df['Time'] = pd.to_datetime(df['Time'])
df.sort_values(inplace = True)
df['Time'] = df['Time'].apply(lambda x: (x - df['Time'].iloc[0]).days)
print(df)
Time
0 0
1 1
2 2
3 3
Is there a function in Pandas that does this already?
I will recommend not use apply
(df.Time-df.Time.iloc[0]).dt.days
0 0
1 1
2 2
3 3
Name: Time, dtype: int64

Reshape Pandas dataframe based on values in two columns

In Python, I would like to search through all rows in the dataframe with two possible paths (dataframe is populated from csv files). If the 'Group' column for a given row is zero, move that row's data to the next row of a new dataframe using the 'Channel_1' and 'Data_1' columns. If the 'Group' column for a given row is non-zero, then get all three rows with the same 'Group' column value (also identified by 'sub-group' column as 1, 2 or 3) and add to the next row of the new dataframe.
Code to generate dataframe from csv file:
for name in glob.glob(search_string):
r_file = pd.read_csv(name)
Current Data Format:
Channel_Num Group Sub_Group Data
1000 1 1 100
1001 1 2 105
1002 1 3 110
1003 0 0 200
1004 2 1 400
1005 2 2 405
1006 2 3 410
1007 0 0 500
Desired Data Format:
Group Channel_1 Data_1 Channel_2 Data_2 Channel_3 Data_3
1 1000 100 1001 105 1002 110
0 1003 200 NaN NaN NaN NaN
2 1004 400 1005 405 1006 410
0 1007 500 NaN NaN NaN NaN
I've tried the GroupBy and pivot_table methods but without success. After the data is in the desired format, there are other calculations that need run against the newly organized data but getting it in the desired format is the key.
This is more like a pivot problem after create the additional key by using diff and cumsum , so I am using pivot_table and multiple columns flatten
df.loc[df.Sub_Group==0,'Sub_Group']=1
df['newkey']=df.Group.diff().ne(0).cumsum()
s=df.pivot_table(index=['Group','newkey'],columns=['Sub_Group'],values=['Channel_Num','Data'],aggfunc='first').sort_index(level=1,axis=1)
s.columns=s.columns.map('{0[0]}_{0[1]}'.format)
s.reset_index(level=0).sort_index()
Out[25]:
Group Channel_Num_1 Data_1 ... Data_2 Channel_Num_3 Data_3
newkey ...
1 1 1000.0 100.0 ... 105.0 1002.0 110.0
2 0 1003.0 200.0 ... NaN NaN NaN
3 2 1004.0 400.0 ... 405.0 1006.0 410.0
4 0 1007.0 500.0 ... NaN NaN NaN
[4 rows x 7 columns]

Create a pandas column based on a lookup value from another dataframe

I have a pandas dataframe that has some data values by hour (which is also the index of this lookup dataframe). The dataframe looks like this:
In [1] print (df_lookup)
Out[1] 0 1.109248
1 1.102435
2 1.085014
3 1.073487
4 1.079385
5 1.088759
6 1.044708
7 0.902482
8 0.852348
9 0.995912
10 1.031643
11 1.023458
12 1.006961
...
23 0.889541
I want to multiply the values from this lookup dataframe to create a column of another dataframe, which has datetime as index.
The dataframe looks like this:
In [2] print (df)
Out[2]
Date_Label ID data-1 data-2 data-3
2015-08-09 00:00:00 1 2513.0 2502 NaN
2015-08-09 00:00:00 1 2113.0 2102 NaN
2015-08-09 01:00:00 2 2006.0 1988 NaN
2015-08-09 02:00:00 3 2016.0 2003 NaN
...
2018-07-19 23:00:00 33 3216.0 333 NaN
I want to calculate the data-3 column from data-2 column, where the weight given to 'data-2' column depends on corresponding value in df_lookup. I get the desired values by looping over the index as follows, but that is too slow:
for idx in df.index:
df.loc[idx,'data-3'] = df.loc[idx, 'data-2']*df_lookup.at[idx.hour]
Is there a faster way someone could suggest?
Using .loc
df['data-2']*df_lookup.loc[df.index.hour].values
Out[275]:
Date_Label
2015-08-09 00:00:00 2775.338496
2015-08-09 00:00:00 2331.639296
2015-08-09 01:00:00 2191.640780
2015-08-09 02:00:00 2173.283042
Name: data-2, dtype: float64
#df['data-3']=df['data-2']*df_lookup.loc[df.index.hour].values
I'd probably try doing a join.
# Fix column name
df_lookup.columns = ['multiplier']
# Get hour index
df['hour'] = df.index.hour
# Join
df = df.join(df_lookup, how='left', on=['hour'])
df['data-3'] = df['data-2'] * df['multiplier']
df = df.drop(['multiplier', 'hour'], axis=1)

Resources