Create a pandas column based on a lookup value from another dataframe - python-3.x

I have a pandas dataframe that has some data values by hour (which is also the index of this lookup dataframe). The dataframe looks like this:
In [1] print (df_lookup)
Out[1] 0 1.109248
1 1.102435
2 1.085014
3 1.073487
4 1.079385
5 1.088759
6 1.044708
7 0.902482
8 0.852348
9 0.995912
10 1.031643
11 1.023458
12 1.006961
...
23 0.889541
I want to multiply the values from this lookup dataframe to create a column of another dataframe, which has datetime as index.
The dataframe looks like this:
In [2] print (df)
Out[2]
Date_Label ID data-1 data-2 data-3
2015-08-09 00:00:00 1 2513.0 2502 NaN
2015-08-09 00:00:00 1 2113.0 2102 NaN
2015-08-09 01:00:00 2 2006.0 1988 NaN
2015-08-09 02:00:00 3 2016.0 2003 NaN
...
2018-07-19 23:00:00 33 3216.0 333 NaN
I want to calculate the data-3 column from data-2 column, where the weight given to 'data-2' column depends on corresponding value in df_lookup. I get the desired values by looping over the index as follows, but that is too slow:
for idx in df.index:
df.loc[idx,'data-3'] = df.loc[idx, 'data-2']*df_lookup.at[idx.hour]
Is there a faster way someone could suggest?

Using .loc
df['data-2']*df_lookup.loc[df.index.hour].values
Out[275]:
Date_Label
2015-08-09 00:00:00 2775.338496
2015-08-09 00:00:00 2331.639296
2015-08-09 01:00:00 2191.640780
2015-08-09 02:00:00 2173.283042
Name: data-2, dtype: float64
#df['data-3']=df['data-2']*df_lookup.loc[df.index.hour].values

I'd probably try doing a join.
# Fix column name
df_lookup.columns = ['multiplier']
# Get hour index
df['hour'] = df.index.hour
# Join
df = df.join(df_lookup, how='left', on=['hour'])
df['data-3'] = df['data-2'] * df['multiplier']
df = df.drop(['multiplier', 'hour'], axis=1)

Related

Time series resampling with column of type object

Good evening,
I want to resample on an irregular time series with column of type object but it does not work
Here is my sample data:
Actual start date Ingredients NumberShortage
2002-01-01 LEVOBUNOLOL HYDROCHLORIDE 1
2006-07-30 LEVETIRACETAM 1
2008-03-19 FLAVOXATE HYDROCHLORIDE 1
2010-01-01 LEVOTHYROXINE SODIUM 1
2011-04-01 BIMATOPROST 1
I tried to re-sample my data frame daily but it does not work with my code which is as follows:
df3 = df1.resample('D', on='Actual start date').sum()
and here is what it gives:
Actual start date NumberShortage
2002-01-01 1
2002-01-02 0
2002-01-03 0
2002-01-04 0
2002-01-05 0
and what I want as a result:
Actual start date Ingredients NumberShortage
2002-01-01 LEVOBUNOLOL HYDROCHLORIDE 1
2002-01-02 NAN 0
2002-01-03 NAN 0
2002-01-04 NAN 0
2002-01-05 NAN 0
Any ideas?
details on the data
So I use an excel file which contains several attributes before it is a csv file (this file can be downloaded from this site web https://www.drugshortagescanada.ca/search?perform=0 ) then I group by 'Actual start date' and 'Ingredients'to obtain 'NumberShortage'
and here is the source code:
import pandas as pd
df = pd.read_excel("Data/Data.xlsx")
df = df.dropna(how='any')
df = df.groupby(['Actual start date','Ingredients']).size().reset_index(name='NumberShortage')
finally after having applied your source code here is the eureur which gives me :
and here is the sample excel file :
Brand name Company Name Ingredients Actual start date
ACETAMINOPHEN PHARMASCIENCE INC ACETAMINOPHEN CODEINE 2017-03-23
PMS-METHYLPHENIDATE ER PHARMASCIENCE INC METHYLPHENIDATE 2017-03-28
You rather need to reindex using date_range as a source of new dates, and the time series as temporary index:
df['Actual start date'] = pd.to_datetime(df['Actual start date'])
(df
.set_index('Actual start date')
.reindex(pd.date_range(df['Actual start date'].min(),
df['Actual start date'].max(), freq='D'))
.fillna({'NumberShortage': 0}, downcast='infer')
.reset_index()
)
output:
index Ingredients NumberShortage
0 2002-01-01 LEVOBUNOLOL HYDROCHLORIDE 1
1 2002-01-02 NaN 0
2 2002-01-03 NaN 0
3 2002-01-04 NaN 0
4 2002-01-05 NaN 0
... ... ... ...
3373 2011-03-28 NaN 0
3374 2011-03-29 NaN 0
3375 2011-03-30 NaN 0
3376 2011-03-31 NaN 0
3377 2011-04-01 BIMATOPROST 1
[3378 rows x 3 columns]

ValueError: cannot reindex from a duplicate axis while shift one column in Pandas

Given a dataframe df with date index as follows:
value
2017-03-31 NaN
2017-04-01 27863.7
2017-04-02 27278.5
2017-04-03 27278.5
2017-04-04 27278.5
...
2021-10-27 NaN
2021-10-28 NaN
2021-10-29 NaN
2021-10-30 NaN
2021-10-31 NaN
I'm able to shift value column by one year use df['value'].shift(freq=pd.DateOffset(years=1)):
Out:
2018-03-31 NaN
2018-04-01 27863.7
2018-04-02 27278.5
2018-04-03 27278.5
2018-04-04 27278.5
...
2022-10-27 NaN
2022-10-28 NaN
2022-10-29 NaN
2022-10-30 NaN
2022-10-31 NaN
But when I use it to replace orginal value by df['value'] = df['value'].shift(freq=pd.DateOffset(years=1)), it raises an error:
ValueError: cannot reindex from a duplicate axis
Since the code below works smoothly, so I think the issue caused by NaNs in value column:
import pandas as pd
import numpy as np
np.random.seed(2021)
dates = pd.date_range('20130101', periods=720)
df = pd.DataFrame(np.random.randint(0, 100, size=(720, 3)), index=dates, columns=list('ABC'))
df
df.B = df.B.shift(freq=pd.DateOffset(years=1))
I also try with df['value'].shift(freq=relativedelta(years=+1)), but it generates: pandas.errors.NullFrequencyError: Cannot shift with no freq
Someone could help to deal with this issue? Sincere thanks.
Since the code below works smoothly, so I think the issue caused by NaNs in value column
No I don't think so. It's probably because in your 2nd sample you have only 1 leap year.
Reproducible error with 2 leap years:
# 2018 (366 days), 2019 (365 days) and 2020 (366 days)
dates = pd.date_range('20180101', periods=365*3+1)
df = pd.DataFrame(np.random.randint(0, 100, size=(365*3+1, 3)),
index=dates, columns=list('ABC'))
df.B = df.B.shift(freq=pd.DateOffset(years=1))
...
ValueError: cannot reindex from a duplicate axis
...
The example below works:
# 2017 (365 days), 2018 (366 days) and 2019 (365 days)
dates = pd.date_range('20170101', periods=365*3+1)
df = pd.DataFrame(np.random.randint(0, 100, size=(365*3+1, 3)),
index=dates, columns=list('ABC'))
df.B = df.B.shift(freq=pd.DateOffset(years=1))
Just look to value_counts:
# 2018 -> 2020
>>> df.B.shift(freq=pd.DateOffset(years=1)).index.value_counts()
2021-02-28 2 # The duplicated index
2020-12-29 1
2021-01-04 1
2021-01-03 1
2021-01-02 1
..
2020-01-07 1
2020-01-08 1
2020-01-09 1
2020-01-10 1
2021-12-31 1
Length: 1095, dtype: int64
# 2017 -> 2019
>>> df.B.shift(freq=pd.DateOffset(years=1)).index.value_counts()
2018-01-01 1
2019-12-30 1
2020-01-05 1
2020-01-04 1
2020-01-03 1
..
2019-01-07 1
2019-01-08 1
2019-01-09 1
2019-01-10 1
2021-01-01 1
Length: 1096, dtype: int64
Solution
Obviously, the solution is to remove duplicated index, in our case '2021-02-28', by using resample('D') and an aggregate function first, last, min, max, mean, sum or a custom one:
>>> df.B.shift(freq=pd.DateOffset(years=1))['2021-02-28']
2021-02-28 41
2021-02-28 96
Name: B, dtype: int64
>>> df.B.shift(freq=pd.DateOffset(years=1))['2021-02-28'] \
.resample('D').agg(('first', 'last', 'min', 'max', 'mean', 'sum')).T
2021-02-28
first 41.0
last 96.0
min 41.0
max 96.0
mean 68.5
sum 137.0
# Choose `last` for example
df.B = df.B.shift(freq=pd.DateOffset(years=1)).resample('D').last()
Note, you can replace .resample(...).func by .loc[lambda x: x.index.duplicated()]

How to get the minimum time value in a dataframe with excluding specific value

I have a dataframe that has the format as below. I am looking to get the minimum time value for each column and save it in a list with excluding a specific time value with a format (00:00:00) to be a minimum value in any column in a dataframe.
df =
10.0.0.155 192.168.1.240 192.168.0.242
0 19:48:46 16:23:40 20:14:07
1 20:15:46 16:23:39 20:14:09
2 19:49:37 16:23:20 00:00:00
3 20:15:08 00:00:00 00:00:00
4 19:48:46 00:00:00 00:00:00
5 19:47:30 00:00:00 00:00:00
6 19:49:13 00:00:00 00:00:00
7 20:15:50 00:00:00 00:00:00
8 19:45:34 00:00:00 00:00:00
9 19:45:33 00:00:00 00:00:00
I tried to use the code below, but it doesn't work:
minValues = []
for column in df:
#print(df[column])
if "00:00:00" in df[column]:
minValues.append (df[column].nlargest(2).iloc[-1])
else:
minValues.append (df[column].min())
print (df)
print (minValues)
Idea is replace 0 to missing values and then get minimal timedeltas:
df1 = df.astype(str).apply(pd.to_timedelta)
s1 = df1.mask(df1.eq(pd.Timedelta(0))).min()
print (s1)
10.0.0.155 0 days 19:45:33
192.168.1.240 0 days 16:23:20
192.168.0.242 0 days 20:14:07
dtype: timedelta64[ns]
Or with get minimal datetimes and last convert output to HH:MM:SS values:
df1 = df.astype(str).apply(pd.to_datetime)
s2 = (df1.mask(df1.eq(pd.to_datetime("00:00:00"))).min().dt.strftime('%H:%M:%S')
print (s2)
10.0.0.155 19:45:33
192.168.1.240 16:23:20
192.168.0.242 20:14:07
dtype: object
Or to times:
df1 = df.astype(str).apply(pd.to_datetime)
s3 = df1.mask(df1.eq(pd.to_datetime("00:00:00"))).min().dt.time
print (s3)
10.0.0.155 19:45:33
192.168.1.240 16:23:20
192.168.0.242 20:14:07
dtype: object

Convert a numerical relative index (=months) to datetime

Given is a Pandas DataFrame with a numerical index representing the relative number of months:
df = pd.DataFrame(columns=['A', 'B'], index=np.arange(1,100))
df
A B
1 NaN NaN
2 NaN NaN
3 NaN NaN
...
How can the index be converted to a DateTimeIndex by specifying a start date (e.g., 2018-11-01)?
magic_function(df, start='2018-11-01', delta='month')
A B
2018-11-01 NaN NaN
2018-12-01 NaN NaN
2019-01-01 NaN NaN
...
I would favor a general solution that also works with arbitrary deltas, e.g. daily or yearly series.
Using date_range
idx=pd.date_range(start='2018-11-01',periods =len(df),freq='MS')
df.index=idx
I'm not sure with Pandas, but with plain datetime can't you just do this?
import datetime
start=datetime.date(2018,1,1)
months = 15
adjusted = start.replace(year=start.year + int(months/12), month=months%12)

Generate daterange and insert in a new column of a dataframe

Problem statement: Create a dataframe with multiple columns and populate one column with daterange series of 5 minute interval.
Tried solution:
Created a dataframe initially with just one row / 5 columns (all "NAN") .
Command used to generate daterange:
rf = pd.date_range('2000-1-1', periods=5, freq='5min').
O/P of rf :
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 00:05:00',
'2000-01-01 00:10:00', '2000-01-01 00:15:00',
'2000-01-01 00:20:00'],
dtype='datetime64[ns]', freq='5T')
When I try to assign rf to one of the columns of df (df['column1'] = rf)., it is throwing exception as shown below (copying the last line of exception).
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.6/site-packages/pandas/core/series.py", line 2879, in _sanitize_index
raise ValueError('Length of values does not match length of ' 'index')
Though I understood the issue, I don't know the solution. I'm looking for a easy way to achieve this.
I think, I was slowly understanding the power/usage of dataframes.
Initially create a dataframe :
df = pd.DataFrame(index=range(100),columns=['A','B','C'])
Then created a date_range.
date = pd.date_range('2000-1-1', periods=100, freq='5T')
Using "assign" function , added date_range as new column to already created dataframe (df).
df = df.assign(D=date)
Final O/P of df:
df[:5]
A B C D
0 NaN NaN NaN 2000-01-01 00:00:00
1 NaN NaN NaN 2000-01-01 00:05:00
2 NaN NaN NaN 2000-01-01 00:10:00
3 NaN NaN NaN 2000-01-01 00:15:00
4 NaN NaN NaN 2000-01-01 00:20:00
Your dataframe has only one row and you try to insert data for five rows.

Resources