Key Error while subsetting Timeseries data using index - python-3.x

I have the following Timeseries data.
price_per_year.head()
price
date
2013-01-02 20.08
2013-01-03 19.78
2013-01-04 19.86
2013-01-07 19.40
2013-01-08 19.66
price_per_year.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 782 entries, 2013-01-02 to 2015-12-31
Data columns (total 1 columns):
price 756 non-null float64
dtypes: float64(1)
memory usage: 12.2 KB
I am trying to extract data for 3 years using the below code. Why is that I am getting KeyError: '2014', when the data as shown below clearly contains year '2014'. Appreciate any inputs.
price_per_year['2014'].head()
price
date
2014-01-01 NaN
2014-01-02 39.59
2014-01-03 40.12
2014-01-06 39.93
2014-01-07 40.92
prices = pd.DataFrame()
for year in ['2013', '2014', '2015']:
price_per_year = price_per_year.loc[year, ['price']].reset_index(drop=True)
price_per_year.rename(columns={'price': year}, inplace=True)
prices = pd.concat([prices, price_per_year], axis=1)
KeyError: '2014'
The code line price_per_year.loc['2014', ['price']], when used independently outside for loop, works fine, while price_per_year['price'][year] when used in the for loop doesn't work.
for year in ['2013', '2014', '2015']:
price_per_year = price_per_year['price'][year].reset_index(drop=True)
KeyError: 'price'
Both the code lines price_per_year.loc[price_per_year.index.year == 2014, ['price']] when used independently outside for loop and price_per_year.loc[price_per_year.index.year == year, ['price']] used inside the for loop are giving errors.
for year in ['2013', '2014', '2015']:
price_per_year.loc[price_per_year.index.year == '2014', ['price']].reset_index(drop=True)
TypeError: Cannot convert input [False] of type <class 'bool'> to Timestamp

Here is problem in your first code is used partial string indexing, here is used DataFrame.loc
prices = pd.DataFrame()
for year in ['2013', '2014', '2015']:
s = price_per_year['price'][year].reset_index(drop=True).rename(year)
prices = pd.concat([prices, s], axis=1)
print (prices)
2013 2014 2015
0 20.08 19.86 19.66
1 19.78 19.40 19.66
Another better solution with reshape:
print (df)
price
date
2013-01-02 20.08
2013-01-03 19.78
2014-01-02 19.86
2014-01-03 19.40
2015-01-02 19.66
2015-01-03 19.66
y = df.index.year
df = df.set_index([df.groupby(y).cumcount(), y])['price'].unstack()
print (df)
date 2013 2014 2015
0 20.08 19.86 19.66
1 19.78 19.40 19.66

Related

ValueError: cannot reindex from a duplicate axis while shift one column in Pandas

Given a dataframe df with date index as follows:
value
2017-03-31 NaN
2017-04-01 27863.7
2017-04-02 27278.5
2017-04-03 27278.5
2017-04-04 27278.5
...
2021-10-27 NaN
2021-10-28 NaN
2021-10-29 NaN
2021-10-30 NaN
2021-10-31 NaN
I'm able to shift value column by one year use df['value'].shift(freq=pd.DateOffset(years=1)):
Out:
2018-03-31 NaN
2018-04-01 27863.7
2018-04-02 27278.5
2018-04-03 27278.5
2018-04-04 27278.5
...
2022-10-27 NaN
2022-10-28 NaN
2022-10-29 NaN
2022-10-30 NaN
2022-10-31 NaN
But when I use it to replace orginal value by df['value'] = df['value'].shift(freq=pd.DateOffset(years=1)), it raises an error:
ValueError: cannot reindex from a duplicate axis
Since the code below works smoothly, so I think the issue caused by NaNs in value column:
import pandas as pd
import numpy as np
np.random.seed(2021)
dates = pd.date_range('20130101', periods=720)
df = pd.DataFrame(np.random.randint(0, 100, size=(720, 3)), index=dates, columns=list('ABC'))
df
df.B = df.B.shift(freq=pd.DateOffset(years=1))
I also try with df['value'].shift(freq=relativedelta(years=+1)), but it generates: pandas.errors.NullFrequencyError: Cannot shift with no freq
Someone could help to deal with this issue? Sincere thanks.
Since the code below works smoothly, so I think the issue caused by NaNs in value column
No I don't think so. It's probably because in your 2nd sample you have only 1 leap year.
Reproducible error with 2 leap years:
# 2018 (366 days), 2019 (365 days) and 2020 (366 days)
dates = pd.date_range('20180101', periods=365*3+1)
df = pd.DataFrame(np.random.randint(0, 100, size=(365*3+1, 3)),
index=dates, columns=list('ABC'))
df.B = df.B.shift(freq=pd.DateOffset(years=1))
...
ValueError: cannot reindex from a duplicate axis
...
The example below works:
# 2017 (365 days), 2018 (366 days) and 2019 (365 days)
dates = pd.date_range('20170101', periods=365*3+1)
df = pd.DataFrame(np.random.randint(0, 100, size=(365*3+1, 3)),
index=dates, columns=list('ABC'))
df.B = df.B.shift(freq=pd.DateOffset(years=1))
Just look to value_counts:
# 2018 -> 2020
>>> df.B.shift(freq=pd.DateOffset(years=1)).index.value_counts()
2021-02-28 2 # The duplicated index
2020-12-29 1
2021-01-04 1
2021-01-03 1
2021-01-02 1
..
2020-01-07 1
2020-01-08 1
2020-01-09 1
2020-01-10 1
2021-12-31 1
Length: 1095, dtype: int64
# 2017 -> 2019
>>> df.B.shift(freq=pd.DateOffset(years=1)).index.value_counts()
2018-01-01 1
2019-12-30 1
2020-01-05 1
2020-01-04 1
2020-01-03 1
..
2019-01-07 1
2019-01-08 1
2019-01-09 1
2019-01-10 1
2021-01-01 1
Length: 1096, dtype: int64
Solution
Obviously, the solution is to remove duplicated index, in our case '2021-02-28', by using resample('D') and an aggregate function first, last, min, max, mean, sum or a custom one:
>>> df.B.shift(freq=pd.DateOffset(years=1))['2021-02-28']
2021-02-28 41
2021-02-28 96
Name: B, dtype: int64
>>> df.B.shift(freq=pd.DateOffset(years=1))['2021-02-28'] \
.resample('D').agg(('first', 'last', 'min', 'max', 'mean', 'sum')).T
2021-02-28
first 41.0
last 96.0
min 41.0
max 96.0
mean 68.5
sum 137.0
# Choose `last` for example
df.B = df.B.shift(freq=pd.DateOffset(years=1)).resample('D').last()
Note, you can replace .resample(...).func by .loc[lambda x: x.index.duplicated()]

Pandas dataframe groupby by day and find first value that exceeds value at fixed time

I have a datetime indexed dataframe with several years of intraday data, in 2 minute increments. I want to group by day and include the first row that exceeds the price at 06:30:00 in each day.
df:
Price
2009-10-12 06:30:00 904
2009-10-12 06:32:00 904
2009-10-12 06:34:00 904.5
2009-10-12 06:36:00 905
2009-10-12 06:38:00 905.5
2009-10-13 06:30:00 901
2009-10-13 06:32:00 901
2009-10-13 06:34:00 901
2009-10-13 06:36:00 902
2009-10-13 06:38:00 903
I've tried using .groupby and .apply with a lambda function to group by day and include all rows that exceed the value at 06:30:00, but get an error.
onh = pd.to_datetime('6:30:00').time()
onhBreak = df.groupby(df.index.date).apply(lambda x: x[x > x.loc[onh]])
ValueError: Can only compare identically-labeled Series objects
Desired output:
Price
2009-10-12 06:34:00 904.5
2009-10-13 06:36:00 902
*If these rows are values in a groupby, that would be good also
Any help is appreciated.
Here we need groupby with idxmax
df = df.to_frame('value')
df['check'] = df.index.time>onh
subdf = df.loc[df.groupby(df.index.date)['check'].idxmax()]
Out[237]:
value check
2009-10-12 00:00:00 900.0 False
2020-05-29 13:08:00 3052.0 True
subdf = subdf[subdf['check']]
We can do:
mask_date = df['Date'].dt.time.gt(pd.to_datetime('06:30:00').time())
df_filtered = df.loc[mask_date.groupby(df['Date'].dt.date).idxmax()]
print(df_filtered)
Output
Date Value
1 2009-10-12 06:32:00 904.0
6 2009-10-13 06:32:00 901.0

how to merge month day year columns in date column?

The date is in separate columns
Month Day Year
8 12 1993
8 12 1993
8 12 1993
I want to merge it in one column
Date
8/12/1993
8/12/1993
8/12/1993
I tried
df_date = df.Timestamp((df_filtered.Year*10000+df_filtered.Month*100+df_filtered.Day).apply(str),format='%Y%m%d')
I get this error
AttributeError: 'DataFrame' object has no attribute 'Timestamp'
Using pd.to_datetime with astype(str)
1. as string type:
df['Date'] = pd.to_datetime(df['Month'].astype(str) + df['Day'].astype(str) + df['Year'].astype(str), format='%d%m%Y').dt.strftime('%d/%m/%Y')
Month Day Year Date
0 8 12 1993 08/12/1993
1 8 12 1993 08/12/1993
2 8 12 1993 08/12/1993
2. as datetime type:
df['Date'] = pd.to_datetime(df['Month'].astype(str) + df['Day'].astype(str) + df['Year'].astype(str), format='%d%m%Y')
Month Day Year Date
0 8 12 1993 1993-12-08
1 8 12 1993 1993-12-08
2 8 12 1993 1993-12-08
Here is the solution:
df = pd.DataFrame({'Month': [8, 8, 8], 'Day': [12, 12, 12], 'Year': [1993, 1993, 1993]})
# This way dates will be a DataFrame
dates = df.apply(lambda row:
pd.Series(pd.Timestamp(row[2], row[0], row[1]), index=['Date']),
axis=1)
# And this way dates will be a Series:
# dates = df.apply(lambda row:
# pd.Timestamp(row[2], row[0], row[1]),
# axis=1)
apply method generates a new Series or DataFrame iteratively applying provided function (lambda in this case) and joining the results.
You can read about apply method in official documentation.
And here is the explanation of lambda expressions.
EDIT:
#JohnClements suggested a better solution, using pd.to_datetime method:
dates = pd.to_datetime(df).to_frame('Date')
Also, if you want your output to be a string, you can use
dates = df.apply(lambda row: f"{row[2]}/{row[0]}/{row[1]}",
axis=1)
You can try:
df = pd.DataFrame({'Month': [8,8,8], 'Day': [12,12,12], 'Year': [1993, 1993, 1993]})
df['date'] = pd.to_datetime(df)
Result:
Month Day Year date
0 8 12 1993 1993-08-12
1 8 12 1993 1993-08-12
2 8 12 1993 1993-08-12
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
Month 3 non-null int64
Day 3 non-null int64
Year 3 non-null int64
date 3 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(3)
memory usage: 176.0 bytes

Create a pandas column based on a lookup value from another dataframe

I have a pandas dataframe that has some data values by hour (which is also the index of this lookup dataframe). The dataframe looks like this:
In [1] print (df_lookup)
Out[1] 0 1.109248
1 1.102435
2 1.085014
3 1.073487
4 1.079385
5 1.088759
6 1.044708
7 0.902482
8 0.852348
9 0.995912
10 1.031643
11 1.023458
12 1.006961
...
23 0.889541
I want to multiply the values from this lookup dataframe to create a column of another dataframe, which has datetime as index.
The dataframe looks like this:
In [2] print (df)
Out[2]
Date_Label ID data-1 data-2 data-3
2015-08-09 00:00:00 1 2513.0 2502 NaN
2015-08-09 00:00:00 1 2113.0 2102 NaN
2015-08-09 01:00:00 2 2006.0 1988 NaN
2015-08-09 02:00:00 3 2016.0 2003 NaN
...
2018-07-19 23:00:00 33 3216.0 333 NaN
I want to calculate the data-3 column from data-2 column, where the weight given to 'data-2' column depends on corresponding value in df_lookup. I get the desired values by looping over the index as follows, but that is too slow:
for idx in df.index:
df.loc[idx,'data-3'] = df.loc[idx, 'data-2']*df_lookup.at[idx.hour]
Is there a faster way someone could suggest?
Using .loc
df['data-2']*df_lookup.loc[df.index.hour].values
Out[275]:
Date_Label
2015-08-09 00:00:00 2775.338496
2015-08-09 00:00:00 2331.639296
2015-08-09 01:00:00 2191.640780
2015-08-09 02:00:00 2173.283042
Name: data-2, dtype: float64
#df['data-3']=df['data-2']*df_lookup.loc[df.index.hour].values
I'd probably try doing a join.
# Fix column name
df_lookup.columns = ['multiplier']
# Get hour index
df['hour'] = df.index.hour
# Join
df = df.join(df_lookup, how='left', on=['hour'])
df['data-3'] = df['data-2'] * df['multiplier']
df = df.drop(['multiplier', 'hour'], axis=1)

Error during conversion column pandas data frame python 3

I have a big problem with pandas. I have an important data frame containing
Ref_id PRICE YEAR MONTH BRAND
100000 '5000' '2012' '4' 'FORD'
100001 '10000' '2015' '5' 'MERCEDES'
...
I want to convert my PRICE, YEAR and MONTH columns but when I use .astype(int) or .apply(lambda x : int(x)) on the column I receive an ValueError. The length of my data frame is 1.8 million on rows.
ValueError: invalid literal for int() with base 10: 'PRICE'
So I don't understand why pandas wants to convert the name of the column.
Could you explain me why ?
Best,
C.
Try this:
In [59]: cols = 'PRICE YEAR MONTH'.split()
In [60]: cols
Out[60]: ['PRICE', 'YEAR', 'MONTH']
In [61]: for c in cols:
...: df[c] = pd.to_numeric(df[c], errors='coerce')
...:
In [62]: df
Out[62]:
Ref_id PRICE YEAR MONTH BRAND
0 100000 5000.0 2012 4 FORD
1 100001 10000.0 2015 5 MERCEDES
2 100002 NaN 2016 6 AUDI
Reproducing your error:
In [65]: df
Out[65]:
Ref_id PRICE YEAR MONTH BRAND
0 100000 5000 2012 4 FORD
1 100001 10000 2015 5 MERCEDES
2 100002 PRICE 2016 6 AUDI # pay attention at `PRICE` value !!!
In [66]: df['PRICE'].astype(int)
...
skipped
...
ValueError: invalid literal for int() with base 10: 'PRICE'
As #jezrael has added in this comment most probably you have "bad" (unexpected) values in your data set.
You can use one of the following techniques in order clean it up:
In [155]: df
Out[155]:
Ref_id PRICE YEAR MONTH BRAND
0 100000 5000 2012 4 FORD
1 100001 10000 2015 5 MERCEDES
2 Ref_id PRICE YEAR MONTH BRAND
3 100002 15000 2016 5 AUDI
In [156]: df.dtypes
Out[156]:
Ref_id object
PRICE object
YEAR object
MONTH object
BRAND object
dtype: object
In [157]: df = df.drop(df.loc[df.PRICE == 'PRICE'].index)
In [158]: df
Out[158]:
Ref_id PRICE YEAR MONTH BRAND
0 100000 5000 2012 4 FORD
1 100001 10000 2015 5 MERCEDES
3 100002 15000 2016 5 AUDI
In [159]: for c in cols:
...: df[c] = pd.to_numeric(df[c], errors='coerce')
...:
In [160]: df
Out[160]:
Ref_id PRICE YEAR MONTH BRAND
0 100000 5000 2012 4 FORD
1 100001 10000 2015 5 MERCEDES
3 100002 15000 2016 5 AUDI
In [161]: df.dtypes
Out[161]:
Ref_id object
PRICE int64
YEAR int64
MONTH int64
BRAND object
dtype: object
or simply:
In [159]: for c in cols:
...: df[c] = pd.to_numeric(df[c], errors='coerce')
...:
In [165]: df
Out[165]:
Ref_id PRICE YEAR MONTH BRAND
0 100000 5000.0 2012.0 4.0 FORD
1 100001 10000.0 2015.0 5.0 MERCEDES
2 Ref_id NaN NaN NaN BRAND
3 100002 15000.0 2016.0 5.0 AUDI
and then .dropna(how='any') if you know that there were no NaN's in your original data set:
In [166]: df = df.dropna(how='any')
In [167]: df
Out[167]:
Ref_id PRICE YEAR MONTH BRAND
0 100000 5000.0 2012.0 4.0 FORD
1 100001 10000.0 2015.0 5.0 MERCEDES
3 100002 15000.0 2016.0 5.0 AUDI

Resources