Python Error: truth value of Series is ambigious in if-statement - python-3.x

I have a problem concerning comparing the values of two dataframes on a per day basis.
The dataframes contain df1 = minimum temperature values per day and df2 = maximum temperature values per day.
The dfs look like this (TS_TIMESTAMP is the index column):
df1:
> TS_TIMESTAMP Date TREND_VALUE
> 2019-04-03 18:48:10.970 2019-04-02 8.340000
> 2019-04-04 16:49:23.320 2019-04-03 7.840000
> 2019-04-05 13:19:33.550 2019-04-04 7.480000
df2:
> TS_TIMESTAMP Date TREND_VALUE
> 2019-04-03 18:48:10.970 2019-04-02 19.340000
> 2019-04-04 16:49:23.320 2019-04-03 18.840000
> 2019-04-05 13:19:33.550 2019-04-04 18.480000
I would like to calculate the difference between max_value and min_value per day with a function (to simply run the calculation with a number of different files).
This is what I came up with:
def temp_diff (df1, df2):
for row in df1, df2:
if df1.Date == df2.Date:
print (df2.TREND_VALUE - df1.TREND_VALUE)
if I run this function I get this Error Message for the if-statement:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I'm not sure how to change my def appropriately.
Thanks for your help!

Assuming you are using an all condition, this will print when all the values in df1['Date'] and df2['Date'] are equal. If that's what you want here you go:
def temp_diff (df1, df2):
for row in df1, df2:
if (df1.Date == df2.Date).all():
print (df1.TREND_VALUE - df2.TREND_VALUE)
I have a feeling you want to iterate over each row and chceck whether there's a match for df1['Date'] and df2['Date'] so that if there is, print the difference, otherwise skip that row. Let me know in the comments if that's what you want, then I'll edit this answer.
import pandas as pd
a = {'Date':['2019-04-02','2019-04-03','2019-04-04'],'Values':[8,7,4]}
b = {'Date':['2019-04-02','2019-04-03','2019-04-04'],'Values':[19,18,17]}
df_1 = pd.DataFrame(a)
df_2 = pd.DataFrame(b)
def temp_diff (df1, df2):
for row in df1, df2:
if (df1.Date == df2.Date).all():
print (df1.Values - df2.Values)
temp_diff(df_1,df_2)
Output:
0 -11
1 -11
2 -13
Name: Values, dtype: int64
Edit
Maybe this is what you are looking for?
import pandas as pd
import numpy as np
a = {'Date':['2019-04-02','2019-04-03','2019-04-04'],'TREND_VALUE':[8,7,4]}
b = {'Date':['2019-04-02','2019-04-03','2019-04-05'],'TREND_VALUE':[19,18,17]}
df1 = pd.DataFrame(a)
df2 = pd.DataFrame(b)
df1['T_Amplitude'] = np.where((df1['Date'] == df2['Date']),df1['TREND_VALUE']-df2['TREND_VALUE'],np.nan)
print(df_1)
Output:
Date TREND_VALUE T_Amplitude
0 2019-04-02 8 -11.0
1 2019-04-03 7 -11.0
2 2019-04-04 4 NaN

Assuming the number of rows in the dataframe is the same, you simply subtract the two columns:
df1.TREND_VALUE - df2.TREND_VALUE

Related

How to set datetime format for pandas dataframe column labels?

IPNI_RNC PATHID 2020-11-11 00:00:00 2020-11-12 00:00:00 2020-11-13 00:00:00 2020-11-14 00:00:00 2020-11-15 00:00:00 2020-11-16 00:00:00 2020-11-17 00:00:00 Last Day Violation Count
Above are the columns label after reading the excel file. There are 10 columns in df variable after reading the excel and 7 of the columns label are date.
My input data set is an excel file which changes everyday and I want to update it automatically. In excel, some columns label are date like 11-Nov-2020, 12-Nov-2020 but after reading the excel it becomes like 2020-11-11 00:00:00, 2020-11-12 00:00:00. I want to keep column labels as 11-Nov-2020, 12-Nov-2020 while reading excel by pd.read_excel if possible or I need to convert it later.
I am very new in python. Looking forward for your support
Thanks who have already came forward to cooperate me
You can of course use the standard python methods to parse the date values, but I would not recommend it, because this way you end up with python datetime objects and not with the pandas representation of dates. That means, it consumes more space, is probably not as efficient and you can't use the pandas methods to access e.g. the year. I'll show you, what I mean below.
In case you want to avoid the naming issue of your column names, you might want to try to prevent pandas to automatically assign the names and read the first line as data to fix it yourselfe automatically (see the section below about how you can do it).
The type conversion part:
# create a test setup with a small dataframe
import pandas as pd
from datetime import date, datetime, timedelta
df= pd.DataFrame(dict(id=range(10), date_string=[str(datetime.now()+ timedelta(days=d)) for d in range(10)]))
# test the python way:
df['date_val_python']= df['date_string'].map(lambda dt: str(dt))
# use the pandas way: (btw. if you want to explicitely
# specify the format, you can use the format= keyword)
df['date_val_pandas']= pd.to_datetime(df['date_string'])
df.dtypes
The output is:
id int64
date_string object
date_val_python object
date_val_pandas datetime64[ns]
dtype: object
As you can see date_val has type object, this is because it contains python objects of class datetime while date_val_pandas uses the internal datetime representation of pandas. You can now try:
df['date_val_pandas'].dt.year
# this will return a series with the year part of the date
df['date_val_python'].dt.year
# this will result in the following error:
AttributeError: Can only use .dt accessor with datetimelike values
See the pandas doc for to_datetime for more details.
The column naming part:
# read your dataframe as usual
df= pd.read_excel('c:/scratch/tmp/dates.xlsx')
rename_dict= dict()
for old_name in df.columns:
if hasattr(old_name, 'strftime'):
new_name= old_name.strftime('DD-MMM-YYYY')
rename_dict[old_name]= new_name
if len(rename_dict) > 0:
df.rename(columns=rename_dict, inplace=True)
This works, in case your column titles are stored as usual dates, which I suppose is true, because you get a time part after importing them.
strftime of the datetime module is the function you need:
If datetime is a datetime object, you can do
datetime.strftime("%d-%b-%Y")
Example:
>>> from datetime import datetime
>>> timestamp = 1528797322
>>> date_time = datetime.fromtimestamp(timestamp)
>>> print(date_time)
2018-06-12 11:55:22
>>> print(date_time.strftime("%d-%b-%Y"))
12-Jun-2018
In order to apply a function to certain dataframe columns, use:
datetime_cols_list = ['datetime_col1', 'datetime_col2', ...]
for col in dataframe.columns:
if col in datetime_cols_list:
dataframe[col] = dataframe[col].apply(lambda x: x.strftime("%d-%b-%Y"))
I am sure this can be done in multiple ways in pandas, this is just what came out the top of my head.
Example:
import pandas as pd
import numpy as np
np.random.seed(0)
# generate some random datetime values
rng = pd.date_range('2015-02-24', periods=5, freq='T')
other_dt_col = rng = pd.date_range('2016-02-24', periods=5, freq='T')
df = pd.DataFrame({ 'Date': rng, 'Date2': other_dt_col,'Val': np.random.randn(len(rng)) })
print (df)
# Output:
# Date Date2 Val
# 0 2016-02-24 00:00:00 2016-02-24 00:00:00 1.764052
# 1 2016-02-24 00:01:00 2016-02-24 00:01:00 0.400157
# 2 2016-02-24 00:02:00 2016-02-24 00:02:00 0.978738
# 3 2016-02-24 00:03:00 2016-02-24 00:03:00 2.240893
# 4 2016-02-24 00:04:00 2016-02-24 00:04:00 1.867558
datetime_cols_list = ['Date', 'Date2']
for col in df.columns:
if col in datetime_cols_list:
df[col] = df[col].apply(lambda x: x.strftime("%d-%b-%Y"))
print (df)
# Output:
# Date Date2 Val
# 0 24-Feb-2016 24-Feb-2016 1.764052
# 1 24-Feb-2016 24-Feb-2016 0.400157
# 2 24-Feb-2016 24-Feb-2016 0.978738
# 3 24-Feb-2016 24-Feb-2016 2.240893
# 4 24-Feb-2016 24-Feb-2016 1.867558

Formatting date from a dataframe of strings

I have data that looks like this:
0 2/18/2020
1 9/30/2019
Each line is a str
I want it to keep its type (string, not datetime) but to look like this:
0 2020-02-18
1 2019-09-30
How can I achieve this format for all rows in the dataframe?
Thank you!
Use split to separate the MM, DD, and YYYY and then change the order of the lists as preferred. Make sure single digit months and days receive a leading zero, before finally joining the lists back into one string using - instead of /:
df = pd.DataFrame({'col': ['2/18/2020', '9/30/2019']})
df['col'] = (df.col
.apply(lambda x: x.split('/')[2:] + x.split('/')[:2])
.apply(
lambda x:
x[0:1]
+ (x[1:2] if len(x[1]) > 1 else ['0'+x[1]])
+ (x[2:3] if len(x[2]) > 1 else ['0'+x[2]])
)
.apply(lambda x: '-'.join(x)))
df
col
0 2020-02-18
1 2019-09-30
You can use:
import pandas as pd
df = pd.DataFrame({'date': ['2/18/2020', '9/30/2019']})
df['date'] = pd.to_datetime(df['date']).dt.strftime('%Y-%m-%d')
print(df)
Result:
date
0 2020-02-18
1 2019-09-30

Pandas timestamp

I'd like to group my data per day and calculate the daily mean of the sentiment.
I have problem with the pandas dataframe because I am not able to transform my date column in datestamp to use the groupby() function. Here is my data sample:
sentiment date
0 1 2018-01-01 07:37:07+00:00
1 0 2018-02-12 06:57:27+00:00
2 -1 2018-09-18 06:23:07+00:00
3 1 2018-09-18 07:23:10+00:00
4 0 2018-02-12 06:21:08+00:00
I think need resample - it create full DatatimeIndex:
df['date'] = pd.to_datetime(df['date'])
df1 = df.resample('D',on='date')['sentiment'].mean()
#if want remove NaNs rows
df1 = df.resample('D',on='date')['sentiment'].mean().dropna()
Or groupby and aggregate mean with dates or floor for remove times:
df2 = df.groupby(df['date'].dt.date)['sentiment'].mean()
#DatetimeIndex in output
df2 = df.groupby(df['date'].dt.floor('d'))['sentiment'].mean()

roll off profile stacking data frames

I have a dataframe that looks like:
import pandas as pd
import datetime as dt
df= pd.DataFrame({'date':['2017-12-31','2017-12-31'],'type':['Asset','Liab'],'Amount':[100,-100],'Maturity Date':['2019-01-02','2018-01-01']})
df
I am trying to build a roll-off profile by checking if the 'Maturity Date' is greater than a 'date' in the future. I am trying to achieve something like:
#First Month
df1=df[df['Maturity Date']>'2018-01-31']
df1['date']='2018-01-31'
#Second Month
df2=df[df['Maturity Date']>'2018-02-28']
df2['date']='2018-02-28'
#third Month
df3=df[df['Maturity Date']>'2018-03-31']
df3['date']='2018-02-31'
#first quarter
qf1=df[df['Maturity Date']>'2018-06-30']
qf1['date']='2018-06-30'
#concatenate
df=pd.concat([df,df1,df2,df3,qf1])
df
I was wondering if there is a way to :
Allow an arbitrary long number of dates without repeating code
I think you need numpy.tile for repeat indices and assign to new column, last filter by boolean indexing and sorting by sort_values:
d = '2017-12-31'
df['Maturity Date'] = pd.to_datetime(df['Maturity Date'])
#generate first month and next quarters
c1 = pd.date_range(d, periods=4, freq='M')
c2 = pd.date_range(c1[-1], periods=2, freq='Q')
#join together
c = c1.union(c2[1:])
#repeat rows be indexing repeated index
df1 = df.loc[np.tile(df.index, len(c))].copy()
#assign column by datetimes
df1['date'] = np.repeat(c, len(df))
#filter by boolean indexing
df1 = df1[df1['Maturity Date'] > df1['date']]
print (df1)
Amount Maturity Date date type
0 100 2019-01-02 2017-12-31 Asset
1 -100 2018-01-01 2017-12-31 Liab
0 100 2019-01-02 2018-01-31 Asset
0 100 2019-01-02 2018-02-28 Asset
0 100 2019-01-02 2018-03-31 Asset
0 100 2019-01-02 2018-06-30 Asset
You could use a nifty tool in the Pandas arsenal called
pd.merge_asof. It
works similarly to pd.merge, except that it matches on "nearest" keys rather
than equal keys. Furthermore, you can tell pd.merge_asof to look for nearest
keys in only the backward or forward direction.
To make things interesting (and help check that things are working properly), let's add another row to df:
df = pd.DataFrame({'date':['2017-12-31', '2017-12-31'],'type':['Asset', 'Asset'],'Amount':[100,200],'Maturity Date':['2019-01-02', '2018-03-15']})
for col in ['date', 'Maturity Date']:
df[col] = pd.to_datetime(df[col])
df = df.sort_values(by='Maturity Date')
print(df)
# Amount Maturity Date date type
# 1 200 2018-03-15 2017-12-31 Asset
# 0 100 2019-01-02 2017-12-31 Asset
Now define some new dates:
dates = (pd.date_range('2018-01-31', periods=3, freq='M')
.union(pd.date_range('2018-01-1', periods=2, freq='Q')))
result = pd.DataFrame({'date': dates})
# date
# 0 2018-01-31
# 1 2018-02-28
# 2 2018-03-31
# 3 2018-06-30
Now we can merge rows, matching nearest dates from result with Maturity Dates from df:
result = pd.merge_asof(result, df.drop('date', axis=1),
left_on='date', right_on='Maturity Date', direction='forward')
In this case we want to "match" dates with Maturity Dates which are greater
so we use direction='forward'.
Putting it all together:
import pandas as pd
df = pd.DataFrame({'date':['2017-12-31', '2017-12-31'],'type':['Asset', 'Asset'],'Amount':[100,200],'Maturity Date':['2019-01-02', '2018-03-15']})
for col in ['date', 'Maturity Date']:
df[col] = pd.to_datetime(df[col])
df = df.sort_values(by='Maturity Date')
dates = (pd.date_range('2018-01-31', periods=3, freq='M')
.union(pd.date_range('2018-01-1', periods=2, freq='Q')))
result = pd.DataFrame({'date': dates})
result = pd.merge_asof(result, df.drop('date', axis=1),
left_on='date', right_on='Maturity Date', direction='forward')
result = pd.concat([df, result], axis=0)
result = result.sort_values(by=['Maturity Date', 'date'])
print(result)
yields
Amount Maturity Date date type
1 200 2018-03-15 2017-12-31 Asset
0 200 2018-03-15 2018-01-31 Asset
1 200 2018-03-15 2018-02-28 Asset
0 100 2019-01-02 2017-12-31 Asset
2 100 2019-01-02 2018-03-31 Asset
3 100 2019-01-02 2018-06-30 Asset

Populating pandas column based on moving date range (efficiently)

I have 2 pandas dataframes, one of them contains dates with measurements, and the other contains dates with an event ID.
df1
from datetime import datetime as dt
from datetime import timedelta
import pandas as pd
import numpy as np
today = dt.now()
ndays = 10
df1 = pd.DataFrame({'Date': [today + timedelta(days = x) for x in range(ndays)], 'measurement': pd.Series(np.random.randint(1, high = 10, size = ndays))})
df1.Date = df1.Date.dt.date
Date measurement
2018-01-10 8
2018-01-11 2
2018-01-12 7
2018-01-13 3
2018-01-14 1
2018-01-15 1
2018-01-16 6
2018-01-17 9
2018-01-18 8
2018-01-19 4
df2
df2 = pd.DataFrame({'Date': ['2018-01-11', '2018-01-14', '2018-01-16', '2018-01-19'], 'letter': ['event_a', 'event_b', 'event_c', 'event_d']})
df2.Date = pd.to_datetime(df2.Date, format = '%Y-%m-%d')
df2.Date = df2.Date.dt.date
Date event_id
2018-01-11 event_a
2018-01-14 event_b
2018-01-16 event_c
2018-01-19 event_d
I give the dates in df1 an event_id from df2 only if it's between two event dates. The resulting dataframe would look something like:
df3
today = dt.now()
ndays = 10
df3 = pd.DataFrame({'Date': [today + timedelta(days = x) for x in range(ndays)], 'measurement': pd.Series(np.random.randint(1, high = 10, size = ndays)), 'event_id': ['event_a', 'event_a', 'event_b', 'event_b', 'event_b', 'event_c', 'event_c', 'event_d', 'event_d', 'event_d']})
df3.Date = df3.Date.dt.date
Date event_id measurement
2018-01-10 event_a 4
2018-01-11 event_a 2
2018-01-12 event_b 1
2018-01-13 event_b 5
2018-01-14 event_b 5
2018-01-15 event_c 4
2018-01-16 event_c 6
2018-01-17 event_d 6
2018-01-18 event_d 9
2018-01-19 event_d 6
The code I use to achieve this is:
n = 1
while n <= len(list(df2.Date)) - 1 :
for date in list(df1.Date):
if date <= df2.iloc[n].Date and (date > df2.iloc[n-1].Date):
df1.loc[df1.Date == date, 'event_id'] = df2.iloc[n].event_id
n += 1
The dataset that I am working with is significantly larger than this (a few million rows) and this method runs far too long. Is there a more efficient way to accomplish this?
So there are quite a few things to improve performance.
The first question I have is: does it have to be a pandas frame to begin with? Meaning can't df1 and df2 just be lists of tuples or list of lists?
The thing is that pandas adds a significant overhead when accessing items but especially when setting values individually.
Pandas excels when it comes to vectorized operations but I don't see an efficient alternative right now (maybe someone comes up with such an answer, that would be ideal).
Now what I'd do is:
Convert your df1 and df2 to records -> e.g. d1 = df1.to_records() what you get is an array of tuples, basically with the same structure as the dataframe.
Now run your algorithm but instead of operating on pandas dataframes you operate on the arrays of tuples d1 and d2
Use a third list of tuples d3 where you store the newly created data (each tuple is a row)
Now if you want you can convert d3 back to a pandas dataframe:
df3 = pd.DataFrame.from_records(d3, myKwArgs**)
This will speed up your code significantly I'd assume by more than 100-1000%. It does increase memory usage though, so if you are low on memory try to avoid the pandas dataframes all-together or dereference unused pandas frames df1, df2 once you used them to create the records (and if you run into problems call gc manually).
EDIT: Here a version of your code using the procedure above:
d3 = []
n = 1
while n < range(len(d2)):
for i in range(len(d1)):
date = d1[i][0]
if date <= d2[n][0] and date > d2[n-1][0]:
d3.append( (date, d2[n][1], d1[i][1]) )
n += 1
You can try df.apply() method to achieve this. Refer pandas.DataFrame.apply. I think my code will works faster than yours.
My approach:
Merge two dataframes df1 and df2 and create new one df3 by
df3 = pd.merge(df1, df2, on='Date', how='outer')
Sort df3 by date to make easy to travserse.
df3['Date'] = pd.to_datetime(df3.Date)
df3.sort_values(by='Date')
Create set_event_date() method to apply for each rows in df3.
new_event_id = np.nan
def set_event_date(df3):
global new_event_id
if df3.event_id is not np.nan:
new_event_id = df3.event_id
return new_event_id
Apply set_event_method() to each rows in df3.
df3['new_event_id'] = df3.apply(set_event_date,axis=1)
Final Output will be:
Date Measurement New_event_id
0 2018-01-11 2 event_a
1 2018-01-12 1 event_a
2 2018-01-13 3 event_a
3 2018-01-14 6 event_b
4 2018-01-15 3 event_b
5 2018-01-16 5 event_c
6 2018-01-17 7 event_c
7 2018-01-18 9 event_c
8 2018-01-19 7 event_d
9 2018-01-20 4 event_d
Let me know once you tried my solution and it works faster than yours.
Thanks.

Resources