I have a dataframe which looks like this:
date symbol numerator denominator
4522 2021-10-06 PAG.SG 1.0 18
1016 2020-11-23 IPA.V 1.0 5
412 2020-04-17 LRK.AX 1.0 30
1884 2021-06-03 BOUVETO.ST 1.0 1
2504 2021-04-28 VKGYO.IS 1.0 100
3523 2021-07-08 603355.SS 1.0 1
3195 2021-08-23 IDAI 1.0 1
3238 2021-08-19 6690.TWO 1.0 1000
3430 2021-07-19 CAXPD 1.0 10
2642 2021-04-15 035720.KS 1.0 1
dtypes:
date: object
symbol: object
numerator: float64
denominator: int64
I am trying to use pd.assign to assign a classifier to this df in the form of
df = df.assign(category = ['forward' if numerator > denominator else 'reverse' for numerator in df[['numerator', 'denominator']]])
But I'm receiving a TypeError stating:
TypeError: Invalid comparison between dtype=int64 and str
I have tried casting them explicitly, with:
df = df.assign(category = ['forward' if df['numerator'] > df['denominator'] else 'reverse' for df['numerator'] in df])
But receive another TypeError stating:
TypeError: '>' not supported between instances of 'str' and 'int'
Which is confusing because I'm not comparing strings, I'm comparing int and float.
Any help would be greatly appreciated.
You still can do that with np.where
import numpy as np
df = df.assign(category = np.where(df['numerator']>df['denominator'],
'forward',
'reverse')
Related
I would like to Vectorization my dataframe with NumPy arrays but I got an error
Here is the code :
Here I initialize my dataframe
df2 = pd.DataFrame({'A': 1.,
'B': pd.Timestamp('20130102'),
'C': pd.Series(1, index=list(range(4)), dtype='float32'),
'D': [True,False,True,False],
'E': pd.Categorical(["test", "Draft", "test", "Draft"]),
'F': 'foo'})
df2
output:
A B C D E F
0 1.0 2013-01-02 1.0 True test foo
1 1.0 2013-01-02 1.0 False Draft foo
2 1.0 2013-01-02 1.0 True test foo
3 1.0 2013-01-02 1.0 False train foo
Here I define the function to apply to dataframe's columns
def IsBillingValid2(xE,yBilling):
if(xE not in ['Draft','Cancelled'] and yBilling==True): #Order Edited
return True
else:
return False
Here I launch my function
df2['BillingPostalCode_Det_StageName_Det']=IsBillingValid(df2['E'].values,df2['D'].values)
Here is the Error:
output:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<command-2946139111570059> in <module>
16 return False
17
---> 18 df2['BillingPostalCode_Det_StageName_Det']=IsBillingValid(df2['E'].values,df2['D'].values)
19
<command-2041881674588848> in IsBillingValid(xStageName, yBilling)
207 def IsBillingValid(xStageName,yBilling):
208
--> 209 if(xStageName not in ['Draft','Cancelled'] and yBilling==True): #Order Edited
210 return True
211
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Thanks for your help
You don't need apply especially when you want vectorized operation.
Use pandas.Series.isin:
df2['BillingPostalCode_Det_StageName_Det'] = ~df2["E"].isin({"Draft", "Cancelled"}) & df2["D"]
print(df2)
Output:
A B C D E F BillingPostalCode_Det_StageName_Det
0 1.0 2013-01-02 1.0 True test foo True
1 1.0 2013-01-02 1.0 False Draft foo False
2 1.0 2013-01-02 1.0 True test foo True
3 1.0 2013-01-02 1.0 False Draft foo False
I'm trying to clip a rioxarray dataset to a shapefile, but get the following error:
> data_clipped = data.rio.clip(shape.geometry.apply(mapping))
MissingCRS: CRS not found. Please set the CRS with 'set_crs()' or 'write_crs()'. Data variable: precip
This error seems straightforward, but I can't figure out which CRS needs to be set. Both the dataset and the shapefile have CRS values that rio can find:
> print(data.rio.crs)
EPSG:4326
> print(shape.crs)
epsg:4326
The dataarray within the dataset, called 'precip', does not have a CRS, but it also doesn't seem to respond to the set_crs() command:
> print(data.precip.rio.crs)
None
> data.precip.rio.set_crs(data.rio.crs)
> print(data.precip.rio.crs)
None
What am I missing here?
For reference, rioxarray set_crs() documentation - this shows set_crs() working on data arrays, unlike my experience with data.precip
My data, in case I have something unusual:
> print(data)
<xarray.Dataset>
Dimensions: (x: 541, y: 411)
Coordinates:
* y (y) float64 75.0 74.9 74.8 74.7 74.6 ... 34.3 34.2 34.1 34.0
* x (x) float64 -12.0 -11.9 -11.8 -11.7 ... 41.7 41.8 41.9 42.0
time object 2020-01-01 00:00:00
spatial_ref int64 0
Data variables:
precip (y, x) float64 nan nan nan ... 1.388e-17 1.388e-17 1.388e-17
Attributes:
Conventions: CF-1.6
history: 2021-01-05 01:36:52 GMT by grib_to_netcdf-2.16.0: /opt/ecmw...
> print(shape)
ID name orgn_name geometry
0 Albania Shqipëria MULTIPOLYGON (((19.50115 40.96230, 19.50563 40...
1 Andorra Andorra POLYGON ((1.43992 42.60649, 1.45041 42.60596, ...
2 Austria Österreich POLYGON ((16.00000 48.77775, 16.00000 48.78252...
This issue is resolved if the set_crs() is used in the same command as the clip operation:
data_clipped = data.precip.rio.set_crs('WGS84').rio.clip(shape.geometry.apply(mapping))
I have this dataframe:
bprice cprice strike irate tleft
0 26436.90 91.70 26400 10 0.5
1 26423.00 98.90 26400 10 0.5
2 26436.00 90.00 26400 10 0.5
3 26416.35 103.60 26400 10 0.5
I am applying one function ( http://code.mibian.net/ )
c = mibian.BS([1.4565, 1.45, 1, 30], callPrice=0.0359) on my dataframe,
df2['IV'] = df2.apply(lambda df: mb.BS([df2['bprice'],26400,10,df2['tleft']],
callPrice=df2['cprice']).impliedVolatility, axis=1)
But I am getting the error
TypeError: ("cannot convert the series to <class 'float'>", 'occurred at index 0')
I suggest you wrap the function an do something like this -
def my_func(x):
return mb.BS([x['bprice'], 26400, 10 x['tleft']],
callPrice=x['cprice']).impliedVolatility
df2['IV'] = df2.apply(my_func, axis=1)
In the way you implemented in your code, you actually called mb.BS with columns (pd.Series) from the dataframe. See pandas apply documentation here.
I'm trying to run a BG/NBD model using the lifetimes libary.
All my analysis are based on the following example, yet with my own data:
https://towardsdatascience.com/whats-a-customer-worth-8daf183f8a4f
Somehow I receive the following error and after reading 50+ stackoverflow articles without finding any answer, I'd like to ask my own question:
What am I doing wrong? :(
Thanks in Advance! :)
I tried to change the type of all columns that are part of my dataframe, without any changes.
df2 = df
df2.head()
person_id effective_date accounting_sales_total
0 219333 2018-08-04 1049.89
1 333219 2018-12-21 4738.97
2 344405 2018-07-16 253.99
3 455599 2017-07-14 2199.96
4 766665 2017-08-15 1245.00
from lifetimes.utils import calibration_and_holdout_data
summary_cal_holdout = calibration_and_holdout_data(df2, 'person_id', 'effective_date',
calibration_period_end='2017-12-31',
observation_period_end='2018-12-31')
print(summary_cal_holdout.head())
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-85-cdcb400098dc> in <module>()
7 summary_cal_holdout = calibration_and_holdout_data(df2, 'person_id', 'effective_date',
8 calibration_period_end='2017-12-31',
----> 9 observation_period_end='2018-12-31')
10
11 print(summary_cal_holdout.head())
/usr/local/envs/py3env/lib/python3.5/site-packages/lifetimes/utils.py in calibration_and_holdout_data(transactions, customer_id_col, datetime_col, calibration_period_end, observation_period_end, freq, datetime_format, monetary_value_col)
122 combined_data.fillna(0, inplace=True)
123
--> 124 delta_time = (to_period(observation_period_end) - to_period(calibration_period_end)).n
125 combined_data["duration_holdout"] = delta_time
126
AttributeError: 'int' object has no attribute 'n'
This actually runs fine as it is :)
data = {'person_id':[219333, 333219, 344405, 455599, 766665],
'effective_date':['2018-08-04', '2018-12-21', '2018-07-16', '2017-07-14', '2017-08-15'],
'accounting_sales_total':[1049.89, 4738.97, 253.99, 2199.96, 1245.00]}
df2 = pd.DataFrame(data)
from lifetimes.utils import calibration_and_holdout_data
summary_cal_holdout = calibration_and_holdout_data(df2, 'person_id', 'effective_date',
calibration_period_end='2017-12-31',
observation_period_end='2018-12-31')
print(summary_cal_holdout.head())
Returns:
frequency_cal recency_cal T_cal frequency_holdout \
person_id
455599 0.0 0.0 170.0 0.0
766665 0.0 0.0 138.0 0.0
duration_holdout
person_id
455599 365
766665 365
Which means your issue is probably with package versioning, try:
pip install lifetimes --upgrade
This is what I do:
import pandas as pd
t = pd.DataFrame(data={'i1':[0,0,0,0,1,1,1,1,2,2,2,2],
'i2':[0,1,2,3,0,1,2,3,0,1,2,3],
'x':[1.,2.,3.,4.,5.,6.,7.,8.,9.,10.,11.,12.]})
t.set_index(['i1','i2'], inplace=True)
t.sort_index(inplace=True)
t.index.values.searchsorted( (1,1) )
This is the error I get:
> Traceback (most recent call last): File "<stdin>", line 1, in
> <module> TypeError: '<' not supported between instances of 'tuple' and
> 'int'
Please, help me to understand what I am doing wrong.
Index values are tuples: type(t.index.values[0]) gives correctly <class 'tuple'> and I give as input to searchsorted a tuple. So where does the 'tuple' to 'int' comparison come from?
>>> print(t)
x
i1 i2
0 0 1.0
1 2.0
2 3.0
3 4.0
1 0 5.0
1 6.0
2 7.0
3 8.0
2 0 9.0
1 10.0
2 11.0
3 12.0
searchsorted doesn't work with tuples. There's an open issue on github for Multiarray searchsorted fails
On that issue, one of the participants suggests using get_indexer
With your code
t.index.get_indexer([(1,1)])[0]
# outputs:
5
I have found a solution:
>>> t.index.get_loc( (1,1) )
5
This solution is ~200 times faster that using t.index.get_indexer:
>>> import time
>>> time.clock()
168.56
>>> for i in range(10000): a = t.index.get_indexer([(1,1)])[0]
...
>>> time.clock()
176.76
>>> (176.76 - 168.56) / 10000
0.0008199999999999989 # 820e-6 sec per call
>>> time.clock()
176.76
>>> for i in range(1000000): a = t.index.get_loc( (1,1) )
...
>>> time.clock()
180.94
>>> (180.94-176.76)/1000000
4.1800000000000066e-06 # 4.2e-6 sec per call