calibration_and_holdout_data: AttributeError: 'int' object has no attribute 'n' - python-3.x

I'm trying to run a BG/NBD model using the lifetimes libary.
All my analysis are based on the following example, yet with my own data:
https://towardsdatascience.com/whats-a-customer-worth-8daf183f8a4f
Somehow I receive the following error and after reading 50+ stackoverflow articles without finding any answer, I'd like to ask my own question:
What am I doing wrong? :(
Thanks in Advance! :)
I tried to change the type of all columns that are part of my dataframe, without any changes.
df2 = df
df2.head()
person_id effective_date accounting_sales_total
0 219333 2018-08-04 1049.89
1 333219 2018-12-21 4738.97
2 344405 2018-07-16 253.99
3 455599 2017-07-14 2199.96
4 766665 2017-08-15 1245.00
from lifetimes.utils import calibration_and_holdout_data
summary_cal_holdout = calibration_and_holdout_data(df2, 'person_id', 'effective_date',
calibration_period_end='2017-12-31',
observation_period_end='2018-12-31')
print(summary_cal_holdout.head())
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-85-cdcb400098dc> in <module>()
7 summary_cal_holdout = calibration_and_holdout_data(df2, 'person_id', 'effective_date',
8 calibration_period_end='2017-12-31',
----> 9 observation_period_end='2018-12-31')
10
11 print(summary_cal_holdout.head())
/usr/local/envs/py3env/lib/python3.5/site-packages/lifetimes/utils.py in calibration_and_holdout_data(transactions, customer_id_col, datetime_col, calibration_period_end, observation_period_end, freq, datetime_format, monetary_value_col)
122 combined_data.fillna(0, inplace=True)
123
--> 124 delta_time = (to_period(observation_period_end) - to_period(calibration_period_end)).n
125 combined_data["duration_holdout"] = delta_time
126
AttributeError: 'int' object has no attribute 'n'

This actually runs fine as it is :)
data = {'person_id':[219333, 333219, 344405, 455599, 766665],
'effective_date':['2018-08-04', '2018-12-21', '2018-07-16', '2017-07-14', '2017-08-15'],
'accounting_sales_total':[1049.89, 4738.97, 253.99, 2199.96, 1245.00]}
df2 = pd.DataFrame(data)
from lifetimes.utils import calibration_and_holdout_data
summary_cal_holdout = calibration_and_holdout_data(df2, 'person_id', 'effective_date',
calibration_period_end='2017-12-31',
observation_period_end='2018-12-31')
print(summary_cal_holdout.head())
Returns:
frequency_cal recency_cal T_cal frequency_holdout \
person_id
455599 0.0 0.0 170.0 0.0
766665 0.0 0.0 138.0 0.0
duration_holdout
person_id
455599 365
766665 365
Which means your issue is probably with package versioning, try:
pip install lifetimes --upgrade

Related

Python Pandas indexing provides KeyError: (slice(None, None, None), )

I am indexing and slicing my data using Pandas in Python3 to calculate spatial statistics.
When I am running a for loop over the range of latitude and longitude using .loc, gives an error KeyError: (slice(None, None, None), ) for the particular set of latitude and longitude for what no values are available in the input file. Instead of skipping those values, it gives an error and stops running the code. Following is my code.
import numpy as np
import pandas as pd
from scipy import stats
filename='input.txt'
df = pd.read_csv(filename,delim_whitespace=True, header=None, names = ['year','month','lat','lon','aod'], index_col = ['year','month','lat','lon'])
idx=pd.IndexSlice
for i in range (1, 13):
for lat0 in N.arange(0.,40.25,0.25,dtype=float):
for lon0 in N.arange(20.0,75.25,0.25,dtype=float):
tmp = df.loc[idx[:,i,lat0,lon0],:]
if (len(tmp) <= 0):
continue
tmp2 = tmp.index.tolist()
In the code above, if I run for tmp = df.loc[idx[:,1,0.0,34.0],:], it works well and provides the following output, which I used for the further calculation.
aod
year month lat lon
2003 1 0.0 34.0 0.032000
2006 1 0.0 34.0 0.114000
2007 1 0.0 34.0 0.035000
2008 1 0.0 34.0 0.026000
2011 1 0.0 34.0 0.097000
2012 1 0.0 34.0 0.106333
2013 1 0.0 34.0 0.081000
2014 1 0.0 34.0 0.038000
2015 1 0.0 34.0 0.278500
2016 1 0.0 34.0 0.033000
2017 1 0.0 34.0 0.036333
2019 1 0.0 34.0 0.064333
2020 1 0.0 34.0 0.109500
But, a same code I run for tmp = df.loc[idx[:,1,0.0,32.75],:], for the respective latitude and longitude no values available in the input file. Instead of skipping those, it gives me the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 925, in __getitem__
return self._getitem_tuple(key)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 1100, in _getitem_tuple
return self._getitem_lowerdim(tup)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 822, in _getitem_lowerdim
return self._getitem_nested_tuple(tup)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 906, in _getitem_nested_tuple
obj = getattr(obj, self.name)._getitem_axis(key, axis=axis)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 1157, in _getitem_axis
locs = labels.get_locs(key)
File "/usr/lib/python3/dist-packages/pandas/core/indexes/multi.py", line 3347, in get_locs
indexer = _update_indexer(
File "/usr/lib/python3/dist-packages/pandas/core/indexes/multi.py", line 3296, in _update_indexer
raise KeyError(key)
KeyError: (slice(None, None, None), 1, 0.0, 32.75)
I tried to replace .loc with .iloc, but it came out with a too many indexers error. However, I tried solutions from internet using .to_numpy(), .values and .as_matrix(), but nothing work.
But, a same code I run for tmp = df.loc[idx[:,1,0.0,32.75],:], for the respective latitude and longitude no values available in the input file. Instead of skipping those, it gives me the following error:
The idiomatic Pandas solution would be to write this as a groupby. Example:
# split df into groups by the keys month, lat, and lon
for index, tmp in df.groupby(['month','lat','lon']):
# tmp is a dataframe where all rows have identical month, lat, and lon values
# ... do something with the tmp dataframe ...
This has three benefits.
Speed. A groupby will be faster because it only needs to loop over the dataframe once, rather than searching the whole dataframe for everything matching the first group, then searching for the second group, etc.
Simplicity.
Robustness. From a robustness perspective, if a dataframe doesn't have, for example, any rows matching "month=1,lat=0.0,lon=32.75", then it will not create that group.
More information: User guide on grouping
Remark about groupby aggregation functions
You'll also sometimes see groupby used with aggregation functions. For example, suppose you wanted to get the sum of each column within each group.
>>> l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
>>> df = pd.DataFrame(l, columns=["a", "b", "c"])
>>> df.groupby(by=["b"]).sum()
a c
b
1.0 2 3
2.0 2 5
These aggregation functions are faster and easier to use, but sometimes I need something that is custom and unusual, so I'll write a loop. But if you're doing something common, like getting the average of a group, consider looking for an aggregation function.

Why I am not able to generate schema using tfdv.infer_schema()?

"TypeError: statistics is of type StatsOptions, should be a DatasetFeatureStatisticsList proto." error shows when I am generating schema using tfdv.infer_schema() option but I am not able to do when I filter relevant feature using tfdv.StatsOptions class using feature_allowlist. So can anyone help me in this ?
features_remove= {"region","fiscal_week"}
columns= [col for col in df.columns if col not in features_remove]
stat_Options= tfdv.StatsOptions(feature_allowlist=columns)
print(stat_Options.feature_allowlist)
schema= tfdv.infer_schema(stat_Options)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-53-e61b2454028e> in <module>
----> 1 schema= tfdv.infer_schema(stat_Options)
2 schema
C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_data_validation\api\validation_api.py in infer_schema(statistics, infer_feature_shape, max_string_domain_size, schema_transformations)
95 """
96 if not isinstance(statistics, statistics_pb2.DatasetFeatureStatisticsList):
---> 97 raise TypeError(
98 'statistics is of type %s, should be '
99 'a DatasetFeatureStatisticsList proto.' % type(statistics).__name__)
TypeError: statistics is of type StatsOptions, should be a DatasetFeatureStatisticsList proto.
For the very simple reason that you have to pass a statistics_pb2.DatasetFeatureStatisticsList object to the tfdv.infer_schema function and not the statsOptions.
You should go this way :
features_remove= {"region","fiscal_week"}
columns= [col for col in df.columns if col not in features_remove]
stat_Options= tfdv.StatsOptions(feature_allowlist=columns)
print(stat_Options.feature_allowlist)
stats = tfdv.generate_statistics_from_dataframe(df, stat_Options)
schema= tfdv.infer_schema(stats)

Operands could not be broadcast together with shapes Error for Pandas Dataframe

I have looked through the other answers for Operand errors and none seem to fit this example.
The mathematics/equation works, either coding in X values or importing from the DataFrame.
Using the same equation in an np.where expression causes the operand error.
import csv
import pandas as pd
from pandas import DataFrame
import numpy as np
data= pd.read_csv('miniDF.csv')
df=pd.DataFrame(data, columns=['X','Z'])
df['y']=df['Z']*0.01
df['y']=(14.6413819224756*(df['X']**0.5)+64.4092780704338*(np.log(df['X'])**-2)
+1675.7498523727*(np.exp(-df['X']))+3.07221083927051*np.cos(df['X']))
print(df)
df['y']=np.where(df['Z']>=(14.6413819224756*(df['X']**0.5)+64.4092780704338*(np.log(df['X'])**-2)
+1675.7498523727*(np.exp(-df['X']))+3.07221083927051*np.cos(df['X']),8,9))
print(df)
The values in my Dataframe, the output from the first print(df) and the error are as follows.
X Z y
0 1.4 1 999.999293
1 2.0 2000 380.275104
2 3.0 3 159.114194
3 4.0 4 91.481930
4 5.0 5 69.767368
5 6.0 6 63.030212
6 7.0 70 59.591631
7 8.0 8 56.422723
8 9.0 9 54.673108
9 10.0 10 55.946732
Traceback (most recent call last):
File "/Users/willhutchins/Desktop/minitest.py", line 17, in <module>
df['y']=np.where(df['Z']>=(14.6413819224756*(df['X']**0.5)+64.4092780704338*(np.log(df['X'])**-2)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/ops/__init__.py", line 1229, in wrapper
res = na_op(values, other)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/ops/__init__.py", line 1115, in na_op
result = method(y)
ValueError: operands could not be broadcast together with shapes (10,) (3,)
The answer was simply a misplaced Parenthesis as answered here:
[https://stackoverflow.com/questions/63046213/complex-curve-equation-giving-error-in-np-where-usage][1]
Corrected code is:
df['y']=np.where(df['Z']>=(14.6413819224756*(df['X']**0.5)+64.4092780704338*(np.log(df['X'])**-2)
+1675.7498523727*(np.exp(-df['X']))+3.07221083927051*np.cos(df['X'])),8,9)

Pandas & Dataframe: ValueError: can only convert an array of size 1 to a Python scalar

I tried to find a similar question, but still couldnt find the solution.
I am working on a dataframe with pandas.
The following code is nor working. It is only working for the first row on the dataframe. Already for the second row I am getting the error. as below. Maybe somebody sees the mistake and can help :)
census2=census_df[census_df["SUMLEV"]==50]
list=census2["CTYNAME"].tolist()
max=0
for county1 in list:
countylist=[]
df1=census2[census2["CTYNAME"]==county1]
countylist.append(df1["POPESTIMATE2010"].item())
countylist.append(df1["POPESTIMATE2011"].item())
countylist.append(df1["POPESTIMATE2012"].item())
countylist.append(df1["POPESTIMATE2013"].item())
countylist.append(df1["POPESTIMATE2014"].item())
countylist.append(df1["POPESTIMATE2015"].item())
countylist.sort()
difference=countylist[5]-countylist[0]
if difference > max:
max=difference
maxcounty=county1
print(maxcounty)
print(max)
[54660, 55253, 55175, 55038, 55290, 55347]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-5-340aeaf28039> in <module>()
12 countylist=[]
13 df1=census2[census2["CTYNAME"]==county1]
---> 14 countylist.append(df1["POPESTIMATE2010"].item())
15 countylist.append(df1["POPESTIMATE2011"].item())
16 countylist.append(df1["POPESTIMATE2012"].item())
/opt/conda/lib/python3.6/site-packages/pandas/core/base.py in item(self)
829 """
830 try:
--> 831 return self.values.item()
832 except IndexError:
833 # copy numpy's message here because Py26 raises an IndexError
ValueError: can only convert an array of size 1 to a Python scalar

Python pandas search in DataFrame MultiIndex

This is what I do:
import pandas as pd
t = pd.DataFrame(data={'i1':[0,0,0,0,1,1,1,1,2,2,2,2],
'i2':[0,1,2,3,0,1,2,3,0,1,2,3],
'x':[1.,2.,3.,4.,5.,6.,7.,8.,9.,10.,11.,12.]})
t.set_index(['i1','i2'], inplace=True)
t.sort_index(inplace=True)
t.index.values.searchsorted( (1,1) )
This is the error I get:
> Traceback (most recent call last): File "<stdin>", line 1, in
> <module> TypeError: '<' not supported between instances of 'tuple' and
> 'int'
Please, help me to understand what I am doing wrong.
Index values are tuples: type(t.index.values[0]) gives correctly <class 'tuple'> and I give as input to searchsorted a tuple. So where does the 'tuple' to 'int' comparison come from?
>>> print(t)
x
i1 i2
0 0 1.0
1 2.0
2 3.0
3 4.0
1 0 5.0
1 6.0
2 7.0
3 8.0
2 0 9.0
1 10.0
2 11.0
3 12.0
searchsorted doesn't work with tuples. There's an open issue on github for Multiarray searchsorted fails
On that issue, one of the participants suggests using get_indexer
With your code
t.index.get_indexer([(1,1)])[0]
# outputs:
5
I have found a solution:
>>> t.index.get_loc( (1,1) )
5
This solution is ~200 times faster that using t.index.get_indexer:
>>> import time
>>> time.clock()
168.56
>>> for i in range(10000): a = t.index.get_indexer([(1,1)])[0]
...
>>> time.clock()
176.76
>>> (176.76 - 168.56) / 10000
0.0008199999999999989 # 820e-6 sec per call
>>> time.clock()
176.76
>>> for i in range(1000000): a = t.index.get_loc( (1,1) )
...
>>> time.clock()
180.94
>>> (180.94-176.76)/1000000
4.1800000000000066e-06 # 4.2e-6 sec per call

Resources