Why I am not able to generate schema using tfdv.infer_schema()? - python-3.x

"TypeError: statistics is of type StatsOptions, should be a DatasetFeatureStatisticsList proto." error shows when I am generating schema using tfdv.infer_schema() option but I am not able to do when I filter relevant feature using tfdv.StatsOptions class using feature_allowlist. So can anyone help me in this ?
features_remove= {"region","fiscal_week"}
columns= [col for col in df.columns if col not in features_remove]
stat_Options= tfdv.StatsOptions(feature_allowlist=columns)
print(stat_Options.feature_allowlist)
schema= tfdv.infer_schema(stat_Options)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-53-e61b2454028e> in <module>
----> 1 schema= tfdv.infer_schema(stat_Options)
2 schema
C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_data_validation\api\validation_api.py in infer_schema(statistics, infer_feature_shape, max_string_domain_size, schema_transformations)
95 """
96 if not isinstance(statistics, statistics_pb2.DatasetFeatureStatisticsList):
---> 97 raise TypeError(
98 'statistics is of type %s, should be '
99 'a DatasetFeatureStatisticsList proto.' % type(statistics).__name__)
TypeError: statistics is of type StatsOptions, should be a DatasetFeatureStatisticsList proto.

For the very simple reason that you have to pass a statistics_pb2.DatasetFeatureStatisticsList object to the tfdv.infer_schema function and not the statsOptions.
You should go this way :
features_remove= {"region","fiscal_week"}
columns= [col for col in df.columns if col not in features_remove]
stat_Options= tfdv.StatsOptions(feature_allowlist=columns)
print(stat_Options.feature_allowlist)
stats = tfdv.generate_statistics_from_dataframe(df, stat_Options)
schema= tfdv.infer_schema(stats)

Related

Is there a way to get the sum of non-zero elements in numpy array? I keep getting a TypeError

I googled this question extensively and I can't figure out what's wrong. I keep getting:
"TypeError: '>' not supported between instances of 'numpy.ndarray' and 'int'".
I searched it and the websites led me to download a package and that still didn't solve my issue. My code looks as follows:
result=[]
FracPos = np.array(result)
for x in lines:
result.append(x.split())
TotalCells = np.array(result)[:,2]
print(type(TotalCells))
print(type(FracPos))
FracPos = np.sum((TotalCells)>0)
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-12-0972db6a45c4> in <module>
17 TotalCells = np.array(result)[:,2]
18 print(type(TotalCells))
---> 19 FracPos = np.sum((TotalCells)>0)
TypeError: '>' not supported between instances of 'numpy.ndarray' and 'int'
I can't figure out why I get the error at the last line and how to change it. I am very new to Python so I'm sure I'm missing something obvious, but the other questions like this are about nympy.ndarray and strings or lists which I understand why you can't compare.
It is because result is a list containing str objects. If you only have numbers in your array, then do:
result = [float(x) for x in result]
TotalCells = ...
Then, in order to get the sum of positive elements, you would have to do:
FracPos = np.sum(TotalCells[TotalCells > 0])
The line you wrote counts the number of postiive elements, it does not sum them.

How can I make a list of three sentences to a string?

I have a target word and the left and right context that I have to join together. I am using pandas and I try to join the sentences, and the target word, together into a list, which I can then turn into a string so that it would work with my vectorizer. Basically I am just trying to turn a list of three sentences to a string.
This is the error that I get:
AttributeError Traceback (most recent call last)
<ipython-input-195-ae09731d3572> in <module>()
3
4 vectorizer=CountVectorizer(max_features=100000,binary=True,ngram_range=(1,2))
----> 5 feature_matrix=vectorizer.fit_transform(trainTexts)
6 print("shape=",feature_matrix.shape)
3 frames
/usr/local/lib/python3.6/dist-packages/sklearn/feature_extraction/text.py in _preprocess(doc, accent_function, lower)
66 """
67 if lower:
---> 68 doc = doc.lower()
69 if accent_function is not None:
70 doc = accent_function(doc)
AttributeError: 'list' object has no attribute 'lower'
I have tried using .joinand .split but they are not working for me so I am doing something wrong.
import sys
import csv
import random
csv.field_size_limit(sys.maxsize)
trainLabels = []
trainTexts = []
with open ("myTsvFile.tsv") as train:
trainData = [row for row in csv.reader(train, delimiter='\t')]
random.shuffle(trainData)
for example in trainData:
trainLabels.append(example[1])
trainTexts.append(example[3:6])
The indexes example[3:6] means that the 3 is left context 4 is target word and 5 right context.
print('Text:', trainTexts[3])
print('Label:', trainLabels[1])
edited the few printed lines from the code:
['Visa electron käy aika monessa paikassa luottokortista . Mukaanlukien ', 'Paypal', ' , mikä avaa taas lisää ovia .']
['Nyt pistän pääni pölkyllä : ', 'WinForms', ' on ihan ok .']

Pandas & Dataframe: ValueError: can only convert an array of size 1 to a Python scalar

I tried to find a similar question, but still couldnt find the solution.
I am working on a dataframe with pandas.
The following code is nor working. It is only working for the first row on the dataframe. Already for the second row I am getting the error. as below. Maybe somebody sees the mistake and can help :)
census2=census_df[census_df["SUMLEV"]==50]
list=census2["CTYNAME"].tolist()
max=0
for county1 in list:
countylist=[]
df1=census2[census2["CTYNAME"]==county1]
countylist.append(df1["POPESTIMATE2010"].item())
countylist.append(df1["POPESTIMATE2011"].item())
countylist.append(df1["POPESTIMATE2012"].item())
countylist.append(df1["POPESTIMATE2013"].item())
countylist.append(df1["POPESTIMATE2014"].item())
countylist.append(df1["POPESTIMATE2015"].item())
countylist.sort()
difference=countylist[5]-countylist[0]
if difference > max:
max=difference
maxcounty=county1
print(maxcounty)
print(max)
[54660, 55253, 55175, 55038, 55290, 55347]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-5-340aeaf28039> in <module>()
12 countylist=[]
13 df1=census2[census2["CTYNAME"]==county1]
---> 14 countylist.append(df1["POPESTIMATE2010"].item())
15 countylist.append(df1["POPESTIMATE2011"].item())
16 countylist.append(df1["POPESTIMATE2012"].item())
/opt/conda/lib/python3.6/site-packages/pandas/core/base.py in item(self)
829 """
830 try:
--> 831 return self.values.item()
832 except IndexError:
833 # copy numpy's message here because Py26 raises an IndexError
ValueError: can only convert an array of size 1 to a Python scalar

Combining 3 data frames based on common field 'Country'

I have 3 dataframes with a common index of Country. I need to combine each of the 3 based on that Country field.
My first try was to combine two and then the third and this is how far I got:
pd.merge(energy, GDP, how='outer', left_index=True, right_index=True)
I have tried 3 options highly rated on this site:
import functools
dfs = [energy, GDP, ScimEn]
df_final = functools.reduce(lambda left,right: pd.merge(left,right,on='Country'), dfs)
energy.merge(GDP,on='Country').merge(ScimEn,on='Country')
pd.concat([energy.set_index('Country'), GDP.set_index('Country'), ScimEn.set_index('Country')], axis=1)
KeyError: 'Country'
During handling of the above exception, another exception occurred:
KeyError
Traceback (most recent call last)
in ()
40 #df_final = functools.reduce(lambda left,right: pd.merge(left,right,on='Country'), dfs)
41 #energy.merge(GDP,on='Country').merge(ScimEn,on='Country')
---> 42 pd.concat([energy.set_index('Country'), GDP.set_index('Country'), ScimEn.set_index('Country')], axis=1)
You need to do multiple merges. Try the following:
merge1 = pd.merge(energy, GDP, how='inner', on='Country')
final = pd.merge(merge1, ScimEn, how='inner', on='Country')

calibration_and_holdout_data: AttributeError: 'int' object has no attribute 'n'

I'm trying to run a BG/NBD model using the lifetimes libary.
All my analysis are based on the following example, yet with my own data:
https://towardsdatascience.com/whats-a-customer-worth-8daf183f8a4f
Somehow I receive the following error and after reading 50+ stackoverflow articles without finding any answer, I'd like to ask my own question:
What am I doing wrong? :(
Thanks in Advance! :)
I tried to change the type of all columns that are part of my dataframe, without any changes.
df2 = df
df2.head()
person_id effective_date accounting_sales_total
0 219333 2018-08-04 1049.89
1 333219 2018-12-21 4738.97
2 344405 2018-07-16 253.99
3 455599 2017-07-14 2199.96
4 766665 2017-08-15 1245.00
from lifetimes.utils import calibration_and_holdout_data
summary_cal_holdout = calibration_and_holdout_data(df2, 'person_id', 'effective_date',
calibration_period_end='2017-12-31',
observation_period_end='2018-12-31')
print(summary_cal_holdout.head())
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-85-cdcb400098dc> in <module>()
7 summary_cal_holdout = calibration_and_holdout_data(df2, 'person_id', 'effective_date',
8 calibration_period_end='2017-12-31',
----> 9 observation_period_end='2018-12-31')
10
11 print(summary_cal_holdout.head())
/usr/local/envs/py3env/lib/python3.5/site-packages/lifetimes/utils.py in calibration_and_holdout_data(transactions, customer_id_col, datetime_col, calibration_period_end, observation_period_end, freq, datetime_format, monetary_value_col)
122 combined_data.fillna(0, inplace=True)
123
--> 124 delta_time = (to_period(observation_period_end) - to_period(calibration_period_end)).n
125 combined_data["duration_holdout"] = delta_time
126
AttributeError: 'int' object has no attribute 'n'
This actually runs fine as it is :)
data = {'person_id':[219333, 333219, 344405, 455599, 766665],
'effective_date':['2018-08-04', '2018-12-21', '2018-07-16', '2017-07-14', '2017-08-15'],
'accounting_sales_total':[1049.89, 4738.97, 253.99, 2199.96, 1245.00]}
df2 = pd.DataFrame(data)
from lifetimes.utils import calibration_and_holdout_data
summary_cal_holdout = calibration_and_holdout_data(df2, 'person_id', 'effective_date',
calibration_period_end='2017-12-31',
observation_period_end='2018-12-31')
print(summary_cal_holdout.head())
Returns:
frequency_cal recency_cal T_cal frequency_holdout \
person_id
455599 0.0 0.0 170.0 0.0
766665 0.0 0.0 138.0 0.0
duration_holdout
person_id
455599 365
766665 365
Which means your issue is probably with package versioning, try:
pip install lifetimes --upgrade

Resources