ColumnTransformer object has no attribute shape error - python-3.x

My data file (CSV) contains categorical and non-categorical variables. To perform cox proportional hazard (CPH) I applied OneHotEncoder on two categorical variables (study_category and patient_category). I am getting the following error on the line where I am trying to fit the CPH model. I am passing three parameters: dataframe, duration column (), event column() to cph.fit() method. I googled the error but could not found something useful. I am using CPH first time, any help to fix the issue will be appreciated.
Error:
AttributeError: 'ColumnTransformer' object has no attribute 'shape'
My python code:
def meth():
dataset = pd.read_csv("C:/Users/XYZ/CTR_Project/CPH.csv")
dataset=dataset.loc[:,
['study_Category','patient_Category','Diff_time','Events']]
X=dataset.loc[:,['study_Category','patient_Category','Diff_time','Events']]
colm_transf=make_column_transformer((OneHotEncoder(),
['study_Category','patient_Category']),remainder='passthrough')
colm_transf.fit_transform(X)
cph= CoxPHFitter()
cph.fit(colm_transf,duration_col='Diff_time', event_col='Events')
cph.print_summary()

Related

AttributeError:Float' object has no attribute log /TypeError: ufunc 'log' not supported for the input types

I have a series of fluorescence intensity data in a column ('2.4M'). I tried to create a new column 'ln_2.4M' by taking the ln of column '2.4M' I got an error:
AttributeError: 'float' object has no attribute 'log'
df["ln_2.4M"] = np.log(df["2.4M"])
I tried using a for loop to iterate the log over each fluorescence data in the column "2.4M":
ln2_4M = []
for x in df["2.4M"]:
ln2_4M = np.log(x)
print(ln2_4M)
Although it printed out ln2_4M as log of column "2.4M" correctly, I am unable to use the data because it gave alongside a TypeError:
ufunc 'log' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'
Not sure why? - Any help at understanding what is happening and how to fix this problem is appreciated. Thanks
.
I then tried using the method below and it worked:
df["2.4M"] = pd.to_numeric(df["2.4M"],errors = 'coerce')
df["ln_24M"] = np.log(df["2.4M"])

Customise train_step in model.fit() "OperatorNotAllowedInGraphError: iterating over `tf.Tensor` is not allowed: AutoGraph did convert this function"

I am trying to write a custom train_step to use in the tf.keras.Model.fit() function. I am following tensor flow tutorial. Here in the train_step function from what I understand the input argument data is supposed to be the training dataset that I am about to pass in Model.fit() function. My dataset is TFRecordDataset. My dataset gives three particular features i.e. image, labels and the box. So, in the train_step function i am first trying to get the img, labels and box parameters from the data argument that is passed.
def train_step(self, data):
print("printing data fed to train_step")
print(data)
img, label, gt_boxes = data
if self.DEBUG:
if(img == None):
print("img input in train step is none")
with tf.GradientTape() as tape:
rpn_classification, rpn_regression = self(img, training=True)
self.tf_rpn_target_generation_layer(gt_boxes, rpn_regression)
loss = self.rpn_loss_function(rpn_classification)
trainable_vars = self.trainable_variables
gradients = tape.gradient(loss, trainable_vars)
self.optimizer.apply_gradients(zip(gradients, trainable_vars))
loss_tracker.update_state(loss)
#mae_metric.update_state()
return [loss_tracker]
The above is the code I use for my custom train_step function. When I run the fit, I get the following error
OperatorNotAllowedInGraphError: iterating over tf.Tensor is not allowed: AutoGraph did convert this function. This might indicate you are trying to use an unsupported feature.
I have used shuffle, cache, and repeat operations on my training dataset. Can anyone please help me understand why exactly this error appears?
From my previous experience, i generally create an iterator for the dataset followed by get_next operation to obtain the features.
Edit:
I have tried the following procedures but did not yield any outcome
Since the data being sent into the train_step is a dataset object, I have used tf.raw_ops.IteratorGetNext method to access the elements of the iterator which returned an error saying
"TypeError: Input 'iterator' of 'IteratorGetNext' Op has type string that does not match the expected type of resource."
To fix this error, I have assumed that it was likely tensorflow returning iterator graph and hence unable to access the elements, so I have added run_eagerly=True argument to the model.compile() function which returned gibberish being printed and the same error.
Epoch 1/5
printing data fed to train_step
Tensor("Shape:0", shape=(0,), dtype=int32)
Tensor("IteratorGetNext:0", shape=(), dtype=string)
I have found the solution. The data that is being passed to my step function is an iterator and hence I have to use tf.raw_ops.IteratorGetNext method to access the contents of the iterator.
When doing this I initially got another error saying that the iterator type does not match the expected type of resource and when debugged carefully I understood that the read_tfrecords mapping that I had to do to the dataset was unsuccessful and that lead to the dataset still containing unmapped tfrecords of format tf.string which is not an expected type of resource for the train_Step.

AttributeError: 'str' object has no attribute 'parameters' due to new version of sklearn

I am doing topic modeling using sklearn. While trying to get the log-likelihood from Grid Search output, I am getting the below error:
AttributeError: 'str' object has no attribute 'parameters'
I think I understand the issue which is: 'parameters' is used in the older version and I am using the new version (0.22) of sklearn and that is giving error. I also search for the term which is used in the new version but couldn't find it. Below is the code:
# Get Log Likelyhoods from Grid Search Output
n_components = [10, 15, 20, 25, 30]
log_likelyhoods_5 = [round(gscore.mean_validation_score) for gscore in model.cv_results_ if gscore.parameters['learning_decay']==0.5]
log_likelyhoods_7 = [round(gscore.mean_validation_score) for gscore in model.cv_results_ if gscore.parameters['learning_decay']==0.7]
log_likelyhoods_9 = [round(gscore.mean_validation_score) for gscore in model.cv_results_ if gscore.parameters['learning_decay']==0.9]
# Show graph
plt.figure(figsize=(12, 8))
plt.plot(n_components, log_likelyhoods_5, label='0.5')
plt.plot(n_components, log_likelyhoods_7, label='0.7')
plt.plot(n_components, log_likelyhoods_9, label='0.9')
plt.title("Choosing Optimal LDA Model")
plt.xlabel("Num Topics")
plt.ylabel("Log Likelyhood Scores")
plt.legend(title='Learning decay', loc='best')
plt.show()
Thanks in advance!
There is key 'params' which is used to store a list of parameter settings dicts for all the parameter candidates. You can see the GridSearchCv doc here from sklearn documentation.
In your code, gscore is a string key value of cv_results_.
Output of cv_results_ is a dictionary of string key like 'params','split0_test_score' etc(you can refer the doc) and their value as list or array etc.
So, you need to make following change to your code :
log_likelyhoods_5 = [round(model.cv_results_['mean_test_score'][index]) for index, gscore in enumerate(model.cv_results_['params']) if gscore['learning_decay']==0.5]

AttributeError: 'numpy.ndarray' object has no attribute 'sqrt'

I am trying to split dataframe in equal samples and applying some function to calculate value of each sample if any sample value greater than 0.3 then in result dataframe i want to save filename
df=pd.DataFrame({'Value':[-0.016,-0.006,0.003,-0.011,-0.036,-0.031,-0.014,-0.006,-0.01 ,-0.009,0.004,0.001,-0.012,-0.021,-0.008,0.001,-0.011,-0.01,-0.006,0.002,0.004],'Nmae':[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]})
x=pd.DataFrame([x.values.sqrt(np.mean(df2['Value']**2)) for x in np.array_split(df2, (len(df2)/10))])
getting this error
AttributeError: 'numpy.ndarray' object has no attribute 'sqrt'
if someone have any other effective way to do this task
This is a working version of your Code:
res= [np.sqrt(np.mean((x.Value**2))) for x in np.array_split(df, (len(df)/10))]
An alternative way of approaching this with Pandas would be. You define a new column 'Split_variable' and use it to apply your calculations:
df.groupby('Split_variable')['Value'].apply(lambda x: np.sqrt(np.mean((x**2))))

How can I generate classification report by removing this error?

I want to generate classification report of dataset movie_reviews from corpus which has already target names [pos , neg]. but found an error.
Code:
movie_train_clf = Pipeline([('vect',CountVectorizer(stop_words='english')),('tfidf',TfidfTransformer()),('clas',BernoulliNB(fit_prior=True))])
movie_train_clas = movie_train_clf.fit(movie_train.data ,movie_train.target)
predict = movie_train_clas.predict(movie_train.data)
np.mean(predict==movie_train.target)
Now I use classification report
from sklearn.metrics import classification_report
print(classification_report(predict, movie_train_clas,target_names==target_names))
Error:
TypeError: iteration over a 0-d array.
please help me with correct syntax.
There are multiple errors in your code:
1) You have the wrong order of arguments in classification_report. As per the documentation:
classification_report(y_true, y_pred, ...
First argument is the true labels and second one is the predicted labels.
2) You are using movie_train_clas in the place of true labels. movie_train_clas as per your code is the return value of movie_train_clf.fit(), so its the movie_train_clf itself. fit() returns itself, so you cannot use that in place of ground truth labels.
3) As #AmiTavory spotted, the current error is due to comparison operator (==) used in place of assignment (=). The correct call to classification_report should be:
classification_report(movie_train.target, predict, target_names=target_names)

Resources