I'm using the Pycaret library in Colab to make a simple prediction on this dataset:
https://www.kaggle.com/andrewmvd/fetal-health-classification
When i run my code:
from pycaret.utils import enable_colab
enable_colab()
from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
from pycaret.classification import *
from pandas_profiling import ProfileReport
df= pd.read_csv("/content/drive/MyDrive/Pycaret/fetal_health.csv")
df2 = df.iloc[:,:11]
df2['fetal_health'] = df['fetal_health']
test = df2.sample(frac=0.10, random_state=42, weights='fetal_health')
train = df2.drop(test.index)
test.reset_index(inplace=True, drop=True)
train.reset_index(inplace=True, drop=True)
clf = setup(data =train, target = 'fetal_health', session_id=42,
log_experiment=True, experiment_name='fetal', normalize=True)
best = compare_models(sort="Accuracy")
rf = create_model('rf', fold=30)
tuned_rf = tune_model(rf, optimize='Accuracy')
predict_model(tuned_rf)
I get this error:
I think this is because my target variable is imbalanced (see img) and is causing the predictions to be incorrect.
Can someone pls help me understand ?
Tks in advance
Have you run each step in a separate cell to check the outputs?
Run
clf = setup(data =train, target = 'fetal_health', session_id=42,
log_experiment=True, experiment_name='fetal', normalize=True)
and check:
Are all variable types correctly inferred? (E.g., using your code with the Kaggle dataset of the same name, all variable shows as numeric except for severe_decelerations that shows as "Categorical" -- is it correct?
Is there any preprocessing configuration that needs to change? I'm sure your issue has nothing to do with an imbalanced target variable, but you can test yourself by changing your setup (adding fix_imbalance = True to change the default -- it shows as False when you check the setup output).
You can learn more about the available preprocessing configurations here:
https://pycaret.gitbook.io/docs/get-started/preprocessing
Also, while troubleshooting, you can save yourself some work by using
best_model = create_model(best, fold=30)
predict_model(best_model)
(No need to look up the best model to add manually to create_model(),
or to use tune_model() until you got the model working.)
I found what the problem was:
My target variables begin with value 1 and has 3 different values. This makes a error when the Pycaret tries to make a list comprehension (because it starts with the zero index).
To solve that i just transformed my variable to begin with zero and worked fine
Leandro,
thank you so much for your solution! I was having the same problem with the same dataset!
A. Beal, I tried your solution, but still the same error message appeared, so I tried Leandro's solution, and the problem was, in fact, the target beginning with 1, and not 0. Thank you for your suggestion on how to reduce the code!
Related
I am trying to use Recursive Feature Elimination with CV and produce reproducible results. Even though I have tried fixing the randomness by random_state = SEED as arguments of the components used as well as tried setting the random seed globally as well using np.random.seed(SEED). However, I am unable to control for the randomness and am unable to reproduce results. Attached is the code segment.
estimator = GradientBoostingClassifier(random_state=SEED, n_estimators=2*df.shape[1])
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=SEED)
selector = RFECV(estimator, n_jobs=-1,step=STEP, cv=cv)
selector = selector.fit(df, y)
df = df.loc[:, selector.support_]
print("Shape of final data AFTER FEATURE SELECTION")
print(df.shape, y.shape)
Specifically, if I run this segment of code it returns different number of features selected at each run. Any help would be appreciated
As part of a school assignment on DSL and code generation, I have to translate the following program written in Python/Scikit-learn into R language (the topic of the exercise is an hypothetic Machine Learning DSL).
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_validate
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
df = pd.read_csv('boston.csv', sep=',')
df.head()
y = df["medv"]
X = df.drop(columns=["medv"])
clf = DecisionTreeRegressor()
scoring = ['neg_mean_absolute_error','neg_mean_squared_error']
results = cross_validate(clf, X, y, cv=6,scoring=scoring)
print('mean_absolute_errors = '+str(results['test_neg_mean_absolute_error']))
print('mean_squared_errors = '+str(results['test_neg_mean_squared_error']))
Since I'm a perfect newbie in Machine Learning, and especially in R, I can't do it.
Could someone help me ?
Sorry for the late answer, probably you have already finished your school assignment. Of course we cannot just do it for you, you probably have to figure it out by yourself. Moreover, I don't get exactly what you need to do. But some tips are:
Read a csv file
data <-read.csv(file="name_of_the_file", header=TRUE, sep=",")
data <-as.data.frame(data)
The header=TRUE indicates that the file has one row which includes the names of the columns, the sep=',' is the same as in python (the seperator in the file is ',')
The as.data.frame makes sure that your data is kept in a dataframe format.
Add/delete a column
data<- data[,-"name_of_the_column_to_be_deleted"] #delete a column
data$name_of_column_to_be_added<- c(1:10) #add column
In order to add a column you will need to add the elements it will include. Also the # symbol indicates the beginning of a comment.
Modelling
For the modelling part I am not sure about what you want to achieve, but R offers a huge selection of algorithms to choose from (i.e. if you want to grow a tree take a look into the page https://www.statmethods.net/advstats/cart.html where it uses the following script to grow a tree
fit <- rpart(Kyphosis ~ Age + Number + Start,
method="class", data=kyphosis))
I've trained a logistic regression in Spark using pipeline. It ran and I am looking at model diagnostics.
I created my model summary (lr_summary = lrModel.stages[-1].summary).
After that I pretty much copied the code from this webpage. It all works until I try to determine the best threshold based on F-measure using this example Python code:
# Set the model threshold to maximize F-Measure
fMeasure = lr_summary.fMeasureByThreshold
maxFMeasure = fMeasure.groupBy().max('F-Measure').select('max(F-Measure)').head()
bestThreshold = fMeasure.where(fMeasure['F-Measure'] == maxFMeasure['max(F-Measure)']).select('threshold').head()['threshold']
lr.setThreshold(bestThreshold)
Unfortunately, I am getting an error in line 3 (bestThreshold = ):
TypeError: 'NoneType' object has no attribute 'getitem'
Any advice?
Thank you so much!
I cannot reproduce this problem, but it is possible that model doesn't have summary (in that case I would expect attribute error in maxFMeasure = ... line). You can check if model has one:
lrModel.stages[-1].hasSummary
Also you can make this code much simpler:
bestThreshold = fMeasure.orderBy(fMeasure['F-Measure'].desc()).first().threshold
I'd like to build a tensorflow graph in a separate function get_graph(), and to print out a simple ops a in the main function. It turns out that I can print out the value of a if I return a from get_graph(). However, if I use get_operation_by_name() to retrieve a, it print out None. I wonder what I did wrong here? Any suggestion to fix it? Thank you!
import tensorflow as tf
def get_graph():
graph = tf.Graph()
with graph.as_default():
a = tf.constant(5.0, name='a')
return graph, a
if __name__ == '__main__':
graph, a = get_graph()
with tf.Session(graph=graph) as sess:
print(sess.run(a))
a = sess.graph.get_operation_by_name('a')
print(sess.run(a))
it prints out
5.0
None
p.s. I'm using python 3.4 and tensorflow 1.2.
Naming conventions in tensorflow are subtle and a bit offsetting at first.
The thing is, when you write
a = tf.constant(5.0, name='a')
a is not the constant op, but its output. Names of op outputs derive from the op name by adding a number corresponding to its rank. Here, constant has only one output, so its name is
print(a.name)
# `a:0`
When you run sess.graph.get_operation_by_name('a') you do get the constant op. But what you actually wanted is to get 'a:0', the tensor that is the output of this operation, and whose evaluation returns an array.
a = sess.graph.get_tensor_by_name('a:0')
print(sess.run(a))
# 5
According to the documentation and other SO questions, ElasticNetCV accepts multiple output regression. When I try it, though, it fails. Code:
from sklearn import linear_model
import numpy as np
import numpy.random as rnd
nsubj = 10
nfeat_train = 5
nfeat_predict = 20
x = rnd.random((nsubj, nfeat_train))
y = rnd.random((nsubj, nfeat_predict))
lm = linear_model.LinearRegression()
lm.fit(x,y) # works
el = linear_model.ElasticNetCV()
el.fit(x,y) # fails
Error message:
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
This is with scikit-learn version 0.14.1. Is this a mismatch between the documentation and implementation?
You may want to take a look at sklearn.linear_model.MultiTaskElasticNetCV. But beware, this object assumes that your multiple targets share features. Thus, a feature is either active for all tasks (with variable activation for each, which can be small), or active for none of them. Before using this object, make sure this is the functionality you need.