I am using scikit-learn ensemble classifiers for classification.I have separate training and testing data sets.When I use the same data sets and classify using machine learning algorithms I am getting consistent accuracies. Inconsistency is only in case of ensemble classifiers. I have even set random_state to 0.
bag_classifier = BaggingClassifier(n_estimators=10,random_state=0)
bag_classifier.fit(train_arrays,train_labels)
bag_predict = bag_classifier.predict(test_arrays)
bag_accuracy = bag_classifier.score(test_arrays,test_labels)
bag_cm = confusion_matrix(test_labels,bag_predict)
print("The Bagging Classifier accuracy is : " ,bag_accuracy)
print("The Confusion Matrix is ")
print(bag_cm)
You will normally find different results for same model because every time when the model is executed during training, the train/test split is random. You can reproduce the same results by giving the seed value to the train/test split.
train, test = train_test_split(your data , test_size=0.3, random_state=57)
Keep the same random_state value in each turn of training.
Related
I have a retail dataset with product_description, price, supplier, category as columns.
I used product_description as feature:
from sklearn import model_selection, preprocessing, naive_bayes
# split the dataset into training and validation datasets
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df['product_description'], df['category'])
# label encode the target variable
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(df['product_description'])
xtrain_tfidf = tfidf_vect.transform(train_x)
xvalid_tfidf = tfidf_vect.transform(valid_x)
classifier = naive_bayes.MultinomialNB().fit(xtrain_tfidf, train_y)
# predict the labels on validation dataset
predictions = classifier.predict(xvalid_tfidf)
metrics.accuracy_score(predictions, valid_y) # ~20%, very low
Since the accuracy is very low, I want to add the supplier and price as features too. How can I incorporate this in the code?
I have tried other classifiers like LR, SVM, and Random Forrest, but they had (almost) the same outcome.
The TF-IDF vectorizer returns a matrix: one row per example with the scores. You can modify this matrix as you wish before feeding it into the classifier.
Prepare your additional features as a NumPy array of shape: number of examples × number of features.
Use np.concatenate with axis=1.
Fit the classifier as you did before.
It is usually a good idea to normalize real-valued features. Also, you can try different classifiers: Logistic Regression or SVM might do a better job for real-valued features than Naive Bayes.
Are the hyper-parameters in Gaussian Process Regresor optimized during the fitting in scikit-learn?
In the page
https://scikit-learn.org/stable/modules/gaussian_process.html
it is said:
"The hyperparameters of the kernel are optimized during fitting of GaussianProcessRegressor by maximizing the log-marginal-likelihood (LML) based on the passed optimizer"
So, it is not required, for instance, to optimize it by using grid earch?
A hyperparameter is something that you need to specify, usually, the best way to do it is within a pipeline ( series of steps) in which you try many hyperparameters and get the best one. Here is an example of just trying different hyperparameters for a k-means in which you give a list of hyperparameters (n_neighbors for K-Means) in order to see which ones work best! Hope it helps you!
neighbors = np.arange(1, 9)
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))
# Loop over different values of k
for i, k in enumerate(neighbors):
# Setup a k-NN Classifier with k neighbors: knn
knn = KNeighborsClassifier(n_neighbors= k)
# Fit the classifier to the training data
knn.fit(X_train,y_train)
#Compute accuracy on the training set
train_accuracy[i] = knn.score(X_train, y_train)
knn.predict(X_test)
#Compute accuracy on the testing set
test_accuracy[i] = knn.score(X_test, y_test)
# Generate plot
plt.title('k-NN: Varying Number of Neighbors')
plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy')
plt.plot(neighbors, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Number of Neighbors')
I already fit the equation. Now I want the RMSE value
q3_1=data1[['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']]
q3_2=data1[['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors','zipcode','condition','grade','waterfront','view','sqft_above','sqft_basement','yr_built','yr_renovated',
'lat', 'long','sqft_living15','sqft_lot15']]
reg = LinearRegression()
reg.fit(q3_1,data1.price)
reg.fit(q3_2,data1.price)
I am not able to proceed from here. I need the RMSE value in both the cases.
As I can understand, you are using TensorFlow on Google Colab.
I don't know exactly what is your LinearRegression object, but I suppose that it is a Keras model with a single node.
Hence, I have a question, how do you train the same model (your reg instance) with datasets with different schema -- one with 6 columns, the other with 16?
By the way, during training/fitting, keras is able to give you the MSE of your epoch, as well as a validation MSE if you provide a validation dataset. Finally, you can use the evaluate method which:
Returns the loss value & metrics values for the model [...]
Just use the "mean_squared_error" metric.
Edit
As you are using scikit-learn you have to take care of the metric yourself.
You can use the predict method to get the predictions from your trained model against a dataset.
Then, there is the mean_squared_error metric which is straighforward to use.
train_x, train_y = data1.features[:-100], data1.price[:-100]
test_x, test_y = data1.features[-100:], data1.price[-100:]
reg = LinearRegression()
reg.fit(train_x, train_y)
predictions = reg.predict(test_x)
mse = sklearn.metrics.mean_squared_error(test_y, predictions)
print("RMSE: %s" % math.sqrt(mse))
I have a logistic regression model trying to predict one of two classes: A or B.
My model's accuracy when predicting A is ~85%.
Model's accuracy when predicting B is ~50%.
Prediction of B is not important however prediction of A is very important.
My goal is to maximize the accuracy when predicting A. Is there any way to adjust the default decision threshold when determining the class?
classifier = LogisticRegression(penalty = 'l2',solver = 'saga', multi_class = 'ovr')
classifier.fit(np.float64(X_train), np.float64(y_train))
Thanks!
RB
As mentioned in the comments, procedure of selecting threshold is done after training. You can find threshold that maximizes utility function of your choice, for example:
from sklearn import metrics
preds = classifier.predict_proba(test_data)
tpr, tpr, thresholds = metrics.roc_curve(test_y,preds[:,1])
print (thresholds)
accuracy_ls = []
for thres in thresholds:
y_pred = np.where(preds[:,1]>thres,1,0)
# Apply desired utility function to y_preds, for example accuracy.
accuracy_ls.append(metrics.accuracy_score(test_y, y_pred, normalize=True))
After that, choose threshold that maximizes chosen utility function. In your case choose threshold that maximizes 1 in y_pred.
for the same training and test datset, the accuracy of KNN is 0.53, for RandomForest and AdaBoost ,the accuracy is 1, can anyone help?
codes:
## prepare data
begin_date='20140101'
end_date='20160908'
stock_code='000001' #平安银行
data=ts.get_hist_data(stock_code,start=begin_date,end=end_date)
close=data.loc[:,'close']
df=data[:-1]
diff=np.array(close[1:])-np.array(close[:-1])
label=1*(diff>=0)
df.loc[:,'diff']=diff
df.loc[:,'label']=label
#split dataset into trainging and test
df_train=df[df.index<'2016-07-08']
df_test=df[df.index>='2016-07-08']
x_train=df_train[df_train.columns[:-1]]
y_train=df_train['label']
x_test=df_test[df_test.columns[:-1]]
y_test=df_test['label']
##KNN
clf2 = neighbors.KNeighborsClassifier()
clf2.fit(x_train, y_train)
accuracy2 = clf2.score(x_test, y_test)
pred_knn=np.array(clf2.predict(x_test))
#RandomForest
clf3 = RandomForestClassifier(n_estimators=100,n_jobs=-1)
clf3.fit(x_train, y_train)
accuracy3 = clf3.score(x_test, y_test)
pred_rf=np.array(clf3.predict(x_test))
print accuracy1,accuracy2,accuracy3
Different models give different accuracy on same dataset in most of the cases. For example, if you are trying to train and test a dataset with LogisticRegression and SVM, then it is highly likely that both the models will give different score. In order to choose best model for your data, you need to first explore the dataset and then choose some algorithm that performs better in the case.
Also, as your RandomForest and AdaBoost are giving an accuracy of 1, it is very likely that your model is 'overfitted'.