I already fit the equation. Now I want the RMSE value
q3_1=data1[['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']]
q3_2=data1[['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors','zipcode','condition','grade','waterfront','view','sqft_above','sqft_basement','yr_built','yr_renovated',
'lat', 'long','sqft_living15','sqft_lot15']]
reg = LinearRegression()
reg.fit(q3_1,data1.price)
reg.fit(q3_2,data1.price)
I am not able to proceed from here. I need the RMSE value in both the cases.
As I can understand, you are using TensorFlow on Google Colab.
I don't know exactly what is your LinearRegression object, but IĀ suppose that it is a Keras model with a single node.
Hence, I have a question, how do you train the same model (your reg instance) with datasets with different schema -- one with 6 columns, the other with 16?
By the way, during training/fitting, keras is able to give you the MSE of your epoch, as well as a validation MSE if you provide a validation dataset. Finally, you can use the evaluate method which:
Returns the loss value & metrics values for the model [...]
Just use the "mean_squared_error" metric.
Edit
As you are using scikit-learn you have to take care of the metric yourself.
You can use the predict method to get the predictions from your trained model against a dataset.
Then, there is the mean_squared_error metric which is straighforward to use.
train_x, train_y = data1.features[:-100], data1.price[:-100]
test_x, test_y = data1.features[-100:], data1.price[-100:]
reg = LinearRegression()
reg.fit(train_x, train_y)
predictions = reg.predict(test_x)
mse = sklearn.metrics.mean_squared_error(test_y, predictions)
print("RMSE: %s" % math.sqrt(mse))
Related
I have a logistic regression model housed in a scikit-learn pipeline using the following:
pipeline = make_pipeline(
StandardScaler(),
LogisticRegressionCV(
solver='lbfgs',
cv=10,
scoring='roc_auc',
class_weight='balanced'
)
)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
I can view the model's coefficients for predictions as a whole with this code ...
# Look at model's coefficients to see what features are most important
plt.rcParams['figure.dpi'] = 50
model = pipeline.named_steps['logisticregressioncv']
coefficients = pd.Series(model.coef_[0], X_train.columns)
plt.figure(figsize=(10,12))
coefficients.sort_values().plot.barh(color='grey');
Which returns a bar plot of the features and their coefficients.
What I'm trying to do is be able to see how different input values for a single observation impact its prediction. The idea is to be able to run predictions on a sample population and examine the group with "low" predictions ... for example if I run predictions for 10 observations, I'd like to see how different input values impacted each of those 10 predictions, individually.
Recalled that I can achieve this via Shap Values using something along the following (but using LinearExplainer instead of TreeExplainer):
# Instantiate model and encoder outside of pipeline for
# use with shap
model = RandomForestClassifier( random_state=25)
# Fit on train, score on val
model.fit(X_train_encoded, y_train2)
y_pred_shap = model.predict(X_val_encoded)
# Get an individual observation to explain.
row = X_test_encoded.iloc[[-3]]
# Why did the model predict this?
# Look at a Shapley Values Force Plot
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(row)
shap.initjs()
shap.force_plot(
base_value=explainer.expected_value[1],
shap_values=shap_values[1],
features=row
)```
I am doing hyperparameter optimization using GridSearchCV
scoring_functions = {'mcc': make_scorer(matthews_corrcoef), 'accuracy': make_scorer(accuracy_score), 'balanced_accuracy': make_scorer(balanced_accuracy_score)}
grid_search = GridSearchCV(pipeline, param_grid=grid, scoring=scoring_functions, n_jobs=-1, cv=splitter, refit='mcc')
I set the refit parameter to 'mcc' so I expect GridSearchCV to choose the best model to maximize this metric. Then I calculate some of the scores
preds = best_model.predict(test_df)
metrics['accuracy'] = round(accuracy_score(test_labels, preds),3)
metrics['balanced_accuracy'] = round(balanced_accuracy_score(test_labels, preds),3)
metrics['mcc'] = round(matthews_corrcoef(test_labels, preds),3)
And I get these results
"accuracy": 0.891, "balanced_accuracy": 0.723, "mcc": 0.871
Now if I do this to get the score of the model on the same test set (not calculating the predictions first) like this
best_model = grid_search.best_estimator_
score = best_model.score(test_df, test_labels)
The score I get is this
"score": 0.891
Which as you can see is the accuracy but not the mcc score. According to the documentation of the score function it says
Returns the score on the given data, if the estimator has been refit.
This uses the score defined by scoring where provided, and the
best_estimator_.score method otherwise.
I don't understand correctly. I thought that if I refit the model like I specify with the refit parameter in GridSearchCV, the result should be with the scoring function used to refit the model? Am I missing something?
When you access the attribute best_estimator_ you are going to the underlying base model, ignoring all the set up you have done to the GridSearchCV object:
best_model = grid_search.best_estimator_
score = best_model.score(test_df, test_labels)
You should use grid_search.score() instead and, in general, interact with that object. For example, when predicting, use grid_search.predict().
The signature of those methods is the same as that of a standard Estimator (fit, predict, score, etc).
You can use the underlying model, but it won't have necessarily inherited the configuration you have done to the grid search object itself.
I am trying to apply machine learning on stock prediction, and I run into problem regarding scaling on future unseen (much higher) stock close value.
Lets say I use random forrest regression on predicting stock price. I break the data into train set and test set.
For the train set, I use standardscaler, and do fit and transform
And then I use regressor to fit
For the test set, I use standardscaler, and do transform
And then I use regressor to predict, and compare to test label
If I plot predict and test label on a graph, predict seems to max out or ceiling. The problem is that standardscaler fit on train set, test set (later in the timeline) have much higher value, the algorithm does not know what to do with these extreme data
def test(X, y):
# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, shuffle=False)
# preprocess the data
pipeline = Pipeline([
('std_scaler', StandardScaler()),
])
# model = LinearRegression()
model = RandomForestRegressor(n_estimators=20, random_state=0)
# preprocessing fit transform on train data
X_train = pipeline.fit_transform(X_train)
# fit model on train data with train label
model.fit(X_train, y_train)
# transform on test data
X_test = pipeline.transform(X_test)
# predict on test data
y_pred = model.predict(X_test)
# print(np.sqrt(mean_squared_error(y_test, y_pred)))
d = {'actual': y_test, 'predict': y_pred}
plot_data = pd.DataFrame.from_dict(d)
sns.lineplot(data=plot_data)
plt.show()
What should be done with the scaling?
This is what I got for plotting prediction, actual close price vs time
The problem mainly comes from the model you are using. RandomForest regressor is created upon Decision Trees. It is learning to map an input to an output for every examples in the training set. Consequently RandomForest regressor will work for middle values but for extreme values that it hasn't seen during training it will of course perform has your picture is showing.
What you want, is to learn a function directly using linear/polynomial regression or more advanced algorithms like ARIMA.
In a binary classification setting, tuning a model based on area under the ROC requires a model output that can be thresholded.
However, in scikit-learn, support vector classifiers do not generate class probabilities by default.
So for example, using GridSearchCV with scoring=make_scorer(roc_auc_score, needs_threshold=False) to tune an SVC model is incorrect because the AUC scores will be calculated based on predicted classes in each CV fold. This will occur regardless of whether we use SVC(probability=True) or SVC(probability=False). On the other hand, scoring=make_scorer(roc_auc_score, needs_threshold=True) will tune correctly.
Therefore SVC must be passing some "thresholdable" output to the scoring function in GridSearchCV. How can we know what this thresholdable output is for a given model?
For SVC I assume that the decision_function() method is called. (I assume it is not calculating class probabilities because you cannot run predict_proba() on the fitted GridSearchCV object when using SVC(probability=False)). But it is not clear (to me at least) from the documentaion that this is definitely what is happening.
Yes you are correct.
From the source-code of make_scorer:
....
elif needs_threshold:
cls = _ThresholdScorer
....
So when needs_threshold = True, a _ThresholdScorer scorer is used. Now looking into the source code of _ThresholdScorer, we see this:
....
....
try:
y_pred = clf.decision_function(X)
# For multi-output multi-class estimator
if isinstance(y_pred, list):
y_pred = np.vstack(p for p in y_pred).T
except (NotImplementedError, AttributeError):
y_pred = clf.predict_proba(X)
So, this will first call decision_function() of the estimator for finding the thresholds.
I wasn't able to find information I am looking for so I will post my question here.
I am just venturing into machine learning. I did my first multiple regression for a time series using scikit learn library. My code is as shown below
X = df[feature_cols]
y = df[['scheduled_amount']]
index= y.reset_index().drop('scheduled_amount', axis=1)
linreg = LinearRegression()
tscv = TimeSeriesSplit(max_train_size=None, n_splits=11)
li=[]
for train_index, test_index in tscv.split(X):
train = index.iloc[train_index]
train_start, train_end = train.iloc[0,0], train.iloc[-1,0]
test = index.iloc[test_index]
test_start, test_end = test.iloc[0,0], test.iloc[-1,0]
X_train, X_test = X[train_start:train_end], X[test_start:test_end]
y_train, y_test = y[train_start:train_end], y[test_start:test_end]
linreg.fit(X_train, y_train)
y_predict = linreg.predict(X_test)
print('RSS:' + str(linreg.score(X_test, y_test)))
y_test['predictec_amount'] = y_predict
y_test.plot()
Not that my data is a time series data and I want to keep the datetime index in my Dataframe when I'm fitting my model.
I am using the TimeSeriesSplit for cross-validation. I still don't really understand the cross validation thing.
First is there a need for a cross-validation in a time series dataset. Second should I use the last linear_coeff_ or should I get the average of all of them to use for my future prediction.
Yes there is a need for cross-validation in a timeseries dataset. Basically you need to ensure your model does not overfit your current test and is able to capture past seasonal changes so you can have some confidence in the model doing the same in the future. This method is also used to choose model hyperparameters (i.e. alpha in a Ridge regression).
In order to make future predictions, you should refit your regressor with the whole data and the best hyperparameters or, as #Marcus V. mentioned in the coments, maybe is best to train it only with the most recent data.