for the same training and test datset, the accuracy of KNN is 0.53, for RandomForest and AdaBoost ,the accuracy is 1, can anyone help?
codes:
## prepare data
begin_date='20140101'
end_date='20160908'
stock_code='000001' #平安银行
data=ts.get_hist_data(stock_code,start=begin_date,end=end_date)
close=data.loc[:,'close']
df=data[:-1]
diff=np.array(close[1:])-np.array(close[:-1])
label=1*(diff>=0)
df.loc[:,'diff']=diff
df.loc[:,'label']=label
#split dataset into trainging and test
df_train=df[df.index<'2016-07-08']
df_test=df[df.index>='2016-07-08']
x_train=df_train[df_train.columns[:-1]]
y_train=df_train['label']
x_test=df_test[df_test.columns[:-1]]
y_test=df_test['label']
##KNN
clf2 = neighbors.KNeighborsClassifier()
clf2.fit(x_train, y_train)
accuracy2 = clf2.score(x_test, y_test)
pred_knn=np.array(clf2.predict(x_test))
#RandomForest
clf3 = RandomForestClassifier(n_estimators=100,n_jobs=-1)
clf3.fit(x_train, y_train)
accuracy3 = clf3.score(x_test, y_test)
pred_rf=np.array(clf3.predict(x_test))
print accuracy1,accuracy2,accuracy3
Different models give different accuracy on same dataset in most of the cases. For example, if you are trying to train and test a dataset with LogisticRegression and SVM, then it is highly likely that both the models will give different score. In order to choose best model for your data, you need to first explore the dataset and then choose some algorithm that performs better in the case.
Also, as your RandomForest and AdaBoost are giving an accuracy of 1, it is very likely that your model is 'overfitted'.
Related
I already fit the equation. Now I want the RMSE value
q3_1=data1[['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']]
q3_2=data1[['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors','zipcode','condition','grade','waterfront','view','sqft_above','sqft_basement','yr_built','yr_renovated',
'lat', 'long','sqft_living15','sqft_lot15']]
reg = LinearRegression()
reg.fit(q3_1,data1.price)
reg.fit(q3_2,data1.price)
I am not able to proceed from here. I need the RMSE value in both the cases.
As I can understand, you are using TensorFlow on Google Colab.
I don't know exactly what is your LinearRegression object, but I suppose that it is a Keras model with a single node.
Hence, I have a question, how do you train the same model (your reg instance) with datasets with different schema -- one with 6 columns, the other with 16?
By the way, during training/fitting, keras is able to give you the MSE of your epoch, as well as a validation MSE if you provide a validation dataset. Finally, you can use the evaluate method which:
Returns the loss value & metrics values for the model [...]
Just use the "mean_squared_error" metric.
Edit
As you are using scikit-learn you have to take care of the metric yourself.
You can use the predict method to get the predictions from your trained model against a dataset.
Then, there is the mean_squared_error metric which is straighforward to use.
train_x, train_y = data1.features[:-100], data1.price[:-100]
test_x, test_y = data1.features[-100:], data1.price[-100:]
reg = LinearRegression()
reg.fit(train_x, train_y)
predictions = reg.predict(test_x)
mse = sklearn.metrics.mean_squared_error(test_y, predictions)
print("RMSE: %s" % math.sqrt(mse))
I am trying to apply machine learning on stock prediction, and I run into problem regarding scaling on future unseen (much higher) stock close value.
Lets say I use random forrest regression on predicting stock price. I break the data into train set and test set.
For the train set, I use standardscaler, and do fit and transform
And then I use regressor to fit
For the test set, I use standardscaler, and do transform
And then I use regressor to predict, and compare to test label
If I plot predict and test label on a graph, predict seems to max out or ceiling. The problem is that standardscaler fit on train set, test set (later in the timeline) have much higher value, the algorithm does not know what to do with these extreme data
def test(X, y):
# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, shuffle=False)
# preprocess the data
pipeline = Pipeline([
('std_scaler', StandardScaler()),
])
# model = LinearRegression()
model = RandomForestRegressor(n_estimators=20, random_state=0)
# preprocessing fit transform on train data
X_train = pipeline.fit_transform(X_train)
# fit model on train data with train label
model.fit(X_train, y_train)
# transform on test data
X_test = pipeline.transform(X_test)
# predict on test data
y_pred = model.predict(X_test)
# print(np.sqrt(mean_squared_error(y_test, y_pred)))
d = {'actual': y_test, 'predict': y_pred}
plot_data = pd.DataFrame.from_dict(d)
sns.lineplot(data=plot_data)
plt.show()
What should be done with the scaling?
This is what I got for plotting prediction, actual close price vs time
The problem mainly comes from the model you are using. RandomForest regressor is created upon Decision Trees. It is learning to map an input to an output for every examples in the training set. Consequently RandomForest regressor will work for middle values but for extreme values that it hasn't seen during training it will of course perform has your picture is showing.
What you want, is to learn a function directly using linear/polynomial regression or more advanced algorithms like ARIMA.
I wasn't able to find information I am looking for so I will post my question here.
I am just venturing into machine learning. I did my first multiple regression for a time series using scikit learn library. My code is as shown below
X = df[feature_cols]
y = df[['scheduled_amount']]
index= y.reset_index().drop('scheduled_amount', axis=1)
linreg = LinearRegression()
tscv = TimeSeriesSplit(max_train_size=None, n_splits=11)
li=[]
for train_index, test_index in tscv.split(X):
train = index.iloc[train_index]
train_start, train_end = train.iloc[0,0], train.iloc[-1,0]
test = index.iloc[test_index]
test_start, test_end = test.iloc[0,0], test.iloc[-1,0]
X_train, X_test = X[train_start:train_end], X[test_start:test_end]
y_train, y_test = y[train_start:train_end], y[test_start:test_end]
linreg.fit(X_train, y_train)
y_predict = linreg.predict(X_test)
print('RSS:' + str(linreg.score(X_test, y_test)))
y_test['predictec_amount'] = y_predict
y_test.plot()
Not that my data is a time series data and I want to keep the datetime index in my Dataframe when I'm fitting my model.
I am using the TimeSeriesSplit for cross-validation. I still don't really understand the cross validation thing.
First is there a need for a cross-validation in a time series dataset. Second should I use the last linear_coeff_ or should I get the average of all of them to use for my future prediction.
Yes there is a need for cross-validation in a timeseries dataset. Basically you need to ensure your model does not overfit your current test and is able to capture past seasonal changes so you can have some confidence in the model doing the same in the future. This method is also used to choose model hyperparameters (i.e. alpha in a Ridge regression).
In order to make future predictions, you should refit your regressor with the whole data and the best hyperparameters or, as #Marcus V. mentioned in the coments, maybe is best to train it only with the most recent data.
I'm do some text classification tasks. What I have observed is that if fed tfidf matrix(from sklearn's TfidfVectorizer), Logistic Regression model is always outperforming MultinomialNB model. Below is my code for training both:
X = df_new['text_content']
y = df_new['label']
X_train, X_test, y_train, y_test = train_test_split(X, y)
vectorizer = TfidfVectorizer(stop_words='english')
X_train_dtm = vectorizer.fit_transform(X_train)
X_test_dtm = vectorizer.transform(X_test)
clf_lr = LogisticRegression()
clf_lr.fit(X_train_dtm, y_train)
y_pred = clf_lr.predict(X_test_dtm)
lr_score = accuracy_score(y_test, y_pred) # perfectly balanced binary classes
clf_mnb = MultinomialNB()
clf_mnb.fit(X_train_dtm, y_train)
y_pred = clf_mnb.predict(X_test_dtm)
mnb_score = accuracy_score(y_test, y_pred) # perfectly balanced binary classes
Currently lr_score > mnb_score always. I'm wondering how exactly MultinomialNB is using the tfidf matrix since the term frequency in tfidf is calculated based on no class information. Any chance that I should not feed tfidf matrix to MultinomialNB the same way I did to LogisticRegression?
Update: I understand the difference between results of TfidfVectorizer and CountVectorizer. And I also just checked the sources code of sklearn's MultinomialNB.fit() function, looks like it does expect a count as oppose to frequency. This will also explain the performance boost mentioned in my comment below. However, I'm still wondering if under any circumstances pass tfidf into MultinomialNB makes sense. The sklearn documentation briefly mentioned the possibility, but not much details.
Any advice would be much appreciated!
I'm using scikit-learn cross_validation(http://scikit-learn.org/stable/modules/cross_validation.html) and get for example 0.82 mean score(r2_scorer).
How could I know do I have over-fitting or under-fitting using scikit-learn functions?
Unfortunately I confirm that there is no built-in tool to compare train and test scores in a CV setup. The cross_val_score tool only reports test scores.
You can setup your own loop with the train_test_split function as in Ando's answer but you can also use any other CV scheme.
import numpy as np
from sklearn.cross_validation import KFold
from sklearn.metrics import SCORERS
scorer = SCORERS['r2']
cv = KFold(5)
train_scores, test_scores = [], []
for train, test in cv:
regressor.fit(X[train], y[train])
train_scores.append(scorer(regressor, X[train], y[train]))
test_scores.append(scorer(regressor, X[test], y[test]))
mean_train_score = np.mean(train_scores)
mean_test_score = np.mean(test_scores)
If you compute the mean train and test scores with cross validation you can then find out if you are:
Underfitting: the train score is far from the perfect score (which is 1.0 for r2)
Overfitting: the train and test scores are not close from on another (the mean test score is significantly lower than the mean train score).
Note: you can be both significantly underfitting and overfitting at the same time if your model is inadequate and your data is too noisy.
You should compare your scores when testing on training and testing data. If the scores are close to equal, you are likely underfitting. If they are far apart, you are likely overfitting (unless using a method such as random forest).
To compute the scores for both train and test data, you can use something along the following (assuming your data is in variables X and Y):
from sklearn import cross_validation
#do five iterations
for i in range(5):
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, Y, test_size=0.4)
#Your predictor, linear SVM in this example
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
print "Test score", clf.score(X_test, y_test)
print "Train score", clf.score(X_train, y_train)