RandomForestRegressor spitting out 1 prediction only - python-3.x

I am trying to work with the RandomForestRegressor. Using the RandomForestClassifier I seemed to be able to receive variable outcome of +/-1. However using the RandomForestRegressor I only get a constant value when I try to predict.
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from pandas_datareader import data
import csv
import statsmodels.api as sm
data = pd.read_csv('C:\H\XPA.csv')
data['pct move']=data['XP MOVE']
# Features construction
data.dropna(inplace=True)
# X is the input variable
X = data[[ 'XPSpread', 'stdev300min']]
# Y is the target or output variable
y = data['pct move']
# Total dataset length
dataset_length = data.shape[0]
# Training dataset length
split = int(dataset_length * 0.75)
# Splitiing the X and y into train and test datasets
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]
clf = RandomForestRegressor(n_estimators=1000)
# Create the model on train dataset
model = clf.fit(X_train, y_train)
from sklearn.metrics import accuracy_score
data['strategy_returns'] = data['pct move'].shift(-1) * -model.predict(X)
print(model.predict(X_test))
Output:
[4.05371547e-07 4.05371547e-07 4.05371547e-07 ... 4.05371547e-07
4.05371547e-07 4.05371547e-07]
The output is stationary while the y data is this:
0 -0.0002
1 0.0000
2 -0.0002
3 0.0002
4 0.0003
...
29583 0.0014
29584 0.0010
29585 0.0046
29586 0.0018
29587 0.0002
x-data:
XPSpread stdev300min
0 1.0 0.0002
1 1.0 0.0002
2 1.0 0.0002
3 1.0 0.0002
4 1.0 0.0002
... ... ...
29583 6.0 0.0021
29584 6.0 0.0021
29585 19.0 0.0022
29586 9.0 0.0022
29587 30.0 0.0022
Now when I change this problem to a classification problem I do get a relative good prediction of the sign. However when I change it to a regression I get a stationary outcome.
Any suggestions how I can improve this?

It may very well be the case that, with only two features, there is not enough information there for a numeric prediction (i.e. regression); while in a "milder" classification setting (predicting just the sign, as you say) you have some success.
The low number of features is not the only possible issue; judging from the few samples you have posted, one can easily see that, for example, your first 5 samples have identical features ([1.0, 0.0002]), while their corresponding y values can be anywhere in [-0.0002, 0.0003] - and the situation is similar for your samples #29583 & 29584. On the other hand, your samples #3 ([1.0, 0.0002]) and #29587 ([30.0, 0.0022]) look very dissimilar, but they end up having the same y value of 0.0002.
If the rest of your dataset has similar characteristics, it may just not be amenable to a decent regression modeling.
Last but not least, If your data are in any way "ordered" along some feature (they look like, but of course I cannot be sure with that small a sample), the situation is getting worse. What I suggest is to split your data using train_test_split, instead of doing it manually:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, shuffle=True)
which hopefully, due to shuffling, will result in a more favorable split. You may want to remove duplicate rows from the dataframe before shuffling and splitting (they are never a good idea) - see pandas.DataFrame.drop_duplicates.

Related

How to get the threshold from a specific precision and recall

I'm trying to get the threshold for a specific precision and recall. Let's say I want to get the threshold at the precision of 60% and recall of 40%. Are there any straightforward way to do it using the sklearn package?
precision, recall, threshold = precision_recall_curve(y_val, y_e)
df_pr = pd.DataFrame()
df_pr['precision'] = precision
df_pr['recall'] = recall
df_pr['threshold'] = list(threshold) + [1]
precision recall threshold
0 0.247543 1.000000 0.059483
1 0.247486 0.999692 0.059489
2 0.247504 0.999692 0.059512
3 0.247523 0.999692 0.059542
Provided that I've properly understood your question, imo, the point to highlight is that precision and recall are not necessarily coupled as you seem to imply. Here's a toy example:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve
X, y = make_classification(n_samples=1000, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=7)
lr = LogisticRegression(random_state=42)
lr.fit(X_train, y_train)
y_scores = lr.predict_proba(X_test)
precision, recall, threshold = precision_recall_curve(y_test, y_scores[:, 1])
plt.plot(threshold, precision[:-1], 'b--', label='Precision')
plt.plot(threshold, recall[:-1], 'r--', label='Recall')
plt.xlabel('Threshold')
plt.legend(loc='lower left')
plt.ylim([0,1])
This said, the problem becomes something you can easily solve either with numpy or pandas, depending on your "setting". For instance, here's a toy function returning precision, recall and threshold at the index where the condition is attained.
def prt(arr, value):
array = np.asarray(arr)
idx = np.where(array[:-1] == value)[0][0]
return precision[idx], recall[idx], threshold[idx]
prt(precision, 0.6) # I checked ex-ante that precision=0.6 is attained. Differently you'll have to go with something custom.
(0.6, 0.9622641509433962, 0.052229434776723364)
Otherwise, to resemble your setting with a pandas DataFrame:
df = pd.DataFrame()
df['precision'] = precision[:-1]
df['recall'] = recall[:-1]
df['threshold'] = threshold
df[df.loc[:, 'precision'] == 0.6]
I would suggest you sklearn precision_recall_curve and threshold that tries to explain how .precision_recall_curve() works under the hood and Why does precision_recall_curve() return different values than confusion matrix? which might be somehow related.

How to split data into 3 parts, one of which wont be used? [duplicate]

This question already has answers here:
How to split data into 3 sets (train, validation and test)?
(11 answers)
Closed 4 years ago.
I've got a csv that I want to split 80% into training, 10% into dev-test and 10% into test set. The dev-test wont be used further.
I've got it set up like:
import sklearn
import csv
with open('Letter.csv') as f:
reader = csv.reader(f)
annotated_data = [r for r in reader]
and for splitting:
import random
random.seed(1234)
random.shuffle(annotated_data)
But all the splitting I've seen only slips into 2 sets, and I can't see where to specify how much partition to split it with, eg I want 80% training. Maybe I'm blind, but can anyone help me? I don't know how to use pandas.
Also once I split it, how do I access the sets separately? For eg I can read each record as a whole and count the amount of entries, but once I split it I want to count how many records are in each set. Sorry if this deserves its own post, but I don't want to spam.
No, it's possible in scikit-learn to split into three sets directly.
The typical approach is two split twice.in 80/20 and then split the 20 percent 50/50. You want to check the train_test_split-function.
Essentially, the code with data X and y could look like this:
import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(100).reshape((5, 2)), range(5)
X_train, X_tmp, y_train, y_tmp = train_test_split(X, y, test_size=0.2)
X_dev, X_test, y_dev, y_test = train_test_split(X_tmp, y_tmp, test_size=0.5)
Now you would want to work with (X_train, y_train), (X_dev, y_dev) and (X_test, y_test)
You can use train_test_split twice:
Split the data into a ratio 0.8 : 0.2
Split the smaller set into a ratio 0.5 : 0.5

h2o vs scikit learn confusion matrix

Anyone able to match the sklearn confusion matrix to h2o?
They never match....
Doing something similar with Keras produces a perfect match.
But in h2o they are always off. Tried it every which way...
Borrowed some code from:
Any difference between H2O and Scikit-Learn metrics scoring?
# In[30]:
import pandas as pd
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
h2o.init()
# Import a sample binary outcome train/test set into H2O
train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
test = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")
# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)
# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()
# Train and cross-validate a GBM
model = H2OGradientBoostingEstimator(distribution="bernoulli", seed=1)
model.train(x=x, y=y, training_frame=train)
# In[31]:
# Test AUC
model.model_performance(test).auc()
# 0.7817203808052897
# In[32]:
# Generate predictions on a test set
pred = model.predict(test)
# In[33]:
from sklearn.metrics import roc_auc_score, confusion_matrix
pred_df = pred.as_data_frame()
y_true = test[y].as_data_frame()
roc_auc_score(y_true, pred_df['p1'].tolist())
#pred_df.head()
# In[36]:
y_true = test[y].as_data_frame().values
cm = pd.DataFrame(confusion_matrix(y_true, pred_df['predict'].values))
# In[37]:
print(cm)
0 1
0 1354 961
1 540 2145
# In[38]:
model.model_performance(test).confusion_matrix()
Confusion Matrix (Act/Pred) for max f1 # threshold = 0.353664307031828:
0 1 Error Rate
0 964.0 1351.0 0.5836 (1351.0/2315.0)
1 274.0 2411.0 0.102 (274.0/2685.0)
Total 1238.0 3762.0 0.325 (1625.0/5000.0)
# In[39]:
h2o.cluster().shutdown()
This does the trick, thx for the hunch Vivek. Still not an exact match but extremely close.
perf = model.model_performance(train)
threshold = perf.find_threshold_by_max_metric('f1')
model.model_performance(test).confusion_matrix(thresholds=threshold)
I also meet the same issue. Here is what I would do to make a fair comparison:
model.train(x=x, y=y, training_frame=train, validation_frame=test)
cm1 = model.confusion_matrix(metrics=['F1'], valid=True)
Since we train the model using training data and validation data, then the pred['predict'] will use the threshold which maximizes the F1 score of validation data. To make sure, one can use these lines:
threshold = perf.find_threshold_by_max_metric(metric='F1', valid=True)
pred_df['predict'] = pred_df['p1'].apply(lambda x: 0 if x < threshold else 1)
To get another confusion matrix from scikit learn:
from sklearn.metrics import confusion_matrix
cm2 = confusion_matrix(y_true, pred_df['predict'])
In my case, I don't understand why I get slightly different results. Something like, for example:
print(cm1)
>> [[3063 176]
[ 94 146]]
print(cm2)
>> [[3063 176]
[ 95 145]]

ValueError: Unknown label type: while implementing MLPClassifier

I have dataframe with columns Year, month, day,hour, minute, second, Daily_KWH. I need to predict Daily KWH using neural netowrk. Please let me know how to go about it
Daily_KWH_System year month day hour minute second
0 4136.900384 2016 9 7 0 0 0
1 3061.657187 2016 9 8 0 0 0
2 4099.614033 2016 9 9 0 0 0
3 3922.490275 2016 9 10 0 0 0
4 3957.128982 2016 9 11 0 0 0
I'm getting the Value Error, when I'm fitting the model.
code so far:
X = df[['year','month','day','hour','minute','second']]
y = df['Daily_KWH_System']
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Fit only to the training data
scaler.fit(X_train)
#y_train.shape
#X_train.shape
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(hidden_layer_sizes=(30,30,30))
#y_train = np.asarray(df['Daily_KWH_System'], dtype="|S6")
mlp.fit(X_train,y_train)
Error:
ValueError: Unknown label type: (array([ 2.27016856e+02, 3.02173014e+03, 4.29404190e+03,
2.41273427e+02, 1.76714247e+02, 4.23374425e+03,
First of all, this is a regression problem and not a classification problem, as the values in the Daily_KWH_System column do not form a set of labels. Instead, they seem to be (at least based on the provided example) real numbers.
If you want to approach it as a classification problem regardless, then according to sklearn documentation:
When doing classification in scikit-learn, y is a vector of integers
or strings.
In your case, y is a vector of floats, and therefore you get the error. Thus, instead of the line
y = df['Daily_KWH_System']
write the line
y = np.asarray(df['Daily_KWH_System'], dtype="|S6")
and this will resolve the issue. (You can read more about this approach here: Python RandomForest - Unknown label Error)
Yet, as regression is more appropriate in this case, then instead of the above change, replace the lines
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(hidden_layer_sizes=(30,30,30))
with
from sklearn.neural_network import MLPRegressor
mlp = MLPRegressor(hidden_layer_sizes=(30,30,30))
The code will run without throwing an error (but there certainly isn't enough data to check whether the model that we get performs well).
With that being said, I don't think that this is the right approach for choosing features for this problem.
In this problem we deal with a sequence of real numbers that form a time series. One reasonable feature that we could choose is the number of seconds (or minutes\hours\days etc) that passed since the starting point. Since this particular data contains only days, months and years (other values are always 0), we could choose as a feature the number of days that passed since the beginning. Then your data frame will look like:
Daily_KWH_System days_passed
0 4136.900384 0
1 3061.657187 1
2 4099.614033 2
3 3922.490275 3
4 3957.128982 4
You could take the values in the column days_passed as features and the values in Daily_KWH_System as targets. You may also add some indicator features. For example, if you think that the end of the year may affect the target, you can add an indicator feature that indicates whether the month is December or not.
If the data is indeed daily (at least in this example you have one data point per day) and you want to tackle this problem with neural networks, then another reasonable approach would be to handle it as a time series and try to fit recurrent neural network. Here are couple of great blog posts that describe this approach:
http://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/
http://machinelearningmastery.com/time-series-forecasting-long-short-term-memory-network-python/
The fit() function expects y to be 1D list. By slicing a Pandas dataframe you always get a 2D object. This means that for your case, you need to convert the 2D object you got from slicing the DataFrame into an actual 1D list, as expected by fit function:
y = list(df['Daily_KWH_System'])
Use a regressor instead. This will solve float 2D data issue.
from sklearn.neural_network import MLPRegressor
model = MLPRegressor(solver='lbfgs',alpha=0.001,hidden_layer_sizes=(10,10))
model.fit(x_train,y_train)
y_pred = model.predict(x_test)
Instead of
mlp.fit(X_train,y_train)
use this
mlp.fit(X_train,y_train.values)

Accelerating the prediction

Prediction with the SVM model created with 5 features and 3000 samples using default parameters is taking unexpectedely longer time (more than hour) with 5 features and 100000 samples. Is there way of accelerating the prediction?
A few issues to consider here:
Have you standardized your input matrix X? SVM is not scale-invariant, so it could be difficult for the algo to do classification if they takes a large number of raw inputs without proper scaling.
The choice of parameter C: Higher C allows a more complicated non-smooth decision boundary and it takes much more time to fit under this complexity. So decreasing the value C from default 1 to a lower value could accelerate the process.
It's also recommended to choose a proper value of gamma. This could be done via Grid-Search-Cross-Validation.
Here is the code to do grid-search cross validation. I ignore the test set here for simplicity.
import numpy as np
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import accuracy_score, recall_score, f1_score, roc_auc_score, make_scorer
# generate some artificial data
X, y = make_classification(n_samples=3000, n_features=5, weights=[0.1, 0.9])
# make a pipeline for convenience
pipe = make_pipeline(StandardScaler(), SVC(kernel='rbf', class_weight='auto'))
# set up parameter space, we want to tune SVC params C and gamma
# the range below is 10^(-5) to 1 for C and 0.01 to 100 for gamma
param_space = dict(svc__C=np.logspace(-5,0,5), svc__gamma=np.logspace(-2, 2, 10))
# choose your customized scoring function, popular choices are f1_score, accuracy_score, recall_score, roc_auc_score
my_scorer = make_scorer(roc_auc_score, greater_is_better=True)
# construct grid search
gscv = GridSearchCV(pipe, param_space, scoring=my_scorer)
gscv.fit(X, y)
# what's the best estimator
gscv.best_params_
Out[20]: {'svc__C': 1.0, 'svc__gamma': 0.21544346900318834}
# what's the best score, in our case, roc_auc_score
gscv.best_score_
Out[22]: 0.86819366014152421
Note: the SVC is still not running very fast. It takes more than 40s to compute 50 possible combinations of params.
%time gscv.fit(X, y)
CPU times: user 42.6 s, sys: 959 ms, total: 43.6 s
Wall time: 43.6 s
Because the number of features is relatively low, I would start with decreasing the penalty parameter. It controls the penalty for mislabeled samples in the train data, and as your data contains 5 features, I guess it is not exactly linearly separable.
Generally, this parameter (C) allows the classifier to have larger margin on the account of higher accuracy (see this for more information)
By default, C=1.0. Start with svm = SVC(C=0.1) and see how it goes.
One reason might be that the parameter gamma is not the same.
By default sklearn.svm.SVC uses RBF kernel and gamma is 0.0, in which case 1/n_features will be used instead. So gamma is different given different number of features.
In terms of suggestions, I agree with Jianxun's answer.

Resources