Split Train dataset based on labels - python-3.x

I would like to know how to go about splitting my multi-labeled class training data-set to a specific ratio like 80% (class_2), 15% (class_1) and 5% (class_0).
I have a balanced data-set. I originally split the pandas data-set: 80% train and 20% test via the command:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
I wanted to however further specify the ratio for the testing pandas data-set as: 80% (class_2), 15% (class_1) and 5% (class_0). How can this be accomplished?
Here is a snippet of my dataset:
Feat1 Feat2 Feat3 Feat4 Feat5 Label
-58.422504 37.966175 -4.8636584 1.6725544 1.9571232 0
-16.001776 12.794211 -1.1406443 1.3552929 -3.1035073 1
-35.907864 19.15079 -1.4540794 4.7229285 -1.3495653 0
-40.63919 11.879825 0.26731083 4.509876 -0.3005377 1
-82.577805 38.87009 -0.6941721 0.41522327 -3.7065275 0
-91.21994 13.109437 -7.270507 2.081625 -4.206697 0
-47.69479 17.02262 -24.102415 -0.9498974 -6.126767 2
-76.956795 17.869856 -1.6058419 4.2835464 -1.3354894 0
-52.443146 46.593403 -3.4466643 1.1810641 -1.9001787 2
-67.86523 14.28042 0.71933913 2.1071763 1.3627108 1
-47.336437 9.525495 -20.755278 6.523259 -3.422134 2
-42.978676 12.458537 0.07322929 1.3635784 0.09735282 1
-24.21139 38.562397 0.042716235 6.6496754 -1.9689865 2
-48.612396 11.766575 -0.748889 3.8106124 2.109056 1
-49.890644 14.508443 0.36204648 1.7602062 -0.42747113 1
-58.165733 18.751013 -3.8809242 5.257564 -1.4671975 0
-31.926224 8.061624 -0.9180617 3.1844578 1.3856677 1
-49.51432 13.603332 1.1162373 0.88059276 0.8680044 1
-38.187065 22.042477 -9.74126 3.464233 -1.4608487 2
-36.763634 11.885029 -0.3559528 1.2861489 -0.006563603 1
-59.474194 17.596613 -13.849893 2.5668569 -7.367901 2
-20.775812 8.021951 -5.8948507 -1.76145 -3.0236924 1
-44.744774 42.550343 -2.8213162 1.496162 -5.367485 2
-59.297913 15.10593 -15.805616 -0.8902338 -2.0228894 2
-43.05664 17.326857 -21.520315 -0.544733 -5.821276 2
-113.831566 10.970723 -1.0806333 2.6965592 -0.50331205 0
-67.71741 37.033604 -7.5146904 4.7712235 -0.88289934 0
-51.200836 20.278473 -9.158655 4.746186 -5.2653203 2
-43.760933 13.239898 -5.1588607 2.5003295 -2.2052805 0
-53.52218 12.309539 -0.24887963 4.237159 0.52248794 0
How to go about correctly splitting the train data-set based on the Label names off specific ratios?
Thanks for your help and time!

Sampling might be better for your purpose:
import numpy as np
class_0, class_1, class_2 = np.split(df.sample(frac=1, random_state=42),
[int(.05*len(df)), int(.20*len(df))])

Related

ValueError: Invalid classes inferred from unique values of `y`. Expected: [0 1 2 ... 1387 1388 1389], got [0 1 2 ... 18609 24127 41850]

Situation: I am trying to use XGBoost classifier, however this error pops up to me:
"ValueError: Invalid classes inferred from unique values of y. Expected: [0 1 2 ... 1387 1388 1389], got [0 1 2 ... 18609 24127 41850]".
Unlike this solved one: Invalid classes inferred from unique values of `y`. Expected: [0 1 2 3 4 5], got [1 2 3 4 5 6], it seems that I have a different scenario which is about not starting from 0.
Code:
X = data_concat
y = data_concat[['forward_count','comment_count','like_count']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=72)
#Train, test split
print ('Train set:', X_train.shape, y_train.shape) #Check the size after split
print ('Test set:', X_test.shape, y_test.shape)
xgb = XGBClassifier()
clf = xgb.fit(X_train, y_train, eval_metric='auc') #HERE IS WHERE GET THE ERROR
The Datafrme and frame info is like this:
DataFrame
DataFrame Info.
I have adopted different y, meaning when y has less or more columns, the list "[0 1 2 ... 1387 1388 1389]" will simultaneously shrink or expand.
If you need further info, please let me know. Appreciate your help :)
Need to transform the y_train value to fit xgboost, it starts from 0 but not 1.
Here is the code:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_train = le.fit_transform(y_train)

How to apply accuracy_score function to two columns in group by

I have the following data frame:
wn Ground_truth Prediction
A 1 1
A 1 1
A 1 0
A 1 1
B 0 1
B 1 1
B 0 0
for each group ( A , B) i would like to calculate the accuracy_score(Ground_truth, Prediction)
Specifically for accuracy you can actually do something simpler:
df.assign(x=df['Ground_truth']==df['Prediction']).groupby('wn').mean()
you can use the accuracy_score function from sklearn. You can check their document from here
from sklearn.metrics import accuracy_score
ground_truth = df["Ground_truth"].values
predictions = df["Prediction"].values
accuracy = accuracy_score(ground_truth, predictions)

RandomForestRegressor spitting out 1 prediction only

I am trying to work with the RandomForestRegressor. Using the RandomForestClassifier I seemed to be able to receive variable outcome of +/-1. However using the RandomForestRegressor I only get a constant value when I try to predict.
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from pandas_datareader import data
import csv
import statsmodels.api as sm
data = pd.read_csv('C:\H\XPA.csv')
data['pct move']=data['XP MOVE']
# Features construction
data.dropna(inplace=True)
# X is the input variable
X = data[[ 'XPSpread', 'stdev300min']]
# Y is the target or output variable
y = data['pct move']
# Total dataset length
dataset_length = data.shape[0]
# Training dataset length
split = int(dataset_length * 0.75)
# Splitiing the X and y into train and test datasets
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]
clf = RandomForestRegressor(n_estimators=1000)
# Create the model on train dataset
model = clf.fit(X_train, y_train)
from sklearn.metrics import accuracy_score
data['strategy_returns'] = data['pct move'].shift(-1) * -model.predict(X)
print(model.predict(X_test))
Output:
[4.05371547e-07 4.05371547e-07 4.05371547e-07 ... 4.05371547e-07
4.05371547e-07 4.05371547e-07]
The output is stationary while the y data is this:
0 -0.0002
1 0.0000
2 -0.0002
3 0.0002
4 0.0003
...
29583 0.0014
29584 0.0010
29585 0.0046
29586 0.0018
29587 0.0002
x-data:
XPSpread stdev300min
0 1.0 0.0002
1 1.0 0.0002
2 1.0 0.0002
3 1.0 0.0002
4 1.0 0.0002
... ... ...
29583 6.0 0.0021
29584 6.0 0.0021
29585 19.0 0.0022
29586 9.0 0.0022
29587 30.0 0.0022
Now when I change this problem to a classification problem I do get a relative good prediction of the sign. However when I change it to a regression I get a stationary outcome.
Any suggestions how I can improve this?
It may very well be the case that, with only two features, there is not enough information there for a numeric prediction (i.e. regression); while in a "milder" classification setting (predicting just the sign, as you say) you have some success.
The low number of features is not the only possible issue; judging from the few samples you have posted, one can easily see that, for example, your first 5 samples have identical features ([1.0, 0.0002]), while their corresponding y values can be anywhere in [-0.0002, 0.0003] - and the situation is similar for your samples #29583 & 29584. On the other hand, your samples #3 ([1.0, 0.0002]) and #29587 ([30.0, 0.0022]) look very dissimilar, but they end up having the same y value of 0.0002.
If the rest of your dataset has similar characteristics, it may just not be amenable to a decent regression modeling.
Last but not least, If your data are in any way "ordered" along some feature (they look like, but of course I cannot be sure with that small a sample), the situation is getting worse. What I suggest is to split your data using train_test_split, instead of doing it manually:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, shuffle=True)
which hopefully, due to shuffling, will result in a more favorable split. You may want to remove duplicate rows from the dataframe before shuffling and splitting (they are never a good idea) - see pandas.DataFrame.drop_duplicates.

Ensemble model in H2O with fold_column argument

I am new to H2O in python. I am trying to model my data using ensemble model following the example codes from H2O's web site. (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/stacked-ensembles.html)
I have applied GBM and RF as base models. And then using stacking, I tried to merge them in ensemble model. In addition, in my training data I created one additional column named 'fold' to be used in fold_column = "fold"
I applied 10 fold cv and I observed that I received results from cv1. However, all the predictions coming from other 9 cvs, they are empty. What am I missing here?
Here is my sample data:
code:
import h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
from h2o.grid.grid_search import H2OGridSearch
from __future__ import print_function
h2o.init(port=23, nthreads=6)
train = h2o.H2OFrame(ens_df)
test = h2o.H2OFrame(test_ens_eq)
x = train.drop(['Date','EQUITY','fold'],axis=1).columns
y = 'EQUITY'
cat_cols = ['A','B','C','D']
train[cat_cols] = train[cat_cols].asfactor()
test[cat_cols] = test[cat_cols].asfactor()
my_gbm = H2OGradientBoostingEstimator(distribution="gaussian",
ntrees=10,
max_depth=3,
min_rows=2,
learn_rate=0.2,
keep_cross_validation_predictions=True,
seed=1)
my_gbm.train(x=x, y=y, training_frame=train, fold_column = "fold")
Then when I check cv results with
my_gbm.cross_validation_predictions():
Plus when I try the ensemble in the test set I get the warning below:
# Train a stacked ensemble using the GBM and GLM above
ensemble = H2OStackedEnsembleEstimator(model_id="mlee_ensemble",
base_models=[my_gbm, my_rf])
ensemble.train(x=x, y=y, training_frame=train)
# Eval ensemble performance on the test data
perf_stack_test = ensemble.model_performance(test)
pred = ensemble.predict(test)
pred
/mgmt/data/conda/envs/python3.6_4.4/lib/python3.6/site-packages/h2o/job.py:69: UserWarning: Test/Validation dataset is missing column 'fold': substituting in a column of NaN
warnings.warn(w)
Am I missing something about fold_column?
Here is an example of how to use a custom fold column (created from a list). This is a modified version of the example Python code in the Stacked Ensemble page in the H2O User Guide.
import h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
from h2o.grid.grid_search import H2OGridSearch
from __future__ import print_function
h2o.init()
# Import a sample binary outcome training set into H2O
train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)
# For binary classification, response should be a factor
train[y] = train[y].asfactor()
# Add a fold column, generate from a list
# The list has 10 unique values, so there will be 10 folds
fold_list = list(range(10)) * 1000
train['fold_id'] = h2o.H2OFrame(fold_list)
# Train and cross-validate a GBM
my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli",
ntrees=10,
keep_cross_validation_predictions=True,
seed=1)
my_gbm.train(x=x, y=y, training_frame=train, fold_column="fold_id")
# Train and cross-validate a RF
my_rf = H2ORandomForestEstimator(ntrees=50,
keep_cross_validation_predictions=True,
seed=1)
my_rf.train(x=x, y=y, training_frame=train, fold_column="fold_id")
# Train a stacked ensemble using the GBM and RF above
ensemble = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf])
ensemble.train(x=x, y=y, training_frame=train)
To answer your second question about how to view the cross-validated predictions in a model. They are stored in two places, however, the method that you probably want to use is: .cross_validation_holdout_predictions() This method returns a single H2OFrame of the cross-validated predictions, in the original order of the training observations:
In [11]: my_gbm.cross_validation_holdout_predictions()
Out[11]:
predict p0 p1
--------- -------- --------
1 0.323155 0.676845
1 0.248131 0.751869
1 0.288241 0.711759
1 0.407768 0.592232
1 0.507294 0.492706
0 0.6417 0.3583
1 0.253329 0.746671
1 0.289916 0.710084
1 0.524328 0.475672
1 0.252006 0.747994
[10000 rows x 3 columns]
The second method, .cross_validation_predictions() is a list which stores the predictions from each fold in an H2OFrame that has the same number of rows as the original training frame, but the rows that are not active in that fold have a value of zero. This is not usually the format that people find most useful, so I'd recommend using the other method instead.
In [13]: type(my_gbm.cross_validation_predictions())
Out[13]: list
In [14]: len(my_gbm.cross_validation_predictions())
Out[14]: 10
In [15]: my_gbm.cross_validation_predictions()[0]
Out[15]:
predict p0 p1
--------- -------- --------
1 0.323155 0.676845
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
[10000 rows x 3 columns]

TensorFlow - Simple Feed Forward NN not training

I'm new to TensorFlow and just constructed my first very small network!! My code runs, but it has the same accuracy all along; it doesn't change with training. My data has 15 features and 6 classes. Maybe I'll add more features if that makes it easier and better. In short, my question is:
What's a general procedure for debugging TensorFlow code?
My network architecture was determined arbitrarily, so maybe I should change the number of neurons per layer, not entirely sure.
sess1 = tf.Session()
num_predictors = len(training_predictors_tf.columns)
num_classes = len(training_classes_tf.columns)
feature_data = tf.placeholder(tf.float32, [None, num_predictors])
actual_classes = tf.placeholder(tf.float32, [None, num_classes])
weights1 = tf.Variable(tf.truncated_normal([num_predictors, 50], stddev=0.0001))
biases1 = tf.Variable(tf.ones([50]))
weights2 = tf.Variable(tf.truncated_normal([50, 45], stddev=0.0001))
biases2 = tf.Variable(tf.ones([45]))
weights3 = tf.Variable(tf.truncated_normal([45, 25], stddev=0.0001))
biases3 = tf.Variable(tf.ones([25]))
weights4 = tf.Variable(tf.truncated_normal([25, num_classes], stddev=0.0001))
biases4 = tf.Variable(tf.ones([num_classes]))
hidden_layer_1 = tf.nn.relu(tf.matmul(feature_data, weights1) + biases1)
hidden_layer_2 = tf.nn.relu(tf.matmul(hidden_layer_1, weights2) + biases2)
hidden_layer_3 = tf.nn.relu(tf.matmul(hidden_layer_2, weights3) + biases3)
out = tf.matmul(hidden_layer_3, weights4) + biases4
model = tf.nn.softmax_cross_entropy_with_logits(labels=actual_classes, logits=out)
# cost = -tf.reduce_sum(actual_classes*tf.log(model))
cross_entropy = tf.reduce_mean( model)
train_step = tf.train.GradientDescentOptimizer(0.05).minimize(cross_entropy)
# train_step = tf.train.AdamOptimizer(learning_rate=0.0001).minimize(cross_entropy)
sess1.run(tf.global_variables_initializer())
correct_prediction = tf.equal(tf.argmax(out, 1), tf.argmax(actual_classes, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
for i in range(1, 30001):
sess1.run(
train_step,
feed_dict={
feature_data: training_predictors_tf.values,
actual_classes: training_classes_tf.values.reshape(len(training_classes_tf.values), num_classes)
}
)
if i%5000 == 0:
print(i, sess1.run(
accuracy,
feed_dict={
feature_data: training_predictors_tf.values,
actual_classes: training_classes_tf.values.reshape(len(training_classes_tf.values), num_classes)
}
))
And this is my output:
5000 0.3627
10000 0.3627
15000 0.3627
20000 0.3627
25000 0.3627
30000 0.3627
EDIT: I scaled my data as explained here, in the range of [-5;0] but it still does not train the network any better :(
Snippet of unscaled data (one-hot encoding first 6 columns):
2017-06-27 0 0 0 1 0 0 20120.0 20080.0 20070.0 20090.0 ... 20050.0 20160.0 20130.0 20160.0 20040.0 20040.0 20040.0 31753.0 36927.0 41516.0
2017-06-28 0 0 1 0 0 0 20150.0 20120.0 20080.0 20150.0 ... 20060.0 20220.0 20160.0 20130.0 20130.0 20040.0 20040.0 39635.0 31753.0 36927.0
2017-06-29 0 0 0 1 0 0 20140.0 20150.0 20120.0 20140.0 ... 20090.0 20220.0 20220.0 20160.0 20100.0 20130.0 20040.0 50438.0 39635.0 31753.0
2017-06-30 0 1 0 0 0 0 20210.0 20140.0 20150.0 20130.0 ... 20150.0 20270.0 20220.0 20220.0 20050.0 20100.0 20130.0 58983.0 50438.0 39635.0
2017-07-03 0 0 0 1 0 0 20020.0 20210.0 20140.0 20210.0 ... 20140.0 20250.0 20270.0 20220.0 19850.0 20050.0 20100.0 88140.0 58983.0 50438.0
Debugging your network and improving it are two different things. To improve it, once you've chosen a type of classifier (for instance a neural network), you should use the training and validation accuracies, and adjust your hyperparameters in functions of both of these. See the chapter Practical methodology of Goodfellow and alii's book for instance to know how to tune the hyperparameters (it's a bit long, but pure gold !).
As to debugging, this is the harder part. You usually do that by printing some "key tensor" values every once in a while. You clearly have a bug somewhere, or your accuracy would change at least a bit during training. A common problem causing that is exploding gradient, that causes Nans to appear very early in training (sometimes infinity or even 0s at strange places), which then basically stops any update in your network. I'd suggest printing your loss, and maybe the norm of your gradients, that should tell you if this is the problem. If so, the quick-and-dirty solution is to use a smaller LR at start and then augment it. The real solution is to use gradient clipping.
Example of how to print multiple tensor values:
if i%5000 == 0:
acc_val, loss_val, predictions = sess1.run(
[accuracy, cross_entropy, tf.argmax(out, 1)],
feed_dict={
feature_data: training_predictors_tf.values,
actual_classes: training_classes_tf.values.reshape(len(training_classes_tf.values), num_classes)
})
print(i, acc_val, loss_val) # You can also print predictions, but it will be very big so it'll be harder to see the others. Doing it would allow you to check if, for instance, the model always predicts class 0...

Resources