Stratifying folds with StratifiedKFold in sklearn - scikit-learn

I do not understand very well the logic behind sklearn function train_test_split and StratifiedKFold for obtaining balanced splits according to multiple "columns" and not only according to the target distribution. I know the previous sentence is a bit obscure so I hope the following code helps.
import numpy as np
import pandas as pd
import random
n_samples = 100
prob = 0.2
pos = int(n_samples * prob)
neg = n_samples - pos
target = [1] * pos + [0] * neg
cat = ["a"] * 50 + ["b"] * 50
random.shuffle(target)
random.shuffle(cat)
ds = pd.DataFrame()
ds["target"] = target
ds["cat"] = cat
ds["f1"] = np.random.random(size=(n_samples,))
ds["f2"] = np.random.random(size=(n_samples,))
print(ds.head())
This is a 100-example dataset, target distribution is governed by p, in this case we have 20% positive examples. There is a binary categorical column cat, perfectly balanced. The output of the previous code is:
target cat f1 f2
0 0 a 0.970585 0.134268
1 0 a 0.410689 0.225524
2 0 a 0.638111 0.273830
3 0 b 0.594726 0.579668
4 0 a 0.737440 0.667996
with train_test_split(), stratify on target and cat, if we study the frequencies, we get:
from sklearn.model_selection import train_test_split, StratifiedKFold
# with train_test_split
training, valid = train_test_split(range(n_samples),
test_size=20,
stratify=ds[["target", "cat"]])
print("---")
print("* training")
print(ds.loc[training, ["target", "cat"]].value_counts() / len(training)) # balanced
print("* validation")
print(ds.loc[valid, ["target", "cat"]].value_counts() / len(valid)) # balanced
we get this:
* dataset
0 0.8
1 0.2
Name: target, dtype: float64
target cat
0 a 0.4
b 0.4
1 a 0.1
b 0.1
dtype: float64
---
* training
target cat
0 a 0.4
b 0.4
1 a 0.1
b 0.1
dtype: float64
* validation
target cat
0 a 0.4
b 0.4
1 a 0.1
b 0.1
dtype: float64
It is perfectly stratified.
Now with StratifiedKFold:
# with stratified k-fold
skf = StratifiedKFold(n_splits=5)
try:
for train, valid in skf.split(X=range(len(ds)), y=ds[["target", "cat"]]):
pass
except:
print("! does not work")
for train, valid in skf.split(X=range(len(ds)), y=ds.target):
print("happily iterating")
output:
! does not work
happily iterating
happily iterating
happily iterating
happily iterating
happily iterating
How do I obtain what I got with train_test_split with StratifiedKFold? I know there might be data distributions not allowing such stratifications in k-fold cross validation, but I cannot understand why train_test_split accepts two or more columns and the other method does not.

This doesn't seem readily possible currently.
Multilabel isn't exactly what you're looking for, but related. That's been asked here before, and was an Issue on sklearn's github (not sure why it got closed).
As a bit of a hack, you should be able to just combine your two columns into a new one with ordered pairs, and stratify on that?

Related

How to apply accuracy_score function to two columns in group by

I have the following data frame:
wn Ground_truth Prediction
A 1 1
A 1 1
A 1 0
A 1 1
B 0 1
B 1 1
B 0 0
for each group ( A , B) i would like to calculate the accuracy_score(Ground_truth, Prediction)
Specifically for accuracy you can actually do something simpler:
df.assign(x=df['Ground_truth']==df['Prediction']).groupby('wn').mean()
you can use the accuracy_score function from sklearn. You can check their document from here
from sklearn.metrics import accuracy_score
ground_truth = df["Ground_truth"].values
predictions = df["Prediction"].values
accuracy = accuracy_score(ground_truth, predictions)

Problem with negative numbers in sklearn.feature_selection.SelectKBest feautre scoring module

I was trying auto feature engineering and selecting, so for that, I used the Boston house price dataset available in sklearn.
from sklearn.datasets import load_boston
import pandas as pd
data = load_boston()
x = data.data
y= data.target
y = pd.DataFrame(y)
Then I implemented the feature transformation library on the dataset.
import autofeat as af
clf = af.AutoFeatRegressor()
df = clf.fit_transform(x,y)
df = pd.DataFrame(df)
After this, I implemented another function to find the score of each feature in relation to the label.
from sklearn.feature_selection import SelectKBest, chi2
X_new = SelectKBest(chi2, k=20)
X_new_done = X_new.fit_transform(df,y)
dfscores = pd.DataFrame(X_new.scores_)
dfcolumns = pd.DataFrame(X_new_done.columns)
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score']
print(featureScores.nlargest(10,'Score'))
This gave error as following.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-16-b0fa1556bdef> in <module>()
1 from sklearn.feature_selection import SelectKBest, chi2
2 X_new = SelectKBest(chi2, k=20)
----> 3 X_new_done = X_new.fit_transform(df,y)
4 dfscores = pd.DataFrame(X_new.scores_)
5 dfcolumns = pd.DataFrame(X_new_done.columns)
ValueError: Input X must be non-negative.
I had a few negative numbers in my dataset. So how can I overcome this problem?
Note:- df has now transformations of y, its only having transformations of x.
You have a feature with all negative values:
df['exp(x005)*log(x000)']
returns
0 -3630.638503
1 -2212.931477
2 -4751.790753
3 -3754.508972
4 -3395.387438
...
501 -2022.382877
502 -1407.856591
503 -2998.638158
504 -1973.273347
505 -1267.482741
Name: exp(x005)*log(x000), Length: 506, dtype: float64
Quoting another answer (https://stackoverflow.com/a/46608239/5025009):
The error message Input X must be non-negative says it all: Pearson's chi square test (goodness of fit) does not apply to negative values. It's logical because the chi square test assumes frequencies distribution and a frequency can't be a negative number. Consequently, sklearn.feature_selection.chi2 asserts the input is non-negative.
In many cases, it may be quite safe to simply shift each feature to make it all positive, or even normalize to [0, 1] interval as suggested by EdChum.
If data transformation is for some reason not possible (e.g. a negative value is an important factor), you should pick another statistic to score your features:
sklearn.feature_selection.f_regression computes ANOVA f-value
sklearn.feature_selection.mutual_info_classif computes the mutual information
Since the whole point of this procedure is to prepare the features for another method, it's not a big deal to pick anyone, the end result usually the same or very close.

RandomForestRegressor spitting out 1 prediction only

I am trying to work with the RandomForestRegressor. Using the RandomForestClassifier I seemed to be able to receive variable outcome of +/-1. However using the RandomForestRegressor I only get a constant value when I try to predict.
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from pandas_datareader import data
import csv
import statsmodels.api as sm
data = pd.read_csv('C:\H\XPA.csv')
data['pct move']=data['XP MOVE']
# Features construction
data.dropna(inplace=True)
# X is the input variable
X = data[[ 'XPSpread', 'stdev300min']]
# Y is the target or output variable
y = data['pct move']
# Total dataset length
dataset_length = data.shape[0]
# Training dataset length
split = int(dataset_length * 0.75)
# Splitiing the X and y into train and test datasets
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]
clf = RandomForestRegressor(n_estimators=1000)
# Create the model on train dataset
model = clf.fit(X_train, y_train)
from sklearn.metrics import accuracy_score
data['strategy_returns'] = data['pct move'].shift(-1) * -model.predict(X)
print(model.predict(X_test))
Output:
[4.05371547e-07 4.05371547e-07 4.05371547e-07 ... 4.05371547e-07
4.05371547e-07 4.05371547e-07]
The output is stationary while the y data is this:
0 -0.0002
1 0.0000
2 -0.0002
3 0.0002
4 0.0003
...
29583 0.0014
29584 0.0010
29585 0.0046
29586 0.0018
29587 0.0002
x-data:
XPSpread stdev300min
0 1.0 0.0002
1 1.0 0.0002
2 1.0 0.0002
3 1.0 0.0002
4 1.0 0.0002
... ... ...
29583 6.0 0.0021
29584 6.0 0.0021
29585 19.0 0.0022
29586 9.0 0.0022
29587 30.0 0.0022
Now when I change this problem to a classification problem I do get a relative good prediction of the sign. However when I change it to a regression I get a stationary outcome.
Any suggestions how I can improve this?
It may very well be the case that, with only two features, there is not enough information there for a numeric prediction (i.e. regression); while in a "milder" classification setting (predicting just the sign, as you say) you have some success.
The low number of features is not the only possible issue; judging from the few samples you have posted, one can easily see that, for example, your first 5 samples have identical features ([1.0, 0.0002]), while their corresponding y values can be anywhere in [-0.0002, 0.0003] - and the situation is similar for your samples #29583 & 29584. On the other hand, your samples #3 ([1.0, 0.0002]) and #29587 ([30.0, 0.0022]) look very dissimilar, but they end up having the same y value of 0.0002.
If the rest of your dataset has similar characteristics, it may just not be amenable to a decent regression modeling.
Last but not least, If your data are in any way "ordered" along some feature (they look like, but of course I cannot be sure with that small a sample), the situation is getting worse. What I suggest is to split your data using train_test_split, instead of doing it manually:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, shuffle=True)
which hopefully, due to shuffling, will result in a more favorable split. You may want to remove duplicate rows from the dataframe before shuffling and splitting (they are never a good idea) - see pandas.DataFrame.drop_duplicates.

KMeans clustering - Value error: n_samples=1 should be >= n_cluster

I am doing an experiment with three time-series datasets with different characteristics for my experiment whose format is as the following.
0.086206438,10
0.086425551,12
0.089227066,20
0.089262508,24
0.089744425,30
0.090036815,40
0.090054172,28
0.090377569,28
0.090514071,28
0.090762872,28
0.090912691,27
The first column is a timestamp. For reproducibility reasons, I am sharing the data here. From column 2, I wanted to read the current row and compare it with the value of the previous row. If it is greater, I keep comparing. If the current value is smaller than the previous row's value, I want to divide the current value (smaller) by the previous value (larger). Accordingly, here is the code:
import numpy as np
import matplotlib.pyplot as plt
protocols = {}
types = {"data1": "data1.csv", "data2": "data2.csv", "data3": "data3.csv"}
for protname, fname in types.items():
col_time,col_window = np.loadtxt(fname,delimiter=',').T
trailing_window = col_window[:-1] # "past" values at a given index
leading_window = col_window[1:] # "current values at a given index
decreasing_inds = np.where(leading_window < trailing_window)[0]
quotient = leading_window[decreasing_inds]/trailing_window[decreasing_inds]
quotient_times = col_time[decreasing_inds]
protocols[protname] = {
"col_time": col_time,
"col_window": col_window,
"quotient_times": quotient_times,
"quotient": quotient,
}
plt.figure(); plt.clf()
plt.plot(quotient_times,quotient, ".", label=protname, color="blue")
plt.ylim(0, 1.0001)
plt.title(protname)
plt.xlabel("time")
plt.ylabel("quotient")
plt.legend()
plt.show()
And this produces the following three points - one for each dataset I shared.
As we can see from the points in the plots based on the code given above, data1 is pretty consistent whose value is around 1, data2 will have two quotients (whose values will concentrate either around 0.5 or 0.8) and the values of data3 are concentrated around two values (either around 0.5 or 0.7). This way, given a new data point (with quotient and quotient_times), I want to know which cluster it belongs to by building each dataset stacking these two transformed features quotient and quotient_times. I am trying it with KMeans clustering as the following
from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=3, random_state=0)
k_means.fit(quotient)
But this is giving me an error: ValueError: n_samples=1 should be >= n_clusters=3. How can we fix this error?
Update: samlpe quotient data = array([ 0.7 , 0.7 , 0.4973262 , 0.7008547 , 0.71287129,
0.704 , 0.49723757, 0.49723757, 0.70676692, 0.5 ,
0.5 , 0.70754717, 0.5 , 0.49723757, 0.70322581,
0.5 , 0.49723757, 0.49723757, 0.5 , 0.49723757])
As is, your quotient variable is now one single sample; here I get a different error message, probably due to different Python/scikit-learn version, but the essence is the same:
import numpy as np
quotient = np.array([ 0.7 , 0.7 , 0.4973262 , 0.7008547 , 0.71287129, 0.704 , 0.49723757, 0.49723757, 0.70676692, 0.5 , 0.5 , 0.70754717, 0.5 , 0.49723757, 0.70322581, 0.5 , 0.49723757, 0.49723757, 0.5 , 0.49723757])
quotient.shape
# (20,)
from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=3, random_state=0)
k_means.fit(quotient)
This gives the following error:
ValueError: Expected 2D array, got 1D array instead:
array=[0.7 0.7 0.4973262 0.7008547 0.71287129 0.704
0.49723757 0.49723757 0.70676692 0.5 0.5 0.70754717
0.5 0.49723757 0.70322581 0.5 0.49723757 0.49723757
0.5 0.49723757].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
which, despite the different wording, is not different from yours - essentially it says that your data look like a single sample.
Following the first advice(i.e. considering that quotient contains a single feature (column) resolves the issue:
k_means.fit(quotient.reshape(-1,1))
# result
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
random_state=0, tol=0.0001, verbose=0)
Please try the code below. A brief explanation on what I've done:
First I built the dataset sample = np.vstack((quotient_times, quotient)).T and standardized it, so it would become easier to cluster. Following, I've applied DBScan with multiple hyperparameters (eps and min_samples) until I've found the one that separated the points better. Finally, I've plotted the data with its respective labels, since you are working with 2 dimensional data, it's easy to visualize how good the clustering is.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
types = {"data1": "data1.csv", "data2": "data2.csv", "data3": "data3.csv"}
dataset = np.empty((0, 2))
for protname, fname in types.items():
col_time,col_window = np.loadtxt(fname,delimiter=',').T
trailing_window = col_window[:-1] # "past" values at a given index
leading_window = col_window[1:] # "current values at a given index
decreasing_inds = np.where(leading_window < trailing_window)[0]
quotient = leading_window[decreasing_inds]/trailing_window[decreasing_inds]
quotient_times = col_time[decreasing_inds]
sample = np.vstack((quotient_times, quotient)).T
dataset = np.append(dataset, sample, axis=0)
scaler = StandardScaler()
dataset = scaler.fit_transform(dataset)
k_means = DBSCAN(eps=0.6, min_samples=1)
k_means.fit(dataset)
colors = [i for i in k_means.labels_]
plt.figure();
plt.title('Dataset 1,2,3')
plt.xlabel("time")
plt.ylabel("quotient")
plt.scatter(dataset[:, 0], dataset[:, 1], c=colors)
plt.legend()
plt.show()
You are trying to make 3 clusters, while you have only 1 np.array i.e n_samples.
Try increasing the no. of arrays.
Decreasing no. of clusters.
Reshaping the array (not sure)

Ensemble model in H2O with fold_column argument

I am new to H2O in python. I am trying to model my data using ensemble model following the example codes from H2O's web site. (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/stacked-ensembles.html)
I have applied GBM and RF as base models. And then using stacking, I tried to merge them in ensemble model. In addition, in my training data I created one additional column named 'fold' to be used in fold_column = "fold"
I applied 10 fold cv and I observed that I received results from cv1. However, all the predictions coming from other 9 cvs, they are empty. What am I missing here?
Here is my sample data:
code:
import h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
from h2o.grid.grid_search import H2OGridSearch
from __future__ import print_function
h2o.init(port=23, nthreads=6)
train = h2o.H2OFrame(ens_df)
test = h2o.H2OFrame(test_ens_eq)
x = train.drop(['Date','EQUITY','fold'],axis=1).columns
y = 'EQUITY'
cat_cols = ['A','B','C','D']
train[cat_cols] = train[cat_cols].asfactor()
test[cat_cols] = test[cat_cols].asfactor()
my_gbm = H2OGradientBoostingEstimator(distribution="gaussian",
ntrees=10,
max_depth=3,
min_rows=2,
learn_rate=0.2,
keep_cross_validation_predictions=True,
seed=1)
my_gbm.train(x=x, y=y, training_frame=train, fold_column = "fold")
Then when I check cv results with
my_gbm.cross_validation_predictions():
Plus when I try the ensemble in the test set I get the warning below:
# Train a stacked ensemble using the GBM and GLM above
ensemble = H2OStackedEnsembleEstimator(model_id="mlee_ensemble",
base_models=[my_gbm, my_rf])
ensemble.train(x=x, y=y, training_frame=train)
# Eval ensemble performance on the test data
perf_stack_test = ensemble.model_performance(test)
pred = ensemble.predict(test)
pred
/mgmt/data/conda/envs/python3.6_4.4/lib/python3.6/site-packages/h2o/job.py:69: UserWarning: Test/Validation dataset is missing column 'fold': substituting in a column of NaN
warnings.warn(w)
Am I missing something about fold_column?
Here is an example of how to use a custom fold column (created from a list). This is a modified version of the example Python code in the Stacked Ensemble page in the H2O User Guide.
import h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
from h2o.grid.grid_search import H2OGridSearch
from __future__ import print_function
h2o.init()
# Import a sample binary outcome training set into H2O
train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)
# For binary classification, response should be a factor
train[y] = train[y].asfactor()
# Add a fold column, generate from a list
# The list has 10 unique values, so there will be 10 folds
fold_list = list(range(10)) * 1000
train['fold_id'] = h2o.H2OFrame(fold_list)
# Train and cross-validate a GBM
my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli",
ntrees=10,
keep_cross_validation_predictions=True,
seed=1)
my_gbm.train(x=x, y=y, training_frame=train, fold_column="fold_id")
# Train and cross-validate a RF
my_rf = H2ORandomForestEstimator(ntrees=50,
keep_cross_validation_predictions=True,
seed=1)
my_rf.train(x=x, y=y, training_frame=train, fold_column="fold_id")
# Train a stacked ensemble using the GBM and RF above
ensemble = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf])
ensemble.train(x=x, y=y, training_frame=train)
To answer your second question about how to view the cross-validated predictions in a model. They are stored in two places, however, the method that you probably want to use is: .cross_validation_holdout_predictions() This method returns a single H2OFrame of the cross-validated predictions, in the original order of the training observations:
In [11]: my_gbm.cross_validation_holdout_predictions()
Out[11]:
predict p0 p1
--------- -------- --------
1 0.323155 0.676845
1 0.248131 0.751869
1 0.288241 0.711759
1 0.407768 0.592232
1 0.507294 0.492706
0 0.6417 0.3583
1 0.253329 0.746671
1 0.289916 0.710084
1 0.524328 0.475672
1 0.252006 0.747994
[10000 rows x 3 columns]
The second method, .cross_validation_predictions() is a list which stores the predictions from each fold in an H2OFrame that has the same number of rows as the original training frame, but the rows that are not active in that fold have a value of zero. This is not usually the format that people find most useful, so I'd recommend using the other method instead.
In [13]: type(my_gbm.cross_validation_predictions())
Out[13]: list
In [14]: len(my_gbm.cross_validation_predictions())
Out[14]: 10
In [15]: my_gbm.cross_validation_predictions()[0]
Out[15]:
predict p0 p1
--------- -------- --------
1 0.323155 0.676845
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
[10000 rows x 3 columns]

Resources