Behaviour of train_test_split() from Scikit-learn - scikit-learn

I am curious how the train_test_split() method of Scikit-learn will behave in the following scenario:
An imaginary dataset:
id, count, size
1, 4, 8
2, 5, 9
3, 6, 0
say I would divide it into two separate sets like this (keeping 'id' in both):
id, count | id, size
1, 4 | 1, 8
2, 5 | 2, 9
3, 6 | 3, 0
And split them both with train_test_split() with the same random_state of 0. Would the order of both be the same with 'id' as reference? (since you are shuffling the same dataset but with different parts left out)
I am curious as to how this works because I have two models. The first one gets trained with the dataset and adds it's results to the dataset, part of which is then used to train the second model.
When doing this it's important that when testing the generalization of the second model, no data points are used which were also used to train the first model. This is because the data was 'seen before' and the model will know what to do with it, so then you are not testing the generalization to new data anymore.
It would be great if train_test_split() would shuffle it the same since then one would not need to keep track of what data was used to train the first algorithm to prevent contamination of the test results.

They should have the same resulting indices if you use the same random_state parameter in each call.
However--you could also just reverse your order of operations. Call test/train split on the parent dataset, then create two sub-sets from both the test and train sets that result.
Example:
print(df)
id count size
0 1 4 8
1 2 5 9
2 3 6 0
from sklearn.model_selection import train_test_split
dfa = df[['id', 'count']].copy()
dfb = df[['id', 'size']].copy()
rstate = 123
traina, testa = train_test_split(dfa, random_state=123)
trainb, testb = train_test_split(dfb, random_state=123)
assert traina.index.equals(trainb.index)
# True

Related

What is meant by id's and labels in keras data generator?

https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly
The like above is the documentation regarding the custom keras data generator.
I have doubt in the "NOTATION" heading in the above link which says the following:-
Before getting started, let's go through a few organizational tips that are particularly useful when dealing with large datasets.
Let ID be the Python string that identifies a given sample of the dataset. A good way to keep track of samples and their labels is to adopt the following framework:
1. Create a dictionary called partition where you gather:
a) in partition['train'] a list of training IDs
b) in partition['validation'] a list of validation IDs
2. Create a dictionary called labels where for each ID of the dataset, the associated label is given by labels[ID]
For example, let's say that our training set contains id-1, id-2 and id-3 with respective labels 0, 1 and 2, with a validation set containing id-4 with label 1. In that case, the Python variables partition and labels look like
>>> partition
{'train': ['id-1', 'id-2', 'id-3'], 'validation': ['id-4']}
and
>>> labels
{'id-1': 0, 'id-2': 1, 'id-3': 2, 'id-4': 1}
I'm really not able to understand what does labels and id's mean.
For example:- Say, I have a data frame, where there are 1000 columns. Each row corresponds to id's i.e., each ID meant to be just a "DATA POINT".
OR
Say, I have multiple data frame. Each data frame represents different id's?
It seems labels meant not to be the number of class-variable.
I would like to have a clear understanding regarding id's and labels WITH SOME EXAMPLES.
The mentioned article provides a good practice to better organize your data between training and validation. To do so, it's relevant to store line indexes from your dataframe (named IDs here) and corresponding target values (named label here) in an independent object so that in case of transformation on the input, you don't lose track of things.
Here is a basic example using a train/test split
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.DataFrame([[0.1, 1, 'label_a'], [0.2, 2, 'label_a'], [0.3, 3, 'label_a'], [0.4, 4, 'label_b']], columns=['feature_a', 'feature_b', 'target'])
# df.index.tolist() results in [0, 1, 2, 3] (4 rows)
partitions = dict()
labels = dict()
X_train, X_test, y_train, y_test = train_test_split(df[['feature_a', 'feature_b']], df['target'], test_size=0.25, random_state=42)
partitions['train'] = X_train.index.tolist()
partitions['validation'] = X_test.index.tolist()
# partitions['train'] results in [3, 0, 2]
# partitions['validation'] results in [1]
labels = df['target'].to_dict()
# labels is {0: 'label_a', 1: 'label_a', 2: 'label_a', 3: 'label_b'}```

Using a subset of classes in ImageNet

I'm aware that subsets of ImageNet exist, however they don't fulfill my requirement. I want 50 classes at their native ImageNet resolutions.
To this end, I used torch.utils.data.dataset.Subset to select specific classes from ImageNet. However, it turns out, class labels/indices must be greater than 0 and less than num_classes.
Since ImageNet contains 1000 classes, the idx of my selected classes quickly goes over 50. How can I reassign the class indices and do so in a way that allows for evaluation later down the road as well?
Is there a way more elegant way to select a subset?
I am not sure I understood your conclusions about labels being greater than zero and less than num_classes. The torch.utils.data.Subset helper takes in a torch.utils.data.Dataset and a sequence of indices, they correspond to indices of data points from the Dataset you would like to keep in the subset. These indices have nothing to do with the classes they belong to.
Here's how I would approach this:
Load your dataset through torchvision.datasets (custom datasets would work the same way). Here I will demonstrate it with FashionMNIST since ImageNet's data is not made available directly through torchvision's API.
>>> ds = torchvision.datasets.FashionMNIST('.')
>>> len(ds)
60000
Define the classes you want to select for the subset dataset. And retrieve all indices from the main dataset which correspond to these classes:
>>> targets = [1, 3, 5, 9]
>>> indices = [i for i, label in enumerate(ds.targets) if label in targets]
You have your subset:
>>> ds_subset = Subset(ds, indices)
>>> len(ds_subset)
24000
At this point, you can use a dictionnary to remap your labels using targets:
>>> remap = {i:x for i, x in enumerate(targets)}
{0: 1, 1: 3, 2: 5, 3: 9}
For example:
>>> x, y = ds_subset[10]
>>> y, remap[y] # old_label, new_label
1, 3

how to maintain natural order when label encoding with scikit learn

I'm trying to fit a model for a decision tree classifier with scikit-learn module. I have 5 features and one of those is categorical, not numerical
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
df = pd.read_csv()
labelEncoders = {}
for column in df.dtypes[df.dtypes == 'object'].index:
labelEncoders[column] = LabelEncoder()
df[column] = labelEncoders[column].fit_transform(df[column])
print(labelEncoders[column].inverse_transform([0, 1, 2])) #['High', 'Low', 'Normal']
I'm new to ML and I've been reading about the need to encode categorical features before feeding the data frame to the model, and how there are encoding variants like label encoding and one hot encoding.
Now, according to most literature, label encoding should or could be used when the values of the feature can be naturally ordered, for instance, 'Low', 'Normal', 'High'; otherwise one should use one hot encoding so the model doesn't establish a misleading order relationship between the values when there is none that would make sense semantically, for example, 'Brazil', 'Congo', 'Czech Republic'.
So, that's where I'm at with the logic behind choosing a coding strategy, and that's why I'm asking this:
how can I make scikit-learn's LabelEncoder keep the natural order of the values, how can I make it encode like this:
Low -> 0
Normal -> 1
High -> 2
and NOT the way it's doing it now:
High -> 0
Low -> 1
Normal -> 2
Can this be done at all? Is it actually the encoder's task? Do I have to do it somewhere else before the encoding?
Thanks
You can use pandas' replace function pandas.DataFrame.replace() to explicitly pass in the encodings you want to use. As an example:
import pandas as pd
df = pd.DataFrame(data={
"ID": [1, 2, 3, 4, 5],
"Label": ["Low", "High", "Low", "High", "Normal"],
})
print("Original:")
print(df)
label_mapping = {"Low": 0, "Normal": 1, "High": 2}
df = df.replace({"Label": label_mapping})
print("Mapped:")
print(df)
Output:
Original:
ID Label
0 1 Low
1 2 High
2 3 Low
3 4 High
4 5 Normal
Mapped:
ID Label
0 1 0
1 2 2
2 3 0
3 4 2
4 5 1

how to "normalize" vectors values when using Spark CountVectorizer?

CountVectorizer and CountVectorizerModel often creates a sparse feature vector that looks like this:
(10,[0,1,4,6,8],[2.0,1.0,1.0,1.0,1.0])
this basically says the total size of the vocabulary is 10, the current document has 5 unique elements, and in the feature vector, these 5 unique elements take position at 0, 1, 4, 6 and 8. Also, one of the elements show up twice, therefore the 2.0 value.
Now, I would like to "normalize" the above feature vector and make it look like this,
(10,[0,1,4,6,8],[0.3333,0.1667,0.1667,0.1667,0.1667])
i.e., each value is divided by 6, the total number of all elements together. For example, 0.3333 = 2.0/6.
So is there a way to do this efficiently here?
Thanks!
You can use Normalizer
class pyspark.ml.feature.Normalizer(*args, **kwargs)
Normalize a vector to have unit norm using the given p-norm.
with 1-norm
from pyspark.ml.linalg import SparseVector
from pyspark.ml.feature import Normalizer
df = spark.createDataFrame([
(SparseVector(10,[0,1,4,6,8],[2.0,1.0,1.0,1.0,1.0]), )
], ["features"])
Normalizer(inputCol="features", outputCol="features_norm", p=1).transform(df).show(1, False)
# +--------------------------------------+---------------------------------------------------------------------------------------------------------------------+
# |features |features_norm |
# +--------------------------------------+---------------------------------------------------------------------------------------------------------------------+
# |(10,[0,1,4,6,8],[2.0,1.0,1.0,1.0,1.0])|(10,[0,1,4,6,8],[0.3333333333333333,0.16666666666666666,0.16666666666666666,0.16666666666666666,0.16666666666666666])|
# +--------------------------------------+---------------------------------------------------------------------------------------------------------------------+

ValueError: Unknown label type: while implementing MLPClassifier

I have dataframe with columns Year, month, day,hour, minute, second, Daily_KWH. I need to predict Daily KWH using neural netowrk. Please let me know how to go about it
Daily_KWH_System year month day hour minute second
0 4136.900384 2016 9 7 0 0 0
1 3061.657187 2016 9 8 0 0 0
2 4099.614033 2016 9 9 0 0 0
3 3922.490275 2016 9 10 0 0 0
4 3957.128982 2016 9 11 0 0 0
I'm getting the Value Error, when I'm fitting the model.
code so far:
X = df[['year','month','day','hour','minute','second']]
y = df['Daily_KWH_System']
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Fit only to the training data
scaler.fit(X_train)
#y_train.shape
#X_train.shape
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(hidden_layer_sizes=(30,30,30))
#y_train = np.asarray(df['Daily_KWH_System'], dtype="|S6")
mlp.fit(X_train,y_train)
Error:
ValueError: Unknown label type: (array([ 2.27016856e+02, 3.02173014e+03, 4.29404190e+03,
2.41273427e+02, 1.76714247e+02, 4.23374425e+03,
First of all, this is a regression problem and not a classification problem, as the values in the Daily_KWH_System column do not form a set of labels. Instead, they seem to be (at least based on the provided example) real numbers.
If you want to approach it as a classification problem regardless, then according to sklearn documentation:
When doing classification in scikit-learn, y is a vector of integers
or strings.
In your case, y is a vector of floats, and therefore you get the error. Thus, instead of the line
y = df['Daily_KWH_System']
write the line
y = np.asarray(df['Daily_KWH_System'], dtype="|S6")
and this will resolve the issue. (You can read more about this approach here: Python RandomForest - Unknown label Error)
Yet, as regression is more appropriate in this case, then instead of the above change, replace the lines
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(hidden_layer_sizes=(30,30,30))
with
from sklearn.neural_network import MLPRegressor
mlp = MLPRegressor(hidden_layer_sizes=(30,30,30))
The code will run without throwing an error (but there certainly isn't enough data to check whether the model that we get performs well).
With that being said, I don't think that this is the right approach for choosing features for this problem.
In this problem we deal with a sequence of real numbers that form a time series. One reasonable feature that we could choose is the number of seconds (or minutes\hours\days etc) that passed since the starting point. Since this particular data contains only days, months and years (other values are always 0), we could choose as a feature the number of days that passed since the beginning. Then your data frame will look like:
Daily_KWH_System days_passed
0 4136.900384 0
1 3061.657187 1
2 4099.614033 2
3 3922.490275 3
4 3957.128982 4
You could take the values in the column days_passed as features and the values in Daily_KWH_System as targets. You may also add some indicator features. For example, if you think that the end of the year may affect the target, you can add an indicator feature that indicates whether the month is December or not.
If the data is indeed daily (at least in this example you have one data point per day) and you want to tackle this problem with neural networks, then another reasonable approach would be to handle it as a time series and try to fit recurrent neural network. Here are couple of great blog posts that describe this approach:
http://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/
http://machinelearningmastery.com/time-series-forecasting-long-short-term-memory-network-python/
The fit() function expects y to be 1D list. By slicing a Pandas dataframe you always get a 2D object. This means that for your case, you need to convert the 2D object you got from slicing the DataFrame into an actual 1D list, as expected by fit function:
y = list(df['Daily_KWH_System'])
Use a regressor instead. This will solve float 2D data issue.
from sklearn.neural_network import MLPRegressor
model = MLPRegressor(solver='lbfgs',alpha=0.001,hidden_layer_sizes=(10,10))
model.fit(x_train,y_train)
y_pred = model.predict(x_test)
Instead of
mlp.fit(X_train,y_train)
use this
mlp.fit(X_train,y_train.values)

Resources