Is it possible to use labelled data in SKLearn?

Is it possible to use labelled data in SKLearn? - scikit-learn

Currently my code looks like:
clf = RandomForestClassifier(n_estimators=10, criterion='entropy')
clf = clf.fit(X, Y)
However X is an array like:
X = [[0, 1], [1, 1]]
I would prefer to use X like:
X = [{'avg': 0, 'stddev': 1}, {'avg': 1, 'stddev': 1}]
Simply because plotting a tree (as described here: http://scikit-learn.org/stable/modules/tree.html#classification ) makes much more sense when you read X[0]['avg'] rather than X[0][0]. Is it possible? Using dictionary or pandas?

You can use the DictVectorizer class to convert such a list of dicts to sparse matrices or dense numpy arrays.
scikit-learn will never use dict objects as the primary datastructure to store records internally as this not memory efficient at all compared to numpy arrays or scipy sparse matrices.

Here's is a great example by 'larsmans' on how to build a feature dict and use DictVectorizer before fitting a model on the data. Note that DictVectorizer class uses scipy.sparse matrix by default (instead of a numpy.ndarray) to make the resulting data structure able to fit in memory. As not all sklearn learning models support sparse matrices you might want to use sparse=False option in the constructor to obtain a dense array
dv = DictVectorizer(sparse=False)

Alternatively, you can specify feature names when using export_graphviz. This will generate
a tree with more meaningful labels at test nodes.
See the feature_names parameter at http://scikit-learn.org/dev/modules/generated/sklearn.tree.export_graphviz.html#sklearn.tree.export_graphviz

Related

sklearn dcg_score not working as expected

This is my code:
from sklearn.metrics import dcg_score
true_relevance = np.asarray([[10]])
scores = np.asarray([[.1]])
dcg_score(true_relevance, scores)
The below code should produce 10 as the dcg_score. The formula from wikipedia gives 10/log2 = 10 But, instead I get ValueError: Only ('multilabel-indicator', 'continuous-multioutput', 'multiclass-multioutput') formats are supported. Got binary instead
Did anyone encounter this?

Since computing dcg on a single element is not meaningful, the sklearn library requires at least two y_true and y_score elements in the corresponding arrays.
You can check this by exploring the sklearn code (or through debugging): https://github.com/scikit-learn/scikit-learn/blob/f3f51f9b611bf873bd5836748647221480071a87/sklearn/utils/multiclass.py#L158
Like:
true_relevance = np.asarray([[10, 5]])
scores = np.asarray([[.1, .2]])
dcg_score(true_relevance, scores)

scikit-learn: Why do we need the target data to have the same size with training data

I am pretty new to ML and I am studying the k-nearest neighbors classifier using python documentation
I am somehow confused about the training part. Let's say my training data is some points in 1d
training = [[1], [4], [3]]
and I want to use k-nearest neighbor classifier to label them into to "teams" :
labels = [[0], [1]]
Why does that doesn't make sense ?
I get an error that target values size does not match the input.
If I put instead labels = [[0], [1], [1]] or labels = [[0], [0], [1]]
it will compile .
Also a side-note question : does the permutation of labels matter?

It looks like you're trying to clusterize your data, which is done by sklearn.neighbors.NearestNeighbors(). sklearn.neighbors.KNeighborsClassifier() is a supervized model which requires supplying the actual class for every observation for training and can predict the class for the previously unseen data after that.
However, NearestNeighbors() method does not allow to limit the amount of clusters iirc, you should probably try something like sklearn.cluster.KMeans(n_clusters=2).

Scaling row-wise with MinMaxScaler from Sklearn

By default, scalers from Sklearn work column-wise. But i need my data to be scaled line-wise, so i did the following:
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import numpy as np
# %% Generating sample data
x = np.array([[-1, 4, 2], [-0.5, 8, 9], [3, 2, 3]])
y = np.array([1, 2, 3])
#%% Train/Test split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=2)
scaler = MinMaxScaler()
x_train = scaler.fit_transform(x_train.T).T # scaling line-wise
x_test = scaler.transform(x_test) <-------- Error here
But i am getting the following error:
ValueError: X has 3 features, but MinMaxScaler is expecting 2 features as input.
I don't understand whats wrong here. Why it says it is expecting 2 features, when all my X (x, x_train and x_test) has 3 features? How can i fix this?

StandardScaler is stateful: when you fit it, it calculates and saves the columns' means and standard deviations; when transforming (train or test sets), it uses those saved statistics. Your transpose trick doesn't work with that: each row has saved statistics, and then your test set doesn't have the same rows, so transform cannot work correctly (throwing an error if different number of rows, or silently mis-scaling if the same number of rows).
What you want isn't stateful: test sets should be transformed completely independently of the training set. Indeed, every row should be transformed independently of each other. So you could just do this kind of transformation before splitting, or using fit_transform on the test set('s transpose).
For l2 normalization of rows, there's a builtin for this: Normalizer (docs). I don't think there's an analogue for min-max normalization, but I think you could write a FunctionTransformer to do it.

This is possible to do. I can think of a scenario where this would be useful. Normally, MinMaxScaler would scale each x, y, and z with respect to other observations of that feature. That's the "series" scaling. Now imagine that instead, you wanted to map each point constrained by x+y+z = 1. I think this is what OP is asking for. I have done this in the past, I will describe how I did it.
You need to treat your individual observations as a column multi-index and treat it like a higher-dimensional feature. Then, you need to build a pipeline within which the observations are transformed from column-wise to row wise, post which you do the min/max scaling. This gets you to x+y+z=1, but you still need to get back to the original shape of the data, for which you will need to track the index of each observation. Within the pipeline, you'll need to use something like a DataframeFunctionTransformer which I have seen on the interwebs, reproducing it below. This way you can use pandas functions to shape the data and merge back in with the indices.
class DataframeFunctionTransformer():
def __init__(self, func):
self.func = func
def transform(self, input_df, **transform_params):
return self.func(input_df)
def fit(self, X, y=None, **fit_params):
return self
Regarding the statefulness of MinMaxScaler, I think in a scenario such as this, the state of MinMaxScaler doesn't get used, it is purely acting as a transformer that maps these points to a different space meeting the constraint that x, y, and z are scaled such that they add up to 1.
#Murilo hope this gets you started with a solution. Must be an interesting problem.

How to convert signal data set of 400 samples with 5000 data points into a tensor of [400, 1, 5000] in pytorch?

I have 400 sensor recordings and each one is having length of 5000. I want to convert it into a tensor of [400,1,5000] or [batch_size, input_channels, signal_length] for a ML problem to train a 1DCNN network by using pytorch nn.Conv1d.

This operation is often referred to as unsqueezing a dimension. There are multiple ways of achieving this, either with an explicit reshape, or with slicing tricks.
Using torch.Torch.unsqueeze, either out-of-place:
>>> x.unsqueeze(dim=1) # won't affect x
Or in-place with torch.Tensor.unsqueeze_:
>>> x.unsqueeze_(dim=1) # will mutate x
Using indexing:
>>> x[:, None] # will insert a singleton at dim=1
Reshaping the tensor with torch.Tensor.reshape:
>>> x.reshape(len(x), 1, -1)
This is not the recommended method as it doesn't generalize. In my opinion, you should not use reshape or view if you are not actually reshaping the tensor.

Usng same Label Encoder to test dataset? or new Label Encoder?

I'm totally novice on scikit-learn.
I want to know whether I should use the same Label Encoder instance that had used on training dataset or not when I want to convert the same feature's categorical data on test dataset. And, it means like below
from sklearn import preprocessing
# trainig data label encoding
le_blood_type = preprocessing.LabelEncoder()
df_training[ 'BLOOD_TYPE' ] = le_blood_type.fit_transform( df_training[ 'BLOOD_TYPE' ] ) # labeling from string
....
1. Using same label encoder
df_test[ 'BLOOD_TYPE' ] = le_blood_type.fit_transform( df_test[ 'BLOOD_TYPE' ] )
2. Using different label encoder
le_for_test_blood_type = preprocessing.LabelEncoder()
df_test[ 'BLOOD_TYPE' ] = le_for_test_blood_type.fit_transform( df_test[ 'BLOOD_TYPE' ] )
Which one is right code?
Or, whatever I choose the above's code it does not make any differences
because training dataset's categorical data and test dataset's categorical data should be the same as a result.

The problem is the way you use it in fact.
As LabelEncoder is associating nominal feature to a numeric increment you should fit once and transform once the object has fitted. Don't forget that you need to have all your nominal feature in the training phase.
The good way to use it may be to have you nominal feature, do a fit on it, then only use the transform method.
>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit([1, 2, 2, 6])
LabelEncoder()
>>> le.classes_
array([1, 2, 6])
>>> le.transform([1, 1, 2, 6])
array([0, 0, 1, 2]...)
from official doc

I think RPresle has already gave the answer. Just wanted to put it a little more direct to the situation in the question:
In general, you just need to fit LabelEncoder (with feature in training set) once and transforms the feature in testing set. But if your testing set has feature values that are not in training set, when you fit the label encoder put union of set of training feature and of testing set in it.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Is it possible to use labelled data in SKLearn? - scikit-learn

You can use the DictVectorizer class to convert such a list of dicts to sparse matrices or dense numpy arrays. scikit-learn will never use dict objects as the primary datastructure to store records internally as this not memory efficient at all compared to numpy arrays or scipy sparse matrices.

Alternatively, you can specify feature names when using export_graphviz. This will generate a tree with more meaningful labels at test nodes. See the feature_names parameter at http://scikit-learn.org/dev/modules/generated/sklearn.tree.export_graphviz.html#sklearn.tree.export_graphviz

Related

sklearn dcg_score not working as expected

scikit-learn: Why do we need the target data to have the same size with training data

Scaling row-wise with MinMaxScaler from Sklearn

How to convert signal data set of 400 samples with 5000 data points into a tensor of [400, 1, 5000] in pytorch?

Usng same Label Encoder to test dataset? or new Label Encoder?

Categories

Resources