Scaling row-wise with MinMaxScaler from Sklearn - scikit-learn

By default, scalers from Sklearn work column-wise. But i need my data to be scaled line-wise, so i did the following:
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import numpy as np
# %% Generating sample data
x = np.array([[-1, 4, 2], [-0.5, 8, 9], [3, 2, 3]])
y = np.array([1, 2, 3])
#%% Train/Test split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=2)
scaler = MinMaxScaler()
x_train = scaler.fit_transform(x_train.T).T # scaling line-wise
x_test = scaler.transform(x_test) <-------- Error here
But i am getting the following error:
ValueError: X has 3 features, but MinMaxScaler is expecting 2 features as input.
I don't understand whats wrong here. Why it says it is expecting 2 features, when all my X (x, x_train and x_test) has 3 features? How can i fix this?

StandardScaler is stateful: when you fit it, it calculates and saves the columns' means and standard deviations; when transforming (train or test sets), it uses those saved statistics. Your transpose trick doesn't work with that: each row has saved statistics, and then your test set doesn't have the same rows, so transform cannot work correctly (throwing an error if different number of rows, or silently mis-scaling if the same number of rows).
What you want isn't stateful: test sets should be transformed completely independently of the training set. Indeed, every row should be transformed independently of each other. So you could just do this kind of transformation before splitting, or using fit_transform on the test set('s transpose).
For l2 normalization of rows, there's a builtin for this: Normalizer (docs). I don't think there's an analogue for min-max normalization, but I think you could write a FunctionTransformer to do it.

This is possible to do. I can think of a scenario where this would be useful. Normally, MinMaxScaler would scale each x, y, and z with respect to other observations of that feature. That's the "series" scaling. Now imagine that instead, you wanted to map each point constrained by x+y+z = 1. I think this is what OP is asking for. I have done this in the past, I will describe how I did it.
You need to treat your individual observations as a column multi-index and treat it like a higher-dimensional feature. Then, you need to build a pipeline within which the observations are transformed from column-wise to row wise, post which you do the min/max scaling. This gets you to x+y+z=1, but you still need to get back to the original shape of the data, for which you will need to track the index of each observation. Within the pipeline, you'll need to use something like a DataframeFunctionTransformer which I have seen on the interwebs, reproducing it below. This way you can use pandas functions to shape the data and merge back in with the indices.
class DataframeFunctionTransformer():
def __init__(self, func):
self.func = func
def transform(self, input_df, **transform_params):
return self.func(input_df)
def fit(self, X, y=None, **fit_params):
return self
Regarding the statefulness of MinMaxScaler, I think in a scenario such as this, the state of MinMaxScaler doesn't get used, it is purely acting as a transformer that maps these points to a different space meeting the constraint that x, y, and z are scaled such that they add up to 1.
#Murilo hope this gets you started with a solution. Must be an interesting problem.

Related

Building vocabulary using document vector

I am not able to build vocabulary and getting an error:
TypeError: 'int' object is not iterable
Here is my code that is based on medium article:
https://towardsdatascience.com/implementing-multi-class-text-classification-with-doc2vec-df7c3812824d
I tried to provide pandas series, list to build_vocab function.
import pandas as pd
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.model_selection import train_test_split
import multiprocessing
import nltk
from nltk.corpus import stopwords
def tokenize_text(text):
tokens = []
for sent in nltk.sent_tokenize(text):
for word in nltk.word_tokenize(sent):
if len(word) < 2:
continue
tokens.append(word.lower())
return tokens
df = pd.read_csv("https://raw.githubusercontent.com/RaRe-Technologies/movie-plots-by-genre/master/data/tagged_plots_movielens.csv")
tags_index = {
"sci-fi": 1,
"action": 2,
"comedy": 3,
"fantasy": 4,
"animation": 5,
"romance": 6,
}
df["tindex"] = df.tag.replace(tags_index)
df = df[["plot", "tindex"]]
mylist = list()
for i, q in df.iterrows():
mylist.append(
TaggedDocument(tokenize_text(str(q["plot"])), tags=q["tindex"])
)
df["tdoc"] = mylist
X = df[["tdoc"]]
y = df["tindex"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
cores = multiprocessing.cpu_count()
model_doc2vec = Doc2Vec(
dm=1,
vector_size=300,
negative=5,
hs=0,
min_count=2,
sample=0,
workers=cores,
)
model_doc2vec.build_vocab([x for x in X_train["tdoc"]])
The documentation is very confusing for this method.
Doc2Vec needs an iterable sequence of TaggedDocument-like objects for its corpus (as is fed to build_vocab() or train()).
When showing an error, you should also show the full stack that accompanied it, so that it is clear what line-of-code, and surrounding call-frames, are involved.
But, it's unclear if what you've fed into the dataframe, then out via dataframe-bracket-access, then through the train_test_split(), is actually that.
So I'd suggest assigning things to descriptive interim variables, and verifying that they contain the right sorts of things at each step.
Is X_train["tdoc"][0] a proper TaggedDocument, with a words property that is a list-of-strings, and tags property a list-of-tags? (And, where each tag is probably a string, but could perhaps be a plain-int, counting upward from 0.)
Is mylist[0] a proper TaggedDocument?
Separately: many online examples of Doc2Vec use have egregious errors, and the Medium article you link is no exception. Its practice of calling train() multiple times in a loop is usually unneeded, and very error-prone, and in fact in that article results in severe learning-rate alpha mismanagement. (For example, deducting 0.002 from the starting-default alpha of 0.025 30 times results in a negative effective alpha, which is never justified and means the model is making itself worse with every example. This may be a factor contributing to the awful reported classifier accuracy.)
I would disregard that article entirely and seek better examples elsewhere.

PySpark: Get Threshold (cuttoff) values for each point in ROC curve

I'm starting with PySpark, building binary classification models (logistic regression), and I need to find the optimal threshold (cuttoff) point for my models.
I want to use the ROC curve to find this point, but I don't know how to extract the threshold value for each point in this curve. Is there a way to find this values?
Things I've found:
This post shows how to extract the ROC curve, but only the values for the TPR and FPR. It's useful for plotting and for selecting the optimal point, but I can't find the threshold value.
I know I can find the threshold values for each point in the ROC curve using H2O (I've done it before), but I'm working on Pyspark.
Here is a post describing how to do it with R... but, again, I need to do it with Pyspark
Other facts
I'm using Apache Spark 2.4.0.
I'm working with Data Frames (I really don't know - yet - how to work with RDDs, but I'm not afraid to learn ;) )
If you specifically need to generate ROC curves for different thresholds, one approach could be to generate a list of threshold values you're interested in and fit/transform on your dataset for each threshold. Or you could manually calculate the ROC curve for each threshold point using the probability field in the response from model.transform(test).
Alternatively, you can use BinaryClassificationMetrics to extract a curve plotting various metrics (F1 score, precision, recall) by threshold.
Unfortunately it appears the PySpark version doesn't implement most of the methods the Scala version does, so you'd need to wrap the class to do it in Python.
For example:
from pyspark.mllib.evaluation import BinaryClassificationMetrics
# Scala version implements .roc() and .pr()
# Python: https://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/common.html
# Scala: https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.html
class CurveMetrics(BinaryClassificationMetrics):
def __init__(self, *args):
super(CurveMetrics, self).__init__(*args)
def _to_list(self, rdd):
points = []
# Note this collect could be inefficient for large datasets
# considering there may be one probability per datapoint (at most)
# The Scala version takes a numBins parameter,
# but it doesn't seem possible to pass this from Python to Java
for row in rdd.collect():
# Results are returned as type scala.Tuple2,
# which doesn't appear to have a py4j mapping
points += [(float(row._1()), float(row._2()))]
return points
def get_curve(self, method):
rdd = getattr(self._java_model, method)().toJavaRDD()
return self._to_list(rdd)
Usage:
import matplotlib.pyplot as plt
preds = predictions.select('label','probability').rdd.map(lambda row: (float(row['probability'][1]), float(row['label'])))
# Returns as a list (false positive rate, true positive rate)
points = CurveMetrics(preds).get_curve('roc')
plt.figure()
x_val = [x[0] for x in points]
y_val = [x[1] for x in points]
plt.title(title)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.plot(x_val, y_val)
Results in:
Here's an example of an F1 score curve by threshold value if you aren't married to ROC:
One way is to use sklearn.metrics.roc_curve.
First use your fitted model to make predictions:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(labelCol="label", featuresCol="features")
model = lr.fit(trainingData)
predictions = model.transform(testData)
Then collect your scores and labels1:
preds = predictions.select('label','probability')\
.rdd.map(lambda row: (float(row['probability'][1]), float(row['label'])))\
.collect()
Now transform preds to work with roc_curve
from sklearn.metrics import roc_curve
y_score, y_true = zip(*preds)
fpr, tpr, thresholds = roc_curve(y_true, y_score, pos_label = 1)
Notes:
I am not 100% certain that the probabilities vector will always be ordered such that the positive label will be at index 1. However in a binary classification problem, you'll know right away if your AUC is less than 0.5. In that case, just take 1-p for the probabilities (since the class probabilities sum to 1).

plot_decision_regions with error "Filler values must be provided when X has more than 2 training features."

I am plotting 2D plot for SVC Bernoulli output.
converted to vectors from Avg word2vec and standerdised data
split data to train and test.
Through grid search found the best C and gamma(rbf)
clf = SVC(C=100,gamma=0.0001)
clf.fit(X_train1,y_train)
from mlxtend.plotting import plot_decision_regions
plot_decision_regions(X_train, y_train, clf=clf, legend=2)
plt.xlabel(X.columns[0], size=14)
plt.ylabel(X.columns[1], size=14)
plt.title('SVM Decision Region Boundary', size=16)
Receive error :-
ValueError: y must be a NumPy array. Found
also tried to convert the y to numpy. Then it prompts error
ValueError: y must be an integer array. Found object. Try passing the array as y.astype(np.integer)
finally i converted it to integer array.
Now it is prompting of error.
ValueError: Filler values must be provided when X has more than 2 training features.
You can use PCA to reduce your data multi-dimensional data to two dimensional data. Then pass the obtained result in plot_decision_region and there will be no need of filler values.
from sklearn.decomposition import PCA
from mlxtend.plotting import plot_decision_regions
clf = SVC(C=100,gamma=0.0001)
pca = PCA(n_components = 2)
X_train2 = pca.fit_transform(X_train)
clf.fit(X_train2, y_train)
plot_decision_regions(X_train2, y_train, clf=clf, legend=2)
plt.xlabel(X.columns[0], size=14)
plt.ylabel(X.columns[1], size=14)
plt.title('SVM Decision Region Boundary', size=16)
I've spent some time with this too as plot_decision_regions was then complaining ValueError: Column(s) [2] need to be accounted for in either feature_index or filler_feature_values and there's one more parameter needed to avoid this.
So, say, you have 4 features and they come unnamed:
X_train_std.shape[1] = 4
We can refer to each feature by their index 0, 1, 2, 3. You only can plot 2 features at a time, say you want 0 and 2.
You'll need to specify one additional parameter (to those specified in #sos.cott's answer), feature_index, and fill the rest with fillers:
value=1.5
width=0.75
fig = plot_decision_regions(X_train.values, y_train.values, clf=clf,
feature_index=[0,2], #these one will be plotted
filler_feature_values={1: value, 3:value}, #these will be ignored
filler_feature_ranges={1: width, 3: width})
You can just do (Assuming X_train and y_train are still panda dataframes) for the numpy array problem.
plot_decision_regions(X_train.values, y_train.values, clf=clf, legend=2)
For the filler_feature issue, you have to specify the number of features so you do the following:
value=1.5
width=0.75
fig = plot_decision_regions(X_train.values, y_train.values, clf=clf,
filler_feature_values={2: value, 3:value, 4:value},
filler_feature_ranges={2: width, 3: width, 4:width},
legend=2, ax=ax)
You need to add one filler feature for each feature you have.

How to normalize time series data with multiple features by using sklearn?

For data with the shape (num_samples,features), MinMaxScaler from sklearn.preprocessing can be used to normalize it easily.
However, when using the same method for time series data with the shape (num_samples, time_steps,features), sklearn will give an error.
from sklearn.preprocessing import MinMaxScaler
import numpy as np
#Making artifical time data
x1 = np.linspace(0,3,4).reshape(-1,1)
x2 = np.linspace(10,13,4).reshape(-1,1)
X1 = np.concatenate((x1*0.1,x2*0.1),axis=1)
X2 = np.concatenate((x1,x2),axis=1)
X = np.stack((X1,X2))
#Trying to normalize
scaler = MinMaxScaler()
X_norm = scaler.fit_transform(X) <--- error here
ValueError: Found array with dim 3. MinMaxScaler expected <= 2.
This post suggests something like
(timeseries-timeseries.min())/(timeseries.max()-timeseries.min())
Yet, it only works for data with only 1 feature. Since my data has more than 1 feature, this method doesn't work.
How to normalize time series data with multiple features?
To normalize a 3D tensor of shape (n_samples, timesteps, n_features) use the following:
(timeseries-timeseries.min(axis=2))/(timeseries.max(axis=2)-timeseries.min(axis=2))
Using the argument axis=2 will return the result of the tensor operation performed along the 3rd dimension i.e., the feature axis. Thus each feature will be normalized independently.

How to split data into 3 parts, one of which wont be used? [duplicate]

This question already has answers here:
How to split data into 3 sets (train, validation and test)?
(11 answers)
Closed 4 years ago.
I've got a csv that I want to split 80% into training, 10% into dev-test and 10% into test set. The dev-test wont be used further.
I've got it set up like:
import sklearn
import csv
with open('Letter.csv') as f:
reader = csv.reader(f)
annotated_data = [r for r in reader]
and for splitting:
import random
random.seed(1234)
random.shuffle(annotated_data)
But all the splitting I've seen only slips into 2 sets, and I can't see where to specify how much partition to split it with, eg I want 80% training. Maybe I'm blind, but can anyone help me? I don't know how to use pandas.
Also once I split it, how do I access the sets separately? For eg I can read each record as a whole and count the amount of entries, but once I split it I want to count how many records are in each set. Sorry if this deserves its own post, but I don't want to spam.
No, it's possible in scikit-learn to split into three sets directly.
The typical approach is two split twice.in 80/20 and then split the 20 percent 50/50. You want to check the train_test_split-function.
Essentially, the code with data X and y could look like this:
import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(100).reshape((5, 2)), range(5)
X_train, X_tmp, y_train, y_tmp = train_test_split(X, y, test_size=0.2)
X_dev, X_test, y_dev, y_test = train_test_split(X_tmp, y_tmp, test_size=0.5)
Now you would want to work with (X_train, y_train), (X_dev, y_dev) and (X_test, y_test)
You can use train_test_split twice:
Split the data into a ratio 0.8 : 0.2
Split the smaller set into a ratio 0.5 : 0.5

Resources