how do I standardize test dataset using StandardScaler in PySpark? - apache-spark

I have train and test datasets as below:
x_train:
inputs
[2,5,10]
[4,6,12]
...
x_test:
inputs
[7,8,14]
[5,5,7]
...
The inputs column is a vector containing the models features after applying the VectorAssembler class to 3 separate columns.
When I try to transform the test data using the StandardScaler as below, I get an error saying it doesn't have the transform method:
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol="inputs", outputCol="scaled_features")
scaledTrainDF = scaler.fit(x_train).transform(x_train)
scaledTestDF = scaler.transform(x_test)
I am told that I should fit the standard scaler on the training data only once and use those parameters to transform the test set, so it is not accurate to do:
scaledTestDF = scaler.fit(x_test).transform(x_test)
So how do I deal with the error mentioned above?

Here is the correct syntax to use the scaler. You need to call transform on a fitted model, not on the scaler itself.
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol="inputs", outputCol="scaled_features")
scaler_model = scaler.fit(x_train)
scaledTrainDF = scaler_model.transform(x_train)
scaledTestDF = scaler_model.transform(x_test)

Related

Length of feature names mismatches the actual size of input X when using sklearn ColumnTransformer

I have designed the following pipelines to train my models:
from sklearn.compose import make_column_selector as selector
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
cat_imputer = SimpleImputer(strategy='constant',fill_value='missing')
num_imputer = SimpleImputer(strategy='constant',fill_value=0,add_indicator=True)
categorical_pipeline = Pipeline([
('imputer',cat_imputer),
('encoder',OneHotEncoder())
])
numerical_pipeline = Pipeline([
('imputer',num_imputer)
])
def get_column_types(X):
numerical_columns = numerical_columns_selector(X)
categorical_columns = categorical_columns_selector(X)
return numerical_columns, categorical_columns
def get_transformer(X,y):
numerical_columns, categorical_columns = get_column_types(X)
pre_transformer = ColumnTransformer([
('cat_pipe', pre_categorical_pipeline, categorical_columns),
('num_pipe', pre_numerical_pipeline, numerical_columns)
])
return transformer
When I fit the transformer on my data I get an inconsistency in the nubmer of features when I extract the names, this code is as follows:
transformer = models_and_pipelines.get_transformer(X,y)
X = transformer.fit_transform(X)
# this extracts the feature names. I also used an alternive function listed below which yields the same results
starting_features = list(transformer.transformers_[0][1]['encoder'].get_feature_names()) + list(transformer.transformers_[1][2])
print(X.shape[1])
print(len(starting_features)
With the following output:
1094
1090
Where does this inconsistency in the number of feature names come from?
other links: function to extract feature names
If you're using v1.1, get_feature_names_out is fully fleshed out, so you won't need the manual approach you're trying or the one from your link.
It's possible some of your columns were all-missing? From the docs for SimpleImputer (which I guess is what your num_imputer is?):
Notes
Columns which only contained missing values at fit are discarded upon transform if strategy is not "constant".

Issues with One Hot Encoding for model with values not in training data

I would like to use One Hot Encoding for my simple model. Yet it seems to trigger an error no matter how I set it up. First, One Hot Encoding is not converting string to float even though I have version 1.0.2 of sklearn. Now the issue is because the values in my training data are not the same length as in test data. Training only has 2 values, testing has all three. How do I fix that? The exact error is the truth value of a series is ambiguous. The error with this other idea is to reshape the data.
import lightgbm as lgbm
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
X = [[ 'apple',5],['banana',1],['apple',6],['banana',2]]
X=pd.DataFrame(X).to_numpy()
test = [[ 'pineapple',0],['banana',1],['apple',7],['banana,2']]
y = [1,0,1,0]
y=pd.DataFrame(y).to_numpy()
labels = ['apples','bananas','pineapple']
ohc = OneHotEncoder(categories=labels)
pp = ColumnTransformer(
transformers=[('ohc', ohc, [0])]
,remainder = 'passthrough')
model=lgbm.LGBMClassifier()
mymodel = Pipeline(steps = [('preprocessor', pp),
('model', model)
])
params = {'model__learning_rate':[0.1]
,'model__n_estimators':[2]}
lgbm_gs=GridSearchCV(
estimator = mymodel, param_grid=params, n_jobs = -1,
cv=2, scoring='accuracy'
,verbose=-1)
lgbm_gs.fit(X,y)
The issue should be related to the fact that you're passing categories as a list rather than as a list of array-like (eg a list of list(s)) as the doc states. Therefore, the following adjustment should fix it.
import lightgbm as lgbm
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
X = [['apple',5],['banana',1],['apple',6],['banana',2]]
X = pd.DataFrame(X).to_numpy()
test = [['pineapple',0],['banana',1],['apple',7],['banana',2]]
y = [1,0,1,0]
y = pd.DataFrame(y).to_numpy()
labels = [['apple', 'banana', 'pineapple']] # observe you were also mispelling categories ('apples' --> 'apple'; 'bananas' --> 'banana')
ohc = OneHotEncoder(categories=labels)
pp = ColumnTransformer(transformers=[('ohc', ohc, [0])], remainder='passthrough')
model=lgbm.LGBMClassifier()
mymodel = Pipeline(steps = [('preprocessor', pp),
('model', model)])
params = {'model__learning_rate':[0.1], 'model__n_estimators':[2]}
lgbm_gs=GridSearchCV(
estimator = mymodel, param_grid=params, n_jobs = -1,
cv=2, scoring='accuracy', verbose=-1)
lgbm_gs.fit(X, y.ravel())
As a further remark, observe what the guide suggests when dealing with cases where test data has categories that cannot be found in the training set.
If there is a possibility that the training data might have missing categorical features, it can often be better to specify handle_unknown='ignore' instead of setting the categories manually as above. When handle_unknown='ignore' is specified and unknown categories are encountered during transform, no error will be raised but the resulting one-hot encoded columns for this feature will be all zeros (handle_unknown='ignore' is only supported for one-hot encoding):
Eventually, you can observe that the attribute categories_ (which specifies the categories of each feature determined during fitting) is a list of array(s) (single array here as you're one-hot-encoding one column only), too. Example with categories='auto':
ohc = OneHotEncoder(handle_unknown='ignore')
ohc.fit(X[:, 0].reshape(-1, 1)).categories_
# Output: [array(['apple', 'banana'], dtype=object)]
Example with your custom categories:
ohc = OneHotEncoder(categories=labels)
ohc.fit(X[:, 0].reshape(-1, 1)).categories_
# Output: [array(['apple', 'banana', 'pineapple'], dtype=object)]

train test data split using stratify on two columns in scikit-learn

I have a dataset that I want to split into train and test so that I have data in the test set from each data source (specified in column "source") and from each class (specified in column "class"). I read about using the parameter stratifiy with scikitlearn's train_test_split function, but how can I use it on two columns?
Stratifying on multiple columns is easily done with sklearn's train_test_split since v.19.0
Proof
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_multilabel_classification
X, Y = make_multilabel_classification(1000000, 10, n_classes=2, n_labels=1)
train_X, test_X, train_Y, test_Y =train_test_split(X,Y,stratify=Y, train_size=.8, random_state=42)
Y.shape
(1000000, 2)
Then you can compare simple column means of resulting stratifications:
train_Y[:,0].mean(), test_Y[:,0].mean()
(0.45422, 0.45422)
train_Y[:,1].mean(), test_Y[:,1].mean()
(0.23472375, 0.234725)
Run statistical t-test on the equality of means:
from scipy.stats import ttest_ind
ttest_ind(train_Y[:,0],test_Y[:,0])
Ttest_indResult(statistic=0.0, pvalue=1.0)
And finally do the same for conditional means to prove that you indeed achieved what you wanted:
train_Y[train_Y[:,0].astype("bool"),1].mean(), test_Y[test_Y[:,0].astype("bool"),1].mean()
(0.43959149751221877, 0.43958874554180793)

Features for Support Vector Machine (SVM)

I have to classify some texts with support vector machine. In my train file I have 5 different categories. I have to do classify at first with "Bag of Words" feature, after with SVD feature by keeping 90% of the total variance.
I 'm using python and sklearn but I don't know how to create the above SVD feature.
My train set is separated with tab (\t), my texts are in 'Content' column and the categories are in 'Category' column.
The high level steps for a tf-idf/PCA/SVM workflow are as follows:
Load data (will be different in your case):
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'soc.religion.christian']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
train_text = newsgroups_train.data
y = newsgroups_train.target
Preprocess features and train classifier:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.svm import SVC
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(train_text)
pca = PCA(.8)
X = pca.fit_transform(X_tfidf.todense())
clf = SVC(kernel="linear")
clf.fit(X,y)
Finally, do the same preprocessing steps for test dataset and make predictions.
PS
If you wish, you may combine preprocessing steps into Pipeline:
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
preproc = Pipeline([('tfidf',TfidfVectorizer())
,('todense', FunctionTransformer(lambda x: x.todense(), validate=False))
,('pca', PCA(.9))])
X = preproc.fit_transform(train_text)
and use it later for dealing with test data as well.

pyspark Model interpretation from pipeline model

I am implementing DecisionTreeClassifier in pyspark using the Pipeline module as I have several feature engineering steps to perform on my dataset.
The code is similar to the example from Spark documentation:
from pyspark import SparkContext, SQLContext
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# Load the data stored in LIBSVM format as a DataFrame.
data = sqlContext.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)
# Automatically identify categorical features, and index them.
# We specify maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
# Train a DecisionTree model.
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")
# Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])
# Train model. This also runs the indexers.
model = pipeline.fit(trainingData)
# Make predictions.
predictions = model.transform(testData)
# Select example rows to display.
predictions.select("prediction", "indexedLabel", "features").show(5)
# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
labelCol="indexedLabel", predictionCol="prediction", metricName="precision")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g " % (1.0 - accuracy))
treeModel = model.stages[2]
# summary only
print(treeModel)
The question is how do I perform the model interpretation on this? The pipeline model object does not have the method toDebugString() similar to the method in the DecisionTree.trainClassifier class
And I cannot use the DecisionTree.trainClassifier in my pipeline because the trainclassifier() takes the training data as a parameter.
Whereas the pipeline accepts the training data as an argument in the fit() method and transform() on the test data
Is there a way to use the pipeline and still perform the model interpretation & find attribute importance?
Yes, I have used the method below in almost all my model interpretations in pyspark. The line below uses the naming conventions from your code excerpt.
dtm = model.stages[-1] # you estimator is the last stage in the pipeline
# hence the DecisionTreeClassifierModel will be the last transformer in the PipelineModel object
dtm.explainParams()
Now you have access to all the methods of the DecisionTreeClassifierModel. All the available methods and attributes can be found here. Code was not tested on your example.

Resources