Remove input dimensions in a sklearn pipeline? - scikit-learn

I have the following pipeline in sklearn:
pipe = sklearn.pipeline.Pipeline(steps=[
('scalar', StandardScaler()),
('pca', utils.PCA(n_components=n_pca_components)),
('reduce',umap.UMAP(n_neighbors=umap_n_neighbors, min_dist=umap_min_dist, metric=umap_metric)),
('model',utils.DBSCAN(eps=dbscan_eps,min_samples=dbscan_min_samples)),
])
Is there a simple way to eliminate one of the dimensions in the step between the pca and the umap?
So if my pca output is (0:100,0:10) and I want for example to remove the first channel before passing the data in the pipeline to the umap (0:100,1:10)

You could potentially have an intermediate step between the two stages to achieve what you want with FunctionTransformer
from sklearn.preprocessing import FunctionTransformer
def custom_function(x):
# Add your code here
pipe = sklearn.pipeline.Pipeline(steps=[
('scalar', StandardScaler()),
('pca', utils.PCA(n_components=n_pca_components)),
('remove_dimension', FunctionTransformer(custom_function))
('reduce',umap.UMAP(n_neighbors=umap_n_neighbors, min_dist=umap_min_dist, metric=umap_metric)),
('model',utils.DBSCAN(eps=dbscan_eps,min_samples=dbscan_min_samples)),
])

Related

how do I standardize test dataset using StandardScaler in PySpark?

I have train and test datasets as below:
x_train:
inputs
[2,5,10]
[4,6,12]
...
x_test:
inputs
[7,8,14]
[5,5,7]
...
The inputs column is a vector containing the models features after applying the VectorAssembler class to 3 separate columns.
When I try to transform the test data using the StandardScaler as below, I get an error saying it doesn't have the transform method:
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol="inputs", outputCol="scaled_features")
scaledTrainDF = scaler.fit(x_train).transform(x_train)
scaledTestDF = scaler.transform(x_test)
I am told that I should fit the standard scaler on the training data only once and use those parameters to transform the test set, so it is not accurate to do:
scaledTestDF = scaler.fit(x_test).transform(x_test)
So how do I deal with the error mentioned above?
Here is the correct syntax to use the scaler. You need to call transform on a fitted model, not on the scaler itself.
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol="inputs", outputCol="scaled_features")
scaler_model = scaler.fit(x_train)
scaledTrainDF = scaler_model.transform(x_train)
scaledTestDF = scaler_model.transform(x_test)

get feature names from Sklearn LabelBinarizer

The documentation and a few related post lead me to believe I should be able to get the feature names from the labels scikit-learn's LabelBinarizer is one-hot encoding.
I have a pipeline defined as following:
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('label_binarizer', LabelBinarizer()),
])
This works just fine, (note that DataFrameSelector is a custom class), however, it seems like I could extract the feature names like this:
feature_names = cat_pipeline.named_steps['label_binarizer'].get_feature_names()
I also tried substituting get_feature_names() with get_support() to not avail.
This is possible when using LabelEncoder on its own outside of a pipeline before one hot encoding as follows:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
housing_cat = housing["ocean_proximity"]
housing_cat_encoded = encoder.fit_transform(housing_cat)
housing_cat_encoded
print(encoder.classes_)
For further context please see the notebook I am working through:
https://github.com/ageron/handson-ml/blob/master/02_end_to_end_machine_learning_project.ipynb

Getting feature names from within a FeatureUnion + Pipeline

I am using a FeatureUnion to join features found from the title and description of events:
union = FeatureUnion(
transformer_list=[
# Pipeline for pulling features from the event's title
('title', Pipeline([
('selector', TextSelector(key='title')),
('count', CountVectorizer(stop_words='english')),
])),
# Pipeline for standard bag-of-words model for description
('description', Pipeline([
('selector', TextSelector(key='description_snippet')),
('count', TfidfVectorizer(stop_words='english')),
])),
],
transformer_weights ={
'title': 1.0,
'description': 0.2
},
)
However, calling union.get_feature_names() gives me an error: "Transformer title (type Pipeline) does not provide get_feature_names." I'd like to see some of the features that are generated by my different Vectorizers. How do I do this?
Its because you are using a custom transfomer called TextSelector. Did you implement get_feature_names in TextSelector?
You are going to have to implement this method within your custom transform if you want this to work.
Here is a concrete example for you:
from sklearn.datasets import load_boston
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.base import TransformerMixin
import pandas as pd
dat = load_boston()
X = pd.DataFrame(dat['data'], columns=dat['feature_names'])
y = dat['target']
# define first custom transformer
class first_transform(TransformerMixin):
def transform(self, df):
return df
def get_feature_names(self):
return df.columns.tolist()
class second_transform(TransformerMixin):
def transform(self, df):
return df
def get_feature_names(self):
return df.columns.tolist()
pipe = Pipeline([
('features', FeatureUnion([
('custom_transform_first', first_transform()),
('custom_transform_second', second_transform())
])
)])
>>> pipe.named_steps['features']_.get_feature_names()
['custom_transform_first__CRIM',
'custom_transform_first__ZN',
'custom_transform_first__INDUS',
'custom_transform_first__CHAS',
'custom_transform_first__NOX',
'custom_transform_first__RM',
'custom_transform_first__AGE',
'custom_transform_first__DIS',
'custom_transform_first__RAD',
'custom_transform_first__TAX',
'custom_transform_first__PTRATIO',
'custom_transform_first__B',
'custom_transform_first__LSTAT',
'custom_transform_second__CRIM',
'custom_transform_second__ZN',
'custom_transform_second__INDUS',
'custom_transform_second__CHAS',
'custom_transform_second__NOX',
'custom_transform_second__RM',
'custom_transform_second__AGE',
'custom_transform_second__DIS',
'custom_transform_second__RAD',
'custom_transform_second__TAX',
'custom_transform_second__PTRATIO',
'custom_transform_second__B',
'custom_transform_second__LSTAT']
Keep in mind that Feature Union is going to concatenate the two lists emitted from the respective get_feature_names from each of your transformers. this is why you are getting an error when one or more of your transformers do not have this method.
However, I can see that this alone will not fix your problem, as Pipeline objects don't have a get_feature_names method in them, and you have nested pipelines (pipelines within Feature Unions.). So you have two options:
Subclass Pipeline and add it get_feature_names method yourself, which gets the feature names from the last transformer in the chain.
Extract the feature names yourself from each of the transformers, which will require you to grab those transformers out of the pipeline yourself and call get_feature_names on them.
Also, keep in mind that many sklearn built in transformers don't operate on DataFrame but pass numpy arrays around, so just watch out for it if you are going to be chaining lots of transformers together. But I think this gives you enough information to give you an idea of what is happening.
One more thing, have a look at sklearn-pandas. I haven't used it myself but it might provide a solution for you.
You can call your different Vectorizers as a nested feature by this (thanks edesz):
pipevect= dict(pipeline.named_steps['union'].transformer_list).get('title').named_steps['count']
And then you got the TfidfVectorizer() instance to pass in another function:
Show_most_informative_features(pipevect,
pipeline.named_steps['classifier'], n=MostIF)
# or direct
print(pipevect.get_feature_names())

How to train SparkML gradient boosting classifer given a RDD

Given the following rdd
training_rdd = rdd.select(
# Categorical features
col('device_os'), # 'ios', 'android'
# Numeric features
col('30day_click_count'),
col('30day_impression_count'),
np.true_divide(col('30day_click_count'), col('30day_impression_count')).alias('30day_click_through_rate'),
# label
col('did_click').alias('label')
)
I am confused about the syntax to train a gradient boosting classifer.
I am following the this tutorial.
https://spark.apache.org/docs/latest/ml-classification-regression.html#gradient-boosted-tree-classifier
However, I am unsure about how to get my 4 feature columns into a vector. Because VectorIndexer assumes that all the features are already in one column.
You can use VectorAssembler to generate the feature vector. Please note that you will have to convert your rdd to a DataFrame first.
from pyspark.ml.feature import VectorAssembler
vectorizer = VectorAssembler()
vectorizer.setInputCols(["device_os",
"30day_click_count",
"30day_impression_count",
"30day_click_through_rate"])
vectorizer.setOutputCol("features")
And consequently, you will need to put vectorizer as the first stage into the Pipeline:
pipeline = Pipeline([vectorizer, ...])

pyspark Model interpretation from pipeline model

I am implementing DecisionTreeClassifier in pyspark using the Pipeline module as I have several feature engineering steps to perform on my dataset.
The code is similar to the example from Spark documentation:
from pyspark import SparkContext, SQLContext
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# Load the data stored in LIBSVM format as a DataFrame.
data = sqlContext.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)
# Automatically identify categorical features, and index them.
# We specify maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
# Train a DecisionTree model.
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")
# Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])
# Train model. This also runs the indexers.
model = pipeline.fit(trainingData)
# Make predictions.
predictions = model.transform(testData)
# Select example rows to display.
predictions.select("prediction", "indexedLabel", "features").show(5)
# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
labelCol="indexedLabel", predictionCol="prediction", metricName="precision")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g " % (1.0 - accuracy))
treeModel = model.stages[2]
# summary only
print(treeModel)
The question is how do I perform the model interpretation on this? The pipeline model object does not have the method toDebugString() similar to the method in the DecisionTree.trainClassifier class
And I cannot use the DecisionTree.trainClassifier in my pipeline because the trainclassifier() takes the training data as a parameter.
Whereas the pipeline accepts the training data as an argument in the fit() method and transform() on the test data
Is there a way to use the pipeline and still perform the model interpretation & find attribute importance?
Yes, I have used the method below in almost all my model interpretations in pyspark. The line below uses the naming conventions from your code excerpt.
dtm = model.stages[-1] # you estimator is the last stage in the pipeline
# hence the DecisionTreeClassifierModel will be the last transformer in the PipelineModel object
dtm.explainParams()
Now you have access to all the methods of the DecisionTreeClassifierModel. All the available methods and attributes can be found here. Code was not tested on your example.

Resources