sklearn passthrough feature selector - python-3.x

I am working with a sklearn pipeline and would like to have a feature selection step that could possibly be set to no feature selection. Is there a sklearn.feature_selection.SelectorMixin object that does nothing?
EDIT: or is there at least a template to develop one, like there can be an estimator?

End of the day I went with something like this which seems to fit my current purpose. Not sure if it would be validated as a proper sklearn.feature_selection selector though:
import numpy as np
import pandas as pd
from sklearn.utils.validation import check_is_fitted
class PassThroughSelector():
"""
Simply selects all columns of the dataframe, allowing
to have the equivalent of no selector without changing
too much of the structure.
Args:
"""
def __init__(self):
pass
def fit(self, x, y=None): # pylint:disable=unused-argument, arguments-differ
"""
Stores a list of selected columns.
Args:
x: training data
y: training y (no effect)
Returns:
self
"""
self.check_x(x)
mask = np.where(np.ones(len(x.columns)), True, False)
self.support_ = mask
self.selected_features_ = x.columns[self.support_].tolist()
return self
def get_support(self, indices = False) -> np.ndarray:
"""Provides a boolean mask of the selected features."""
check_is_fitted(self)
if indices == True:
return np.array([i for i in range(len(self.support_)) if self.support_[i]==True])
else:
return self.support_
def transform(self, x: pd.DataFrame):
"""Selects the features selected in `fit` from a provided dataframe."""
check_is_fitted(self)
self.check_x(x)
return x.loc[:, self.support_]
def get_params(self, deep = True):
return {}

Related

How to retrieve one row at a time from the csv file using generator functions

I need to take one row from the CSV file to be used in the reinforcement learning class environment as an observation tuple. I have used generator function first it's not retrieving any data and secondly it will provide all the data iteratively which doesn't match with the requirement of my problem. Also, I need the currently selected observation(CSV row) to be used in multiple methods in the class environment for instance in the reward function.
Any idea or suggestion is highly appreciated on how to do this. Thanks
class Environment1:
def __init__(self, data, max_ticks=300):
self.data = data
self.application_latency=1342
self.reward = 0
#self.done = False
self.MAX_TICKS = max_ticks
self.episode_over = False
def step(self, act):
self.take_action(action)
reward = self.get_reward()
ob = self.get_state()
return ob, reward, self.episode_over
#return ob, reward, self.done # obs, reward, done
def get_state(self):
"""Get the observation. it is a tuple """
lst = [tuple(x) for x in data.values]
def gen(last):
for i in last:
print(yield i)
#observation_space= yield i
#ob = (observation_space.Edge_Latency, observation_space.Cloud_latency )
#print(ob)
#return ob
With what I gathered from your question, you want to create a generator of observation tuples from your csv data. Specifically, you want to pass each tuple with edge latency and cloud latency columns to another function. I have written some example code which will make a list of tuples for each row of your data.
import pandas as pd
import numpy as np
def createGenerator(self):
obs_data = [tuple(x) for x in self.data[['Edge_Latency', 'Cloud_latency']].to_numpy()]
for obs in obs_data:
yield obs

How to properly pass self in stacked decortors on class methods in python?

First, I would like to have a class method, that is concerened only with manipulating a dataframe column. So that i can really focus on the manipulation itself rather than the background stuff. This background stuff is actually applying this simple function over specified columns (e.g. all numeric ones, stated explicitly by their column name)
To seperate this from the nasty bits, i tryed using decorators and actually succeded.
However the difficulty arose as i wanted to use a second decorator that is in fact a plotting method for each of those manipulated columns to keep track on the manipulations.
The code that you find below is a working version and simplified:
plotter is merely printing each columns name rather than actually plotting it
Note the commented self, that allows this code to work propperly. I dont understand why.
import pandas as pd
import numpy as np
class test_class:
def __init__(self):
self.df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list('ABCD'))
def plotter(fn):
def printer(self, **kwargs):
print('printer call')
for col in kwargs['column']:
print(col)
return fn(self, **kwargs) # this self actually allows to applyer to referencne self, otherwise: applyer positional argument missing
return printer
def wrapapply(fn):
def applyer(self, **kwargs):
print('applyer call')
fnkwargs = {k: v for k, v in kwargs.items() if k != 'column'} # clean up the decorators arguments
self.df[kwargs['column']] = pd.DataFrame.apply(self.df[kwargs['column']], func=fn, axis=0, **fnkwargs)
return applyer
#plotter
#wrapapply
def norm(column):
return (column - np.mean(column)) / np.std(column)
if __name__ == '__main__':
a = test_class()
a.norm(column=['A', 'B'])
a.norm(column=['D'])
print(a.df)
The result i expect is
a silent inplace manipulation of all columns A, B, D in the
Dataframe,
each of the columnames of a call must be printed by a seperate
decorator function (as in my application, this is in fact a plotting
method)

How to apply template method pattern in Python data science process while not knowing exactly the number of repeating steps

I like to apply the template method pattern for a data science project while I need to select or identify target subjects from a large pool of original subjects. I will create tags based on different characteristics of these subjects, i.e., age, sex, disease status, etc.
I prefer this code to be reused for future projects of similar nature. But all projects are somewhat different and the criteria of selecting subjects to be in the final filtered pool are different from one another. How do I structure the subject_selection_steps in such a way that it is flexible and customizable based on project needs. Currently, I only included three tags in my code, but I may need more or less in different projects.
import sys
from abc import ABC, abstractmethod
import pandas as pd
import datetime
import ctypes
import numpy as np
import random
import pysnooper
import var_creator.var_creator as vc
import feature_tagger.feature_tagger as ft
import data_descriptor.data_descriptor as dd
import data_transformer.data_transformer as dt
import helper_functions.helper_functions as hf
import sec1_data_preparation as data_prep
import sec2_prepped_data_import as prepped_data_import
class SubjectGrouping(ABC):
def __init__(self):
pass
def subject_selection_steps(self):
self._pandas_output_setting()
self.run_data_preparation()
self.import_processed_main_data()
self.inject_test_data()
self.create_all_subject_list()
self.CREATE_TAG1()
self.FILTER_SUBJECT_BY_TAG1()
self.CREATE_TAG2()
self.FILTER_SUBJECT_BY_TAG2()
self.CREATE_TAG3()
self.FILTER_SUBJECT_BY_TAG3()
self.finalize_data()
def _pandas_output_setting(self):
'''Set pandas output display setting'''
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 180)
#abstractmethod
def run_data_preparation(self):
'''Run data_preparation_steps from base class'''
pass
#abstractmethod
def import_processed_main_data(self):
'''Import processed main data'''
pass
def inject_test_data(self):
'''For unitest, by injecting mock cases that for sure fulfill/fail the defined subject selection criteria'''
pass
def create_all_subject_list(self):
'''Gather all the unique subject ids from all datasets and create a full subject list'''
pass
def CREATE_TAG1(self): pass
def FILTER_SUBJECT_BY_TAG1(self): pass
def CREATE_TAG2(self): pass
def FILTER_SUBJECT_BY_TAG2(self): pass
def CREATE_TAG3(self): pass
def FILTER_SUBJECT_BY_TAG3(self): pass
def finalize_data(self):
pass
class SubjectGrouping_Project1(SubjectGrouping, data_prep.DataPreparation_Project1):
def __init__(self):
self.df_dad = None
self.df_pc = None
self.df_nacrs = None
self.df_pin = None
self.df_reg = None
self.df_final_subject_group1 = None
self.df_final_subject_group2 = None
self.df_final_subject_group3 = None
self.control_panel = {
'save_file_switch': False, # WARNING: Will overwrite existing files
'df_subsampling_switch': True, # WARNING: Only switch to True when testing
'df_subsampling_n': 8999,
'random_seed': 888,
'df_remove_dup_switch': True,
'parse_date_switch': True,
'result_printout_switch': True,
'comp_loc': 'office',
'show_df_n_switch': False, # To be implemented. Show df length before and after record removal
'done_switch': False,
}
def run_data_preparation(self):
self.data_preparation_steps()
def import_processed_main_data(self):
x = prepped_data_import.PreppedDataImport_Project1()
x.data_preparation_steps()
x.prepped_data_import_steps()
df_dict = x.return_all_dfs()
self.df_d, self.df_p, self.df_n, self.df_p, self.df_r = (df_dict['DF_D'], df_dict['DF_P'],
df_dict['DF_N'], df_dict['DF_P'], df_dict['DF_R'])
del x
if __name__=='__main__':
x = SubjectGrouping_Project1()
x.subject_selection_steps()
Consider a Filter Pattern. It basically allows filtering of list of objects based on defined filters and you can easily introduce a new filter at a later point with minimal changes to your code.
Create an Criteria interface or abstract class.
class Criteria():
def filter(self, request):
raise NotImplementedError("Should have implemented this")
and have each of your filter extend from Criteria class. Let's consider one of the filters is an Age filter
class AgeFilter(Criteria):
def __init__(self, age=20):
self.age = age
def filter(self, list):
filteredList = []
for item in self.list:
if (item.age > self.age):
# add to the filteredList
return filteredList
Similar you can define other filters like DiseaseFilter, GenderFilter by extending from Criteria interface.
You can also do logical operations on your filters by defining And or Or filters as well. For eg.
class AndFilter(Criteria):
def __init__(self, filter1, filter2):
self.filter1 = filter1
self.filter2 = filter2
def filter(self, list):
filteredList1 = filter1.filter(list)
filteredList2 = filter2.filter(filteredList1)
return filteredList2
Assuming you have already defined your filters, after which your subject_selection_steps method will look like,
def subject_selection_steps(self):
# define list of filters
filterList = [ageFilter1, maleFilter, MalariaAndJaundiceFilter]
result = personList
for criteria in filterList:
result = criteria.filter(result)
return result

How to add an image to summary during evaluation when using Estimator?

I run an evaluation at the end of each epoch and need to show an image calculated from the features and labels arguments of the model function model_fn. Including a tf.summary.image(name, image) in evaluation part of the model function does not help and it looks to me that the only way to do so is to pass the correct eval_metric_ops to construct the EstimatorSpec for mode EVAL. So I first sub-class Estimator so that it considers images. The following code is mostly from estimator.py; the only change is the few lines marked by "my change" inside _write_dict_to_summary:
import logging
import io
import numpy as np
import matplotlib.pyplot as plt
import six
from google.protobuf import message
import tensorflow as tf
from tensorflow.python.training import evaluation
from tensorflow.python import ops
from tensorflow.python.estimator.estimator import _dict_to_str, _write_checkpoint_path_to_summary
from tensorflow.core.framework import summary_pb2
from tensorflow.python.framework import tensor_util
from tensorflow.python.summary.writer import writer_cache
def dump_as_image(a):
vmin = np.min(a)
vmax = np.max(a)
img = np.squeeze((img - vmin) / (vmax - vmin) * 255).astype(np.uint8)
s = io.BytesIO()
plt.imsave(s, img, format='png', vmin=0, vmax=255, cmap='gray')
return s.getvalue()
# see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/estimator/estimator.py
def _write_dict_to_summary(output_dir, dictionary, current_global_step):
logging.info('Saving dict for global step %d: %s', current_global_step, _dict_to_str(dictionary))
summary_writer = writer_cache.FileWriterCache.get(output_dir)
summary_proto = summary_pb2.Summary()
for key in dictionary:
if dictionary[key] is None:
continue
if key == 'global_step':
continue
if (isinstance(dictionary[key], np.float32) or
isinstance(dictionary[key], float)):
summary_proto.value.add(tag=key, simple_value=float(dictionary[key]))
elif (isinstance(dictionary[key], np.int64) or
isinstance(dictionary[key], np.int32) or
isinstance(dictionary[key], int)):
summary_proto.value.add(tag=key, simple_value=int(dictionary[key]))
elif isinstance(dictionary[key], six.binary_type):
try:
summ = summary_pb2.Summary.FromString(dictionary[key])
for i, img_bytes in enumerate(summ.value):
summ.value[i].tag = '%s/%d' % (key, i)
summary_proto.value.extend(summ.value)
except message.DecodeError:
logging.warn('Skipping summary for %s, cannot parse string to Summary.', key)
continue
elif isinstance(dictionary[key], np.ndarray):
value = summary_proto.value.add()
value.tag = key
value.node_name = key
array = dictionary[key]
# my change begins
if array.ndim == 2:
buffer = dump_as_image(array)
value.image.encoded_image_string = buffer
# my change ends
else:
tensor_proto = tensor_util.make_tensor_proto(array)
value.tensor.CopyFrom(tensor_proto)
logging.info(
'Summary for np.ndarray is not visible in Tensorboard by default. '
'Consider using a Tensorboard plugin for visualization (see '
'https://github.com/tensorflow/tensorboard-plugin-example/blob/master/README.md'
' for more information).')
else:
logging.warn(
'Skipping summary for %s, must be a float, np.float32, np.int64, '
'np.int32 or int or np.ndarray or a serialized string of Summary.',
key)
summary_writer.add_summary(summary_proto, current_global_step)
summary_writer.flush()
class ImageMonitoringEstimator(tf.estimator.Estimator):
def __init__(self, *args, **kwargs):
tf.estimator.Estimator._assert_members_are_not_overridden = lambda self: None
super(ImageMonitoringEstimator, self).__init__(*args, **kwargs)
def _evaluate_run(self, checkpoint_path, scaffold, update_op, eval_dict, all_hooks, output_dir):
eval_results = evaluation._evaluate_once(
checkpoint_path=checkpoint_path,
master=self._config.evaluation_master,
scaffold=scaffold,
eval_ops=update_op,
final_ops=eval_dict,
hooks=all_hooks,
config=self._session_config)
current_global_step = eval_results[ops.GraphKeys.GLOBAL_STEP]
_write_dict_to_summary(
output_dir=output_dir,
dictionary=eval_results,
current_global_step=current_global_step)
if checkpoint_path:
_write_checkpoint_path_to_summary(
output_dir=output_dir,
checkpoint_path=checkpoint_path,
current_global_step=current_global_step)
return eval_results
the model function is like --
def model_func(features, labels, mode):
# calculate network_output
if mode == tf.estimator.ModeKeys.TRAIN:
# training
elif mode == tf.estimator.ModeKeys.EVAL:
# make_image consists of slicing and concatenations
images = tf.map_fn(make_image, (features, network_output, labels), dtype=features.dtype)
eval_metric_ops = images, tf.no_op() # not working
return tf.estimator.EstimatorSpec(mode, loss=loss)
eval_metric_ops={'images': eval_metric_ops})
else:
# prediction
And the main part --
# mon_features and mon_labels are np.ndarray
estimator = ImageMonitoringEstimator(model_fn=model_func,...)
mon_input_func = tf.estimator.inputs.numpy_input_fn(mon_features,
mon_labels,
shuffle=False,
num_epochs=num_epochs,
batch_size=len(mon_features))
for _ in range(num_epochs):
estimator.train(...)
estimator.evaluate(input_fn=mon_input_func)
The code above will give a warning (later an error):
WARNING:tensorflow:An OutOfRangeError or StopIteration exception is
raised by the code in FinalOpsHook. This typically means the Ops
running by the FinalOpsHook have a dependency back to some input
source, which should not happen. For example, for metrics in
tf.estimator.Estimator, all metrics functions return two Ops:
value_op and update_op. Estimator.evaluate calls the update_op
for each batch of the data in input source and, once it is exhausted,
it call the value_op to get the metric values. The value_op here
should have dependency back to variables reading only, rather than
reading another batch from input. Otherwise, the value_op, executed
by FinalOpsHook, triggers another data reading, which ends
OutOfRangeError/StopIteration. Please fix that.
Looks like I didn't set the eval_metric_ops correctly. I guess tf.map_fn touches another batch as the warning message hints; maybe I need some stacking operation as the update_op to build the images used for monitoring incrementally? But I am not sure how to do that. So how to add an image to summary during evaluation when using Estimator?
The way I make it work is by passing a tf.train.SummarySaverHook under the evaluation mode and then declaring it to the tf.estimator.EstimatorSpec at evaluation_hooks=.
images is a list of the desired tf.summary.image you want to print during evaluation.
example:
eval_summary_hook = tf.train.SummarySaverHook(output_dir=params['eval_save_path'], summary_op=images, save_secs=120)
spec = tf.estimator.EstimatorSpec(mode=mode, predictions=y_pred, loss=loss, eval_metric_ops=eval_metric_ops,
evaluation_hooks=[eval_summary_hook])

Serialize a custom transformer using python to be used within a Pyspark ML pipeline

I found the same discussion in comments section of Create a custom Transformer in PySpark ML, but there is no clear answer. There is also an unresolved JIRA corresponding to that: https://issues.apache.org/jira/browse/SPARK-17025.
Given that there is no option provided by Pyspark ML pipeline for saving a custom transformer written in python, what are the other options to get it done? How can I implement the _to_java method in my python class that returns a compatible java object?
As of Spark 2.3.0 there's a much, much better way to do this.
Simply extend DefaultParamsWritable and DefaultParamsReadable and your class will automatically have write and read methods that will save your params and will be used by the PipelineModel serialization system.
The docs were not really clear, and I had to do a bit of source reading to understand this was the way that deserialization worked.
PipelineModel.read instantiates a PipelineModelReader
PipelineModelReader loads metadata and checks if language is 'Python'. If it's not, then the typical JavaMLReader is used (what most of these answers are designed for)
Otherwise, PipelineSharedReadWrite is used, which calls DefaultParamsReader.loadParamsInstance
loadParamsInstance will find class from the saved metadata. It will instantiate that class and call .load(path) on it. You can extend DefaultParamsReader and get the DefaultParamsReader.load method automatically. If you do have specialized deserialization logic you need to implement, I would look at that load method as a starting place.
On the opposite side:
PipelineModel.write will check if all stages are Java (implement JavaMLWritable). If so, the typical JavaMLWriter is used (what most of these answers are designed for)
Otherwise, PipelineWriter is used, which checks that all stages implement MLWritable and calls PipelineSharedReadWrite.saveImpl
PipelineSharedReadWrite.saveImpl will call .write().save(path) on each stage.
You can extend DefaultParamsWriter to get the DefaultParamsWritable.write method that saves metadata for your class and params in the right format. If you have custom serialization logic you need to implement, I would look at that and DefaultParamsWriter as a starting point.
Ok, so finally, you have a pretty simple transformer that extends Params and all your parameters are stored in the typical Params fashion:
from pyspark import keyword_only
from pyspark.ml import Transformer
from pyspark.ml.param.shared import HasOutputCols, Param, Params
from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable
from pyspark.sql.functions import lit # for the dummy _transform
class SetValueTransformer(
Transformer, HasOutputCols, DefaultParamsReadable, DefaultParamsWritable,
):
value = Param(
Params._dummy(),
"value",
"value to fill",
)
#keyword_only
def __init__(self, outputCols=None, value=0.0):
super(SetValueTransformer, self).__init__()
self._setDefault(value=0.0)
kwargs = self._input_kwargs
self._set(**kwargs)
#keyword_only
def setParams(self, outputCols=None, value=0.0):
"""
setParams(self, outputCols=None, value=0.0)
Sets params for this SetValueTransformer.
"""
kwargs = self._input_kwargs
return self._set(**kwargs)
def setValue(self, value):
"""
Sets the value of :py:attr:`value`.
"""
return self._set(value=value)
def getValue(self):
"""
Gets the value of :py:attr:`value` or its default value.
"""
return self.getOrDefault(self.value)
def _transform(self, dataset):
for col in self.getOutputCols():
dataset = dataset.withColumn(col, lit(self.getValue()))
return dataset
Now we can use it:
from pyspark.ml import Pipeline, PipelineModel
svt = SetValueTransformer(outputCols=["a", "b"], value=123.0)
p = Pipeline(stages=[svt])
df = sc.parallelize([(1, None), (2, 1.0), (3, 0.5)]).toDF(["key", "value"])
pm = p.fit(df)
pm.transform(df).show()
pm.write().overwrite().save('/tmp/example_pyspark_pipeline')
pm2 = PipelineModel.load('/tmp/example_pyspark_pipeline')
print('matches?', pm2.stages[0].extractParamMap() == pm.stages[0].extractParamMap())
pm2.transform(df).show()
Result:
+---+-----+-----+-----+
|key|value| a| b|
+---+-----+-----+-----+
| 1| null|123.0|123.0|
| 2| 1.0|123.0|123.0|
| 3| 0.5|123.0|123.0|
+---+-----+-----+-----+
matches? True
+---+-----+-----+-----+
|key|value| a| b|
+---+-----+-----+-----+
| 1| null|123.0|123.0|
| 2| 1.0|123.0|123.0|
| 3| 0.5|123.0|123.0|
+---+-----+-----+-----+
I am not sure this is the best approach, but I too need the ability to save custom Estimators, Transformers and Models that I have created in Pyspark, and also to support their use in the Pipeline API with persistence. Custom Pyspark Estimators, Transformers and Models may be created and used in the Pipeline API but cannot be saved. This poses an issue in production when the model training takes longer than an event prediction cycle.
In general, Pyspark Estimators, Transformers and Models are just wrappers around the Java or Scala equivalents and the Pyspark wrappers just marshal the parameters to and from Java via py4j. Any persisting of the model is then done on the Java side. Because of this current structure, this limits Custom Pyspark Estimators, Transformers and Models to living only in the python world.
In a previous attempt, I was able to save a single Pyspark model by using Pickle/dill serialization. This worked well, but still did not allow saving or loading back such from within the Pipeline API. But, pointed to by another SO post I was directed to the OneVsRest classifier, and inspected the _to_java and _from_java methods. They do all the heavy lifting on the Pyspark side. After looking I thought, if one had a way to save the pickle dump to an already made and supported savable java object, then it should be possible to save a Custom Pyspark Estimator, Transformer and Model with the Pipeline API.
To that end, I found the StopWordsRemover to be the ideal object to hijack because it has an attribute, stopwords, that is a list of strings. The dill.dumps method returns a pickled representation of the object as a string. The plan was to turn the string into a list and then set the stopwords parameter of a StopWordsRemover to this list. Though a list strings, I found that some of the characters would not marshal to the java object. So the characters get converted to integers then the integers to strings. This all works great for saving a single instance, and also when saving within in a Pipeline, because the Pipeline dutifully calls the _to_java method of my python class (we are still on the Pyspark side so this works). But, coming back to Pyspark from java did not in the Pipeline API.
Because I am hiding my python object in a StopWordsRemover instance, the Pipeline, when coming back to Pyspark, does not know anything about my hidden class object, it knows only it has a StopWordsRemover instance. Ideally, it would be great to subclass Pipeline and PipelineModel, but alas this brings us back to trying to serialize a Python object. To combat this, I created a PysparkPipelineWrapper that takes a Pipeline or PipelineModel and just scans the stages, looking for a coded ID in the stopwords list (remember, this is just the pickled bytes of my python object) that tells it to unwraps the list to my instance and stores it back in the stage it came from. Below is code that shows how this all works.
For any Custom Pyspark Estimator, Transformer and Model, just inherit from Identifiable, PysparkReaderWriter, MLReadable, MLWritable. Then when loading a Pipeline and PipelineModel, pass such through PysparkPipelineWrapper.unwrap(pipeline).
This method does not address using the Pyspark code in Java or Scala, but at least we can save and load Custom Pyspark Estimators, Transformers and Models and work with Pipeline API.
import dill
from pyspark.ml import Transformer, Pipeline, PipelineModel
from pyspark.ml.param import Param, Params
from pyspark.ml.util import Identifiable, MLReadable, MLWritable, JavaMLReader, JavaMLWriter
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.wrapper import JavaParams
from pyspark.context import SparkContext
from pyspark.sql import Row
class PysparkObjId(object):
"""
A class to specify constants used to idenify and setup python
Estimators, Transformers and Models so they can be serialized on there
own and from within a Pipline or PipelineModel.
"""
def __init__(self):
super(PysparkObjId, self).__init__()
#staticmethod
def _getPyObjId():
return '4c1740b00d3c4ff6806a1402321572cb'
#staticmethod
def _getCarrierClass(javaName=False):
return 'org.apache.spark.ml.feature.StopWordsRemover' if javaName else StopWordsRemover
class PysparkPipelineWrapper(object):
"""
A class to facilitate converting the stages of a Pipeline or PipelineModel
that were saved from PysparkReaderWriter.
"""
def __init__(self):
super(PysparkPipelineWrapper, self).__init__()
#staticmethod
def unwrap(pipeline):
if not (isinstance(pipeline, Pipeline) or isinstance(pipeline, PipelineModel)):
raise TypeError("Cannot recognize a pipeline of type %s." % type(pipeline))
stages = pipeline.getStages() if isinstance(pipeline, Pipeline) else pipeline.stages
for i, stage in enumerate(stages):
if (isinstance(stage, Pipeline) or isinstance(stage, PipelineModel)):
stages[i] = PysparkPipelineWrapper.unwrap(stage)
if isinstance(stage, PysparkObjId._getCarrierClass()) and stage.getStopWords()[-1] == PysparkObjId._getPyObjId():
swords = stage.getStopWords()[:-1] # strip the id
lst = [chr(int(d)) for d in swords]
dmp = ''.join(lst)
py_obj = dill.loads(dmp)
stages[i] = py_obj
if isinstance(pipeline, Pipeline):
pipeline.setStages(stages)
else:
pipeline.stages = stages
return pipeline
class PysparkReaderWriter(object):
"""
A mixin class so custom pyspark Estimators, Transformers and Models may
support saving and loading directly or be saved within a Pipline or PipelineModel.
"""
def __init__(self):
super(PysparkReaderWriter, self).__init__()
def write(self):
"""Returns an MLWriter instance for this ML instance."""
return JavaMLWriter(self)
#classmethod
def read(cls):
"""Returns an MLReader instance for our clarrier class."""
return JavaMLReader(PysparkObjId._getCarrierClass())
#classmethod
def load(cls, path):
"""Reads an ML instance from the input path, a shortcut of `read().load(path)`."""
swr_java_obj = cls.read().load(path)
return cls._from_java(swr_java_obj)
#classmethod
def _from_java(cls, java_obj):
"""
Get the dumby the stopwords that are the characters of the dills dump plus our guid
and convert, via dill, back to our python instance.
"""
swords = java_obj.getStopWords()[:-1] # strip the id
lst = [chr(int(d)) for d in swords] # convert from string integer list to bytes
dmp = ''.join(lst)
py_obj = dill.loads(dmp)
return py_obj
def _to_java(self):
"""
Convert this instance to a dill dump, then to a list of strings with the unicode integer values of each character.
Use this list as a set of dumby stopwords and store in a StopWordsRemover instance
:return: Java object equivalent to this instance.
"""
dmp = dill.dumps(self)
pylist = [str(ord(d)) for d in dmp] # convert byes to string integer list
pylist.append(PysparkObjId._getPyObjId()) # add our id so PysparkPipelineWrapper can id us.
sc = SparkContext._active_spark_context
java_class = sc._gateway.jvm.java.lang.String
java_array = sc._gateway.new_array(java_class, len(pylist))
for i in xrange(len(pylist)):
java_array[i] = pylist[i]
_java_obj = JavaParams._new_java_obj(PysparkObjId._getCarrierClass(javaName=True), self.uid)
_java_obj.setStopWords(java_array)
return _java_obj
class HasFake(Params):
def __init__(self):
super(HasFake, self).__init__()
self.fake = Param(self, "fake", "fake param")
def getFake(self):
return self.getOrDefault(self.fake)
class MockTransformer(Transformer, HasFake, Identifiable):
def __init__(self):
super(MockTransformer, self).__init__()
self.dataset_count = 0
def _transform(self, dataset):
self.dataset_count = dataset.count()
return dataset
class MyTransformer(MockTransformer, Identifiable, PysparkReaderWriter, MLReadable, MLWritable):
def __init__(self):
super(MyTransformer, self).__init__()
def make_a_dataframe(sc):
df = sc.parallelize([Row(name='Alice', age=5, height=80), Row(name='Alice', age=5, height=80), Row(name='Alice', age=10, height=80)]).toDF()
return df
def test1():
trA = MyTransformer()
trA.dataset_count = 999
print trA.dataset_count
trA.save('test.trans')
trB = MyTransformer.load('test.trans')
print trB.dataset_count
def test2():
trA = MyTransformer()
pipeA = Pipeline(stages=[trA])
print type(pipeA)
pipeA.save('testA.pipe')
pipeAA = PysparkPipelineWrapper.unwrap(Pipeline.load('testA.pipe'))
stagesAA = pipeAA.getStages()
trAA = stagesAA[0]
print trAA.dataset_count
def test3():
dfA = make_a_dataframe(sc)
trA = MyTransformer()
pipeA = Pipeline(stages=[trA]).fit(dfA)
print type(pipeA)
pipeA.save('testB.pipe')
pipeAA = PysparkPipelineWrapper.unwrap(PipelineModel.load('testB.pipe'))
stagesAA = pipeAA.stages
trAA = stagesAA[0]
print trAA.dataset_count
dfB = pipeAA.transform(dfA)
dfB.show()
I couldn't get #dmbaker's ingenious solution to work using Python 2 on Spark 2.2.0; I kept getting pickling errors. After several blind alleys I got a working solution by modifying his (her?) idea to write and read the parameter values as strings into StopWordsRemover's stop words directly.
Here's the base class you need if you want to save and load your own estimators or transformers:
from pyspark import SparkContext
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.util import Identifiable, MLWritable, JavaMLWriter, MLReadable, JavaMLReader
from pyspark.ml.wrapper import JavaWrapper, JavaParams
class PysparkReaderWriter(Identifiable, MLReadable, MLWritable):
"""
A base class for custom pyspark Estimators and Models to support saving and loading directly
or within a Pipeline or PipelineModel.
"""
def __init__(self):
super(PysparkReaderWriter, self).__init__()
#staticmethod
def _getPyObjIdPrefix():
return "_ThisIsReallyA_"
#classmethod
def _getPyObjId(cls):
return PysparkReaderWriter._getPyObjIdPrefix() + cls.__name__
def getParamsAsListOfStrings(self):
raise NotImplementedError("PysparkReaderWriter.getParamsAsListOfStrings() not implemented for instance: %r" % self)
def write(self):
"""Returns an MLWriter instance for this ML instance."""
return JavaMLWriter(self)
def _to_java(self):
# Convert all our parameters to strings:
paramValuesAsStrings = self.getParamsAsListOfStrings()
# Append our own type-specific id so PysparkPipelineLoader can detect this algorithm when unwrapping us.
paramValuesAsStrings.append(self._getPyObjId())
# Convert the parameter values to a Java array:
sc = SparkContext._active_spark_context
java_array = JavaWrapper._new_java_array(paramValuesAsStrings, sc._gateway.jvm.java.lang.String)
# Create a Java (Scala) StopWordsRemover and give it the parameters as its stop words.
_java_obj = JavaParams._new_java_obj("org.apache.spark.ml.feature.StopWordsRemover", self.uid)
_java_obj.setStopWords(java_array)
return _java_obj
#classmethod
def _from_java(cls, java_obj):
# Get the stop words, ignoring the id at the end:
stopWords = java_obj.getStopWords()[:-1]
return cls.createAndInitialisePyObj(stopWords)
#classmethod
def createAndInitialisePyObj(cls, paramsAsListOfStrings):
raise NotImplementedError("PysparkReaderWriter.createAndInitialisePyObj() not implemented for type: %r" % cls)
#classmethod
def read(cls):
"""Returns an MLReader instance for our clarrier class."""
return JavaMLReader(StopWordsRemover)
#classmethod
def load(cls, path):
"""Reads an ML instance from the input path, a shortcut of `read().load(path)`."""
swr_java_obj = cls.read().load(path)
return cls._from_java(swr_java_obj)
Your own pyspark algorithm must then inherit from PysparkReaderWriter and override the getParamsAsListOfStrings() method which saves your parameters to a list of strings. Your algorithm must also override the createAndInitialisePyObj() method for converting a list of strings back into your parameters. Behind the scenes the parameters are converted to and from the stop words used by StopWordsRemover.
Example estimator with 3 parameters of different type:
from pyspark.ml.param.shared import Param, Params, TypeConverters
from pyspark.ml.base import Estimator
class MyEstimator(Estimator, PysparkReaderWriter):
def __init__(self):
super(MyEstimator, self).__init__()
# 3 sample parameters, deliberately of different types:
stringParam = Param(Params._dummy(), "stringParam", "A dummy string parameter", typeConverter=TypeConverters.toString)
def setStringParam(self, value):
return self._set(stringParam=value)
def getStringParam(self):
return self.getOrDefault(self.stringParam)
listOfStringsParam = Param(Params._dummy(), "listOfStringsParam", "A dummy list of strings.", typeConverter=TypeConverters.toListString)
def setListOfStringsParam(self, value):
return self._set(listOfStringsParam=value)
def getListOfStringsParam(self):
return self.getOrDefault(self.listOfStringsParam)
intParam = Param(Params._dummy(), "intParam", "A dummy int parameter.", typeConverter=TypeConverters.toInt)
def setIntParam(self, value):
return self._set(intParam=value)
def getIntParam(self):
return self.getOrDefault(self.intParam)
def _fit(self, dataset):
model = MyModel()
# Just some changes to verify we can modify the model (and also it's something we can expect to see when restoring it later):
model.setAnotherStringParam(self.getStringParam() + " World!")
model.setAnotherListOfStringsParam(self.getListOfStringsParam() + ["E", "F"])
model.setAnotherIntParam(self.getIntParam() + 10)
return model
def getParamsAsListOfStrings(self):
paramValuesAsStrings = []
paramValuesAsStrings.append(self.getStringParam()) # Parameter is already a string
paramValuesAsStrings.append(','.join(self.getListOfStringsParam())) # ...convert from a list of strings
paramValuesAsStrings.append(str(self.getIntParam())) # ...convert from an int
return paramValuesAsStrings
#classmethod
def createAndInitialisePyObj(cls, paramsAsListOfStrings):
# Convert back into our parameters. Make sure you do this in the same order you saved them!
py_obj = cls()
py_obj.setStringParam(paramsAsListOfStrings[0])
py_obj.setListOfStringsParam(paramsAsListOfStrings[1].split(","))
py_obj.setIntParam(int(paramsAsListOfStrings[2]))
return py_obj
Example Model (also a Transformer) which has 3 different parameters:
from pyspark.ml.base import Model
class MyModel(Model, PysparkReaderWriter):
def __init__(self):
super(MyModel, self).__init__()
# 3 sample parameters, deliberately of different types:
anotherStringParam = Param(Params._dummy(), "anotherStringParam", "A dummy string parameter", typeConverter=TypeConverters.toString)
def setAnotherStringParam(self, value):
return self._set(anotherStringParam=value)
def getAnotherStringParam(self):
return self.getOrDefault(self.anotherStringParam)
anotherListOfStringsParam = Param(Params._dummy(), "anotherListOfStringsParam", "A dummy list of strings.", typeConverter=TypeConverters.toListString)
def setAnotherListOfStringsParam(self, value):
return self._set(anotherListOfStringsParam=value)
def getAnotherListOfStringsParam(self):
return self.getOrDefault(self.anotherListOfStringsParam)
anotherIntParam = Param(Params._dummy(), "anotherIntParam", "A dummy int parameter.", typeConverter=TypeConverters.toInt)
def setAnotherIntParam(self, value):
return self._set(anotherIntParam=value)
def getAnotherIntParam(self):
return self.getOrDefault(self.anotherIntParam)
def _transform(self, dataset):
# Dummy transform code:
return dataset.withColumn('age2', dataset.age + self.getAnotherIntParam())
def getParamsAsListOfStrings(self):
paramValuesAsStrings = []
paramValuesAsStrings.append(self.getAnotherStringParam()) # Parameter is already a string
paramValuesAsStrings.append(','.join(self.getAnotherListOfStringsParam())) # ...convert from a list of strings
paramValuesAsStrings.append(str(self.getAnotherIntParam())) # ...convert from an int
return paramValuesAsStrings
#classmethod
def createAndInitialisePyObj(cls, paramsAsListOfStrings):
# Convert back into our parameters. Make sure you do this in the same order you saved them!
py_obj = cls()
py_obj.setAnotherStringParam(paramsAsListOfStrings[0])
py_obj.setAnotherListOfStringsParam(paramsAsListOfStrings[1].split(","))
py_obj.setAnotherIntParam(int(paramsAsListOfStrings[2]))
return py_obj
Below is a sample test case showing how you can save and load your model. It's similar for the estimator so I omit that for brevity.
def createAModel():
m = MyModel()
m.setAnotherStringParam("Boo!")
m.setAnotherListOfStringsParam(["P", "Q", "R"])
m.setAnotherIntParam(77)
return m
def testSaveLoadModel():
modA = createAModel()
print(modA.explainParams())
savePath = "/whatever/path/you/want"
#modA.save(savePath) # Can't overwrite, so...
modA.write().overwrite().save(savePath)
modB = MyModel.load(savePath)
print(modB.explainParams())
testSaveLoadModel()
Output:
anotherIntParam: A dummy int parameter. (current: 77)
anotherListOfStringsParam: A dummy list of strings. (current: ['P', 'Q', 'R'])
anotherStringParam: A dummy string parameter (current: Boo!)
anotherIntParam: A dummy int parameter. (current: 77)
anotherListOfStringsParam: A dummy list of strings. (current: [u'P', u'Q', u'R'])
anotherStringParam: A dummy string parameter (current: Boo!)
Notice how the parameters have come back in as unicode strings. This may or may not make a difference to your underlying algorithm that you implement in _transform() (or _fit() for the estimator). So be aware of this.
Finally, because the Scala algorithm behind the scenes is really a StopWordsRemover, you need to unwrap it back into your own class when loading the Pipeline or PipelineModel from disk. Here's the utility class that does this unwrapping:
from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.feature import StopWordsRemover
class PysparkPipelineLoader(object):
"""
A class to facilitate converting the stages of a Pipeline or PipelineModel
that were saved from PysparkReaderWriter.
"""
def __init__(self):
super(PysparkPipelineLoader, self).__init__()
#staticmethod
def unwrap(thingToUnwrap, customClassList):
if not (isinstance(thingToUnwrap, Pipeline) or isinstance(thingToUnwrap, PipelineModel)):
raise TypeError("Cannot recognize an object of type %s." % type(thingToUnwrap))
stages = thingToUnwrap.getStages() if isinstance(thingToUnwrap, Pipeline) else thingToUnwrap.stages
for i, stage in enumerate(stages):
if (isinstance(stage, Pipeline) or isinstance(stage, PipelineModel)):
stages[i] = PysparkPipelineLoader.unwrap(stage)
if isinstance(stage, StopWordsRemover) and stage.getStopWords()[-1].startswith(PysparkReaderWriter._getPyObjIdPrefix()):
lastWord = stage.getStopWords()[-1]
className = lastWord[len(PysparkReaderWriter._getPyObjIdPrefix()):]
stopWords = stage.getStopWords()[:-1] # Strip the id
# Create and initialise the appropriate class:
py_obj = None
for clazz in customClassList:
if clazz.__name__ == className:
py_obj = clazz.createAndInitialisePyObj(stopWords)
if py_obj is None:
raise TypeError("I don't know how to create an instance of type: %s" % className)
stages[i] = py_obj
if isinstance(thingToUnwrap, Pipeline):
thingToUnwrap.setStages(stages)
else:
# PipelineModel
thingToUnwrap.stages = stages
return thingToUnwrap
Test for saving and loading a pipeline:
def testSaveAndLoadUnfittedPipeline():
estA = createAnEstimator()
#print(estA.explainParams())
pipelineA = Pipeline(stages=[estA])
savePath = "/whatever/path/you/want"
#pipelineA.save(savePath) # Can't overwrite, so...
pipelineA.write().overwrite().save(savePath)
pipelineReloaded = PysparkPipelineLoader.unwrap(Pipeline.load(savePath), [MyEstimator])
estB = pipelineReloaded.getStages()[0]
print(estB.explainParams())
testSaveAndLoadUnfittedPipeline()
Output:
intParam: A dummy int parameter. (current: 42)
listOfStringsParam: A dummy list of strings. (current: [u'A', u'B', u'C', u'D'])
stringParam: A dummy string parameter (current: Hello)
Test for saving and loading a pipeline model:
from pyspark.sql import Row
def make_a_dataframe(sc):
df = sc.parallelize([Row(name='Alice', age=5, height=80), Row(name='Bob', age=7, height=85), Row(name='Chris', age=10, height=90)]).toDF()
return df
def testSaveAndLoadPipelineModel():
dfA = make_a_dataframe(sc)
estA = createAnEstimator()
#print(estA.explainParams())
pipelineModelA = Pipeline(stages=[estA]).fit(dfA)
savePath = "/whatever/path/you/want"
#pipelineModelA.save(savePath) # Can't overwrite, so...
pipelineModelA.write().overwrite().save(savePath)
pipelineModelReloaded = PysparkPipelineLoader.unwrap(PipelineModel.load(savePath), [MyModel])
modB = pipelineModelReloaded.stages[0]
print(modB.explainParams())
dfB = pipelineModelReloaded.transform(dfA)
dfB.show()
testSaveAndLoadPipelineModel()
Output:
anotherIntParam: A dummy int parameter. (current: 52)
anotherListOfStringsParam: A dummy list of strings. (current: [u'A', u'B', u'C', u'D', u'E', u'F'])
anotherStringParam: A dummy string parameter (current: Hello World!)
+---+------+-----+----+
|age|height| name|age2|
+---+------+-----+----+
| 5| 80|Alice| 57|
| 7| 85| Bob| 59|
| 10| 90|Chris| 62|
+---+------+-----+----+
When unwrapping a pipeline or pipeline model you have to pass in a list of the classes that correspond to your own pyspark algorithms that are masquerading as StopWordsRemover objects in the saved pipeline or pipeline model. The last stop word in your saved object is used to identify your own class's name and then createAndInitialisePyObj() is called to create an instance of your class and initialise its parameters with the remaining stop words.
Various refinements could be made. But hopefully this will enable you to save and load custom estimators and transformers, both inside and outside pipelines, until SPARK-17025 is resolved and available to you.
Similar to the working answer by #dmbaker, I wrapped my custom transformer called Aggregator inside of a built-in Spark transformer, in this example, Binarizer, though I'm sure you can inherit from other transformers, too. That allowed my custom transformer to inherit the methods necessary for serialization.
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler, Binarizer
from pyspark.ml.regression import LinearRegression
class Aggregator(Binarizer):
"""A huge hack to allow serialization of custom transformer."""
def transform(self, input_df):
agg_df = input_df\
.groupBy('channel_id')\
.agg({
'foo': 'avg',
'bar': 'avg',
})\
.withColumnRenamed('avg(foo)', 'avg_foo')\
.withColumnRenamed('avg(bar)', 'avg_bar')
return agg_df
# Create pipeline stages.
aggregator = Aggregator()
vector_assembler = VectorAssembler(...)
linear_regression = LinearRegression()
# Create pipeline.
pipeline = Pipeline(stages=[aggregator, vector_assembler, linear_regression])
# Train.
pipeline_model = pipeline.fit(input_df)
# Save model file to S3.
pipeline_model.save('s3n://example')
The #dmbaker solution didn't work for me. I believe that is because the python version (2.x versus 3.x). I made some updates on his solution and now it works on Python 3. My setup is listed below:
python: 3.6.3
spark: 2.2.1
dill: 0.2.7.1
class PysparkObjId(object):
"""
A class to specify constants used to idenify and setup python
Estimators, Transformers and Models so they can be serialized on there
own and from within a Pipline or PipelineModel.
"""
def __init__(self):
super(PysparkObjId, self).__init__()
#staticmethod
def _getPyObjId():
return '4c1740b00d3c4ff6806a1402321572cb'
#staticmethod
def _getCarrierClass(javaName=False):
return 'org.apache.spark.ml.feature.StopWordsRemover' if javaName else StopWordsRemover
class PysparkPipelineWrapper(object):
"""
A class to facilitate converting the stages of a Pipeline or PipelineModel
that were saved from PysparkReaderWriter.
"""
def __init__(self):
super(PysparkPipelineWrapper, self).__init__()
#staticmethod
def unwrap(pipeline):
if not (isinstance(pipeline, Pipeline) or isinstance(pipeline, PipelineModel)):
raise TypeError("Cannot recognize a pipeline of type %s." % type(pipeline))
stages = pipeline.getStages() if isinstance(pipeline, Pipeline) else pipeline.stages
for i, stage in enumerate(stages):
if (isinstance(stage, Pipeline) or isinstance(stage, PipelineModel)):
stages[i] = PysparkPipelineWrapper.unwrap(stage)
if isinstance(stage, PysparkObjId._getCarrierClass()) and stage.getStopWords()[-1] == PysparkObjId._getPyObjId():
swords = stage.getStopWords()[:-1] # strip the id
# convert stop words to int
swords = [int(d) for d in swords]
# get the byte value of all ints
lst = [x.to_bytes(length=1, byteorder='big') for x in
swords] # convert from string integer list to bytes
# return the first byte and concatenates all the others
dmp = lst[0]
for byte_counter in range(1, len(lst)):
dmp = dmp + lst[byte_counter]
py_obj = dill.loads(dmp)
stages[i] = py_obj
if isinstance(pipeline, Pipeline):
pipeline.setStages(stages)
else:
pipeline.stages = stages
return pipeline
class PysparkReaderWriter(object):
"""
A mixin class so custom pyspark Estimators, Transformers and Models may
support saving and loading directly or be saved within a Pipline or PipelineModel.
"""
def __init__(self):
super(PysparkReaderWriter, self).__init__()
def write(self):
"""Returns an MLWriter instance for this ML instance."""
return JavaMLWriter(self)
#classmethod
def read(cls):
"""Returns an MLReader instance for our clarrier class."""
return JavaMLReader(PysparkObjId._getCarrierClass())
#classmethod
def load(cls, path):
"""Reads an ML instance from the input path, a shortcut of `read().load(path)`."""
swr_java_obj = cls.read().load(path)
return cls._from_java(swr_java_obj)
#classmethod
def _from_java(cls, java_obj):
"""
Get the dumby the stopwords that are the characters of the dills dump plus our guid
and convert, via dill, back to our python instance.
"""
swords = java_obj.getStopWords()[:-1] # strip the id
lst = [x.to_bytes(length=1, byteorder='big') for x in swords] # convert from string integer list to bytes
dmp = lst[0]
for i in range(1, len(lst)):
dmp = dmp + lst[i]
py_obj = dill.loads(dmp)
return py_obj
def _to_java(self):
"""
Convert this instance to a dill dump, then to a list of strings with the unicode integer values of each character.
Use this list as a set of dumby stopwords and store in a StopWordsRemover instance
:return: Java object equivalent to this instance.
"""
dmp = dill.dumps(self)
pylist = [str(int(d)) for d in dmp] # convert bytes to string integer list
pylist.append(PysparkObjId._getPyObjId()) # add our id so PysparkPipelineWrapper can id us.
sc = SparkContext._active_spark_context
java_class = sc._gateway.jvm.java.lang.String
java_array = sc._gateway.new_array(java_class, len(pylist))
for i in range(len(pylist)):
java_array[i] = pylist[i]
_java_obj = JavaParams._new_java_obj(PysparkObjId._getCarrierClass(javaName=True), self.uid)
_java_obj.setStopWords(java_array)
return _java_obj
class HasFake(Params):
def __init__(self):
super(HasFake, self).__init__()
self.fake = Param(self, "fake", "fake param")
def getFake(self):
return self.getOrDefault(self.fake)
class CleanText(Transformer, HasInputCol, HasOutputCol, Identifiable, PysparkReaderWriter, MLReadable, MLWritable):
#keyword_only
def __init__(self, inputCol=None, outputCol=None):
super(CleanText, self).__init__()
kwargs = self._input_kwargs
self.setParams(**kwargs)
I wrote some base classes to make this easier. Basically I abstract all the complication of the code and initialisation into some base classes that expose a much simpler API to build custom ones. This includes taking care of the serialisation/deserialisation problem and saving and loading SparkML objects. Then you can use concentrate in the __init__ and transform/fit functions. You can find a full explanation with examples in here.

Resources