How can I use Hyperopt with MLFlow within a pandas_udf?

How can I use Hyperopt with MLFlow within a pandas_udf? - apache-spark

I'm building multiple Prophet models where each model is passed to a pandas_udf function which trains the model and stores the results with MLflow.
#pandas_udf(result_schema, PandasUDFType.GROUPED_MAP)
def forecast(data):
......
with mlflow.start_run() as run:
......
Then I call this UDF which trains a model for each KPI.
df.groupBy('KPI').apply(forecast)
The idea is that, for each KPI a model will be trained with multiple hyperparameters and store the best params for each model in MLflow. I would like to use Hyperopt to make the search more efficient.
In this case, where should I place the objective function? Since the data is passed to the UDF for each model I thought of creating an inner function within the UDF that uses the data for each run. Does this make sense?

if I remember correctly, you couldn't do it because it would be something like nested Spark execution, and it won't work with Spark. You'll need to have to change approach to something like:
for kpi in list_of_kpis:
run_hyperopt_tuning
if you need to tune parameters for every KPI model separately - because it will optimize parameters separately.
If KPI is like a hyperparameter of the model, then you can just include list of KPIs into search space, and load necessary data inside the function that doing the training & evaluation.

Related

combine multiple spacy textcat_multilabel models into a single textcat_multilabel model

Problem: I have millions of records that need to be transformed using a bunch of spacy textcat_multilabel models.
// sudo code
for model in models:
nlp = spacy.load(model)
for groups_of_records in records: // millions of records
new_data = nlp.pipe(groups_of_records) // data is getting processed bulk
// process data
bulk_create_records(new_data)
My current loop is as follows:
load a model
loop through records / transform data using model / save
As you can imagine, the more records i process, and the more models i include, the longer this entire process will take. The idea is to make a single model, and just process my data once, instead of (n * num_of_models)
Question: is there a way to combine multiple textcat_multilabel models created from the same spacy config, into a single textcat_multilabel model?

There is no basic feature to just combine models, but there are a couple of ways you can do this.
One is to source all your components into the same pipeline. This is very easy to do, see the double NER project for an example. The disadvantage is that this might not save you much processing time, since separately trained models will still have their own tok2vec layers.
You could combine your training data and train one big model. But if your models are actually separate that would almost certainly cause a reduction in accuracy.
If speed is the primary concern, you could train each of your textcats separately while freezing your tok2vec. That would result in decreased accuracy, though maybe not too bad, and it would allow you to then combine the textcat models in the same pipeline while removing a bunch of tok2vec processing. (This is probably the method I've listed with the best balance of implementation complexity, speed advantage, and accuracy sacrificed.)
One thing that I don't think has been tested is that you could try training separate textcat models at the same time with separate sets of labels by manually specifying the labels to each component in their configs. I am not completely sure that would work but you could try it.

Calling function on each row of a DataFrame that requires building a DataFrame

I'm trying to wrap some functionalities of the lime python library over spark ml models. The general idea is to have a PipelineModel (containg each phase of data transformation and the application of the model) as an input and build a functionality the calls the spark model, apply the lime algorithm and give an explanation for each single row.
Some context
The lime algorithm consists in approximating locally a trained machine learning model. In its implementation, lime just basically needs a function that, given a feature vector as input, evaluates the predictions of the model. With this function, lime can perturb slightly the feature input, see how the model predictions change and then give an explanation. So, theoretically, it can be applied to any model, evaluated with any engine.
The idea here is to use it with Spark ml models.
The wrapping
In particular, I'm wrapping the LimeTabularExplainer. In order to work, it needs a feature vector in which each element is an index corresponding to the category. Digging with the StringIndexer and similar, it's pretty easy to build such vector from the "raw" values of the data. Then, I built a function that, from such vector (or a 2d array if you have more than one case), create a Spark DataFrame, apply the PipelineModel and returns the model predictions.
The task
Ideally, I would a like to build a function that does the following:
process a row of an input DataFrame
from the row, it builds and collect a numpy vector that works as input for the lime explainer
internally, the lime explainer slightly changes that vector in many ways, building a 2d array of "similar" cases
the above cases are transformed back as a Spark DataFrame
the PipelineModel is applied on the above DataFrame, the results collected and brought the lime explainer that will continue its work
The problem
As you see (if you read so far!), for each row of the DataFrame you build another DataFrame. So, you cannot define an udf, since you are not allowed to call Spark functions inside the udf.
So the question is: how can I parallelize the above procedure? Is there another approach that I could follow to avoid the problem?

I think you can still use udfs in this case, followed by explode() to retrieve all the results on different lines. You just have to make sure the input column is already the vector you want to feed lime.
That way you don't even have to collect out of spark, which is expensive. Maybe you can even use vectorized udfs in your case to gain speed(not sure)
def function(base_case):
list_similarCases = limefunction(base_case)
return list_similarCases
f_udf = udf(function, ArrayType())
df_result = df_start.withColumn("similar_cases", explode(f_udf("base_case")))

sklearn pipeline + keras sequential model - how to get history?

Keras models, when .fit is called, return a history object. Is it possible to retrieve it if I use this model as one step of a sklearn pipeline?
btw, i'm using python 3.6
Thanks in advance!

The History callback records training metrics for each epoch. This includes the loss and the accuracy (for classification problems) as well as the loss and accuracy for the validation dataset, if one is set.
The history object is returned from calls to the fit() function used to train the model. Metrics are stored in a dictionary in the history member of the object returned.
This also means that the values have to be in the scope of the fit() function or the sequential model, so if it is in a sklearn pipeline, it doesn't have access to the final values, and it can't store, or return what it can't see.
As of right now I an not aware of a history callback in sklearn so the only I see for you is to manually record the metrics you want to track. One way to do so would be to have pipeline return the data and then simply fit your model onto it. If you are not able to figure that out comment.

Aggregate training results to predits

When training the model the results depend on the sampling. In order to obtain something better you could repeat the training (in another randomly create training sample, using Ffolds, StratifiedKFold ... ), somehow aggregate the results and have this way a result that will be more robust that one create in a particular case alone. Question: is it already implemented in sklearn or similar?. Apologies is this is a straighforward question, I haven't see a simple solution.
I see that there is a function called cross_val_predict however my first impresion having a quick look to the source code is that it predecits as many times as trains and I would like to predicts only ones, so I can piclke the, somehow aggregate results, and predict later, instead of repeat the whole training thing again.

So far I think the best option are the ensemblers in sklearn.
I left here the solution I was using before. I am pretty sure could be improved (as mentioned before the Ensemblers in sklearn) are better. I have placed here https://github.com/rafaelvalero/aggreating_predictions_sklearn, where I have left a notebook with and example (using iris database), in case anyone can play around and see in details how could be done.
That solution will train models (in parallel, using joblib), pickle the trained model (a model from SKlearn), store the results (using joblib dump) and later would recover them to create predictions (in parallel, using joblib) that later are aggregated.

advanced feature extraction for cross-validation using sklearn

Given a sample dataset with 1000 samples of data, suppose I would like to preprocess the data in order to obtain 10000 rows of data, so each original row of data leads to 10 new samples. In addition, when training my model I would like to be able to perform cross validation as well.
The scoring function I have uses the original data to compute the score so I would like cross validation scoring to work on the original data as well rather than the generated one. Since I am feeding the generated data to the trainer (I am using a RandomForestClassifier), I cannot rely on cross-validation to correctly split the data according to the original samples.
What I thought about doing:
Create a custom feature extractor to extract features to feed to the classifier.
add the feature extractor to a pipeline and feed it to, say, GridSearchCv for example
implement a custom scorer which operates on the original data to score the model given a set of selected parameters.
Is there a better method for what I am trying to accomplish?
I am asking this in connection to a competition going on right now on Kaggle

Maybe you can use Stratified cross validation (e.g. Stratified K-Fold or Stratified Shuffle Split) on the expanded samples and use the original sample idx as stratification info in combination with a custom score function that would ignore the non original samples in the model evaluation.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How can I use Hyperopt with MLFlow within a pandas_udf? - apache-spark

Related

combine multiple spacy textcat_multilabel models into a single textcat_multilabel model

Calling function on each row of a DataFrame that requires building a DataFrame

sklearn pipeline + keras sequential model - how to get history?

Aggregate training results to predits

advanced feature extraction for cross-validation using sklearn

Categories

Resources