Databricks, save a logistic regression to an object to be able to call it from another notebook - object

I fitted to my data a logistic regression, in another notebook i am trying to predict the probabilities of the previously fitted function.
I would not like to call the all notebook and run everything every single time, i just would like to save the fitted logreg in an object or something like that and only call that in the other notebook on databriks.
This is my function:
log_test_var = logreg.fit(X_train, y_train)
from "gaming" notebook
And I would like to save this log_test_var somewhere to predict my X column values from clean_text notebook
y_pred = log_test_var.predict(X)
Thanks

Unfortunately, it is not possible to call particular variable directly.
Workaround:-
Use %run function to call a notebook inside another notebook.
In below screenshot you can see I have created variable in parent notebook.
To access variable from parent notebook, I have used %run notebook_name
This method is suitable if you want to define a notebook that holds all the constant variables or a centralized shared function library. And you want to refer to them in the calling or child notebook.
You can refer this article by PAVAN BANGAD

Related

Same sklearn pipeline different results

I have created a pipeline based on:
Custom tfidfvectorizer to transform tf IDF vector as dataframe (600 features)
Custom Features generator to create new features (5)
Feature Union to join the two dataframes. I checked the output is an array, so no feature names. (605)
Xgboost classifier model seed and random state included (8 classes as labels names)
If I fit and use de pipeline in Jupyter notebook, I obtain good F1 scores.
However, when I save it (using pickle, joblib or dill), and later load it in another notebook or script, I cannot always reproduce the results! I cannot understand it because the input for testing is always the same.. and the python environment!
Could you help me with some suggestions?
Thanks!
Tried to save the pipeline with different libraries.
DenseTransformer in some points
Column transform instead of feature Union
I cannot use pmml library due to some restrictions
Etc
The problem is the same

How to slice Kinetics400 training dataset? (pytorch)

I am trying to run the official script for video classification.
I want to tweak some functions and running through all examples would cost me too much time.
I wonder how can I slice the training kinetics dataset based on that script.
This is the code I added before
train_sampler = RandomClipSampler(dataset.video_clips, args.clips_per_video)
in the script: (let's say I just want to run 100 examples.)
tr_split_len = 100
dataset = torch.utils.data.random_split(dataset, [tr_split_len, len(dataset)-tr_split_len])[0]
Then when hitting train_sampler = RandomClipSampler(dataset.video_clips, args.clips_per_video)
, it pops out the error:
AttributeError: 'Subset' object has no attribute 'video_clips'
Yeah, so the type of dataset converts from torchvision.datasets.kinetics.Kinetics400 to torch.utils.data.dataset.Subset.
I understand. So how can I do it? (hopefully not the way using break in the dataloader loop).
Thanks.
It seems that torchvision.datasets.kinetics.Kinetics400 internally uses an object of class VideoClips to store the information about the clips. It is stored in the member variable Kinetics4000().video_clips.
The VideoClips class has a function called subset, that takes a list of indices and returns a new VideoClips object with only the clips with the specified indices. You could then just replace the old VideoClips object with the new one in your dataset.

How can I use Hyperopt with MLFlow within a pandas_udf?

I'm building multiple Prophet models where each model is passed to a pandas_udf function which trains the model and stores the results with MLflow.
#pandas_udf(result_schema, PandasUDFType.GROUPED_MAP)
def forecast(data):
......
with mlflow.start_run() as run:
......
Then I call this UDF which trains a model for each KPI.
df.groupBy('KPI').apply(forecast)
The idea is that, for each KPI a model will be trained with multiple hyperparameters and store the best params for each model in MLflow. I would like to use Hyperopt to make the search more efficient.
In this case, where should I place the objective function? Since the data is passed to the UDF for each model I thought of creating an inner function within the UDF that uses the data for each run. Does this make sense?
if I remember correctly, you couldn't do it because it would be something like nested Spark execution, and it won't work with Spark. You'll need to have to change approach to something like:
for kpi in list_of_kpis:
run_hyperopt_tuning
if you need to tune parameters for every KPI model separately - because it will optimize parameters separately.
If KPI is like a hyperparameter of the model, then you can just include list of KPIs into search space, and load necessary data inside the function that doing the training & evaluation.

sklearn pipeline + keras sequential model - how to get history?

Keras models, when .fit is called, return a history object. Is it possible to retrieve it if I use this model as one step of a sklearn pipeline?
btw, i'm using python 3.6
Thanks in advance!
The History callback records training metrics for each epoch. This includes the loss and the accuracy (for classification problems) as well as the loss and accuracy for the validation dataset, if one is set.
The history object is returned from calls to the fit() function used to train the model. Metrics are stored in a dictionary in the history member of the object returned.
This also means that the values have to be in the scope of the fit() function or the sequential model, so if it is in a sklearn pipeline, it doesn't have access to the final values, and it can't store, or return what it can't see.
As of right now I an not aware of a history callback in sklearn so the only I see for you is to manually record the metrics you want to track. One way to do so would be to have pipeline return the data and then simply fit your model onto it. If you are not able to figure that out comment.

Pyspark: save transformers

I am using some transformers of Pyspark such as StringIndexer, StandardScaler and more. I first apply those to the training set and then later on I want to use the same transformation objects (same parameters of StringIndexerModel, StandardScalerModel) in order to apply them on the test set. Therefore, I am looking for a way to save those transformation functions as a file. However, I cannot find any related method but only with ml functions such as LogisticRegression. Do you know any possible way to do that? Thanks.
I found an easy solution.
Save the indexer model to a file (on HDFS).
writer = indexerModel._call_java("write")
writer.save("indexerModel")
Load the indexer model from a file (saved on HDFS).
indexer = StringIndexerModel._new_java_obj("org.apache.spark.ml.feature.StringIndexerModel.load", "indexerModel")
indexerModel = StringIndexerModel(indexer)
The output of StringIndexer and StandardScaler are both RDDs, so you can either save the models directly to a file or, more likely what you want, you can persist the results for later computation.
To save to a parquet file call (you might need a schema attached as well) sqlContext.createDataFrame(string_indexed_rdd).write.parquet("indexer.parquet"). You would then need to program loading this result back from a file when you wanted it.
To persist call string_indexed_rdd.persist(). This will save the intermediary results in memory for reuse later. You can pass options to save to disk as well if you are memory limited.
If you want to just persist the model itself, you're stuck on an existing bug/missing capability in the api (PR). If the underlying issue was resolved and didn't provide new methods, you need to call some underlying methods manually to get and set the model parameters. Looking through the model code you can see that the Models inherit from a chain of classes, one of which is Params. This class has the extractParamMap which pulls out the parameters used in the model. You can then save this in any manner you wish for persisting python dicts. Then you need to create an empty model object and follow that with a call to copy(saved_params) to pass the persisted parameters into the object.
Something along these lines should work:
def save_params(model, filename):
d = shelve.open(filename)
try:
return d.update(model.extractParamMap())
finally:
d.close()
def load_params(ModelClass, filename):
d = shelve.open(filename)
try:
return ModelClass().copy(dict(d))
finally:
d.close()

Resources