Same sklearn pipeline different results - scikit-learn

I have created a pipeline based on:
Custom tfidfvectorizer to transform tf IDF vector as dataframe (600 features)
Custom Features generator to create new features (5)
Feature Union to join the two dataframes. I checked the output is an array, so no feature names. (605)
Xgboost classifier model seed and random state included (8 classes as labels names)
If I fit and use de pipeline in Jupyter notebook, I obtain good F1 scores.
However, when I save it (using pickle, joblib or dill), and later load it in another notebook or script, I cannot always reproduce the results! I cannot understand it because the input for testing is always the same.. and the python environment!
Could you help me with some suggestions?
Thanks!
Tried to save the pipeline with different libraries.
DenseTransformer in some points
Column transform instead of feature Union
I cannot use pmml library due to some restrictions
Etc
The problem is the same

Related

Pytorch Geometric: How to create a small temporary dataset in a colab notebook from a list of pytorch geometric data objects

I am working in a Colab sheet and have generated a python list of Pytorch Geometric data objects. I now want to turn them into a dataset for use just in this notebook. How can I do this? The existing documentation seems geared towards long-term datasets.
When I worked on standard Pytorch I used a combination of torch.FloatTensor() and TensorDataset() to create my own dataset for use with random split.

How to use functions from sklearn into pyspark

I have a training set with 201,917 rows, 3 features and 1 target. My aim is to calculate the strength of the relationship of the individual features with the target. My choice of method for this is sklearn.feature_selection.mutual_info_regression because it works for continuous variables and can detect non-linear relationships better than the counterpart sklearn.feature_selection.f_regression. This is the line I tried to run -
feature_selection.mutual_info_regression(trainPD[['feature_1']],trainPD['target'])
Now the problem is if I run sklearn.feature_selection.mutual_info_regression in Colab, the system crashes. Hence my idea was to shift to pyspark. But pyspark.ml does not have support for sklearn.feature_selection.mutual_info_regression. So what are my options to use sklearn.feature_selection.mutual_info_regression in pyspark?
I am not sure if pandas_udf will help because here it is not the traditional pd.Series -> pd.Series conversion where pyspark parallelization works.

How do you implement a model built using sklearn pipeline in pyspark?

I would like to use the model I built using sklearn pipeline in pyspark. The pipeline takes care of imputation, scaling and one-hot encoding and Random Forest Classification.I tried broadcasting the model and using pandas udf to predict.it did not work, got py4jjavaerror.

Calling function on each row of a DataFrame that requires building a DataFrame

I'm trying to wrap some functionalities of the lime python library over spark ml models. The general idea is to have a PipelineModel (containg each phase of data transformation and the application of the model) as an input and build a functionality the calls the spark model, apply the lime algorithm and give an explanation for each single row.
Some context
The lime algorithm consists in approximating locally a trained machine learning model. In its implementation, lime just basically needs a function that, given a feature vector as input, evaluates the predictions of the model. With this function, lime can perturb slightly the feature input, see how the model predictions change and then give an explanation. So, theoretically, it can be applied to any model, evaluated with any engine.
The idea here is to use it with Spark ml models.
The wrapping
In particular, I'm wrapping the LimeTabularExplainer. In order to work, it needs a feature vector in which each element is an index corresponding to the category. Digging with the StringIndexer and similar, it's pretty easy to build such vector from the "raw" values of the data. Then, I built a function that, from such vector (or a 2d array if you have more than one case), create a Spark DataFrame, apply the PipelineModel and returns the model predictions.
The task
Ideally, I would a like to build a function that does the following:
process a row of an input DataFrame
from the row, it builds and collect a numpy vector that works as input for the lime explainer
internally, the lime explainer slightly changes that vector in many ways, building a 2d array of "similar" cases
the above cases are transformed back as a Spark DataFrame
the PipelineModel is applied on the above DataFrame, the results collected and brought the lime explainer that will continue its work
The problem
As you see (if you read so far!), for each row of the DataFrame you build another DataFrame. So, you cannot define an udf, since you are not allowed to call Spark functions inside the udf.
So the question is: how can I parallelize the above procedure? Is there another approach that I could follow to avoid the problem?
I think you can still use udfs in this case, followed by explode() to retrieve all the results on different lines. You just have to make sure the input column is already the vector you want to feed lime.
That way you don't even have to collect out of spark, which is expensive. Maybe you can even use vectorized udfs in your case to gain speed(not sure)
def function(base_case):
list_similarCases = limefunction(base_case)
return list_similarCases
f_udf = udf(function, ArrayType())
df_result = df_start.withColumn("similar_cases", explode(f_udf("base_case")))

Spark ML pipeline usage

I created an ML pipeline with several transformers, including a StringIndexer which is used during training on the data labels.
I then store the resultant PipelineModel which will later be used for data preparation and prediction on a dataset which doesn't have labels.
The issue is that the created pipeline model's transform function cannot be applied to the new DataFrame, since it expects data labels to be available.
What am I missing?
How should this be done?
Note: My goal is to have a single pipeline (i.e. I'd like to keep the various transformations and ML algorithm together)
Thanks!
You should paste your source code.Then your test data format should be consistent with your train data including the feature name.But you don't need label column.
You can refer to official site

Resources