why the output of model is different in pytorch - pytorch

I have a simple model, just only one linear layer.
model = torch.nn.Linear(1,1).to(device)
x_train1 = torch.FloatTensor([[1], [2], [3]])
out = model(x_train1)
print(out)
But whenever I tried to run this code, the printed output is diffrent.
Also I set these random seeds.
import random
import torch
import numpy as np
random_seed=76
torch.manual_seed(random_seed)
torch.cuda.manual_seed(random_seed)
torch.cuda.manual_seed_all(random_seed) # if use multi-GPU
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
np.random.seed(random_seed)
random.seed(random_seed)
I want to know why the output keep changing when the code is run.

You must set the seed every time you run the code if want to get the same result.
import torch
def my_func(device: str, seed: int):
torch.manual_seed(seed)
model = torch.nn.Linear(1,1).to(device)
x_train1 = torch.FloatTensor([[1], [2], [3]])
out = model(x_train1)
print(out)
# Whenever you run the function you'll get the same result!
my_func(device="cpu", seed=76)
# tensor([[0.3573],
# [0.5021],
# [0.6470]], grad_fn=<AddmmBackward>)
my_func(device="cpu", seed=76)
# tensor([[0.3573],
# [0.5021],
# [0.6470]], grad_fn=<AddmmBackward>)

Related

Issues with One Hot Encoding for model with values not in training data

I would like to use One Hot Encoding for my simple model. Yet it seems to trigger an error no matter how I set it up. First, One Hot Encoding is not converting string to float even though I have version 1.0.2 of sklearn. Now the issue is because the values in my training data are not the same length as in test data. Training only has 2 values, testing has all three. How do I fix that? The exact error is the truth value of a series is ambiguous. The error with this other idea is to reshape the data.
import lightgbm as lgbm
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
X = [[ 'apple',5],['banana',1],['apple',6],['banana',2]]
X=pd.DataFrame(X).to_numpy()
test = [[ 'pineapple',0],['banana',1],['apple',7],['banana,2']]
y = [1,0,1,0]
y=pd.DataFrame(y).to_numpy()
labels = ['apples','bananas','pineapple']
ohc = OneHotEncoder(categories=labels)
pp = ColumnTransformer(
transformers=[('ohc', ohc, [0])]
,remainder = 'passthrough')
model=lgbm.LGBMClassifier()
mymodel = Pipeline(steps = [('preprocessor', pp),
('model', model)
])
params = {'model__learning_rate':[0.1]
,'model__n_estimators':[2]}
lgbm_gs=GridSearchCV(
estimator = mymodel, param_grid=params, n_jobs = -1,
cv=2, scoring='accuracy'
,verbose=-1)
lgbm_gs.fit(X,y)
The issue should be related to the fact that you're passing categories as a list rather than as a list of array-like (eg a list of list(s)) as the doc states. Therefore, the following adjustment should fix it.
import lightgbm as lgbm
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
X = [['apple',5],['banana',1],['apple',6],['banana',2]]
X = pd.DataFrame(X).to_numpy()
test = [['pineapple',0],['banana',1],['apple',7],['banana',2]]
y = [1,0,1,0]
y = pd.DataFrame(y).to_numpy()
labels = [['apple', 'banana', 'pineapple']] # observe you were also mispelling categories ('apples' --> 'apple'; 'bananas' --> 'banana')
ohc = OneHotEncoder(categories=labels)
pp = ColumnTransformer(transformers=[('ohc', ohc, [0])], remainder='passthrough')
model=lgbm.LGBMClassifier()
mymodel = Pipeline(steps = [('preprocessor', pp),
('model', model)])
params = {'model__learning_rate':[0.1], 'model__n_estimators':[2]}
lgbm_gs=GridSearchCV(
estimator = mymodel, param_grid=params, n_jobs = -1,
cv=2, scoring='accuracy', verbose=-1)
lgbm_gs.fit(X, y.ravel())
As a further remark, observe what the guide suggests when dealing with cases where test data has categories that cannot be found in the training set.
If there is a possibility that the training data might have missing categorical features, it can often be better to specify handle_unknown='ignore' instead of setting the categories manually as above. When handle_unknown='ignore' is specified and unknown categories are encountered during transform, no error will be raised but the resulting one-hot encoded columns for this feature will be all zeros (handle_unknown='ignore' is only supported for one-hot encoding):
Eventually, you can observe that the attribute categories_ (which specifies the categories of each feature determined during fitting) is a list of array(s) (single array here as you're one-hot-encoding one column only), too. Example with categories='auto':
ohc = OneHotEncoder(handle_unknown='ignore')
ohc.fit(X[:, 0].reshape(-1, 1)).categories_
# Output: [array(['apple', 'banana'], dtype=object)]
Example with your custom categories:
ohc = OneHotEncoder(categories=labels)
ohc.fit(X[:, 0].reshape(-1, 1)).categories_
# Output: [array(['apple', 'banana', 'pineapple'], dtype=object)]

Mlflow log_model, not able to predict with spark_udf but with python works

I was wondering to log a model on mlflow, once I do it, I'm able to predict probabilities with python loaded model but not with spark_udf. The thing is, I still need to have a preprocessing function inside the model. Here is a toy reproductible example for you to see when it fails:
import mlflow
from mlflow.models.signature import infer_signature
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
X, y = make_classification(n_samples=1000, n_features=10, n_informative=2, n_classes=2, shuffle=True, random_state=1995)
X, y = pd.DataFrame(X), pd.DataFrame(y,columns=["target"])
# geerate column names
X.columns = [f"col_{idx}" for idx in range(len(X.columns))]
X["categorical_column"] = np.random.choice(["a","b","c"], size=len(X) )
def encode_catcolumn(X):
X = X.copy()
# replace cat values [a,b,c] for [-10,0,35] respectively
X['categorical_column'] = np.select([X["categorical_column"] == "a", X["categorical_column"] == "b", X["categorical_column"] == "c"], [-10, 0,35] )
return X
# with catcolumn encoded; i need to use custom encoding , we'll do this within mlflow later
X_encoded = encode_catcolumn(X)
Now let's create a wrapper for the model to encode the function within the model. Please see that the function encode_catcolumn within the class and the one outside the class presented before are the same.
class SklearnModelWrapper(mlflow.pyfunc.PythonModel):
def __init__(self, model):
self.model = model
def encode_catcolumn(self,X):
X = X.copy()
# replace cat values [a,b,c] for [-10,0,35] respectively
X['categorical_column'] = np.select([X["categorical_column"] == "a", X["categorical_column"] == "b", X["categorical_column"] == "c"], [-10, 0,35] )
return X
def predict(self, context, model_input):
# encode catvariable
model_input = self.encode_catcolumn(model_input)
# predict probabilities
predictions = self.model.predict_proba(model_input)[:,1]
return predictions
Now let's log the model
with mlflow.start_run(run_name="reproductible_example") as run:
clf = RandomForestClassifier()
clf.fit(X_encoded,y)
# wrappmodel with pyfunc, does the encoding inside the class
wrappedModel = SklearnModelWrapper(clf)
# When the model is deployed, this signature will be used to validate inputs.
mlflow.pyfunc.log_model("reproductible_example_model", python_model=wrappedModel)
model_uuid = run.info.run_uuid
model_path = f'runs:/{model_uuid}/reproductible_example_model'
Do the inference without spark and works perfectly:
model_uuid = run.info.run_uuid
model_path = f'runs:/{model_uuid}/reproductible_example_model'
# Load model as a PyFuncModel.
loaded_model = mlflow.pyfunc.load_model(model_path)
# predictions without spark , encodes the variables INSIDE; this WORKS
loaded_model.predict(X)
Now do the inference with spark_udf and get an error:
# create spark dataframe to test it on spark
X_spark = spark.createDataFrame(X)
# Load model as a Spark UDF.
loaded_model_spark = mlflow.pyfunc.spark_udf(spark, model_uri=model_path)
# Predict on a Spark DataFrame.
columns = list(X_spark.columns)
# this does not work
X_spark.withColumn('predictions', loaded_model_spark(*columns)).collect()
The error is:
PythonException: An exception was thrown from a UDF: 'KeyError: 'categorical_column'', from <command-908038>, line 7. Full traceback below:
I need to some how encode the variables and preprocess within the class. Is there any solution to this or any workaround to make this code able to woork with spark?
What I've tried so far:
Incorporate the encode_catcolumn within a sklearn Pipeline (with a custom encoder sklearn) -> Fails;
Create a function within the sklearn wrapper class (this example) -> Fails
3 ) Use the log_model and then create a pandas_udf in order to do it with spark as well --> works but that's not what I want. I would like to be able to run the model on spark with just calling .predict() method or something like that.
When a remove the preprocessing function and do it outside the class --> this actually works but this is not what
I solve this by just changing the last chunk of my question, when I load the spark_udf model and perform inference. This is a possible answer to the problem. Just pass an F.struct() to the spark_udf instead of a list of columns. Like in the chunk bellow:
import pyspark.sql.functions as F
# create spark dataframe to test it on spark
X_spark = spark.createDataFrame(X)
# Load model as a Spark UDF.
loaded_model_spark = mlflow.pyfunc.spark_udf(spark, model_uri=model_path)
# Predict on a Spark DataFrame.
# columns = list(X_spark.columns) --> delete this
columns = F.struct(X_spark.columns) # use struct
# this does work
X_spark.withColumn('predictions', loaded_model_spark(columns)).collect()

'Subset' object is not an iterator for updating torch' legacy IMDB dataset

I'm updating a pytorch network from legacy code to the current code. Following documentation such as that here.
I used to have:
import torch
from torchtext import data
from torchtext import datasets
# setting the seed so our random output is actually deterministic
SEED = 1234
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
# defining our input fields (text) and labels.
# We use the Spacy function because it provides strong support for tokenization in languages other than English
TEXT = data.Field(tokenize = 'spacy', include_lengths = True)
LABEL = data.LabelField(dtype = torch.float)
from torchtext import datasets
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)
import random
train_data, valid_data = train_data.split(random_state = random.seed(SEED))
example = next(iter(test_data))
example.text
MAX_VOCAB_SIZE = 25_000
TEXT.build_vocab(train_data,
max_size = MAX_VOCAB_SIZE,
vectors = "glove.6B.100d",
unk_init = torch.Tensor.normal_) # how to initialize unseen words not in glove
LABEL.build_vocab(train_data)
Now in the new code I am struggling to add the validation set. All goes well until here:
from torchtext.datasets import IMDB
train_data, test_data = IMDB(split=('train', 'test'))
I can print the outputs, while they look different (problems later on?), they have all the info. I can print test_data fine with next(train_data.
Then after I do:
test_size = int(len(train_dataset)/2)
train_data, valid_data = torch.utils.data.random_split(train_dataset, [test_size,test_size])
It tells me:
next(train_data)
TypeError: 'Subset' object is not an iterator
This makes me think I am not correct in applying random_split. How to correctly create the validation set for this dataset? Without causing issues.
Try next(iter(train_data)). It seems one have to create iterator over dataset explicitly. And use Dataloader when effectiveness is required.

How to make a GridSearchCV with a proper FunctionTransformer in a pipeline?

I'm trying to make a Pipeline with GridSearchCV to filter data (with iforest) and perform a regression with StandarSclaler+MLPRegressor.
I made a FunctionTransformer to include my iForest filter in the pipeline. I also define a parameters grid for the iForest filter (using kw_args methods).
All seems OK but when un mahe the fit, nothing happens ... No error message. Nothing.
After, when I want to make a predict, I have the message : "This RandomizedSearchCV instance is not fitted yet"
from sklearn.preprocessing import FunctionTransformer
#Definition of the function auto_filter using the iForest algo
def auto_filter(DF, conta=0.1):
#iForest made on the DF dataframe
iforest = IsolationForest(behaviour='new', n_estimators=300, max_samples='auto', contamination=conta)
iforest = iforest.fit(DF)
# The DF (dataframe in input) is filtered taking into account only the inlier observations
data_filtered = DF[iforest.predict(DF) == 1]
# Only few variables are kept for the next step (regression by MLPRegressor)
# this function delivers X_filtered and y
X_filtered = data_filtered[['SessionTotalTime','AverageHR','MaxHR','MinHR','EETotal','EECH','EEFat','TRIMP','BeatByBeatRMSSD','BeatByBeatSD','HFAverage','LFAverage','LFHFRatio','Weight']]
y = data_filtered['MaxVO2']
return (X_filtered, y)
#Pipeline definition ('auto_filter' --> 'scaler' --> 'MLPRegressor')
pipeline_steps = [('auto_filter', FunctionTransformer(auto_filter)), ('scaler', StandardScaler()), ('MLPR', MLPRegressor(solver='lbfgs', activation='relu', early_stopping=True, n_iter_no_change=20, validation_fraction=0.2, max_iter=10000))]
#Gridsearch Definition with differents values of 'conta' for the first stage of the pipeline ('auto_filter)
parameters = {'auto_filter__kw_args': [{'conta': 0.1}, {'conta': 0.2}, {'conta': 0.3}], 'MLPR__hidden_layer_sizes':[(sp_randint.rvs(1, nb_features, 1),), (sp_randint.rvs(1, nb_features, 1), sp_randint.rvs(1, nb_features, 1))], 'MLPR__alpha':sp_rand.rvs(0, 1, 1)}
pipeline = Pipeline(pipeline_steps)
estimator = RandomizedSearchCV(pipeline, parameters, cv=5, n_iter=10)
estimator.fit(X_train, y_train)
You can try to run step by step manually to find a problem:
auto_filter_transformer = FunctionTransformer(auto_filter)
X_train = auto_filter_transformer.fit_transform(X_train)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
MLPR = MLPRegressor(solver='lbfgs', activation='relu', early_stopping=True, n_iter_no_change=20, validation_fraction=0.2, max_iter=10000)
MLPR.fit(X_train, y_train)
If each of the steps works fine, build a pipeline. Check the pipeline. If it works fine, try to use RandomizedSearchCV.
The func parameter of FunctionTransformer should be a callable that accepts the
same arguments as transform method (array-like X of shape
(n_samples, n_features) and kwargs for func) and returns a transformed X of
the same shape. Your function auto_filter doesn't fit these requirements.
Additionally, anomaly/outlier detection techniques from scikit-learn cannot be
used as intermediate steps in scikit-learn pipelines since a pipeline assembles
one or more transformers and an optional final estimator. IsolationForest or,
say, OneClassSVM is not a transformer: it implements fit and predict.
Thus, a possible solution is to cut off possible outliers separately and build
a pipeline composing of transformers and a regressor:
>>> import warnings
>>> from sklearn.exceptions import ConvergenceWarning
>>> warnings.filterwarnings(category=ConvergenceWarning, action='ignore')
>>> import numpy as np
>>> from scipy import stats
>>> from sklearn.datasets import make_regression
>>> from sklearn.ensemble import IsolationForest
>>> from sklearn.model_selection import RandomizedSearchCV
>>> from sklearn.neural_network import MLPRegressor
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.preprocessing import StandardScaler
>>> X, y = make_regression(n_samples=50, n_features=2, n_informative=2)
>>> detect = IsolationForest(contamination=0.1, behaviour='new')
>>> inliers_mask = detect.fit_predict(X) == 1
>>> pipe = Pipeline([('scale', StandardScaler()),
... ('estimate', MLPRegressor(max_iter=500, tol=1e-5))])
>>> param_distributions = dict(estimate__alpha=stats.uniform(0, 0.1))
>>> search = RandomizedSearchCV(pipe, param_distributions,
... n_iter=2, cv=3, iid=True)
>>> search = search.fit(X[inliers_mask], y[inliers_mask])
The problem is that you won't be able to optimize the hyperparameters of
IsolationForest. One way to handle it is to define hyperparameter space
for the forest, sample hyperparameters with ParameterSampler or
ParameterGrid, predict inliers and fit randomized search:
>>> from sklearn.model_selection import ParameterGrid
>>> forest_param_dict = dict(contamination=[0.1, 0.15, 0.2])
>>> forest_param_grid = ParameterGrid(forest_param_dict)
>>> for sample in forest_param_grid:
... detect = detect.set_params(contamination=sample['contamination'])
... inliers_mask = detect.fit_predict(X) == 1
... search.fit(X[inliers_mask], y[inliers_mask])

Sckit learn with GraphViz exports empty outputs

I would like to export decision tree using sklearn.
First I trained a decision tree classifier:
self._selected_classifier = tree.DecisionTreeClassifier()
self._selected_classifier.fit(train_dataframe, train_class)
self._column_names = list(train_dataframe.columns.values)
After that I used the following method in order to export the decision tree:
def _create_graph_visualization(self):
decision_tree_classifier = self._selected_classifier
from sklearn.externals.six import StringIO
dot_data = StringIO()
tree.export_graphviz(decision_tree_classifier,
out_file=dot_data,
feature_names=self._column_names)
import pydotplus
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_pdf("decision_tree_output.pdf")
After many errors regarding missing executables now the program is finished successfully.
The file is created, but it is empty.
What am I doing wrong?
Here is an example with output which works for me, using pydotplus:
from sklearn import tree
import pydotplus
import StringIO
# Define training and target set for the classifier
train = [[1,2,3],[2,5,1],[2,1,7]]
target = [10,20,30]
# Initialize Classifier. Random values are initialized with always the same random seed of value 0
# (allows reproducible results)
dectree = tree.DecisionTreeClassifier(random_state=0)
dectree.fit(train, target)
# Test classifier with other, unknown feature vector
test = [2,2,3]
predicted = dectree.predict(test)
dotfile = StringIO.StringIO()
tree.export_graphviz(dectree, out_file=dotfile)
graph=pydotplus.graph_from_dot_data(dotfile.getvalue())
graph.write_png("dtree.png")
graph.write_pdf("dtree.pdf")

Resources