What are common sources of randomness in Machine Learning projects with Keras? - python-3.x

Reproducibility is important. In a closed-source machine learning project I'm currently working on it is hard to achieve it. What are the parts to look at?

Setting seeds
Computers have pseudo-random number generators which are initialized with a value called the seed. For machine learning, you might need to do the following:
# I've heard the order here is important
import random
random.seed(0)
import numpy as np
np.random.seed(0)
import tensorflow as tf
tf.set_random_seed(0)
session_conf = tf.ConfigProto(intra_op_parallelism_threads=1,
inter_op_parallelism_threads=1)
sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)
from keras import backend as K
K.set_session(sess) # tell keras about the seeded session
# now import keras stuff
See also: Keras FAQ: How can I obtain reproducible results using Keras during development?
sklearn
sklearn.model_selection.train_test_split has a random_state parameter.
What to check
Am I loading the data in the same order every time?
Do I initialize the model the same way?
Do you use external data that might change?
Do you use external state that might change (e.g. datetime.now)?

Related

How do I make ray.tune.run reproducible?

I'm using Tune class-based Trainable API. See code sample:
from ray import tune
import numpy as np
np.random.seed(42)
# first run
tune.run(tune.Trainable, ...)
# second run, expecting same result
np.random.seed(42)
tune.run(tune.Trainable, ...)
The problem is that tune.run results are still different, likely reason being that each ray actor still has different seed.
Question: how do I make ray.tune.run reproducible?
(This answer focuses on class API and ray version 0.8.7. Function API does not support reproducibility due to implementation specifics)
There are two main sources of undeterministic results.
1. Search algorithm
Every search algorithm supports random seed, although interface to it may vary. This initializes hyperparameter space sampling.
For example, if you're using AxSearch, it looks like this:
from ax.service.ax_client import AxClient
from ray.tune.suggest.ax import AxSearch
client = AxClient(..., random_seed=42)
client.create_experiment(...)
algo = AxSearch(client)
2. Trainable API
This is distributed among worker processes, which requires seeding within tune.Trainable class. Depending on the tune.Trainable.train logic that you implement, you need to manually seed numpy, tf, or whatever other framework you use, inside tune.Trainable.setup by passing seed with config argument of tune.run.
The following code is based on RLLib PR5197 that handled the same issue:
See the example:
from ray import tune
import numpy as np
import random
class Tuner(tune.Trainable):
def setup(self, config):
seed = config['seed']
np.random.seed(seed)
random.seed(seed)
...
...
seed = 42
tune.run(Tuner, config={'seed': seed})

How to reproduce RNN results on several runs?

I call same model on same input twice in a row and I don't get the same result, this model have nn.GRU layers so I suspect that it have some internal state that should be release before second run?
How to reset RNN hidden state to make it the same as if model was initially loaded?
UPDATE:
Some context:
I'm trying to run model from here:
https://github.com/erogol/WaveRNN/blob/master/models/wavernn.py#L93
I'm calling generate:
https://github.com/erogol/WaveRNN/blob/master/models/wavernn.py#L148
Here it's actually have some code using random generator in pytorch:
https://github.com/erogol/WaveRNN/blob/master/models/wavernn.py#L200
https://github.com/erogol/WaveRNN/blob/master/utils/distribution.py#L110
https://github.com/erogol/WaveRNN/blob/master/utils/distribution.py#L129
I have placed (I'm running code on CPU):
torch.manual_seed(0)
torch.cuda.manual_seed_all(0)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
np.random.seed(0)
in
https://github.com/erogol/WaveRNN/blob/master/utils/distribution.py
after all imports.
I have checked GRU weights between runs and they are the same:
https://github.com/erogol/WaveRNN/blob/master/models/wavernn.py#L153
Also I have checked logits and sample between runs and logits are the same but sample are not, so #Andrew Naguib seems were right about random seeding, but I'm not sure where the code that fixes random seed should be placed?
https://github.com/erogol/WaveRNN/blob/master/models/wavernn.py#L200
UPDATE 2:
I have placed seed init inside generate and now results are consistent:
https://github.com/erogol/WaveRNN/blob/master/models/wavernn.py#L148
I believe this may be highly related to Random Seeding. To ensure reproducible results (as stated by them) you have to seed torch as in this:
import torch
torch.manual_seed(0)
And also, the CuDNN module.
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
If you're using numpy, you could also do:
import numpy as np
np.random.seed(0)
However, they warn you:
Deterministic mode can have a performance impact, depending on your model.
A suggested script I regularly use which has been working very good to reproduce results is:
# imports
import numpy as np
import random
import torch
# ...
""" Set Random Seed """
if args.random_seed is not None:
"""Following seeding lines of code are to ensure reproducible results
Seeding the two pseudorandom number generators involved in PyTorch"""
random.seed(args.random_seed)
np.random.seed(args.random_seed)
torch.manual_seed(args.random_seed)
# https://pytorch.org/docs/master/notes/randomness.html#cudnn
if not args.cpu_only:
torch.cuda.manual_seed(args.random_seed)
cudnn.deterministic = True
cudnn.benchmark = False
You can use model.init_hidden() to reset the RNN hidden state.
def init_hidden(self):
# Initialize hidden and cell states
return Variable(torch.zeros(num_layers, batch_size, hidden_size))
So, before calling the same model on the same data next time, you can call model.init_hidden() to reset the hidden and cell states to the initial values.
This will clear out the history, in order words, the weights the model learned after running on the data first time.

Is is necessary to normalize data before using MLPregressor?

I want to use MLPregression in sklearn and I have input with different scale. I am using MLPRegressor in scikit-learn in Python.
Here is my code:
smlp = MLPRegressor(hidden_layer_sizes=(committee,),
activation='relu',
solver='adam',
learning_rate='adaptive',
max_iter=3000,
learning_rate_init=0.01,
alpha=0.01)
It is better to standardize the data in order to improve the convergence.
from sklearn.preprocessing import StandardScaler
Regarding the output values - You might want to standardize them too. It might help the convergence. However. it will be harder to interpret the results afterwards.
Nevertheless, if You are aiming neural networks, it might be worth looking into keras library, allowing much more up-to-date functionality, usage of GPU for training, etc.

Tensorflow causes errors in scikit-learn

When I import scikit-learn before importing tensorflow I don't have any issues. Running this block of code produces an output of 1.7766212763101197e-12.
import numpy as np
np.random.seed(123)
import numpy.random as rand
from sklearn.decomposition import PCA
import tensorflow as tf
X = rand.randn(100,15)
X = X - X.mean(axis=0)
mod = PCA()
w = mod.fit_transform(X)
h = mod.components_
print(np.sum(np.abs(X-np.dot(w,h))))
However, if I import tensorflow before importing scikit-learn my code no longer functions. When I run this code-block
import tensorflow as tf
import numpy as np
np.random.seed(123)
import numpy.random as rand
from sklearn.decomposition import PCA
X = rand.randn(100,15)
X = X - X.mean(axis=0)
mod = PCA()
w = mod.fit_transform(X)
h = mod.components_
print(np.sum(np.abs(X-np.dot(w,h))))
I get an output of 130091393261440.25.
Why is that? My versions for the packages are:
numpy - 1.13.1
sklearn - 0.19.0
tensorflow - 1.3.0
Import order should not affect output, as python modules are self-contained, except in the case of dependencies.
I was unable to reproduce your error, and get an output of 1.7951539777252834e-12 for both code blocks.
This is an interesting problem and I am curious to see if others can provide a better response for why you are seeing this issue.
Note: the present answer is an answer to the title for the ones looking for using TensorFlow within Scikit-Learn, and does not just regards some import errors as you've had.
You can use TensorFlow within Scikit-Learn pipelines using Neuraxle.
Neuraxle is an extension of Scikit-Learn to make it more compatible with all deep learning libraries.
Problem: You can’t Parallelize nor Save Pipelines Using Steps that Can’t be Serialized “as-is” by Joblib (e.g.: a TensorFlow step)
Whereas a step is a transformer or estimator in a scikit-learn Pipeline.
This problem will only surface past some point of using Scikit-Learn. This is the point of no-return: you’ve coded your entire production pipeline, but once you trained it and selected the best model, you realize that what you’ve just coded can’t be serialized.
This means once trained, your pipeline can’t be saved to disks because one of its steps imports things from a weird python library coded in another language and/or uses GPU resources. Your code smells weird and you start panicking over what was a full year of research development.
Solution with Code Examples:
Here is a full project example from A to Z where TensorFlow is used with Neuraxle as if it was used with Scikit-Learn.
Here is another practical example where TensorFlow is used within a scikit-learn-like pipeline
The trick is performed by using Neuraxle-TensorFlow.
This is to make use of Neuraxle's savers.
Read also: https://stackoverflow.com/a/60557192/2476920

Show model layout / design (with all connections) in Keras

I have major differences when testing a Keras LSTM model after I've trained it compared to when I load that trained model from a .h5 file (Accuracy of the first is always > 0.85 but of the later is always below < 0.2 i.e. a random guess).
However I checked the weights, they are identical and also the sparse layout Keras give me via plot_model is the same, but since this only retrieves a rough overview:
Is there away to show the full layout of a Keras model (especially node connections)?
If you're using tensorflow backend, apart from plot_model, you can also use keras.callbacks.TensorBoard callback to visualize the whole graph in tensorboard. Example:
callback = keras.callbacks.TensorBoard(log_dir='./graph',
histogram_freq=0,
write_graph=True,
write_images=True)
model.fit(..., callbacks=[callback])
Then run tensorboard --logdir ./graph from the same directory.
This is a quick shortcut, but you can go even further with that.
For example, add tensorflow code to define (load) the model within custom tf.Graph instance, like this:
from keras.layers import LSTM
import tensorflow as tf
my_graph = tf.Graph()
with my_graph.as_default():
# All ops / variables in the LSTM layer are created as part of our graph
x = tf.placeholder(tf.float32, shape=(None, 20, 64))
y = LSTM(32)(x)
.. after which you can list all graph nodes with dependencies, evaluate any variable, display the graph topology and so on, to compare the models.
I personally think, the simplest way is to setup your own session. It works in all cases with minimal patching:
import tensorflow as tf
from keras import backend as K
sess = tf.Session()
K.set_session(sess)
...
# Now can evaluate / access any node in this session, e.g. `sess.graph`

Resources