I'm using Tune class-based Trainable API. See code sample:
from ray import tune
import numpy as np
np.random.seed(42)
# first run
tune.run(tune.Trainable, ...)
# second run, expecting same result
np.random.seed(42)
tune.run(tune.Trainable, ...)
The problem is that tune.run results are still different, likely reason being that each ray actor still has different seed.
Question: how do I make ray.tune.run reproducible?
(This answer focuses on class API and ray version 0.8.7. Function API does not support reproducibility due to implementation specifics)
There are two main sources of undeterministic results.
1. Search algorithm
Every search algorithm supports random seed, although interface to it may vary. This initializes hyperparameter space sampling.
For example, if you're using AxSearch, it looks like this:
from ax.service.ax_client import AxClient
from ray.tune.suggest.ax import AxSearch
client = AxClient(..., random_seed=42)
client.create_experiment(...)
algo = AxSearch(client)
2. Trainable API
This is distributed among worker processes, which requires seeding within tune.Trainable class. Depending on the tune.Trainable.train logic that you implement, you need to manually seed numpy, tf, or whatever other framework you use, inside tune.Trainable.setup by passing seed with config argument of tune.run.
The following code is based on RLLib PR5197 that handled the same issue:
See the example:
from ray import tune
import numpy as np
import random
class Tuner(tune.Trainable):
def setup(self, config):
seed = config['seed']
np.random.seed(seed)
random.seed(seed)
...
...
seed = 42
tune.run(Tuner, config={'seed': seed})
Related
I am new to Jax.
I am implementing a variational autoencoder (VAE) using Jax and Flax. During training, I sample a latent code (from the distribution inferred by the encoder, which I implement using compositions of flax.linen.nn modules). Crucially, in addition to passing this code through the decoder (as is standard for a VAE), I also pass the code to an external function (the MuJoCo physics engine), which tries to assign it to a NumPy array. This unsurprisingly leads to the following error:
TracerArrayConversionError: The numpy.ndarray conversion method array() was called on the JAX Tracer object...
Fundamentally, I need to pass a concrete numpy array to MuJoCo. How can I make my variable a NumPy array will still allowing my model to be implemented in a computationally efficient manner using abstract tracers wherever possible?
Here is a minimal working example of the problem I am facing - gym and mujoco (https://mujoco.org/) will need to be installed to run this I believe:
import jax
import jax.numpy as np
import numpy as onp
import gym
from jax import jit
# create an instance of an open AI gym environment
env = gym.make('Humanoid-v3')
env.reset()
def this_fails(env, x):
# this gives a TracerArrayConversionError
env.sim.data.qpos[:] = x
return env, x
x = np.arange(len(env.sim.data.qpos))
jit_this_fails = jax.jit(this_fails, static_argnums = 0)
env, x = jit_this_fails(env, x)
Edit: there is now a JAX FAQ entry on this topic: https://jax.readthedocs.io/en/latest/faq.html#how-can-i-convert-a-jax-tracer-to-a-numpy-array
Note: this is the answer to the OP's question as originally written. The question has been edited multiple times and no longer asks what it originally asked.
In the past this sort of thing has not been supported, but you can do this with the new jax.pure_callback feature that is part of JAX version 0.3.17, which is not yet released at the time I am writing this.
For example, say you want to call a numpy-based function from within a JAX jit-compiled function; we'll use np.sin for simplicity. You might first try something like this:
import jax
import jax.numpy as jnp
import numpy as np
#jax.jit
def this_fails(x):
# Call a numpy function...
return np.sin(x)
x = jnp.arange(5.0)
this_fails(x)
jax._src.errors.TracerArrayConversionError: The numpy.ndarray conversion method __array__() was called on the JAX Tracer object Traced<ShapedArray(float32[5])>with<DynamicJaxprTrace(level=0/1)>
The error occurred while tracing the function this_fails at tmp.py:7 for jit. This concrete value was not available in Python because it depends on the value of the argument 'x'.
See https://jax.readthedocs.io/en/latest/errors.html#jax.errors.TracerArrayConversionError
The result is a TracerConversionError, because you're attempting to pass a traced JAX value into a function that expects a numpy array (side note: see How To Think In JAX for an introduction to JAX Tracers and related topics).
In JAX version 0.3.17 or newer, you can get around this issue using jax.pure_callback:
#jax.jit
def numpy_callback(x):
# Need to forward-declare the shape & dtype of the expected output.
result_shape = jax.core.ShapedArray(x.shape, x.dtype)
return jax.pure_callback(np.sin, result_shape, x)
x = jnp.arange(5.0)
print(numpy_callback(x))
[ 0. 0.841471 0.9092974 0.14112 -0.7568025]
A few caveats to keep in mind:
the resulting execution will rely on a callback to the host, so it will be quite slow on accelerators like GPU/TPU, particularly in distributed/multi-host settings. In the case of local CPU execution, though, it avoids buffer copies and can be quite performant.
if you vmap the function, it will result in a for loop of multiple callbacks (you can specify vectorized=True if the callback function handles batches natively).
autodiff transformations like grad and jacobian will not work with this function, because JAX has no way of reasoning about the computations being done. If you would like to use it with autodiff transformations, you could define custom gradients as in Custom Derivative Rules, though this would require having access to a function that computes the gradient for your callback function.
None of this is documented yet on the JAX website, but we hope to write docs for pure_callback soon!
I don't understand why I have three different behaviors depending on the classifier I use, even though they should go hand in hand.
This is the code in order to go deeply in the question:
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from lightgbm import LGBMClassifier
from sklearn.model_selection import cross_validate
import matplotlib.pyplot as plt
import numpy as np
#load data
wine = datasets.load_wine()
X = wine.data
y = wine.target
# some helper functions
def repeat_feature(X,which=1,times=1):
return np.hstack([X,np.hstack([X[:, :which]]*times)])
def do_the_job(X,y,clf):
return np.mean(cross_validate(clf, X, y,cv=5)['test_score'])
# define the classifiers
clf1=DecisionTreeClassifier(max_depth=25,random_state=42)
clf2=RandomForestClassifier(n_estimators=5,random_state=42)
clf3=LGBMClassifier(n_estimators=5,random_state=42)
# repeat up to 50 times the same feature and test the classifiers
clf1_result=[]
clf2_result=[]
clf3_result=[]
for i in range(1,50):
my_x=repeat_feature(X,times=i)
clf1_result.append(do_the_job(my_x,y,clf1))
clf2_result.append(do_the_job(my_x,y,clf2))
clf3_result.append(do_the_job(my_x,y,clf3))
# plot the mean of the cv-scores for each classifier
plt.figure(figsize=(12,7))
plt.plot(clf1_result,label='tree')
plt.plot(clf2_result,label='forest')
plt.plot(clf3_result,label='boost')
plt.legend()
The result of the previous script is the following graph:
What I want to verify is that by adding the same information (like a repeated feature) I would get a decrease in the score (which happens as expected for random forest).
The question is why does this not happen with the other two classifiers instead?
Why do their scores remain stable?
Am I missing something from the theoretical point of view?
Ty all
When fitting a single decision tree (sklearn.tree.DecisionTreeClassifier)
or a LightGBM model using its default behavior (lightgbm.LGBMClassifier), the training algorithm considers all features as candidates for every split, and always chooses the split with the best "gain" (reduction in the training loss).
Because of this, adding multiple identical copies of the same feature will not change the fit to the training data.
For random forest, on the other hand, the training algorithm randomly selects a subset of features to consider at each split. The random forest learns how to explain the training data by ensembling together multiple slightly-different models, and this can be effective because the different models explain different characteristics of the target. If you hold the number of trees + the number of leaves per tree constant, then adding copies of a feature reduces the diversity of the trees in the forest, which reduces the forest's fit to the training data.
I call same model on same input twice in a row and I don't get the same result, this model have nn.GRU layers so I suspect that it have some internal state that should be release before second run?
How to reset RNN hidden state to make it the same as if model was initially loaded?
UPDATE:
Some context:
I'm trying to run model from here:
https://github.com/erogol/WaveRNN/blob/master/models/wavernn.py#L93
I'm calling generate:
https://github.com/erogol/WaveRNN/blob/master/models/wavernn.py#L148
Here it's actually have some code using random generator in pytorch:
https://github.com/erogol/WaveRNN/blob/master/models/wavernn.py#L200
https://github.com/erogol/WaveRNN/blob/master/utils/distribution.py#L110
https://github.com/erogol/WaveRNN/blob/master/utils/distribution.py#L129
I have placed (I'm running code on CPU):
torch.manual_seed(0)
torch.cuda.manual_seed_all(0)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
np.random.seed(0)
in
https://github.com/erogol/WaveRNN/blob/master/utils/distribution.py
after all imports.
I have checked GRU weights between runs and they are the same:
https://github.com/erogol/WaveRNN/blob/master/models/wavernn.py#L153
Also I have checked logits and sample between runs and logits are the same but sample are not, so #Andrew Naguib seems were right about random seeding, but I'm not sure where the code that fixes random seed should be placed?
https://github.com/erogol/WaveRNN/blob/master/models/wavernn.py#L200
UPDATE 2:
I have placed seed init inside generate and now results are consistent:
https://github.com/erogol/WaveRNN/blob/master/models/wavernn.py#L148
I believe this may be highly related to Random Seeding. To ensure reproducible results (as stated by them) you have to seed torch as in this:
import torch
torch.manual_seed(0)
And also, the CuDNN module.
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
If you're using numpy, you could also do:
import numpy as np
np.random.seed(0)
However, they warn you:
Deterministic mode can have a performance impact, depending on your model.
A suggested script I regularly use which has been working very good to reproduce results is:
# imports
import numpy as np
import random
import torch
# ...
""" Set Random Seed """
if args.random_seed is not None:
"""Following seeding lines of code are to ensure reproducible results
Seeding the two pseudorandom number generators involved in PyTorch"""
random.seed(args.random_seed)
np.random.seed(args.random_seed)
torch.manual_seed(args.random_seed)
# https://pytorch.org/docs/master/notes/randomness.html#cudnn
if not args.cpu_only:
torch.cuda.manual_seed(args.random_seed)
cudnn.deterministic = True
cudnn.benchmark = False
You can use model.init_hidden() to reset the RNN hidden state.
def init_hidden(self):
# Initialize hidden and cell states
return Variable(torch.zeros(num_layers, batch_size, hidden_size))
So, before calling the same model on the same data next time, you can call model.init_hidden() to reset the hidden and cell states to the initial values.
This will clear out the history, in order words, the weights the model learned after running on the data first time.
Sklearn provide different data generation functions such as make_blobs and make_regression in sklearn.datasets.
However, I am not aware of any functions that can generate sequential data. Is there any existing libraries that can generate artificial sequential data?
It really depends on what kind of series you want. Check out this repository for generating different kinds of simulated series. It's called TimeSynth
But if you just want something you can easily modify yourself, try writing a function similar to this:
def SynthSeries(start,end,stepSize,coefficients):
import numpy as np
samples = np.array(np.arange(start,end,stepSize))
array = np.array(np.zeros(np.shape(samples)))
for coeff in coefficients:
array = np.add(array,(np.sin(coeff*samples)))
return array, samples
This is sort of a reverse of a fourier transform, if you know the base frequencies of series you want to create, you can pass it into this function to recreate the signal.
You can use it like this:
import matplotlib.pyplot as plt
(SeqData,samples) = SynthSeries(0,20,0.1,[12,3,1,22])
plt.plot(samples, SeqData)
plt.show()
Reproducibility is important. In a closed-source machine learning project I'm currently working on it is hard to achieve it. What are the parts to look at?
Setting seeds
Computers have pseudo-random number generators which are initialized with a value called the seed. For machine learning, you might need to do the following:
# I've heard the order here is important
import random
random.seed(0)
import numpy as np
np.random.seed(0)
import tensorflow as tf
tf.set_random_seed(0)
session_conf = tf.ConfigProto(intra_op_parallelism_threads=1,
inter_op_parallelism_threads=1)
sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)
from keras import backend as K
K.set_session(sess) # tell keras about the seeded session
# now import keras stuff
See also: Keras FAQ: How can I obtain reproducible results using Keras during development?
sklearn
sklearn.model_selection.train_test_split has a random_state parameter.
What to check
Am I loading the data in the same order every time?
Do I initialize the model the same way?
Do you use external data that might change?
Do you use external state that might change (e.g. datetime.now)?