In-graph replication for Distributed Tensorflow

In-graph replication for Distributed Tensorflow - python-3.x

I am learning Distributed Tensorflow, and I implemented a simple version code of In-graph replication as below (task_parallel.py):
import argparse
import logging
import tensorflow as tf
log = logging.getLogger(__name__)
# Job Names
PARAMETER_SERVER = "ps"
WORKER_SERVER = "worker"
# Cluster Details
CLUSTER_SPEC = {
PARAMETER_SERVER: ["localhost:2222"],
WORKER_SERVER: ["localhost:1111", "localhost:1112", "localhost:1113"]}
def parse_command_arguments():
""" Set up and parse the command line arguments passed for experiment. """
parser = argparse.ArgumentParser(
description="Parameters and Arguments for the Test.")
parser.add_argument(
"--ps_hosts",
type=str,
default="",
help="Comma-separated list of hostname:port pairs"
)
parser.add_argument(
"--worker_hosts",
type=str,
default="",
help="Comma-separated list of hostname:port pairs"
)
parser.add_argument(
"--job_name",
type=str,
default="",
help="One of 'ps', 'worker'"
)
# Flags for defining the tf.train.Server
parser.add_argument(
"--task_index",
type=int,
default=0,
help="Index of task within the job"
)
return parser.parse_args()
def start_server(
job_name, ps_hosts, task_index, worker_hosts):
""" Create a server based on a cluster spec. """
cluster_spec = {
PARAMETER_SERVER: ps_hosts,
WORKER_SERVER: worker_hosts}
cluster = tf.train.ClusterSpec(cluster_spec)
server = tf.train.Server(
cluster, job_name=job_name, task_index=task_index)
return server
def model():
""" Build up a simple estimator model. """
with tf.device("/job:%s/task:0" % PARAMETER_SERVER):
log.info("111")
# Build a linear model and predict values
W = tf.Variable([.3], tf.float32)
b = tf.Variable([-.3], tf.float32)
x = tf.placeholder(tf.float32)
linear_model = W * x + b
y = tf.placeholder(tf.float32)
global_step = tf.Variable(0)
with tf.device("/job:%s/task:0" % WORKER_SERVER):
# Loss sub-graph
loss = tf.reduce_sum(tf.square(linear_model - y))
log.info("222")
# optimizer
optimizer = tf.train.GradientDescentOptimizer(0.01)
with tf.device("/job:%s/task:1" % WORKER_SERVER):
log.info("333")
train = optimizer.minimize(loss, global_step=global_step)
return W, b, loss, x, y, train, global_step
def main():
# Parse arguments from command line.
arguments = parse_command_arguments()
# Initializing logging with level "INFO".
logging.basicConfig(level=logging.INFO)
ps_hosts = arguments.ps_hosts.split(",")
worker_hosts = arguments.worker_hosts.split(",")
job_name = arguments.job_name
task_index = arguments.task_index
# Start a server.
server = start_server(
job_name, ps_hosts, task_index, worker_hosts)
W, b, loss, x, y, train, global_step = model()
# with sv.prepare_or_wait_for_session(server.target) as sess:
with tf.train.MonitoredTrainingSession(
master=server.target,
is_chief=(arguments.task_index == 0 and (
arguments.job_name == 'ps')),
config=tf.ConfigProto(log_device_placement=True)) as sess:
step = 0
# training data
x_train = [1, 2, 3, 4]
y_train = [0, -1, -2, -3]
while not sess.should_stop() and step < 1000:
_, step = sess.run(
[train, global_step], {x: x_train, y: y_train})
# evaluate training accuracy
curr_W, curr_b, curr_loss = sess.run(
[W, b, loss], {x: x_train, y: y_train})
print("W: %s b: %s loss: %s" % (curr_W, curr_b, curr_loss))
if __name__ == "__main__":
main()
I ran the code with 3 different processes in a single machine (MacPro with only CPU):
PS: $python task_parallel.py --task_index 0 --ps_hosts localhost:2222 --worker_hosts localhost:1111,localhost:1112 --job_name ps,
Worker 1: $python task_parallel.py --task_index 0 --ps_hosts localhost:2222 --worker_hosts localhost:1111,localhost:1112 --job_name worker
Worker 2: $python task_parallel.py --task_index 1 --ps_hosts localhost:2222 --worker_hosts localhost:1111,localhost:1112 --job_name worker
I noticed that the results were not what I expected. Specifically, I expect process "PS" only prints 111, "Worker 1" only prints 222 and "Worker 3" only prints 333 as I specified task for each process. However, what I got is all 3 processes printed the exactly same thing:
INFO:__main__:111
INFO:__main__:222
INFO:__main__:333
Isn't true that process PS only executed the code inside of block with tf.device("/job:%s/task:0" % PARAMETER_SERVER? And same for workers? I wonder if I missed something in my code.
I also found that I had to run all worker processes first and run ps process afterwards. Otherwise, the worker processes cannot be gracefully exited after training was done. So I want to know any reasons for this issue in my code. Really appreciate for helps :) Thanks!

Please note that, in your snippet, the codes before MonitoredTrainingSession are used to describe and build the running graph, both parameter servers and workers will execute these codes to generate the graph. The graph will be frozen when the MonitoredTrainingSession is being created.
If you want to see 111 only in PS, your code may work like this:
FLAGS = tf.app.flags.FLAGS
if FLAGS.job_name == 'ps':
print('111')
server.join()
else:
print('222')
If you want to setup replicas model in workers, in model() function:
with tf.device('/job:ps/task:0'):
# define variable in parameter
with tf.device('/job:worker/task:%d' % FLAGS.task_index):
# define model in worker % task_index
Additionally, replica_device_setter will automatically assign devices to Operation objects as they are constructed.
There exists some examples provided by tensorflow, such as:
hello distributed, a basic guide in tensorflow tutorial.
mnist_replica.py, a distributed MNIST training and validation, with model replicas.
cifar10_multi_gpu_train.py, a binary to train CIFAR-10 using multiple GPU's with synchronous updates.
Wish this will help you.

Related

Implementing Spinningup Pytorch DDPG for Cartpole-v0 problem - getting discrete values

This is my first time posting a question here. Please correct me if I am not putting the right information.
I am trying to implement DDPG for the cartpole problem from here: https://spinningup.openai.com/en/latest/user/algorithms.html
Its giving the error
act_limit = env.action_space.high[0] #AD
AttributeError: 'Discrete' object has no attribute 'high'
can you suggest how to fix this. I think because cartpole is a continous action space, I am getting this error as act_dim return discrete values
from copy import deepcopy
import numpy as np
import torch
from torch.optim import Adam
import gym
import time
import spinningup.spinup.algos.pytorch.ddpg.core as core
from spinningup.spinup.utils.logx import EpochLogger
class ReplayBuffer:
"""
A simple FIFO experience replay buffer for DDPG agents.
"""
def __init__(self, obs_dim, act_dim, size):
self.obs_buf = np.zeros(core.combined_shape(size, obs_dim), dtype=np.float32)
self.obs2_buf = np.zeros(core.combined_shape(size, obs_dim), dtype=np.float32)
self.act_buf = np.zeros(core.combined_shape(size, act_dim), dtype=np.float32) #AD action_memory
self.rew_buf = np.zeros(size, dtype=np.float32) #AD reward mem
self.done_buf = np.zeros(size, dtype=np.float32) #AD Terminal memory
self.ptr, self.size, self.max_size = 0, 0, size
def store(self, obs, act, rew, next_obs, done): #AD Store tranisiton
self.obs_buf[self.ptr] = obs
self.obs2_buf[self.ptr] = next_obs
self.act_buf[self.ptr] = act
self.rew_buf[self.ptr] = rew
self.done_buf[self.ptr] = done
self.ptr = (self.ptr+1) % self.max_size
self.size = min(self.size+1, self.max_size)
def sample_batch(self, batch_size=32):
idxs = np.random.randint(0, self.size, size=batch_size)
batch = dict(obs=self.obs_buf[idxs],
obs2=self.obs2_buf[idxs],
act=self.act_buf[idxs],
rew=self.rew_buf[idxs],
done=self.done_buf[idxs])
return {k: torch.as_tensor(v, dtype=torch.float32) for k,v in batch.items()}
def ddpg(env_fn, actor_critic=core.MLPActorCritic, ac_kwargs=dict(), seed=0,
steps_per_epoch=4000, epochs=100, replay_size=int(1e6), gamma=0.99,
polyak=0.995, pi_lr=1e-3, q_lr=1e-3, batch_size=100, start_steps=10000,
update_after=1000, update_every=50, act_noise=0.1, num_test_episodes=10,
max_ep_len=1000, logger_kwargs=dict(), save_freq=1):
"""
Deep Deterministic Policy Gradient (DDPG)
Args:
env_fn : A function which creates a copy of the environment.
The environment must satisfy the OpenAI Gym API.
actor_critic: The constructor method for a PyTorch Module with an ``act``
method, a ``pi`` module, and a ``q`` module. The ``act`` method and
``pi`` module should accept batches of observations as inputs,
and ``q`` should accept a batch of observations and a batch of
actions as inputs. When called, these should return:
=========== ================ ======================================
Call Output Shape Description
=========== ================ ======================================
``act`` (batch, act_dim) | Numpy array of actions for each
| observation.
``pi`` (batch, act_dim) | Tensor containing actions from policy
| given observations.
``q`` (batch,) | Tensor containing the current estimate
| of Q* for the provided observations
| and actions. (Critical: make sure to
| flatten this!)
=========== ================ ======================================
ac_kwargs (dict): Any kwargs appropriate for the ActorCritic object
you provided to DDPG.
seed (int): Seed for random number generators.
steps_per_epoch (int): Number of steps of interaction (state-action pairs)
for the agent and the environment in each epoch.
epochs (int): Number of epochs to run and train agent.
replay_size (int): Maximum length of replay buffer.
gamma (float): Discount factor. (Always between 0 and 1.)
polyak (float): Interpolation factor in polyak averaging for target
networks. Target networks are updated towards main networks
according to:
.. math:: \\theta_{\\text{targ}} \\leftarrow
\\rho \\theta_{\\text{targ}} + (1-\\rho) \\theta
where :math:`\\rho` is polyak. (Always between 0 and 1, usually
close to 1.)
pi_lr (float): Learning rate for policy.
q_lr (float): Learning rate for Q-networks.
batch_size (int): Minibatch size for SGD.
start_steps (int): Number of steps for uniform-random action selection,
before running real policy. Helps exploration.
update_after (int): Number of env interactions to collect before
starting to do gradient descent updates. Ensures replay buffer
is full enough for useful updates.
update_every (int): Number of env interactions that should elapse
between gradient descent updates. Note: Regardless of how long
you wait between updates, the ratio of env steps to gradient steps
is locked to 1.
act_noise (float): Stddev for Gaussian exploration noise added to
policy at training time. (At test time, no noise is added.)
num_test_episodes (int): Number of episodes to test the deterministic
policy at the end of each epoch.
max_ep_len (int): Maximum length of trajectory / episode / rollout.
logger_kwargs (dict): Keyword args for EpochLogger.
save_freq (int): How often (in terms of gap between epochs) to save
the current policy and value function.
"""
logger = EpochLogger(**logger_kwargs)
logger.save_config(locals())
torch.manual_seed(seed)
np.random.seed(seed)
env, test_env = env_fn(), env_fn()
obs_dim = env.observation_space.shape
# act_dim = env.action_space.shape[0] #AD
if len(env.action_space.shape) > 1:
action_dim = env.action_space.shape[0]
else:
action_dim = env.action_space.n
# Action limit for clamping: critically, assumes all dimensions share the same bound!
act_limit = env.action_space.high[0] #AD
# act_limit = env.action_space.high
# Create actor-critic module and target networks
ac = actor_critic(env.observation_space, env.action_space, **ac_kwargs)
ac_targ = deepcopy(ac)
# Freeze target networks with respect to optimizers (only update via polyak averaging)
for p in ac_targ.parameters():
p.requires_grad = False
# Experience buffer
replay_buffer = ReplayBuffer(obs_dim=obs_dim, act_dim=act_dim, size=replay_size)
# Count variables (protip: try to get a feel for how different size networks behave!)
var_counts = tuple(core.count_vars(module) for module in [ac.pi, ac.q])
logger.log('\nNumber of parameters: \t pi: %d, \t q: %d\n'%var_counts)
# Set up function for computing DDPG Q-loss
def compute_loss_q(data):
o, a, r, o2, d = data['obs'], data['act'], data['rew'], data['obs2'], data['done']
q = ac.q(o,a)
# Bellman backup for Q function
with torch.no_grad():
q_pi_targ = ac_targ.q(o2, ac_targ.pi(o2))
backup = r + gamma * (1 - d) * q_pi_targ
# MSE loss against Bellman backup
loss_q = ((q - backup)**2).mean()
# Useful info for logging
loss_info = dict(QVals=q.detach().numpy())
return loss_q, loss_info
# Set up function for computing DDPG pi loss
def compute_loss_pi(data):
o = data['obs']
q_pi = ac.q(o, ac.pi(o))
return -q_pi.mean()
# Set up optimizers for policy and q-function
pi_optimizer = Adam(ac.pi.parameters(), lr=pi_lr)
q_optimizer = Adam(ac.q.parameters(), lr=q_lr)
# Set up model saving
logger.setup_pytorch_saver(ac)
def update(data):
# First run one gradient descent step for Q.
q_optimizer.zero_grad()
loss_q, loss_info = compute_loss_q(data)
loss_q.backward()
q_optimizer.step()
# Freeze Q-network so you don't waste computational effort
# computing gradients for it during the policy learning step.
for p in ac.q.parameters():
p.requires_grad = False
# Next run one gradient descent step for pi.
pi_optimizer.zero_grad()
loss_pi = compute_loss_pi(data)
loss_pi.backward()
pi_optimizer.step()
# Unfreeze Q-network so you can optimize it at next DDPG step.
for p in ac.q.parameters():
p.requires_grad = True
# Record things
logger.store(LossQ=loss_q.item(), LossPi=loss_pi.item(), **loss_info)
# Finally, update target networks by polyak averaging.
with torch.no_grad():
for p, p_targ in zip(ac.parameters(), ac_targ.parameters()):
# NB: We use an in-place operations "mul_", "add_" to update target
# params, as opposed to "mul" and "add", which would make new tensors.
p_targ.data.mul_(polyak)
p_targ.data.add_((1 - polyak) * p.data)
def get_action(o, noise_scale):
a = ac.act(torch.as_tensor(o, dtype=torch.float32))
a += noise_scale * np.random.randn(act_dim)
return np.clip(a, -act_limit, act_limit)
def test_agent():
for j in range(num_test_episodes):
o, d, ep_ret, ep_len = test_env.reset(), False, 0, 0
while not(d or (ep_len == max_ep_len)):
# Take deterministic actions at test time (noise_scale=0)
o, r, d, _ = test_env.step(get_action(o, 0))
ep_ret += r
ep_len += 1
logger.store(TestEpRet=ep_ret, TestEpLen=ep_len)
# Prepare for interaction with environment
total_steps = steps_per_epoch * epochs
start_time = time.time()
o, ep_ret, ep_len = env.reset(), 0, 0
# Main loop: collect experience in env and update/log each epoch
for t in range(total_steps):
# Until start_steps have elapsed, randomly sample actions
# from a uniform distribution for better exploration. Afterwards,
# use the learned policy (with some noise, via act_noise).
if t > start_steps:
a = get_action(o, act_noise)
else:
a = env.action_space.sample()
# Step the env
o2, r, d, _ = env.step(a)
ep_ret += r
ep_len += 1
# Ignore the "done" signal if it comes from hitting the time
# horizon (that is, when it's an artificial terminal signal
# that isn't based on the agent's state)
d = False if ep_len==max_ep_len else d
# Store experience to replay buffer
replay_buffer.store(o, a, r, o2, d)
# Super critical, easy to overlook step: make sure to update
# most recent observation!
o = o2
# End of trajectory handling
if d or (ep_len == max_ep_len):
logger.store(EpRet=ep_ret, EpLen=ep_len)
o, ep_ret, ep_len = env.reset(), 0, 0
# Update handling
if t >= update_after and t % update_every == 0:
for _ in range(update_every):
batch = replay_buffer.sample_batch(batch_size)
update(data=batch)
# End of epoch handling
if (t+1) % steps_per_epoch == 0:
epoch = (t+1) // steps_per_epoch
# Save model
if (epoch % save_freq == 0) or (epoch == epochs):
logger.save_state({'env': env}, None)
# Test the performance of the deterministic version of the agent.
test_agent()
# Log info about epoch
logger.log_tabular('Epoch', epoch)
logger.log_tabular('EpRet', with_min_and_max=True)
logger.log_tabular('TestEpRet', with_min_and_max=True)
logger.log_tabular('EpLen', average_only=True)
logger.log_tabular('TestEpLen', average_only=True)
logger.log_tabular('TotalEnvInteracts', t)
logger.log_tabular('QVals', with_min_and_max=True)
logger.log_tabular('LossPi', average_only=True)
logger.log_tabular('LossQ', average_only=True)
logger.log_tabular('Time', time.time()-start_time)
logger.dump_tabular()
if __name__ == '__main__':
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--env', type=str, default='CartPole-v0')
parser.add_argument('--hid', type=int, default=256)
parser.add_argument('--l', type=int, default=2)
parser.add_argument('--gamma', type=float, default=0.99) #change this
parser.add_argument('--seed', '-s', type=int, default=0) #change this
parser.add_argument('--epochs', type=int, default=50)
parser.add_argument('--exp_name', type=str, default='ddpg')
args = parser.parse_args()
from spinningup.spinup.utils.run_utils import setup_logger_kwargs
logger_kwargs = setup_logger_kwargs(args.exp_name, args.seed)
ddpg(lambda : gym.make(args.env), actor_critic=core.MLPActorCritic,
ac_kwargs=dict(hidden_sizes=[args.hid]*args.l),
gamma=args.gamma, seed=args.seed, epochs=args.epochs,
logger_kwargs=logger_kwargs)

Predicting classes in MNIST dataset with a Gaussian- the same prediction errors with different paramemters?

I am trying to find the best c parameter following the instructions to a task that asks me to ' Define a function, fit_generative_model, that takes as input a training set (train_data, train_labels) and fits a Gaussian generative model to it. It should return the parameters of this generative model; for each label j = 0,1,...,9, where
pi[j]: the frequency of that label
mu[j]: the 784-dimensional mean vector
sigma[j]: the 784x784 covariance matrix
It is important to regularize these matrices. The standard way of doing this is to add cI to them, where c is some constant and I is the 784-dimensional identity matrix. c is now a parameter, and by setting it appropriately, we can improve the performance of the model.
%matplotlib inline
import sys
import matplotlib.pyplot as plt
import gzip, os
import numpy as np
from scipy.stats import multivariate_normal
if sys.version_info[0] == 2:
from urllib import urlretrieve
else:
from urllib.request import urlretrieve
# Downloads the dataset
def download(filename, source='http://yann.lecun.com/exdb/mnist/'):
print("Downloading %s" % filename)
urlretrieve(source + filename, filename)
# Invokes download() if necessary, then reads in images
def load_mnist_images(filename):
if not os.path.exists(filename):
download(filename)
with gzip.open(filename, 'rb') as f:
data = np.frombuffer(f.read(), np.uint8, offset=16)
data = data.reshape(-1,784)
return data
def load_mnist_labels(filename):
if not os.path.exists(filename):
download(filename)
with gzip.open(filename, 'rb') as f:
data = np.frombuffer(f.read(), np.uint8, offset=8)
return data
## Load the training set
train_data = load_mnist_images('train-images-idx3-ubyte.gz')
train_labels = load_mnist_labels('train-labels-idx1-ubyte.gz')
## Load the testing set
test_data = load_mnist_images('t10k-images-idx3-ubyte.gz')
test_labels = load_mnist_labels('t10k-labels-idx1-ubyte.gz')
train_data.shape, train_labels.shape
So I have written this code for three different C-values. they each give me the same error?
def fit_generative_model(x,y):
lst=[]
for c in [20,200, 4000]:
k = 10 # labels 0,1,...,k-1
d = (x.shape)[1] # number of features
mu = np.zeros((k,d))
sigma = np.zeros((k,d,d))
pi = np.zeros(k)
for label in range(0,k):
indices = (y == label)
mu[label] = np.mean(x[indices,:], axis=0)
sigma[label] = np.cov(x[indices,:], rowvar=0, bias=1) + c*np.identity(784) # I define the identity matrix
predictions = np.argmax(score, axis=1)
errors = np.sum(predictions != y)
lst.append(errors)
print(c,"Model makes " + str(errors) + " errors out of 10000", lst)
Then I fit it to the training data and get these same errors:
mu, sigma, pi = fit_generative_model(train_data, train_labels)
20 Model makes 1 errors out of 10000 [1]
200 Model makes 1 errors out of 10000 [1, 1]
4000 Model makes 1 errors out of 10000 [1, 1, 1]
and to the test data:
mu, sigma, pi = fit_generative_model(test_data, test_labels)
20 Model makes 9020 errors out of 10000 [9020]
200 Model makes 9020 errors out of 10000 [9020, 9020]
4000 Model makes 9020 errors out of 10000 [9020, 9020, 9020]
What is it I'm doing wrong? the correct answer is c=4000 which yields an error of ~4.3%.

Run.get_context() gives the same run id

I am submitting the training through a script file. Following is the content of the train.py script. Azure ML is treating all these as one run (instead of run per alpha value as coded below) as Run.get_context() is returning the same Run id.
train.py
from azureml.opendatasets import Diabetes
from azureml.core import Run
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.externals import joblib
import math
import os
import logging
# Load dataset
dataset = Diabetes.get_tabular_dataset()
print(dataset.take(1))
df = dataset.to_pandas_dataframe()
df.describe()
# Split X (independent variables) & Y (target variable)
x_df = df.dropna() # Remove rows that have missing values
y_df = x_df.pop("Y") # Y is the label/target variable
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, test_size=0.2, random_state=66)
print('Original dataset size:', df.size)
print("Size after dropping 'na':", x_df.size)
print("Training split size: ", x_train.size)
print("Test split size: ", x_test.size)
# Training
alphas = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] # Define hyperparameters
# Create and log interactive runs
output_dir = os.path.join(os.getcwd(), 'outputs')
for hyperparam_alpha in alphas:
# Get the experiment run context
run = Run.get_context()
print("Started run: ", run.id)
run.log("train_split_size", x_train.size)
run.log("test_split_size", x_train.size)
run.log("alpha_value", hyperparam_alpha)
# Train
print("Train ...")
model = Ridge(hyperparam_alpha)
model.fit(X = x_train, y = y_train)
# Predict
print("Predict ...")
y_pred = model.predict(X = x_test)
# Calculate & log error
rmse = math.sqrt(mean_squared_error(y_true = y_test, y_pred = y_pred))
run.log("rmse", rmse)
print("rmse", rmse)
# Serialize the model to local directory
if not os.path.isdir(output_dir):
os.makedirs(output_dir, exist_ok=True)
print("Save model ...")
model_name = "model_alpha_" + str(hyperparam_alpha) + ".pkl" # Pickle file
file_path = os.path.join(output_dir, model_name)
joblib.dump(value = model, filename = file_path)
# Upload the model
run.upload_file(name = model_name, path_or_stream = file_path)
# Complete the run
run.complete()
Experiments view
Authoring code (i.e. control plane)
import os
from azureml.core import Workspace, Experiment, RunConfiguration, ScriptRunConfig, VERSION, Run
ws = Workspace.from_config()
exp = Experiment(workspace = ws, name = "diabetes-local-script-file")
# Create new run config obj
run_local_config = RunConfiguration()
# This means that when we run locally, all dependencies are already provided.
run_local_config.environment.python.user_managed_dependencies = True
# Create new script config
script_run_cfg = ScriptRunConfig(
source_directory = os.path.join(os.getcwd(), 'code'),
script = 'train.py',
run_config = run_local_config)
run = exp.submit(script_run_cfg)
run.wait_for_completion(show_output=True)

Short Answer
Option 1: create child runs within run
run = Run.get_context() assigns the run object of the run that you're currently in to run. So in every iteration of the hyperparameter search, you're logging to the same run. To solve this, you need to create child (or sub-) runs for each hyperparameter value. You can do this with run.child_run(). Below is the template for making this happen.
run = Run.get_context()
for hyperparam_alpha in alphas:
# Get the experiment run context
run_child = run.child_run()
print("Started run: ", run_child.id)
run_child.log("train_split_size", x_train.size)
On the diabetes-local-script-file Experiment page, you can see that Run 9 was the parent run and Runs 10-19 were the child runs if you click "Include child runs" page. There is also a "Child runs" tab on Run 9 details page.
Long answer
I highly recommend abstracting the hyperparameter search away from the data plane (i.e. train.py) and into the control plane (i.e. "authoring code"). This becomes especially valuable as training time increases and you can arbitrarily parallelize and also choose Hyperparameters more intelligently by using Azure ML's Hyperdrive.
Option 2 Create runs from control plane
Remove the loop from your code, add the code like below (full data and control here)
import argparse
from pprint import pprint
parser = argparse.ArgumentParser()
parser.add_argument('--alpha', type=float, default=0.5)
args = parser.parse_args()
print("all args:")
pprint(vars(args))
# use the variable like this
model = Ridge(args.alpha)
below is how to submit a single run using a script argument. To submit multiple runs, just use a loop in the control plane.
alphas = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] # Define hyperparameters
list_rcs = [ScriptRunConfig(
source_directory = os.path.join(os.getcwd(), 'code'),
script = 'train.py',
arguments=['--alpha',a],
run_config = run_local_config) for a in alphas]
list_runs = [exp.submit(rc) for rc in list_rcs]
Option 3 Hyperdrive (IMHO the recommended approach)
In this way you outsource the hyperparameter source to Hyperdrive. The UI will also report results exactly how you want them, and via the API you can easily download the best model. Note you can't use this locally anymore and must use AMLCompute, but to me it is a worthwhile trade-off.This is a great overview. Excerpt below (full code here)
param_sampling = GridParameterSampling( {
"alpha": choice(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0)
}
)
estimator = Estimator(
source_directory = os.path.join(os.getcwd(), 'code'),
entry_script = 'train.py',
compute_target=cpu_cluster,
environment_definition=Environment.get(workspace=ws, name="AzureML-Tutorial")
)
hyperdrive_run_config = HyperDriveConfig(estimator=estimator,
hyperparameter_sampling=param_sampling,
policy=None,
primary_metric_name="rmse",
primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
max_total_runs=10,
max_concurrent_runs=4)
run = exp.submit(hyperdrive_run_config)
run.wait_for_completion(show_output=True)

How can I do simple matmul on edge tpu?

I can't work out how to invoke my .tflite model that does matmul on the coral accelerator using the python api.
The .tflite model is generated from some example code here. It works well using the tf.lite.Interpreter() class but I don't know how to transform it to work with the edgetpu class. I have tried edgetpu.basic.basic_engine.BasicEngine() by changing the models datatype from numpy.float32 to numpy.uint8, but that did not help. I am a complete beginner with TensorFlow and just want to use my tpu for matmul.
import numpy
import tensorflow as tf
import edgetpu
from edgetpu.basic.basic_engine import BasicEngine
def export_tflite_from_session(session, input_nodes, output_nodes, tflite_filename):
print("Converting to tflite...")
converter = tf.lite.TFLiteConverter.from_session(session, input_nodes, output_nodes)
tflite_model = converter.convert()
with open(tflite_filename, "wb") as f:
f.write(tflite_model)
print("Converted %s." % tflite_filename)
#This does matmul just fine but does not use the TPU
def test_tflite_model(tflite_filename, examples):
print("Loading TFLite interpreter for %s..." % tflite_filename)
interpreter = tf.lite.Interpreter(model_path=tflite_filename)
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
print("input details: %s" % input_details)
print("output details: %s" % output_details)
for i, input_tensor in enumerate(input_details):
interpreter.set_tensor(input_tensor['index'], examples[i])
interpreter.invoke()
model_output = []
for i, output_tensor in enumerate(output_details):
model_output.append(interpreter.get_tensor(output_tensor['index']))
return model_output
#this should use the TPU, but I don't know how to run the model or if it needs
#further processing. One matrix can be constant for my use case
def test_tpu(tflite_filename,examples):
print("Loading TFLite interpreter for %s..." % tflite_filename)
#TODO edgetpu.basic
interpreter = BasicEngine(tflite_filename)
interpreter.allocate_tensors()#does not work...
def main():
tflite_filename = "model.tflite"
shape_a = (2, 2)
shape_b = (2, 2)
a = tf.placeholder(dtype=tf.float32, shape=shape_a, name="A")
b = tf.placeholder(dtype=tf.float32, shape=shape_b, name="B")
c = tf.matmul(a, b, name="output")
numpy.random.seed(1234)
a_ = numpy.random.rand(*shape_a).astype(numpy.float32)
b_ = numpy.random.rand(*shape_b).astype(numpy.float32)
with tf.Session() as session:
session_output = session.run(c, feed_dict={a: a_, b: b_})
export_tflite_from_session(session, [a, b], [c], tflite_filename)
tflite_output = test_tflite_model(tflite_filename, [a_, b_])
tflite_output = tflite_output[0]
#test the TPU
tflite_output = test_tpu(tflite_filename, [a_, b_])
print("Input example:")
print(a_)
print(a_.shape)
print(b_)
print(b_.shape)
print("Session output:")
print(session_output)
print(session_output.shape)
print("TFLite output:")
print(tflite_output)
print(tflite_output.shape)
print(numpy.allclose(session_output, tflite_output))
if __name__ == '__main__':
main()

You're only converting your model once, and your model is not fully compiled for the Edge TPU. From the docs:
At the first point in the model graph where an unsupported operation occurs, the compiler partitions the graph into two parts. The first part of the graph that contains only supported operations is compiled into a custom operation that executes on the Edge TPU, and everything else executes on the CPU
There are several specific requirements that the model must meet:
quantization-aware training
constant tensor sizes and model parameters at compile time
tensors are 3-dimensional or smaller.
models only use operations supported by the Edge TPU.
There is an online compiler as well as a CLI version that is useful for translating .tflite models into Edge TPU compatible .tflite models.
Your code is also incomplete. You've passed your model to the class here:
interpreter = BasicEngine(tflite_filename)
but you're missing the step of actually running the inference on the tensor:
output = RunInference(interpreter)

Tensorflow queues and multi threads : how to use the data API

I'm currently trying to implement a Tensorflow pipeline. Indeed i want to load the data with my CPU and use my GPU to run the graph at the same time. In order to understand better what is happening, i've created a very simple convolutionnal network :
import os
import h5py
import tensorflow as tf
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
sess= tf.InteractiveSession()
from tensorflow.python.client import timeline
import time
t1 = time.time()
class generator:
def __init__(self, file):
self.file = file
def __call__(self):
with h5py.File(self.file, 'r') as hf:
for im in hf["data"]:
yield tuple(im)
dataset = tf.data.Dataset().from_generator(generator('file.h5'),
output_types= tf.float32,
output_shapes=(tf.TensorShape([None,4,128,128,3])))
dataset = dataset.batch(batch_size=1000)
dataset = dataset.prefetch(10)
iter = dataset.make_initializable_iterator()
e1 = iter.get_next()
e1 = tf.reshape(e1, (-1, 128, 128, 3))
with tf.device('gpu'):
output = tf.layers.conv2d(e1[:150],200,(5,5))
output = tf.layers.conv2d(output,50,(5,5))
output = tf.layers.conv2d(output, 50, (5, 5))
output = tf.layers.conv2d(output, 25, (5, 5))
with tf.Session() as sess:
config = tf.ConfigProto()
config.intra_op_parallelism_threads = 2
tf.Session(config=config)
options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
sess.run(tf.global_variables_initializer())
sess.run(iter.initializer)
for i in range(10):
a = sess.run(output, options=options, run_metadata=run_metadata)
print('done')
fetched_timeline = timeline.Timeline(run_metadata.step_stats)
chrome_trace = fetched_timeline.generate_chrome_trace_format()
with open('timeline_01.json', 'w') as f:
f.write(chrome_trace)
t2= time.time()
print('TIME', t2-t1)
And i don't understand the results :
first it seems that the number of threads doesn't matter on the time i spend to run the whole code. (68 seconds) Indeed when i comment the following lines :
config = tf.ConfigProto()
config.intra_op_parallelism_threads = 2
tf.Session(config=config)
it is still the same...
second, why are the GPU and the CPU not used at the same time ? Am i doing something wrong ?
If someone can help me, it would be very nice to him because i've already spend two days on this issue.
Thanks a lot for your help

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

In-graph replication for Distributed Tensorflow - python-3.x

Related

Implementing Spinningup Pytorch DDPG for Cartpole-v0 problem - getting discrete values

Predicting classes in MNIST dataset with a Gaussian- the same prediction errors with different paramemters?

Run.get_context() gives the same run id

How can I do simple matmul on edge tpu?

Tensorflow queues and multi threads : how to use the data API

Categories

Resources