distributed Tensorflow tracking timestamps for synchronization operations - multithreading

I am new to TensorFlow. Currently, I am trying to evaluate the performance of distributed TensorFlow using Inception model provided by TensorFlow team.
The thing I want is to generate timestamps for some critical operations in a Parameter Server - Worker architecture, so I can measure the bottleneck (the network lag due to parameter transfer/synchronization or parameter computation cost) on replicas for one iteration (batch).
I came up with the idea of adding a customized dummy py_func operator designated of printing timestamps inside inception_distributed_train.py, with some control dependencies. Here are some pieces of code that I added:
def timer(s):
print ("-------- thread ID ", threading.current_thread().ident, ", ---- Process ID ----- ", getpid(), " ~~~~~~~~~~~~~~~ ", s, datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S.%f'))
return Falsedf
dummy1 = tf.py_func(timer, ["got gradients, before dequeues token "], tf.bool)
dummy2 = tf.py_func(timer, ["finished dequeueing the token "], tf.bool)
I modified
apply_gradients_op = opt.apply_gradients(grads, global_step=global_step)
with tf.control_dependencies([apply_gradients_op]):
train_op = tf.identity(total_loss, name='train_op')
into
with tf.control_dependencies([dummy1]):
apply_gradients_op = opt.apply_gradients(grads, global_step=global_step)
with tf.control_dependencies([apply_gradients_op]):
with tf.control_dependencies([dummy2]):
train_op = tf.identity(total_loss, name='train_op')
hoping to print the timestamps before evaluating the apply_gradient_op and after finishing evaluating the apply_gradient_op by enforcing node dependencies.
I did similar things inside sync_replicas_optimizer.apply_gradients, by adding two dummy print nodes before and after update_op:
dummy1 = py_func(timer, ["---------- before update_op "], tf_bool)
dummy2 = py_func(timer, ["---------- finished update_op "], tf_bool)
# sync_op will be assigned to the same device as the global step.
with ops.device(global_step.device), ops.name_scope(""):
with ops.control_dependencies([dummy1]):
update_op = self._opt.apply_gradients(aggregated_grads_and_vars, global_step)
# Clear all the gradients queues in case there are stale gradients.
clear_queue_ops = []
with ops.control_dependencies([update_op]):
with ops.control_dependencies([dummy2]):
for queue, dev in self._one_element_queue_list:
with ops.device(dev):
stale_grads = queue.dequeue_many(queue.size())
clear_queue_ops.append(stale_grads)
I understand that apply_gradient_op is the train_op returned by sync_replicas_optimizer.apply_gradient. And apply_gradient_op is the op to dequeue a token (global_step) from sync_queue managed by the chief worker using chief_queue_runner, so that replica can exit current batch and start a new batch.
In theory, apply_gradient_op should take some time as replica has to wait before it can dequeue the token (global_step) from sync_queue, but the print result for one replica I got, such as the time differences for executing apply_gradient_op is pretty short (~1/1000 sec) and sometimes the print output is indeterministic (especially for chief worker). Here is a snippet of the output on the workers (I am running 2 workers and 1 PS):
chief worker (worker 0) output
worker 1 output
My questions are:
1) I need to record the time TensorFlow takes to execute an op (such as train_op, apply_gradients_op, compute_gradients_op, etc.)
2) Is this the right direction to go, given my ultimate goal is to record the elapsed time for executing certain operations (such as the difference between the time a replica finishes computing gradients and the time it gets the global_step from sync_token)?
3) If this is not the way it should go, please guide me with some insights about the possible ways I could achieve my ultimate goal.
Thank you so much for reading my long long posts. as I have spent weeks working on this!

Related

How to Fully Utilize CPU cores for skopt.forest_minimize

So I have the following code for running skopt.forest_minimize(), but the biggest challenge I am facing right now is that it is taking upwards of days to finish running even just 2 iterations.
SPACE = [skopt.space.Integer(4, max_neighbour, name='n_neighbors', prior='log-uniform'),
skopt.space.Integer(6, 10, name='nr_cubes', prior='uniform'),
skopt.space.Categorical(overlap_cat, name='overlap_perc')]
#skopt.utils.use_named_args(SPACE)
def objective(**params):
score, scomp = tune_clustering(X_cont=X_cont, df=df, pl_brewer=pl_brewer, **params)
if score == 0:
print('saving new scomp')
with open(scomp_file, 'w') as filehandle:
json.dump(scomp, filehandle, default = json_default)
return score
results = skopt.forest_minimize(objective, SPACE, n_calls=1, n_initial_points=1, callback=[scoring])
Is it possible to optimize the following code so that it can compute faster? I noticed that it was barely making use of my CPU, highest CPU utilized is about 30% (it's i7 9th gen with
8 cores).
Also a question while I'm at it, is it possible to utilize a GPU for these computational tasks? I have a 3050 that I can use.

(Wald)Test for comparing two nested feols/plm models

I want to test if
model_1 <- feols(auth ~ dummy_past1 + dummy_past2 + dummy_past3
| region + date_f, # Fixed Effects
data=df)
is just as good as
model_2 <- feols(auth ~ i(dummy_past1,specific_regions) + i(dummy_past2,specific_regions) + i(dummy_past3, specific_regions) # interaction(dummies * specific regions)
| region + date_f, # Fixed Effects
data=df)
If this was a normal linear regression I would conduct a lrtest (likelihood ratio test). As I heard that a waldtest can also be applied to panel data (?) I conducted a waldtest
waldtest(model_1, model_2). Yet, when doing that I get the warning: "Error in solve.default(vc[ovar, ovar]) : system is computationally singular: reciprocal condition number = 2.10749e-44"
Does someone know if doing a waldtest is the correct approach here, or if there are any other tests that I could do on plm or feols regressions? Or it would also be very helpful if you have ideas on how I could get the waldtest working.

Confusion About Implementing LeafSystem With Vector Output Port Correctly

I'm a student teaching myself Drake, specifically pydrake with Dr. Russ Tedrake's excellent Underactuated Robotics course. I am trying to write a combined energy shaping and lqr controller for keeping a cartpole system balanced upright. I based the diagram on the cartpole example found in Chapter 3 of Underactuated Robotics [http://underactuated.mit.edu/acrobot.html], and the SwingUpAndBalanceController on Chapter 2: [http://underactuated.mit.edu/pend.html].
I have found that due to my use of the cart_pole.sdf model I have to create an abstract input port due receive FramePoseVector from the cart_pole.get_output_port(0). From there I know that I have to create a control signal output of type BasicVector to feed into a Saturation block before feeding into the cartpole's actuation port.
The problem I'm encountering right now is that I'm not sure how to get the system's current state data in the DeclareVectorOutputPort's callback function. I was under the assumption I would use the LeafContext parameter in the callback function, OutputControlSignal, obtaining the BasicVector continuous state vector. However, this resulting vector, x_bar is always NaN. Out of desperation (and testing to make sure the rest of my program worked) I set x_bar to the controller's initialization cart_pole_context and have found that the simulation runs with a control signal of 0.0 (as expected). I can also set output to 100 and the cartpole simulation just flies off into endless space (as expected).
TL;DR: What is the proper way to obtain the continuous state vector in a custom controller extending LeafSystem with a DeclareVectorOutputPort?
Thank you for any help! I really appreciate it :) I've been teaching myself so it's been a little arduous haha.
# Combined Energy Shaping (SwingUp) and LQR (Balance) Controller
# with a simple state machine
class SwingUpAndBalanceController(LeafSystem):
def __init__(self, cart_pole, cart_pole_context, input_i, ouput_i, Q, R, x_star):
LeafSystem.__init__(self)
self.DeclareAbstractInputPort("state_input", AbstractValue.Make(FramePoseVector()))
self.DeclareVectorOutputPort("control_signal", BasicVector(1),
self.OutputControlSignal)
(self.K, self.S) = BalancingLQRCtrlr(cart_pole, cart_pole_context,
input_i, ouput_i, Q, R, x_star).get_LQR_matrices()
(self.A, self.B, self.C, self.D) = BalancingLQRCtrlr(cart_pole, cart_pole_context,
input_i, ouput_i,
Q, R, x_star).get_lin_matrices()
self.energy_shaping = EnergyShapingCtrlr(cart_pole, x_star)
self.energy_shaping_context = self.energy_shaping.CreateDefaultContext()
self.cart_pole_context = cart_pole_context
def OutputControlSignal(self, context, output):
#xbar = copy(self.cart_pole_context.get_continuous_state_vector())
xbar = copy(context.get_continuous_state_vector())
xbar_ = np.array([xbar[0], xbar[1], xbar[2], xbar[3]])
xbar_[1] = wrap_to(xbar_[1], 0, 2.0*np.pi) - np.pi
# If x'Sx <= 2, then use LQR ctrlr. Cost-to-go J_star = x^T * S * x
threshold = np.array([2.0])
if (xbar_.dot(self.S.dot(xbar_)) < 2.0):
#output[:] = -self.K.dot(xbar_) # u = -Kx
output.set_value(-self.K.dot(xbar_))
else:
self.energy_shaping.get_input_port(0).FixValue(self.energy_shaping_context,
self.cart_pole_context.get_continuous_state_vector())
output_val = self.energy_shaping.get_output_port(0).Eval(self.energy_shaping_context)
output.set_value(output_val)
print(output)
Here are two things that might help:
If you want to get the state of the cart-pole from MultibodyPlant, you probably want to be connecting to the continuous_state output port, which gives you a normal vector instead of the abstract-type FramePoseVector. In that case, your call to get_input_port().Eval(context) should work just fine.
If you do really want to read the FramePoseVector, then you have to evaluate the input port slightly differently. You can find an example of that here.

PuLP solvers do not respond to options fed to them

So I've got a fairly large optimization problem and I'm trying to solve it within a sensible amount of time.
Ive set it up as:
import pulp as pl
my_problem = LpProblem("My problem",LpMinimize)
# write to problem file
my_problem.writeLP("MyProblem.lp")
And then alternatively
solver = CPLEX_CMD(timeLimit=1, gapRel=0.1)
status = my_problem .solve(solver)
solver = pl.apis.CPLEX_CMD(timeLimit=1, gapRel=0.1)
status = my_problem .solve(solver)
path_to_cplex = r'C:\Program Files\IBM\ILOG\CPLEX_Studio1210\cplex\bin\x64_win64\cplex.exe' # and yes this is the actual path on my machine
solver = pl.apis.cplex_api.CPLEX_CMD(timeLimit=1, gapRel=0.1, path=path_to_cplex)
status = my_problem .solve(solver)
solver = pl.apis.cplex_api.CPLEX_CMD(timeLimit=1, gapRel=0.1, path=path_to_cplex)
status = my_problem .solve(solver)
It runs in each case.
However, the solver does not repond to the timeLimit or gapRel instructions.
If I use timelimit it does warn this is depreciated for timeLimit. Same for fracgap: it tells me I should use relGap. So somehow I am talking to the solver.
However, nor matter what values i pick for timeLimit and relGap, it always returns the exact same answer and takes the exact same amount of time (several minutes).
Also, I have tried alternative solvers, and I cannot get any one of them to accept their variants of time limits or optimization gaps.
In each case, the problem solves and returns an status: optimal message. But it just ignores the time limit and gap instructions.
Any ideas?
out of the zoo example:
import pulp
import cplex
bus_problem = pulp.LpProblem("bus", pulp.LpMinimize)
nbBus40 = pulp.LpVariable('nbBus40', lowBound=0, cat='Integer')
nbBus30 = pulp.LpVariable('nbBus30', lowBound=0, cat='Integer')
# Objective function
bus_problem += 500 * nbBus40 + 400 * nbBus30, "cost"
# Constraints
bus_problem += 40 * nbBus40 + 30 * nbBus30 >= 300
solver = pulp.CPLEX_CMD(options=['set timelimit 40'])
bus_problem.solve(solver)
print(pulp.LpStatus[bus_problem.status])
for variable in bus_problem.variables():
print ("{} = {}".format(variable.name, variable.varValue))
Correct way to pass solver option as dictionary
pulp.CPLEX_CMD(options={'timelimit': 40})
#Alex Fleisher has it correct with pulp.CPLEX_CMD(options=['set timelimit 40']). This also works for CBC using the following syntax:
prob.solve(COIN_CMD(options=['sec 60','Presolve More','Multiple 15', 'Node DownFewest','HEUR on', 'Round On','PreProcess Aggregate','PassP 10','PassF 40','Strong 10','Cuts On', 'Gomory On', 'CutD -1', 'Branch On', 'Idiot -1', 'sprint -1','Reduce On','Two On'],msg=True)).
It is important to understand that the parameters, and associated options, are specific to a solver. PuLP seems to be calling CBC via the command line so an investigation of those things is required. Hope that helps

Solving multiple CPLEX models in parallel

I have some docplex models that I need to populate solution pools for them at the same time. All the models have a lazy constraint callback. My problem is when I start solving these models at the same time ,by running them on different consoles, their runtime increases. 1 single model can populate in 200 seconds but when I start solving 3 different models at the same time the runtime for that model becomes 2000 seconds. assuming that I have enough CPU and memory, why is this happening? and how can I avoid it and get the lower runtime?
You could try to limit each solve to 1 thread by using threads.
As an example
from docplex.mp.model import Model
mdl = Model(name='buses')
mdl.parameters.threads=1
nbbus40 = mdl.integer_var(name='nbBus40')
nbbus30 = mdl.integer_var(name='nbBus30')
mdl.add_constraint(nbbus40*40 + nbbus30*30 >= 300, 'kids')
mdl.minimize(nbbus40*500 + nbbus30*400)
mdl.solve()
print("threads = ",mdl.parameters.threads.get())
for v in mdl.iter_integer_vars():
print(v," = ",v.solution_value)
where I changed
https://github.com/AlexFleischerParis/zoodocplex/blob/master/zoosettings.py
to show how to set the treads parameter

Resources