I'm rewriting my to code to use tf.estimator.Estimator as an encapsulating object for my models.
The problem is :
I don't see how typical input pipeline fits into the picture.
My input pipeline use queues which are coordianted by tf.train.Coordinator.
To satisify tf.estimator.Estimator requirements i create all the "input graph" in init_fn function that is passed to estimator when calling:
Estimator.train(...)
It looks like this
input_fn(f):
...create input graph...
qr = tf.train.QueueRunner(queue, [operations...])
tf.train.add_queue_runner(qr)
The problem is: in such scenario how can I start and stop queue runners, respectivly at the start and beginning of the Estimator.train(...)?
Starting
I figured out for starting the queues I can pass and init_fn that does it to scaffold object passed to Estimator.
However how to join threads and close them gracefully - this I do not know.
Is there reference architecture for proper threaded input pipeline when using tf.estimator.?
Is Estimator class even ready to work with queues?
Estimator uses tf.train.MonitoredTrainingSession which handles starting and joining threads. You can check a couple example input-fns, such as
tf.estimator.inputs.*, tf.contrib.learn.io.read*
Related
I'm writing some platform for run trainigs on the server and I need some option to halt the training process via API. It's mean when I receive somr request to REST controller, the main threads need to stop the training that can take some days time long.
I see that Tensorflow have Coordinator class and EarlyStopping callback, but I don't see nothing that can stop thetraining by demand.
Something like: model.stop()
Yes you can but it is a bit different.
You can't tell model to stop but the model can ask you if it should stop. That is done using a callback.
Here is a simple thing that you could do:
Implement your callback
documentation here. Your callback could check for, let's say, a file in a folder. Just an example "../.../stop_training.txt"
Add your callback to your model with the event you want to use.
Create an API called "https://..../stop-training" which just creates that stop_training.txt file.
When I launch my main script on the cluster with ddp mode (2 GPU's), Pytorch Lightning duplicates whatever is executed in the main script, e.g. prints or other logic. I need some extended training logic, which I would like to handle myself. E.g. do something (once!) after Trainer.fit(). But with the duplication of the main script, this doesn't work as I intend. I also tried to wrap it in if __name__ == "__main__", but it doesn't change behavior. How could one solve this problem? Or, how can I use some logic around my Trainer object, without the duplicates?
I have since moved on to use the native "ddp" with multiprocessing in PyTorch. As far as I understand, PytorchLightning (PTL) is just running your main script multiple times on multiple GPU's. This is fine if you only want to fit your model in one call of your script. However, a huge drawback in my opinion is the lost flexibility during the training process. The only way of interacting with your experiment is through these (badly documented) callbacks. Honestly, it is much more flexible and convenient to use native multiprocessing in PyTorch. In the end it was so much faster and easier to implement, plus you don't have to search for ages through PTL documentation to achieve simple things.
I think PTL is going in a good direction with removing much of the boiler plate, however, in my opinion, the Trainer concept needs some serious rework. It is too closed in my opinion and violates PTL's own concept of "reorganizing PyTorch code, keep native PyTorch code".
If you want to use PTL for easy multi GPU training, I personally would strongly suggest to refrain from using it, for me it was a waste of time, better learn native PyTorch multiprocessing.
Asked this at the GitHub repo: https://github.com/PyTorchLightning/pytorch-lightning/issues/8563
There are different accelerators for training, and while DDP (DistributedDataParallel) runs the script once per GPU, ddp_spawn and dp doesn't.
However, certain plugins like DeepSpeedPlugin are built on DDP, so changing the accelerator doesn't stop the main script from running multiple times.
You could quit the duplicated sub-processes by putting the following code after Trainer.fit:
import sys
if model.global_rank != 0:
sys.exit(0)
where model is inherited from LightningModule, which has a property global_rank specifying the rank of the machine. We could roughly understand it as the gpu id or the process id. Everything after this code will only be executed in the main process, i.e., process with global_rank = 0.
For more information, please refer the documentation https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#global_rank
Use global variables:
if __name__ == "__main__":
is_primary = os.environ.get(IS_PTL_PRIMARY) is None
os.environ[IS_PTL_PRIMARY] = "yes"
## code to run on each GPU
if is_primary:
## code to run only once
From Pytorch Lightning Official Document on DDP, we know that PL intendedly call the main script multiple times to spin off the child processes that take charge of GPUs:
They used the environment variable "LOCAL_RANK" and "NODE_RANK" to denote GPUs. So we can add conditions to bypass the code blocks that we don't want to get executed repeatedly. For example:
import os
if __name__ == "__main__":
if 'LOCAL_RANK' not in os.environ.keys() and 'NODE_RANK' not in os.environ.keys():
# code you only want to run once
I am using direct-Kafka-input-stream in my Spark app. When I use window(...) function in the chain it will cause the processing pipeline to stop - when I open the Spark-UI I can see that the streaming batches are being queued and the pipeline reports to process one of the first batches.
Derivations of window(..) function - like reduceByKeyAndWindow(..), etc. works as expected - pipeline doesn't stop. The same applies when using different type of stream.
Is it some known limitation of window(..) function when used with direct-Kafka-input-stream ?
Thanks
Martin
Java pseudo code:
org.apache.spark.streaming.kafka.DirectKafkaInputDStream s;
s.window(Durations.seconds(10)).print(); // the pipeline will stop
To be more correct: the issue happens only when the windows overlap (if sliding_interval < window_length). Otherwise the system behaves as expected.
Greetings SO denizens!
I'm trying to architect an overhaul of an existing NodeJS application that has outgrown its original design. The solutions I'm working towards are well beyond my experience.
The system has ~50 unique async tasks defined as various finite state machines which it knows how to perform. Each task has a required set of parameters to begin execution which may be supplied by interactive prompts, a database or from the results of a previously completed async task.
I have a UI where the user may define a directed graph ("the flow"), specifying which tasks they want to run and the order they want to execute them in with additional properties associated with both the vertices and edges such as extra conditionals to evaluate before calling a child task(s). This information is stored in a third normal form PostgreSQL database as a "parent + child + property value" configuration which seems to work fairly well.
Because of the sheer number of permutations, conditionals and absurd number of possible points of failure I'm leaning towards expressing "the flow" as a state machine. I merely have just enough knowledge of graph theory and state machines to implement them but practically zero background.
I think what I'm trying to accomplish is at the flow run time after user input for the root services have been received, is somehow compile the database representation of the graph + properties into a state machine of some variety.
To further complicate the matter in the near future I would like to be able to "pause" a flow, save its state to memory, load it on another worker some time in the future and resume execution.
I think I'm close to a viable solution but if one of you kind souls would take mercy on a blind fool and point me in the right direction I'd be forever in your debt.
I solved similar problem few years ago as my bachelor and diploma thesis. I designed a Cascade, an executable structure which forms growing acyclic oriented graph. You can read about it in my paper "Self-generating Programs – Cascade of the Blocks".
The basic idea is, that each block has inputs and outputs. Initially some blocks are inserted into the cascade and inputs are connected to outputs of other blocks to form an acyclic graph. When a block is executed, it reads its inputs (cascade will pass values from connected outputs) and then the block sets its outputs. It can also insert additional blocks into the cascade and connect its inputs to outputs of already present blocks. This should be equal to your task starting another task and passing some parameters to it. Alternative to setting output to an value is forwarding a value from another output (in your case waiting for a result of some other task, so it is possible to launch helper sub-tasks).
I have written a nice parallel job processor that accepts jobs (functions, their arguments, timeout information etc.) and submits then to a Python multiprocessing pool. I can provide the full (long) code if requested, but the key step (as I see it) is the asynchronous application to the pool:
job.resultGetter = self.pool.apply_async(
func = job.workFunction,
kwds = job.workFunctionKeywordArguments
)
I am trying to use this parallel job processor with a large body of legacy code and, perhaps naturally, have run into pickling problems:
PicklingError: Can’t pickle <type ’instancemethod’>: attribute lookup builtin .instancemethod failed
This type of problem is observable when I try to submit a problematic object as an argument for a work function. The real problem is that this is legacy code and I am advised that I can make only very minor changes to it. So... is there some clever trick or simple modification I can make somewhere that could allow my parallel job processor code to cope with these traditionally unpicklable objects? I have total control over the parallel job processor code, so I am open to, say, wrapping every submitted function in another function. For the legacy code, I should be able to add the occasional small method to objects, but that's about it. Is there some clever approach to this type of problem?
use dill and pathos.multiprocessing instead of pickle and multiprocessing.
see here:
What can multiprocessing and dill do together?
http://matthewrocklin.com/blog/work/2013/12/05/Parallelism-and-Serialization/
How to pickle functions/classes defined in __main__ (python)