Memory Leak in TensorFlow Google Cloud ML Training - memory-leaks

I've been trying the TensorFlow tutorial scripts on Google Cloud ML.
In particular I've used the cifar10 CNN tutorial scripts at https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10.
When I run this training script in Google Cloud ML, there is a memory leak of around 0.5% per hour.
I have not made any changes to the scripts other than packaging them into the required GCP format (as described in https://cloud.google.com/ml-engine/docs/how-tos/packaging-trainer) and setting the data location to the storage bucket containing the .bin data files.
If I run locally i.e. not in Google Cloud, and use TCMALLOC, by setting LD_PRELOAD="/usr/lib/libtcmalloc.so", the memory leak is resolved.
However, I do not have this option with Google Cloud ML.
What could be causing the leak, and what can I do to fix this? Why aren't other users noticing the same problem?
Although the leak is small, it is big enough to cause my training sessions to run out of memory and fail, when I run against my own data for several days.
The leak happens regardless of the number of GPUs I use.
The gcloud command I used is:
gcloud ml-engine jobs submit training cifar10_job --job-dir gs://tfoutput/joboutput --package-path trainer --module-name=trainer.cifar10_multi_gpu_train --region europe-west1 --staging-bucket gs://tfoutput --scale-tier CUSTOM --config config.yml --runtime-version 1.0 -- --num_gpus=4
The config file (config.yml) is:
trainingInput:
scaleTier: CUSTOM
masterType: complex_model_m_gpu
Any help appreciated,
thanks.

We recommend using this version of the code:
github.com/tensorflow/models/pull/1538
Which has performance benefits (by running for less time, you're less prone to OOMs).
That of course, may not be the permanent fix, however, according to our testing, TensorFlow 1.2 appears to address the issue. TensorFlow 1.2 will be available soon on CloudML Engine. If you continue to have problems, please let us know.

Related

Debugging an Out of Memory Error in Node.js

I'm currently working on a Node.js project and my server keeps running out of memory. It has happened 4 times in the last 2 weeks, usually after about 10,000 requests. This project is live and has real users.
I am using
NodeJS 16
Google Cloud Platform's App Engine (instances have 2048mb of memory)
Express as my server framework
TypeORM as database ORM (database is postgres hosted on separate GCP SQL instance)
I have installed the GCP profiling tools and have captured the app running out of memory, but I'm not quite sure how to use the results. It almost looks like there is a memory leak in the _handleDataRow function within the pg client library. I am currently using version 8.8.0 of the library (8.9.0 was just released a few weeks ago and doesn't mention fixing any memory leaks in the release notes).
I'm a bit stuck with what I should do at this point.
Any suggestions or advice would be greatly appreciated! Thanks.
Update: I have also cross-posted to reddit and someone there helped me determine that issue is related to large queries with many joins. I was able to reproduce the issue, and will report back here once I am able to solve it.
When using App Engine, a great place to start looking for "why" a problem occurred in your app is through the Logs Explorer. Particularly, if you know the time-frame of when the issues started escalating or when the crash occurred.
Although based on your Memory Usage graph, it's a slow leak. So a top-to-bottom approach of your back-end is really necessary to try and pin-point the culprit. I would go through the whole stack and look for things like Globals that are set and not cleaned up, promises that are not being returned, large result-sets from the database that are bottle-necking the server, perhaps from a scheduled task.
Looking at the 2pm - 2:45pm range on the right-hand of the graph, I would narrow the Logs Explorer down to that exact time-frame. Then I would look for the processes or endpoints that are being utilized most frequently in that time-frame as well as the ones that are taking the most memory to get a good starting point.

Why does Azure ML Studio (classic) take additional time to execute Python Scripts?

I have been working with ML Studio (classic) and facing a problem with "Execute Python" scripts. I have noticed that it takes additional time to perform some internal tasks after which it starts executing the actual Python code in ML Studio. This delay has caused an increased time of 40-60 seconds per module which is aggregating and causing a delay of 400-500 seconds per execution when consumed through Batch Execution System or on running the experiments manually. (I've multiple Modules of "Execute Python" scripts)
For instance - If I run a code in my local system, suppose it takes 2-3 seconds. The same would consume 50-60 seconds in Azure ML Studio.
Can you please help understand the reason behind this or any optimization that can be done?
Regards,
Anant
The known limitations of Machine Learning Studio (classic) are:
The Python runtime is sandboxed and does not allow access to the network or to the local file system in a persistent manner.
All files saved locally are isolated and deleted once the module finishes. The Python code cannot access most directories on the machine it runs on, the exception being the current directory and its subdirectories.
When you provide a zipped file as a resource, the files are copied from your workspace to the experiment execution space, unpacked, and then used. Copying and unpacking resources can consume memory.
The module can output a single data frame. It's not possible to return arbitrary Python objects such as trained models directly back to the Studio (classic) runtime. However, you can write objects to storage or to the workspace. Another option is to use pickle to serialize multiple objects into a byte array and then return the array inside a data frame.
Hope this helps!

Azure Machine Learning Studio : Deployed Service gives OutOfMemoryLimit exception

Using Azure Machine Learning Studio I have published a few models that are having OutOfMemoryLimit issues in the deployed services (not during training).
The model type I am using is the "multiclass decision forest", and I do create some decent sized forests. When stored to a blob they take up to around 150 MB in size. Closing in on 150 I get OOM every time. At around 140 Maybe 1 in 10, and even at 120 MB they happens now and then.
The thing is it runs fine in the studio, and when deployed as a service it is not very consistent in when it gives exceptions. I can run requests against a service and get a reply 9 out of 10 cases, but then in the 10% remaining cases I will get an exception that looks like this:
{"error":{"code":"MemoryQuotaViolation","message":"The model had
exceeded the memory quota assigned to
it.","details":[{"code":"OutOfMemoryLimit","message":"The model
consumed more memory than was appropriated for it. Maximum allowed
memory for the model is 2560 MB. Please check your model for
issues."}]}}
Now I do run this as a request-response, as opposed to a batch job, and I suspect it might do just fine as a batch job. The reason for R-R is that I do need these data in real time and batch jobs are simply too slow.
I suspect the "right" approach is to further handicap my forest by reducing tree count or increasing leaf node sizes, but obviously this will reduce the model accuracy (further). Before I do so I am looking for some advice around:
Is it possible to pay for MORE than the 2,5 GB limit for the Azure ML SaaS ? (If not when is that coming??)
Is there any way to test whether a deployed model will break this limit or not before actually deploying it? We are trying to run retraining automatically and this reduces our reliability drastically.
Any other advice on what to try/test/think of
Thanks in advance!

Why does Keras write multiple tensorboard logs for a single .fit run?

I was running a convnet model using Keras with tensorflow backend on Google Cloud, using the tensorboard callback to save a tfevents log for the training history. When I was monitoring the learning curve I noticed that half way through the training (learning curve was on plateau), a new tfevents log was saved to disk. And TensorBoard's learning curve graph showed that the training was reset to epoch #1 with val_loss also reset to scratch.
This is really weird. Does anyone know what is going on here? Under what circumstances would Keras automatically restart the training and save a new tfevents log?
It turned out this issue only happened when I ran my code on Google Cloud, not on my local machine. The actual cause, as confirmed by Google engineers,
was Google's cloud maintenance, not Keras! Google Compute Engine (GCE) instances would occasionally be shut down for maintenance without any warnings or prior notification (also not documented at the time of this answer). The maintenance would cause the training instance to restart from scratch, therefore generating a new tfevents log and resetting all previous progress.
The solution to this is to frequently save checkpoints, load previous model if it exists, and resume training at the restart. Note that when using GCE the checkpoints have to be saved to Google Cloud Storage (GCS) using a custom Lambda callback function in Keras, otherwise your checkpoints will be gone with the shutdown.

Is Apache Zeppelin stable enough to be used in Production

I am using AWS EMR cluster. I have been experimenting with Spark Drivers and Apache Zeppelin Rest APIs to run jobs. I have run several hundred adhoc jobs with Zeppelin and didn't have any concern. With that fact I am considering to use Zeppelin Rest APIs in production. Will be submitting jobs using Rest APIs.
Has anyone experienced stability issues with Zeppelin in Production?
I have a zeppelin running in production in a multiuser environment (+/- 15 users) and it hasn't been very stable. To make it more stable I run zeppelin on its own node, not any longer on the master node.
Anyway, I found the following problems:
In the releases before 0.7.2 Zeppelin created a lot of zombie processes, which causes memory problems after heavy usage.
User libraries can break Zeppelin, this has been the case in the versions prior 0.7.0. E.g. Jackson libraries make Zeppelin unable to communicate with the spark interpreter. In 0.7.0 and up this problem has been mitigated.
There are random freezes when there are a lot of users. The only way to fix this, is a restart of the service. (All versions)
Sometimes when a user starts his interpreter and the local repo is empty, zeppelin doesn't download all the libraries specified in the interpreter config. Then it won't download them again, the only way to mitigate this is to delete the contents of the local repo of the interpreter. (All versions)
Sometimes changes on notebooks don't get saved, which causes users to loose code.
In version 0.6.0 spark interpreters shared a context, which caused users to overwrite each other variables.
Problems are difficult to debug, the logging is not that great yet. Some bugs seem to break the logging and sometimes running an interpreter in debug mode fixes the problem.
So, I wouldn't put it in a production setting yet, where people depend on it. But for testing and data discovery it would be fine. Zeppelin is clearly still in a beta stage.
Also don't run it on the master node, but setup your own instance and let it connect remotely to the cluster. This makes it much more stable. Put it on a beefy node and restart it overnight.
Most of the bugs I encountered are already on the Jira and the developers are working hard to make things better. The stability becomes better and better every release and I see the maintenance load going down every version, so it certainly has potential.
I have used zeppelin now for more than a year. It gets you going quickly when you are just starting but it is not a good candidate for production use cases and especially with more than 10 users and it depends on your cluster resources. These were my concerns overall with Zeppelin.
By default you can't have more than one job running at a time, you
will need to change the configuration to make that happen.
If you are loading additional libraries from s3 or external environments, you can do that only in the beginning or you will have
to restart zeppelin.
spark context is pre-created and there are only few settings you can make changes to.
The editor itself doesn't resize well when your output is large.
I am moving on to jupyter for my use cases which is much strong in my initial assessment.
As of the time of this answer, end of February 2019, my answer would be : NO. Plain and Simple. Zeppelin keeps crashing, hanging and getting unresponsive, notebooks tend to get unloadable due to size errors, very slow execution compared to Jupyter, plus so many limitations regarding third party displaying engines integration (although many effort have been made towards this).
I experienced these issues on a decently sized and capacited cluster, with a single user. I would never, ever, advice it to be a production tool. Not as it is today to the least. Unless you have an admin at hand able to restart the whole thing regularly and track down/fix errors and be in charge of integration.
We moved back to Jupyter, and everything worked smoothly out-of-the box from day one, after struggling to stabilize Zeppelin for weeks.

Resources