Segmentation fault: 11 after back-porting TensorFlow script from Python 3 to Python 2 - python-3.x

After having created a TensorFlow 1.4 model for Python 3, I have now found that Google Cloud ML Engine currently only has support for Python 2.7.
Back-porting my Python 3 code at first seemed simple enough: Some scripts still work as expected when I replace their shebang #!/usr/bin/env python3 with #!/usr/bin/env python. python -V reports 2.7.10 in my (macOS) environment.
Yet one script does not react so gracefully. When I run it now, it produces a Segmentation fault: 11 without any previous warnings or other diagnostic output.
How can I find out about the root cause, so that I know what else to change to make also that script palatable to Python 2?
UPDATE The segmentation fault apparently occurs during a call to session.run(get_next), where get_next is obtained from a tf.data.Iterator as follows:
iterator = dataset.make_initializable_iterator()
get_next = iterator.get_next()

There are two issues here: one is about Python 3 support and the other is about the segfault.
Python 3 Support
CloudML Engine now supports Python 3, via the 'pythonVersion' field when submitting jobs (see the API reference docs).
If you are using gcloud you will need to create a config file like this (let's name it config.yaml):
trainingInput:
pythonVersion: "3.5"
When you submit your job, point gcloud to that file, e.g.
gcloud ml-engine jobs submit training --config=config.yaml ...
Segfault
This may be caused by running out of memory. Please check the memory usage in the console for that job. That said, if the job dies abruptly, memory usage at the time of failure may not be accurately reflected for that job.

Related

Running python script in colab very slow as compared to same code run on directly colab in notebook

Recently I was trying to test my model which i already trained. Initially I was using Google colab notebook to write code because of it's interactive features, once I was done writing code and I was getting satisfactory results, it took around 2.5 hr to give final output. After that what I wanted was to transfer the notebook code to .py script, I did that with little bit of modification, saved it in gdrive, and then used command !python test.py. now it took me more than 4.5 hr to get the final output, can any one explain why does colab take so much time when trying to run the python script from gdrive while compared to the same code as used in notebook
I would add time calculation to every step I doubt that takes time and see which step in your whole program takes the time.
a1 = time.time()
//your code step
print(time.time() - a1)
This will give you the time for each step and you can see which one is taking a long time.
Operations to check.
1. object creation in loops
2. read/write operation to Gdrive
Once you find the problem-causing piece of code, you may change it.
Well it can be because of the fact that colab is retrieving the data from gdrive and then might be again writing in gdrive which will of ofcourse take time i guess

Python too many subprocesses?

I'm trying to start a lot of python procees on a single machine.
Here is a code snippet:
fout = open(path, 'w')
p = subprocess.Popen((python_path,module_name),stdout=fout,bufsize=-1)
After about 100 processes I'm getting the error below:
Running on win 10 64 bit, Python 3.5. Any Idea what that might be? Already tried to split the start (so start from two scripts) as well as sleep command. After a certain number of processes, the error shows up. Any Idea what that might be? Thanks a lot for any hint!
PS:
Some background. Each process opens database connections as well as does some requests using the requests package. Then some calculations are done using numpy, scipy etc.
PPS: Just discover this error message:
dll load failed the paging file is too small for this operation to complete python (when calling scipy)
Issues solved through reinstalling numpy and scipy + installing mkl.
Strange about this error was that it only appeared after a certain number of processes. Would love to hear if anybody knows why this happened!

Can we run tensorflow lite on linux ? Or it is for android and ios only

Hi is there any possibility to run tensorflow lite on linux platform? If yes, then how we can write code in java/C++/python to load and run models on linux platform? I am familiar with bazel and successfully made Android and ios application using tensorflow lite.
I think the other answers are quite wrong.
Look, I'll tell you my experience... I've been working with Django for many years, and I've been using normal tensorflow, but there was a problem with having 4 or 5 or more models in the same project.
I don't know if you know Gunicorn + Nginx. This generates workers, so if you have 4 machine learning models, for every worker it multiplies, if you have 3 workers you will have 12 models preloaded in RAM. This is not efficient at all, because if the RAM overflows your project will fall or in fact the service responses are slower.
So this is where Tensorflow lite comes in. Switching from a tensorflow model to tensorflow lite improves and makes things much more efficient. Times are reduced absurdly.
Also, Django and Gunicorn can be configured so that the model is pre-loaded and compiled at the same time. So every time the API is used up, it only generates the prediction, which helps you make each API call a fraction of a second long.
Currently I have a project in production with 14 models and 9 workers, you can understand the magnitude of that in terms of RAM.
And besides doing thousands of extra calculations, outside of machine learning, the API call does not take more than 2 seconds.
Now, if I used normal tensorflow, it would take at least 4 or 5 seconds.
In summary, if you can use tensorflow lite, I use it daily in Windows, MacOS, and Linux, it is not necessary to use Docker at all. Just a python file and that's it. If you have any doubt you can ask me without any problem.
Here a example project
Django + Tensorflow Lite
It's possible to run (but it will works slower, than original tf)
Example
# Load TFLite model and allocate tensors.
interpreter = tf.lite.Interpreter(model_path=graph_file)
interpreter.allocate_tensors()
# Get input and output tensors.
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
# Get quantization info to know input type
quantization = None
using_type = input_details[0]['dtype']
if dtype is np.uint8:
quantization = input_details[0]['quantization']
# Get input shape
input_shape = input_details[0]['shape']
# Input tensor
input_data = np.zeros(dtype=using_type, shape=input_shape)
# Set input tensor, run and get output tensor
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
output_data = interpreter.get_tensor(output_details[0]['index'])
I agree with Nouvellie. It is possible and worth the time implementing. I developed a model on my Ubuntu 18.04 32 processor server and exported the model to tflite. The model ran in 178 secs on my ubuntu server. On my raspberry pi4 with 4GB memory, the tflite implementation ran in 85 secs, less than half the time of my server. When I installed tflite on my server the run time went down to 22 secs, an 8 fold increase in performance and now almost 4 times faster than the rpi4.
To install for python, I did not have to build the package but was able to use one of the prebuilt interpreters here:
https://www.tensorflow.org/lite/guide/python
I have Ubuntu 18.04 with python 3.7.7. So I ran pip install with the Linux python 3.7 package:
pip3 install
https://dl.google.com/coral/python/tflite_runtime-2.1.0.post1-cp37-cp37m-linux_x86_64.whl
Then import the package with:
from tflite_runtime.interpreter import Interpreter
Previous posts show how to use tflite.
From Tensorflow lite
TensorFlow Lite is TensorFlow’s lightweight solution for mobile and embedded devices.
Tensorflow lite is a fork of tensorflow for embedded devices. For PC just use the original tensorflow.
From github tensorflow:
TensorFlow is an open source software library
TensorFlow provides stable Python API and C APIs as well as without API backwards compatibility guarantee like C++, Go, Java, JavaScript and Swift.
We support CPU and GPU packages on Linux, Mac, and Windows.
>>> import tensorflow as tf
>>> tf.enable_eager_execution()
>>> tf.add(1, 2)
3
>>> hello = tf.constant('Hello, TensorFlow!')
>>> hello.numpy()
'Hello, TensorFlow!'
Yes, you can compile Tensorflow Lite to run on Linux platforms even with a Docker container. See the demo: https://sconedocs.github.io/tensorflowlite/

Openai universe-starter-agent not training

I've been trying to run Openai's universe-starter-agent example found here, However, using an m4.16xlarge instance on AWS with 32 workers, the agent's training result doesn't improve after 0.6 hours (over 30 minutes) while it is stated that "the agent is able to solve the same environment in 10 minutes" on the GitHub page.
The progress was monitored through TensorBoard. Please notice the example given in the GitHub was shown for the case of 16 workers, and it converges to an episode reward of 21 within 30 minutes, while for this case, with doubled number of workers and same amount of training time, the reward doesn't improve. I also took a look at the log and it seems there's no compiling error. The command I used to run the script is:
python train.py --num-workers 32 --env-id PongDeterministic-v3 --log-dir /tmp/pong
The only thing that I find a little dubious is when running the script, the following error was shown, but didn't abort the run. The error reads: "failed to connect to server"
Has anyone else run the starter agent, and/or run into similar issue? If so, how did you solve it?
Thanks!
Problem solved - downgrade tensorflow from 1.0.0 to 0.11.0 and trained as expected!

Running python scripts and obtaining variables in linux command line

I have low-Ram laptop but I need to work on whole genome data which is more than 1Gb. For this end, I connect to a supercomputer.
In a windows machine, I run the codes in IDLE or Pyscripter and when there is errors, Its easily identifiable because all the variables up to the error point is available and accessible. for example if you have code like this:
genome_dict={}
with open ('genome.fa') as file:
chromosome= parse(file)
sequence= parse(file)
genome_dict[chromosome[z]]=sequence[n][m]
if there is error in parsing chromosome and sequence variable, their values are accessible in IDLE.But in a supercomputer linux machine, when an error occur I could not obtain the variables to find out what is the problem, I can not use print variable because its simply too big to be printable.
my question is, is there any way to run a python script in a linux command line in a way that you can obtain the variables generated in the process of running the script after its has been finished processing with or without errors?

Resources