pipe to cuda not working stable diffusion - pytorch

stable diffusion inside a jupyter notebook with cuda 12
Nvidea studio driver on the host Win 11.
PyTorch in other projects runs just fine no problems with cuda.
My jupyterlab sits inside a WSL ubuntu.
My problem I cannot run pipe.to("cuda") with stable diffusion, the image generator.
Which I like to run local for faster generation.
# inside jupyterlab cell:
from huggingface_hub import notebook_login
notebook_login()
although I enter my key hf_asfasfd... I cannot verify login is accepted
but I guess that's normal? kind of weird
# inside the next cell:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
--> prints cuda
# in another cell:
import torch
from diffusers import StableDiffusionPipeline
from PIL import Image
pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
i can see it downloaded the model so login was OK i guess..
pipe = pipe.to("cuda")
kernel time out I got a 3080TX 12Gig memory stays low no indication it loaded
The fans of the cards also don't spin up which it should if it worked
I also raised the timeout for jupyterlab to 10 minutes with no effect.
The time-out doesn't show any error in the console or in the web page.
# cell that doesn't get executed:
prompt = "a photo of a cat riding a horse on mars"
image = pipe(prompt).images[0]
image.show()
last cell never gets executed
I have altered the config to perform kernel restart after 10 minutes to wait if it takes longer but this has no effect ( when the kernel dies eventually on the Linux prompt I get AsyncIOLoopKernelRestarter: restarting kernel ) but no other errors on the screen or on the web page of the notebook. And this error is just that the timeout (increased to 10 minutes) has passed, so it's hanging
installed all the latest versions of transformers, and diffusers, scipy
Any ideas of what this could be, this code runs fine on google colab.
I know pytorch cuda can work on my machine but it isn't loading the pipeline here

Related

Jupyter Lab shutting down and system being logged of on running python code

I am using Jupyter Lab with python for data analysis of cosmological data sets.
I use Dell Vostro 5515 laptop with 16gb ram and Ryzen7 processor. My OS is Fedora 36 with KDE and Xfce environments.
The problem is that on running my .ipynb notebook for some time, it closes down abruptly if I am in KDE. If I m in xfce, it also closes all the applications and logs out my session.
The crash happens mostly while running a function called compute_full_master in pymaster library in python. But has also happened rarely while running some other functions.
I have tried to get error messages by running jupyter lab in --debug mode, but when the crash happens, the terminal is also closed. I do not know how to get the crash details in other ways.
I have tried to run the code in firefox, chrome, and VSCode.
I am sorry if I have not provided any details necessary and I am happy to provide any if anyone can help!
EDIT:
A simple example:
arr_len = 8394753
x = np.arange(arr_len)
plt.figure(figsize=(25,15))
plt.plot(x, y_1 - y_2)
plt.plot(x, y_1 - y_3)
plt,plot(x, y_1 - y_4)
plt.ylim((-1e-6,1e-6))
The arrays y_1, y_2, y_3 and y_4 have the length arr_len and are complex. The imaginary part does not matter. The notebook is already having run some code in previous cells. But running this plotting cell a few times caused the shut down many times.

Python 3.8 RAM owerflow and loading issues

First, I want to mention, that this is our first project in a bigger scale and therefore we don't know everything but we learn fast.
We developed a code for image recognition. We tried it with a raspberry pi 4b but quickly faced that this is way to slow overall. Currently we are using a NVIDIA Jetson Nano. The first recognition was ok (around 30 sec.) and the second try was even better (around 6-7 sec.). The first took so long because the model will be loaded for the first time. Via an API the image recognition can be triggered and the meta data from the AI model will be the response. We use fast-API for this.
But there is a problem right now, where if I load my CNN as a global variable in the beginning of my classification file (loaded on import) and use it within a thread I need to use mp.set_start_method('spawn') because otherwise I will get the following error:
"RuntimeError: Cannot re-initialize CUDA in forked subprocess.
To use CUDA with multiprocessing, you must use the 'spawn' start method"
Now that is of course an easy fix. Just add the method above before starting my thread. Indeed this works but another challenge occurs at the same time. After setting the start method to 'spawn' the ERROR disappears but the Jetson starts to allocate way to much memory.
Because of the overhead and preloaded CNN model, the RAM is around 2.5Gig before the thread starts. After the start it doesn’t stop allocating RAM, it consumes all 4Gig of the RAM and also the whole 6Gig Swap. Right after this, the whole API process kill with this error: "cannot allocate memory" which is obvious.
I managed to fix that as well just by loading the CNN Model in the classification function. (Not preloading it on the GPU as in the two cases before). However, here I got problem as well. The process of loading the model to the GPU takes around 15s - 20s and this every time the recognition starts. This is not suitable for us and we are wondering why we cannot pre-load the model without killing the whole thing after two image-recognitions. Our goal is to be under 5 sec with this.
#clasify
import torchvision.transforms as transforms
from skimage import io
import time
from torch.utils.data import Dataset
from .loader import *
from .ResNet import *
#if this part is in the classify() function than no allocation problem occurs
net = ResNet152(num_classes=25)
net = net.to('cuda')
save_file = torch.load("./model.pt", map_location=torch.device('cuda'))
net.load_state_dict(save_file)
def classify(imgp=""):
#do some classification with the net
pass
if __name__ == '__main__':
mp.set_start_method('spawn') #if commented out the first error ocours
manager = mp.Manager()
return_dict = manager.dict()
p = mp.Process(target=classify, args=('./bild.jpg', return_dict))
p.start()
p.join()
print(return_dict.values())
Any help here will be much appreciated. Thank you.

10 s delay time for time.sleep in ipython after matplotlib 3.x.x import

I realize that time.sleep() calls the OS timer and can vary a little (as this answer points out), but I am seeing a very large delay time in Ipython. This seems to just happen after I have imported matplotlib.pyplot. Then right after waiting about 30 seconds, it starts to lag. To give a working example, try entering iPython:
>>import matplotlib.pyplot as plt
# after 30 seconds
>>%time time.sleep(1)
CPU times: user 5.27 ms, sys: 3.58 ms, total: 8.85 ms
Wall time: 11 s
Using slightly longer times in sleep appears to have an additive effect:
>>%time time.sleep(3)
CPU times: user 4.75 ms, sys: 3.7 ms, total: 8.45 ms
Wall time: 13 s
Very occasionally the wall time is appropriate, but only about 1/10 of the time. I also tried boxing the sleep in a function as follows:
>>def test():
start = time.time()
for i in range(4):
time.sleep(1)
print(f'{time.time() - start}')
>>test()
11.000279188156128
22.000601053237915
33.000962018966675
44.001291036605835
This also occasionally shows smaller time steps, but this is the usual output. I also put the same function in a separate file and used %run script.py in iPython, with the same result. Thus, it happens anytime time.sleep is called.
The only things that seems to work are (a) not importing matplotlib.pyplot
or (b) defining a function based on a simple all-python timer:
>>def dosleep(t):
start = time.time()
while time.time() - start < t:
continue
>>%time dosleep(2)
CPU times: user 1.99 s, sys: 8.4 ms, total: 2 s
Wall time: 2 s
The last example seems like a good solution, but I have a decent amount of code that relies on time.sleep() already, and I would like to still use Jupyter with an Ipython kernel. Is there any way to determine what is holding it up, or are there any tips on how to decrease the lag time? I'm just wondering what sort of thing could cause this.
I'm on Mac OS X 10.14.3, running Python 3.6.8 (Anaconda). My Ipython version is 7.3.0. It works the same for iPython 7.4.0. The matplotlib version is 3.0.3. The problem does not occur until the interactive GUI system is interacted (which is immediately at the import for matplotlib 3.x and at the creation of a figure (plt.figure()) with matplotlib 2.x). It occurs when an icon appears in the dock called "Python 3.6".
I realized this time.sleep delay behavior happens only when using the matplotlib backends Qt5Agg and Qt4Agg. This happens whether I use iPython or the regular python console. It doesn't occur when running a file by entering "python filename.py" into the Terminal, though it does hold up files that are run through the iPython or python consoles. When the plotting GUI starts, the time.sleep behavior kicks in after around 30 seconds or so.
I was able to fix the problem by switching the backend to TkAgg, which works similarly and appears to work well in interactive mode. I just made a file called "matplotlibrc" in my base user folder (~) and added the line
backend : TkAgg
to make the default backend TkAgg. This doesn't entirely answer why it happens, but it fixes the problem.

Can we run tensorflow lite on linux ? Or it is for android and ios only

Hi is there any possibility to run tensorflow lite on linux platform? If yes, then how we can write code in java/C++/python to load and run models on linux platform? I am familiar with bazel and successfully made Android and ios application using tensorflow lite.
I think the other answers are quite wrong.
Look, I'll tell you my experience... I've been working with Django for many years, and I've been using normal tensorflow, but there was a problem with having 4 or 5 or more models in the same project.
I don't know if you know Gunicorn + Nginx. This generates workers, so if you have 4 machine learning models, for every worker it multiplies, if you have 3 workers you will have 12 models preloaded in RAM. This is not efficient at all, because if the RAM overflows your project will fall or in fact the service responses are slower.
So this is where Tensorflow lite comes in. Switching from a tensorflow model to tensorflow lite improves and makes things much more efficient. Times are reduced absurdly.
Also, Django and Gunicorn can be configured so that the model is pre-loaded and compiled at the same time. So every time the API is used up, it only generates the prediction, which helps you make each API call a fraction of a second long.
Currently I have a project in production with 14 models and 9 workers, you can understand the magnitude of that in terms of RAM.
And besides doing thousands of extra calculations, outside of machine learning, the API call does not take more than 2 seconds.
Now, if I used normal tensorflow, it would take at least 4 or 5 seconds.
In summary, if you can use tensorflow lite, I use it daily in Windows, MacOS, and Linux, it is not necessary to use Docker at all. Just a python file and that's it. If you have any doubt you can ask me without any problem.
Here a example project
Django + Tensorflow Lite
It's possible to run (but it will works slower, than original tf)
Example
# Load TFLite model and allocate tensors.
interpreter = tf.lite.Interpreter(model_path=graph_file)
interpreter.allocate_tensors()
# Get input and output tensors.
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
# Get quantization info to know input type
quantization = None
using_type = input_details[0]['dtype']
if dtype is np.uint8:
quantization = input_details[0]['quantization']
# Get input shape
input_shape = input_details[0]['shape']
# Input tensor
input_data = np.zeros(dtype=using_type, shape=input_shape)
# Set input tensor, run and get output tensor
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
output_data = interpreter.get_tensor(output_details[0]['index'])
I agree with Nouvellie. It is possible and worth the time implementing. I developed a model on my Ubuntu 18.04 32 processor server and exported the model to tflite. The model ran in 178 secs on my ubuntu server. On my raspberry pi4 with 4GB memory, the tflite implementation ran in 85 secs, less than half the time of my server. When I installed tflite on my server the run time went down to 22 secs, an 8 fold increase in performance and now almost 4 times faster than the rpi4.
To install for python, I did not have to build the package but was able to use one of the prebuilt interpreters here:
https://www.tensorflow.org/lite/guide/python
I have Ubuntu 18.04 with python 3.7.7. So I ran pip install with the Linux python 3.7 package:
pip3 install
https://dl.google.com/coral/python/tflite_runtime-2.1.0.post1-cp37-cp37m-linux_x86_64.whl
Then import the package with:
from tflite_runtime.interpreter import Interpreter
Previous posts show how to use tflite.
From Tensorflow lite
TensorFlow Lite is TensorFlow’s lightweight solution for mobile and embedded devices.
Tensorflow lite is a fork of tensorflow for embedded devices. For PC just use the original tensorflow.
From github tensorflow:
TensorFlow is an open source software library
TensorFlow provides stable Python API and C APIs as well as without API backwards compatibility guarantee like C++, Go, Java, JavaScript and Swift.
We support CPU and GPU packages on Linux, Mac, and Windows.
>>> import tensorflow as tf
>>> tf.enable_eager_execution()
>>> tf.add(1, 2)
3
>>> hello = tf.constant('Hello, TensorFlow!')
>>> hello.numpy()
'Hello, TensorFlow!'
Yes, you can compile Tensorflow Lite to run on Linux platforms even with a Docker container. See the demo: https://sconedocs.github.io/tensorflowlite/

pyglet vertex list not rendered (AMD driver?)

My machine apparently won't draw vertex lists in pyglet. The following code renders two identical shapes at different positions in the window, one using a vertex list and the other using a straight draw(). The one that's drawn directly renders fine, while the vertex list doesn't render at all.
import pyglet
window = pyglet.window.Window()
w, h = window.get_size()
vl = pyglet.graphics.vertex_list( 4,
('v2i', (100,0, 100,h, 200,h, 200,0)),
('c3B', (255,255,255, 255,0,0,
0,255,0, 0,0,255)) )
#window.event
def on_draw():
window.clear()
vl.draw( pyglet.gl.GL_QUADS )
pyglet.graphics.draw( 4, pyglet.gl.GL_QUADS,
('v2i', (300,0, 300,h, 400,h, 400,0)),
('c3B', (255,255,255, 255,0,0,
0,255,0, 0,0,255)) )
pyglet.app.run()
This is pyglet 1.1.2 in Ubuntu Lucid, using an AMD Radeon HD 6450 card with the newest Catalyst 12.1 driver. I imagine it must be something to do with the drivers, etc., because this code worked three years ago on several NVIDIA cards, and it's almost direct from the pyglet documentation. Anybody know what setting I need to futz with, or if a particular driver version works right?
I seem to have the same problem running Catalyst 12.2 on Windows 7 with a Radeon HD 4870. Some earlier code of mine stopped partially working as well after I moved to this card from my older Geforce 8800 GTX, specifically the fps_counter and label drawing still worked, drawing a batch didn't.
After I downgraded the video driver to Catalyst 11.5 the problems went away (both with your snippet above and with my earlier code).
Later versions of Catalyst might work. I tried this one first because it is mentioned as working somewhat properly overhere: http://groups.google.com/group/pyglet-users/msg/ae317c37ce54c107
Update: Tested Catalyst 11.12 (the latest 11.x release, video driver version 8.920.0.0000) and the problem has returned.
Update 2: Some more testing later, it appears this issue started occuring with Catalyst 11.9 (video driver 8.892.0.0000). Catalyst 11.8 (video driver 8.881.0.0000) worked as expected.
A work-around is to use v2f instead of v2i as per this comment on the pyglet issue tracker.
Last update: This problem seems to be fixed with Catalyst 12.4 (video driver 8.961.0.0).

Resources