I have a relatively simple linear regression lambda in AWS. Each instance the function is called the logs display the following:
/opt/python/sklearn/externals/joblib/_multiprocessing_helpers.py:38: UserWarning: [Errno 38] Function not implemented. joblib will operate in serial mode
warnings.warn('%s. joblib will operate in serial mode' % (e,))
I suspect this is due to sklearn running on a lambda (i.e. 'serverless') and trying to determine it's multi-processing capabilities as per this question and this GH issue.
I am also understanding from the GH that this is not a 'fixable' issue, it will always happen when deploying with these dependencies on this hardware. I am getting back my expected results (even though I am currently maxing out the default, minimum lambda memory of 128mb).
I aim to control the warnings and would know if there is a way to either:
stop sklearn looking for multiprocessing, so preventing the warning from issuing
capture this specific warning and prevent it from being passed from my function into the cloudwatch logs
if both are possible, which would be preferable from a aws architecture/pythonic opinion?
To capture the warning and prevent it from being passed into the cloudwatch logs, you can filter the warning as follows.
import json
import warnings
warnings.filterwarnings('error')
try:
import sklearn
except Warning:
pass
def lambda_handler(event, context):
# TODO implement
return {
'statusCode': 200,
'body': json.dumps('Hello from Lambda!')
}
The article here, particularly towards the end, recreates and filters the warning.
None of the suggested solutions worked for me. Digging into the source code of joblib here: https://github.com/joblib/joblib/blob/master/joblib/_multiprocessing_helpers.py, I discovered the environment variable JOBLIB_MULTIPROCESSING which seems to control whether joblib attempts to use multiprocessing.
Setting this to 0 solved the problem for me.
Now, we can use larger memory for AWS Lambda, up to about 10GB. I faced the same problem, and I set up 10GB of memory and then fixed it. (Actually, my program used 248MB memory.) I don't know why small memory caused the joblib problem by importing sklearn though.
Related
I'm using PyTorch DDP on SageMaker PyTorch Training DLC 1.8.1 The code seems properly DDP-formatted. I'm using instance_count = 2, and launching torch.distributed.launch and I believe the ranks and world size are properly set however the dist.init_process_group waits and times out
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00)
What could go wrong? machines not networked together?
This is usually something to do with the way local_rank is retrieved and used during initialization. Please refer to the below example and see if you can figure out the difference.
https://github.com/aruncs2005/pytorch-ddp-sagemaker-example
The torch.distributed.launch is the helper utility within the torch.distributed package which can be used to launch multiple processes per node for distributed training. It tells all workers which IP address is of rank 0 which is set by MASTER_ADDR,
Each rank needs to be able to communicate to the MASTER_ADDR on the port MASTER_PORT. If those are set but the workers cannot reach the MASTER_ADDR, then it can be the root cause of hang and timeoutfor the job.
Besides, it will also wait until all nodes report in from --nodes defined in the launch as well.
I am experiencing the following error while training a generative network via Pytorch 1.9.0+cu102:
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
While using a Google Colaboratory GPU session. This segment was triggered on either one of these two lines:
running_loss += loss.item()
or
target = target.to(device)
It produces the error on the first line when I am first running the notebook, and the second line each subsequent time I try to run the block. The first error occurs after training for 3 batches. The second error happens on the first batch. I can confirm that the device is cuda0, that device is available, and target is a pytorch tensor. Naturally, I tried to take the advice of the error and run:
!CUDA_LAUNCH_BLOCKING=1
and
os.system('CUDA_LAUNCH_BLOCKING=1')
However, neither of these lines changes the error message. According to a different post, this is because colab is running these lines in a subshell. The error does not occur when running on CPU, and I do not have access to a GPU device besides the GPU on Colab. While this question has been asked in many different forms, no answers are particularly helpful to me because they either recommend passing the aforementioned line, are about a situation fundamentally different from my own (such as training a classifier with an inappropriate number of classes), or recommend a solution which I have already tried, such as resetting the runtime or switching to CPU.
I am hoping to gain insight into the following questions:
Is there a way for me to get a more specific error message? Efforts to set the launch blocking variable have been unsuccessful.
How could it be that I am getting this error on two seemingly very different lines? How could it be that my network trains for 3 batches (it is always 3), but fails on the fourth?
Does this situation remind anyone of an error that they have encountered previously, and have a possible route for ameliorating it given the limited information I can extract?
I was successfully able to get more information about the error by executing:
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
BEFORE importing torch. This allowed me to get a more detailed traceback and ultimately diagnose the problem as an inappropriate loss function.
This can be mainly due to 2 reasons:
Inconsistency in the number of classes
Wrong input for the loss function
If it's the first one, then see you should get the same error when you change the runtime back to CPU.
In my case, it was the second one. I had used BCE loss, and its input should be between 0 and 1. If it's any other value, this error might appear. So I fixed this by using:
criterion=nn.BCEWithLogitsLoss()
instead of:
criterion=nn.BCELoss()
Oh yeah, and I also used:
CUDA_LAUNCH_BLOCKING = "1"
at the beginning of the code.
I wonder but could not found any information why this appears all the time if I try to tune hyperparameter from sklearn with TuneSearchCV:
Note that the important part is the Log sync warning and as a result that the logging in combination with tensorflow and search_optimization such as optuna does not work:
Backend is sklearn
Concatenating h5 datasets of the following files:
('output/example_train_1.h5', 'output/example_train_2.h5')
based on the following keys:
('x', 'y')
Concatenation successful, resulting shapes for the given dsets:
Key: x, shape: (20000, 25)
Key: y, shape: (20000,)
Log sync requires rsync to be installed.
Process finished with exit code 0
The tuning processes seem to be working, as long as I do not use search-optimization such as optional.
I use it within a docker container. I got through the ray-documentation, but I could find the source where I think the error drops. However, I could not find any settings or additional options on how to prevent it.
Furthermore, it seems that rsync is just necessary if I use a cluster. But actually, I don't do that right now.
The warning (Log sync requires rsync to be installed.) does not stop the script from executing. If rsync is not installed, it will just not synchronize logs between nodes, which seems to be unnecessary in your case anyway. You shouldn't run into any problem there.
It's hard to say what the problem here is, as we're missing crucial information: Which version of Ray are you running, which version of tune-sklearn, and how does your training script look like?
If you're running into problems and you suspect it is a bug, please consider opening an issue in the tune-sklearn repository, and make sure to include the above information and preferably a minimal reproducible script so the maintainers can look into this.
First, I want to mention, that this is our first project in a bigger scale and therefore we don't know everything but we learn fast.
We developed a code for image recognition. We tried it with a raspberry pi 4b but quickly faced that this is way to slow overall. Currently we are using a NVIDIA Jetson Nano. The first recognition was ok (around 30 sec.) and the second try was even better (around 6-7 sec.). The first took so long because the model will be loaded for the first time. Via an API the image recognition can be triggered and the meta data from the AI model will be the response. We use fast-API for this.
But there is a problem right now, where if I load my CNN as a global variable in the beginning of my classification file (loaded on import) and use it within a thread I need to use mp.set_start_method('spawn') because otherwise I will get the following error:
"RuntimeError: Cannot re-initialize CUDA in forked subprocess.
To use CUDA with multiprocessing, you must use the 'spawn' start method"
Now that is of course an easy fix. Just add the method above before starting my thread. Indeed this works but another challenge occurs at the same time. After setting the start method to 'spawn' the ERROR disappears but the Jetson starts to allocate way to much memory.
Because of the overhead and preloaded CNN model, the RAM is around 2.5Gig before the thread starts. After the start it doesn’t stop allocating RAM, it consumes all 4Gig of the RAM and also the whole 6Gig Swap. Right after this, the whole API process kill with this error: "cannot allocate memory" which is obvious.
I managed to fix that as well just by loading the CNN Model in the classification function. (Not preloading it on the GPU as in the two cases before). However, here I got problem as well. The process of loading the model to the GPU takes around 15s - 20s and this every time the recognition starts. This is not suitable for us and we are wondering why we cannot pre-load the model without killing the whole thing after two image-recognitions. Our goal is to be under 5 sec with this.
#clasify
import torchvision.transforms as transforms
from skimage import io
import time
from torch.utils.data import Dataset
from .loader import *
from .ResNet import *
#if this part is in the classify() function than no allocation problem occurs
net = ResNet152(num_classes=25)
net = net.to('cuda')
save_file = torch.load("./model.pt", map_location=torch.device('cuda'))
net.load_state_dict(save_file)
def classify(imgp=""):
#do some classification with the net
pass
if __name__ == '__main__':
mp.set_start_method('spawn') #if commented out the first error ocours
manager = mp.Manager()
return_dict = manager.dict()
p = mp.Process(target=classify, args=('./bild.jpg', return_dict))
p.start()
p.join()
print(return_dict.values())
Any help here will be much appreciated. Thank you.
The following statement,
bn1 = tf.contrib.layers.batch_normalization(inputs=conv1, axis=1, training = is_training)
Throws an error: InternalError: The CPU implementation of FusedBatchNorm only supports NHWC tensor format for now.
on my CPU using tensorflow v:1.4
However, I've ensured that the code uses data in NHWC format. The same piece of code works on my friend's CPU, the only difference being he's using Tensorflow v.1.0 and the code runs smoothly without issues.
I tried to look up tensorflow documentation,
https://www.tensorflow.org/performance/performance_guide
It suggests feeding in two extra arguments: fused=True, data_format='NHWC'.
However, as per
https://www.tensorflow.org/api_docs/python/tf/layers/batch_normalization
there is no such provision for the 2 above mentioned arguments. And in fact, the code throws an error saying batch_normalization received an unexpected argument.
Any responses on the potential reason behind the issue and how I could get around it without rolling back my Tensorflow version (because that would be absurd) are most welcome.
Thank you so uch for your time and effort.