Sagemaker pytorch inference stops at model call on gpu - pytorch
I deployed a pytorch model using sagemaker and can successfully query it on a CPU. Deploying it on a GPU leads to a InternalServerError client-side though. Checking the CloudWatch Logs shows that the request is received, preprocessing finishes and the call to the model is started. I can also see log from the metric collector about the prediction time. At that point there are no further logs though. The print statement I put right after the model call is never reached.
It is possible that there is an error happening which doesn't make it to CloudWatch. I have noticed that sagemaker seems to not show stack traces fully. Unfortunately I have already set the log_level to DEBUG without success.
I'm running on a sagemaker docker container - pytorch-inference:1.10-gpu-py38 on a ml.g4dn.xlarge instance. The model itself is compiled using torchscript.trace. I am using a custom transform function which you can see below as well as the CloudWatch Logs (the log continues as the client retries 4x).
If anyone has any idea what is happening here it would be very much appreciated!
import io
import torch
import os, sys
import json
import logging
from PIL import Image
from sagemaker_inference import (
content_types,
decoder,
encoder,
errors,
utils,
)
from MyDetrFeatureExtractor import MyDetrFeatureExtractor
INFERENCE_ACCELERATOR_PRESENT_ENV = "SAGEMAKER_INFERENCE_ACCELERATOR_PRESENT"
IMG_WIDTH = 800
IMG_HEIGHT = 1131
MODEL_FILE = "model.pt"
THRESHOLD = 0.2
feature_extractor = MyDetrFeatureExtractor.from_pretrained(
"facebook/detr-resnet-50", size=(IMG_WIDTH, IMG_HEIGHT))
index_to_name = json.load(open('/opt/ml/model/code/id2label.json', 'r'))
logger = logging.getLogger("sagemaker-inference")
# logger.addHandler(logging.StreamHandler(sys.stdout))
def model_fn(model_dir):
logger.info(f"Trying to load model from {model_dir}/{MODEL_FILE}.")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = torch.jit.load(f"{model_dir}/{MODEL_FILE}", map_location=torch.device(device))
model = model.to(device)
return model
def preprocess(images):
logger.info("Preprocessing image...")
try:
encoding = feature_extractor(images=images, return_tensors="pt")
pixel_values = encoding["pixel_values"]
except Exception as e:
logger.error("Preprocessing Failed.")
logger.error(e)
return pixel_values
def load_fn(input_data, content_type):
"""A default input_fn that can handle JSON, CSV and NPZ formats.
Args:
input_data: the request payload serialized in the content_type format
content_type: the request content_type
Returns: input_data deserialized into torch.FloatTensor or torch.cuda.FloatTensor,
depending if cuda is available.
"""
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if content_type == "application/x-image":
if isinstance(input_data, str):
# if the image is a string of bytesarray.
print("Found string of bytesarray. Translating to Image.")
image = base64.b64decode(input_data)
elif isinstance(input_data, (bytearray, bytes)):
# If the image is sent as bytesarray
print("Found bytesarray. Translating to Image.")
image = Image.open(io.BytesIO(input_data))
else:
err_msg = f"Type [{content_type}] not support this type yet"
logger.error(err_msg)
raise ValueError(err_msg)
# image = Image.from_array(np_array)
size = image.size
image_sizes_orig = [[size[1], size[0]]]
logger.info(f"Image of size {size} loaded. Start Preprocessing.")
tensor = preprocess(image)
return tensor.to(device), torch.tensor(image_sizes_orig)
def inference_fn(data, model):
"""A default predict_fn for PyTorch. Calls a model on data deserialized in input_fn.
Runs prediction on GPU if cuda is available.
Args:
data: input data (torch.Tensor) for prediction deserialized by input_fn
model: PyTorch model loaded in memory by model_fn
Returns: a prediction
"""
with torch.no_grad():
if os.getenv(INFERENCE_ACCELERATOR_PRESENT_ENV) == "true":
device = torch.device("cpu")
model = model.to(device)
input_data = data.to(device)
model.eval()
with torch.jit.optimized_execution(True, {"target_device": "eia:0"}):
output = model(input_data)
else:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
logger.info(f"Running predictions on {device}.")
model = model.to(device)
input_data = data.to(device)
model.eval()
logger.info("Compute predictions.")
output = model(input_data)
logger.info("Finished actual inference")
return output
def postprocess(output, img_sizes_orig):
logger.info("Postprocessing image...")
try:
results_all = feature_extractor.post_process(output, img_sizes_orig, use_dict=False)
results = []
for res_per_img in results_all:
scores_per_img = res_per_img['scores'].detach().numpy()
# keep only predictions with confidence >= threshold
keep = scores_per_img > THRESHOLD
labels_per_img = list(map(
index_to_name.get,
res_per_img['labels'][keep].detach().numpy().astype(str)
))
bboxes_per_img = res_per_img['boxes'][keep].detach().numpy()
scores_per_img = scores_per_img[keep]
out = [{
'bbox': list(map(int, bbox)),
'score': score.astype(float),
'label': label
} for score, label, bbox in
zip(scores_per_img, labels_per_img, bboxes_per_img)]
logger.info(f"Appending {out}.")
results.append(out)
except Exception as e:
logger.error("Postprocessing Failed.")
logger.error(e)
return results
def create_output(prediction, accept):
"""A default output_fn for PyTorch. Serializes predictions from predict_fn to JSON, CSV or NPY format.
Args:
prediction: a prediction result from predict_fn
accept: type which the output data needs to be serialized
Returns: output data serialized
"""
if type(prediction) == torch.Tensor:
prediction = prediction.detach().cpu().numpy().tolist()
for content_type in utils.parse_accept(accept):
if content_type in encoder.SUPPORTED_CONTENT_TYPES:
encoded_prediction = encoder.encode(prediction, content_type)
if content_type == content_types.CSV:
encoded_prediction = encoded_prediction.encode("utf-8")
if content_type == content_types.JSON:
encoded_prediction = encoded_prediction.encode("utf-8")
return encoded_prediction, accept
raise errors.UnsupportedFormatError(accept)
def transform_fn(model, request_body, content_type, accept_type):
logger.info("Received Request.")
images, image_sizes = load_fn(request_body, content_type)
logger.info("Starting Inference.")
output = inference_fn(images, model)
logger.info("Postprocessing.")
results = postprocess(output, image_sizes)
logger.info(results)
return create_output(results, accept_type)
and the logs...
Requirement already satisfied: numpy in /opt/conda/lib/python3.8/site-packages (from -r /opt/ml/model/code/requirements.txt (line 1)) (1.22.2)
Requirement already satisfied: Pillow in /opt/conda/lib/python3.8/site-packages (from -r /opt/ml/model/code/requirements.txt (line 2)) (9.1.1)
Collecting nvgpu
Downloading nvgpu-0.9.0-py2.py3-none-any.whl (9.4 kB)
Collecting transformers==4.17
Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.8/3.8 MB 50.7 MB/s eta 0:00:00
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.8/site-packages (from transformers==4.17->-r /opt/ml/model/code/requirements.txt (line 4)) (20.4)
Collecting regex!=2019.12.17
Downloading regex-2022.7.9-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (765 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 765.0/765.0 kB 23.8 MB/s eta 0:00:00
Collecting filelock
Downloading filelock-3.7.1-py3-none-any.whl (10 kB)
Collecting huggingface-hub<1.0,>=0.1.0
Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 101.5/101.5 kB 25.5 MB/s eta 0:00:00
Requirement already satisfied: tqdm>=4.27 in /opt/conda/lib/python3.8/site-packages (from transformers==4.17->-r /opt/ml/model/code/requirements.txt (line 4)) (4.64.0)
Requirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.8/site-packages (from transformers==4.17->-r /opt/ml/model/code/requirements.txt (line 4)) (5.4.1)
Collecting tokenizers!=0.11.3,>=0.11.1
Downloading tokenizers-0.12.1-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.6/6.6 MB 111.5 MB/s eta 0:00:00
Requirement already satisfied: requests in /opt/conda/lib/python3.8/site-packages (from transformers==4.17->-r /opt/ml/model/code/requirements.txt (line 4)) (2.27.1)
Collecting sacremoses
Downloading sacremoses-0.0.53.tar.gz (880 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 880.6/880.6 kB 90.0 MB/s eta 0:00:00
Preparing metadata (setup.py): started
Preparing metadata (setup.py): finished with status 'done'
Collecting pynvml
Downloading pynvml-11.4.1-py3-none-any.whl (46 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 47.0/47.0 kB 15.6 MB/s eta 0:00:00
Requirement already satisfied: psutil in /opt/conda/lib/python3.8/site-packages (from nvgpu->-r /opt/ml/model/code/requirements.txt (line 3)) (5.9.0)
Requirement already satisfied: pandas in /opt/conda/lib/python3.8/site-packages (from nvgpu->-r /opt/ml/model/code/requirements.txt (line 3)) (1.4.2)
Collecting flask-restful
Downloading Flask_RESTful-0.3.9-py2.py3-none-any.whl (25 kB)
Collecting tabulate
Downloading tabulate-0.8.10-py3-none-any.whl (29 kB)
Collecting termcolor
Downloading termcolor-1.1.0.tar.gz (3.9 kB)
Preparing metadata (setup.py): started
Preparing metadata (setup.py): finished with status 'done'
Collecting arrow
Downloading arrow-1.2.2-py3-none-any.whl (64 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64.0/64.0 kB 19.0 MB/s eta 0:00:00
Requirement already satisfied: six in /opt/conda/lib/python3.8/site-packages (from nvgpu->-r /opt/ml/model/code/requirements.txt (line 3)) (1.16.0)
Collecting flask
Downloading Flask-2.1.3-py3-none-any.whl (95 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 95.6/95.6 kB 29.0 MB/s eta 0:00:00
Collecting ansi2html
Downloading ansi2html-1.8.0-py3-none-any.whl (16 kB)
Collecting packaging>=20.0
Downloading packaging-21.3-py3-none-any.whl (40 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 40.8/40.8 kB 13.5 MB/s eta 0:00:00
Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/conda/lib/python3.8/site-packages (from huggingface-hub<1.0,>=0.1.0->transformers==4.17->-r /opt/ml/model/code/requirements.txt (line 4)) (4.2.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/lib/python3.8/site-packages (from packaging>=20.0->transformers==4.17->-r /opt/ml/model/code/requirements.txt (line 4)) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7.0 in /opt/conda/lib/python3.8/site-packages (from arrow->nvgpu->-r /opt/ml/model/code/requirements.txt (line 3)) (2.8.2)
Collecting itsdangerous>=2.0
Downloading itsdangerous-2.1.2-py3-none-any.whl (15 kB)
Requirement already satisfied: click>=8.0 in /opt/conda/lib/python3.8/site-packages (from flask->nvgpu->-r /opt/ml/model/code/requirements.txt (line 3)) (8.1.3)
Collecting Jinja2>=3.0
Downloading Jinja2-3.1.2-py3-none-any.whl (133 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 133.1/133.1 kB 37.2 MB/s eta 0:00:00
Collecting Werkzeug>=2.0
Downloading Werkzeug-2.1.2-py3-none-any.whl (224 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 224.9/224.9 kB 50.7 MB/s eta 0:00:00
Collecting importlib-metadata>=3.6.0
Downloading importlib_metadata-4.12.0-py3-none-any.whl (21 kB)
Requirement already satisfied: pytz in /opt/conda/lib/python3.8/site-packages (from flask-restful->nvgpu->-r /opt/ml/model/code/requirements.txt (line 3)) (2022.1)
Collecting aniso8601>=0.82
Downloading aniso8601-9.0.1-py2.py3-none-any.whl (52 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52.8/52.8 kB 17.9 MB/s eta 0:00:00
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.8/site-packages (from requests->transformers==4.17->-r /opt/ml/model/code/requirements.txt (line 4)) (1.26.9)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.8/site-packages (from requests->transformers==4.17->-r /opt/ml/model/code/requirements.txt (line 4)) (2022.5.18.1)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.8/site-packages (from requests->transformers==4.17->-r /opt/ml/model/code/requirements.txt (line 4)) (3.3)
Requirement already satisfied: charset-normalizer~=2.0.0 in /opt/conda/lib/python3.8/site-packages (from requests->transformers==4.17->-r /opt/ml/model/code/requirements.txt (line 4)) (2.0.12)
Requirement already satisfied: joblib in /opt/conda/lib/python3.8/site-packages (from sacremoses->transformers==4.17->-r /opt/ml/model/code/requirements.txt (line 4)) (1.1.0)
Collecting zipp>=0.5
Downloading zipp-3.8.1-py3-none-any.whl (5.6 kB)
Collecting MarkupSafe>=2.0
Downloading MarkupSafe-2.1.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB)
Building wheels for collected packages: sacremoses, termcolor
Building wheel for sacremoses (setup.py): started
Building wheel for sacremoses (setup.py): finished with status 'done'
Created wheel for sacremoses: filename=sacremoses-0.0.53-py3-none-any.whl size=895241 sha256=a3bb167ffae5506dddf61987611fcdfc0b8204913917be57bf7567f41240501c
Stored in directory: /root/.cache/pip/wheels/82/ab/9b/c15899bf659ba74f623ac776e861cf2eb8608c1825ddec66a4
Building wheel for termcolor (setup.py): started
Building wheel for termcolor (setup.py): finished with status 'done'
Created wheel for termcolor: filename=termcolor-1.1.0-py3-none-any.whl size=4832 sha256=f2b732eca48c5b5b44b0b23a29ba7130b890cb8b7df31955e7d7f34c7caeeb16
Stored in directory: /root/.cache/pip/wheels/a0/16/9c/5473df82468f958445479c59e784896fa24f4a5fc024b0f501
Successfully built sacremoses termcolor
Installing collected packages: tokenizers, termcolor, aniso8601, zipp, Werkzeug, tabulate, regex, pynvml, packaging, MarkupSafe, itsdangerous, filelock, ansi2html, sacremoses, Jinja2, importlib-metadata, huggingface-hub, arrow, transformers, flask, flask-restful, nvgpu
Attempting uninstall: packaging
Found existing installation: packaging 20.4
Uninstalling packaging-20.4:
Successfully uninstalled packaging-20.4
Successfully installed Jinja2-3.1.2 MarkupSafe-2.1.1 Werkzeug-2.1.2 aniso8601-9.0.1 ansi2html-1.8.0 arrow-1.2.2 filelock-3.7.1 flask-2.1.3 flask-restful-0.3.9 huggingface-hub-0.8.1 importlib-metadata-4.12.0 itsdangerous-2.1.2 nvgpu-0.9.0 packaging-21.3 pynvml-11.4.1 regex-2022.7.9 sacremoses-0.0.53 tabulate-0.8.10 termcolor-1.1.0 tokenizers-0.12.1 transformers-4.17.0 zipp-3.8.1
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
WARNING: There was an error checking the latest version of pip.
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2022-07-22T11:10:02,627 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2022-07-22T11:10:02,696 [INFO ] main org.pytorch.serve.ModelServer -
Torchserve version: 0.5.3
TS Home: /opt/conda/lib/python3.8/site-packages
Current directory: /
Temp directory: /home/model-server/tmp
Number of GPUs: 1
Number of CPUs: 1
Max heap size: 3234 M
Python executable: /opt/conda/bin/python3.8
Config file: /etc/sagemaker-ts.properties
Inference address: http://0.0.0.0:8080
Management address: http://0.0.0.0:8080
Metrics address: http://127.0.0.1:8082
Model Store: /.sagemaker/ts/models
Initial Models: model=/opt/ml/model
Log dir: /logs
Metrics dir: /logs
Netty threads: 0
Netty client threads: 0
Default workers per model: 1
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: false
Metrics report format: prometheus
Enable metrics API: true
Workflow Store: /.sagemaker/ts/models
Model config:
{
"model": {
"1.0": {
"defaultVersion": true,
"marName": "model.mar",
"minWorkers": 1,
"maxWorkers": 1,
"batchSize": 1,
"maxBatchDelay": 10000,
"responseTimeout": 60
}
}
}
2022-07-22T11:10:02,703 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Loading snapshot serializer plugin...
2022-07-22T11:10:02,706 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: /opt/ml/model
2022-07-22T11:10:02,709 [WARN ] main org.pytorch.serve.archive.model.ModelArchive - Model archive version is not defined. Please upgrade to torch-model-archiver 0.2.0 or higher
2022-07-22T11:10:02,710 [WARN ] main org.pytorch.serve.archive.model.ModelArchive - Model archive createdOn is not defined. Please upgrade to torch-model-archiver 0.2.0 or higher
2022-07-22T11:10:02,712 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model model loaded.
2022-07-22T11:10:02,722 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2022-07-22T11:10:02,797 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8080
2022-07-22T11:10:02,797 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2022-07-22T11:10:02,800 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082
Model server started.
2022-07-22T11:10:03,018 [WARN ] pool-3-thread-1 org.pytorch.serve.metrics.MetricCollector - worker pid is not available yet.
2022-07-22T11:10:03,544 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:container-0.local,timestamp:1658488203
2022-07-22T11:10:03,545 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:26.050277709960938|#Level:Host|#hostname:container-0.local,timestamp:1658488203
2022-07-22T11:10:03,545 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:25.937984466552734|#Level:Host|#hostname:container-0.local,timestamp:1658488203
2022-07-22T11:10:03,545 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:49.9|#Level:Host|#hostname:container-0.local,timestamp:1658488203
2022-07-22T11:10:03,546 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:0.0|#Level:Host,device_id:0|#hostname:container-0.local,timestamp:1658488203
2022-07-22T11:10:03,546 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:0|#Level:Host,device_id:0|#hostname:container-0.local,timestamp:1658488203
2022-07-22T11:10:03,546 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:0|#Level:Host,device_id:0|#hostname:container-0.local,timestamp:1658488203
2022-07-22T11:10:03,547 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:13904.71875|#Level:Host|#hostname:container-0.local,timestamp:1658488203
2022-07-22T11:10:03,547 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:1511.390625|#Level:Host|#hostname:container-0.local,timestamp:1658488203
2022-07-22T11:10:03,547 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:11.7|#Level:Host|#hostname:container-0.local,timestamp:1658488203
2022-07-22T11:10:03,814 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2022-07-22T11:10:03,815 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - [PID]60
2022-07-22T11:10:03,815 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Torch worker started.
2022-07-22T11:10:03,815 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Python runtime: 3.8.10
2022-07-22T11:10:03,821 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2022-07-22T11:10:03,830 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2022-07-22T11:10:03,832 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1658488203832
2022-07-22T11:10:03,902 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - model_name: model, batchSize: 1
2022-07-22T11:10:04,735 [WARN ] W-9000-model_1.0-stderr MODEL_LOG -
2022-07-22T11:10:04,736 [WARN ] W-9000-model_1.0-stderr MODEL_LOG - Downloading: 0%| | 0.00/274 [00:00<?, ?B/s]
2022-07-22T11:10:04,737 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Trying to load model from /opt/ml/model/model.pt.
2022-07-22T11:10:05,938 [INFO ] pool-2-thread-2 ACCESS_LOG - /169.254.178.2:40592 "GET /ping HTTP/1.1" 200 6
2022-07-22T11:10:05,939 [INFO ] pool-2-thread-2 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:container-0.local,timestamp:null
2022-07-22T11:10:08,126 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 4223
2022-07-22T11:10:08,127 [INFO ] W-9000-model_1.0 TS_METRICS - W-9000-model_1.0.ms:5410|#Level:Host|#hostname:container-0.local,timestamp:1658488208
2022-07-22T11:10:08,127 [INFO ] W-9000-model_1.0 TS_METRICS - WorkerThreadTime.ms:72|#Level:Host|#hostname:container-0.local,timestamp:null
2022-07-22T11:10:10,861 [INFO ] pool-2-thread-2 ACCESS_LOG - /169.254.178.2:40592 "GET /ping HTTP/1.1" 200 1
2022-07-22T11:10:10,861 [INFO ] pool-2-thread-2 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:container-0.local,timestamp:null
2022-07-22T11:12:03,445 [INFO ] pool-3-thread-2 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:container-0.local,timestamp:1658488323
2022-07-22T11:12:03,447 [INFO ] pool-3-thread-2 TS_METRICS - DiskAvailable.Gigabytes:26.09253692626953|#Level:Host|#hostname:container-0.local,timestamp:1658488323
2022-07-22T11:12:03,447 [INFO ] pool-3-thread-2 TS_METRICS - DiskUsage.Gigabytes:25.89572525024414|#Level:Host|#hostname:container-0.local,timestamp:1658488323
2022-07-22T11:12:03,448 [INFO ] pool-3-thread-2 TS_METRICS - DiskUtilization.Percent:49.8|#Level:Host|#hostname:container-0.local,timestamp:1658488323
2022-07-22T11:12:03,449 [INFO ] pool-3-thread-2 TS_METRICS - GPUMemoryUtilization.Percent:5.731683102786419|#Level:Host,device_id:0|#hostname:container-0.local,timestamp:1658488323
2022-07-22T11:12:03,449 [INFO ] pool-3-thread-2 TS_METRICS - GPUMemoryUsed.Megabytes:866|#Level:Host,device_id:0|#hostname:container-0.local,timestamp:1658488323
2022-07-22T11:12:03,449 [INFO ] pool-3-thread-2 TS_METRICS - GPUUtilization.Percent:0|#Level:Host,device_id:0|#hostname:container-0.local,timestamp:1658488323
2022-07-22T11:12:03,449 [INFO ] pool-3-thread-2 TS_METRICS - MemoryAvailable.Megabytes:12352.65625|#Level:Host|#hostname:container-0.local,timestamp:1658488323
2022-07-22T11:12:03,449 [INFO ] pool-3-thread-2 TS_METRICS - MemoryUsed.Megabytes:3051.94140625|#Level:Host|#hostname:container-0.local,timestamp:1658488323
2022-07-22T11:12:03,450 [INFO ] pool-3-thread-2 TS_METRICS - MemoryUtilization.Percent:21.5|#Level:Host|#hostname:container-0.local,timestamp:1658488323
2022-07-22T11:12:05,859 [INFO ] pool-2-thread-2 ACCESS_LOG - /169.254.178.2:40592 "GET /ping HTTP/1.1" 200 0
2022-07-22T11:12:05,860 [INFO ] pool-2-thread-2 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:container-0.local,timestamp:null
2022-07-22T11:12:10,860 [INFO ] pool-2-thread-2 ACCESS_LOG - /169.254.178.2:40592 "GET /ping HTTP/1.1" 200 0
2022-07-22T11:12:10,860 [INFO ] pool-2-thread-2 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:container-0.local,timestamp:null
2022-07-22T11:12:15,860 [INFO ] pool-2-thread-2 ACCESS_LOG - /169.254.178.2:40592 "GET /ping HTTP/1.1" 200 1
2022-07-22T11:12:15,860 [INFO ] pool-2-thread-2 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:container-0.local,timestamp:null
2022-07-22T11:12:20,193 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1658488340193
2022-07-22T11:12:20,195 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Backend received inference at: 1658488340
2022-07-22T11:12:20,196 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Received Request.
2022-07-22T11:12:20,205 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Found bytesarray. Translating to Image.
2022-07-22T11:12:20,206 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Image of size (1654, 2339) loaded. Start Preprocessing.
2022-07-22T11:12:20,206 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Preprocessing image...
2022-07-22T11:12:20,342 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Starting Inference.
2022-07-22T11:12:20,343 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Running predictions on cuda.
2022-07-22T11:12:20,349 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Compute predictions.
2022-07-22T11:12:20,869 [INFO ] pool-2-thread-2 ACCESS_LOG - /169.254.178.2:40608 "GET /ping HTTP/1.1" 200 0
2022-07-22T11:12:20,870 [INFO ] pool-2-thread-2 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:container-0.local,timestamp:null
2022-07-22T11:12:22,119 [INFO ] W-9000-model_1.0-stdout MODEL_METRICS - PredictionTime.Milliseconds:1923.3|#ModelName:model,Level:Model|#hostname:container-0.local,requestID:f49f15ab-aed4-4ecf-80e2-22910f5d578e,timestamp:1658488342
2022-07-22T11:12:22,120 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 1925
2022-07-22T11:12:22,121 [INFO ] W-9000-model_1.0 ACCESS_LOG - /169.254.178.2:40592 "POST /invocations HTTP/1.1" 500 1940
2022-07-22T11:12:22,122 [INFO ] W-9000-model_1.0 TS_METRICS - Requests5XX.Count:1|#Level:Host|#hostname:container-0.local,timestamp:null
2022-07-22T11:12:22,122 [INFO ] W-9000-model_1.0 TS_METRICS - QueueTime.ms:0|#Level:Host|#hostname:container-0.local,timestamp:null
2022-07-22T11:12:22,122 [INFO ] W-9000-model_1.0 TS_METRICS - WorkerThreadTime.ms:4|#Level:Host|#hostname:container-0.local,timestamp:null
2022-07-22T11:12:22,172 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1658488342171
2022-07-22T11:12:22,177 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Backend received inference at: 1658488342
It turns out wrapping the model call into a try-except statement and manually printing the error message comes through to CloudWatch!
I hope that insight is useful for anyone stuck without an error message in the future.
Related
I can't running "hello world" with Kivy and virtualenv
I'm installed kivy with python3 and virtualenv following the official documentation The 'showcase' example run perfectly but if I create a simple app and running it I don't get any error and the display not showing anything. I think the error is related to the use of virtualenv, but I don't understand why the showcase example works. This is the kivy log: [INFO ] [Logger ] Record log in /home/pi/.kivy/logs/kivy_23-02-17_55.txt [INFO ] [Kivy ] v2.1.0 [INFO ] [Kivy ] Installed at "/home/pi/kivy_venv/lib/python3.7/site-packages/kivy/__init__.py" [INFO ] [Python ] v3.7.3 (default, Oct 31 2022, 14:04:00) [GCC 8.3.0] [INFO ] [Python ] Interpreter at "/home/pi/kivy_venv/bin/python" [INFO ] [Logger ] Purge log fired. Processing... [INFO ] [Logger ] Purge finished! [INFO ] [Factory ] 189 symbols loaded [INFO ] [Image ] Providers: img_tex, img_dds, img_sdl2, img_pil (img_ffpyplayer ignored) [INFO ] [Window ] Provider: sdl2 [INFO ] [GL ] Using the "OpenGL" graphics system [INFO ] [GL ] Backend used <sdl2> [INFO ] [GL ] OpenGL version <b'OpenGL ES 3.1 Mesa 19.3.2'> [INFO ] [GL ] OpenGL vendor <b'VMware, Inc.'> [INFO ] [GL ] OpenGL renderer <b'llvmpipe (LLVM 9.0.1, 128 bits)'> [INFO ] [GL ] OpenGL parsed version: 3, 1 [INFO ] [GL ] Shading version <b'OpenGL ES GLSL ES 3.10'> [INFO ] [GL ] Texture max size <8192> [INFO ] [GL ] Texture max units <32> [INFO ] [Window ] auto add sdl2 input provider [INFO ] [Window ] virtual keyboard not allowed, single mode, not docked [INFO ] [Text ] Provider: sdl2(['text_pango'] ignored) [INFO ] [ProbeSysfs ] device match: /dev/input/event0 [INFO ] [MTD ] Read event from </dev/input/event0> [INFO ] [MTD ] Set custom rotation to 180 [INFO ] [MTD ] Set custom invert_y to 0 [INFO ] [ProbeSysfs ] device match: /dev/input/event0 [INFO ] [ProbeSysfs ] Unable to find provider hdinput [INFO ] [ProbeSysfs ] fallback on hidinput [INFO ] [HIDInput ] Read event from </dev/input/event0> [INFO ] [HIDInput ] Set custom rotation to 180 [INFO ] [HIDInput ] Set custom invert_y to 0 [INFO ] [ProbeSysfs ] device match: /dev/input/event0 [INFO ] [HIDInput ] Read event from </dev/input/event0> [INFO ] [HIDInput ] Set custom rotation to 180 [INFO ] [Base ] Start application main loop [INFO ] [HIDMotionEvent] using <raspberrypi-ts> [INFO ] [HIDMotionEvent] using <raspberrypi-ts> [INFO ] [MTD ] </dev/input/event0> range position X is 0 - 799 [INFO ] [GL ] NPOT texture support is available [INFO ] [HIDMotionEvent] <raspberrypi-ts> range ABS X position is 0 - 799 [INFO ] [HIDMotionEvent] <raspberrypi-ts> range ABS X position is 0 - 799 [INFO ] [MTD ] </dev/input/event0> range position Y is 0 - 479 [INFO ] [HIDMotionEvent] <raspberrypi-ts> range ABS Y position is 0 - 479 [INFO ] [HIDMotionEvent] <raspberrypi-ts> range ABS Y position is 0 - 479 [INFO ] [HIDMotionEvent] <raspberrypi-ts> range position X is 0 - 799 [INFO ] [HIDMotionEvent] <raspberrypi-ts> range position X is 0 - 799 [INFO ] [MTD ] </dev/input/event0> range touch major is 0 - 0 [INFO ] [HIDMotionEvent] <raspberrypi-ts> range position Y is 0 - 479 [INFO ] [HIDMotionEvent] <raspberrypi-ts> range position Y is 0 - 479 [INFO ] [MTD ] </dev/input/event0> range touch minor is 0 - 0 [INFO ] [MTD ] </dev/input/event0> range pressure is 0 - 255 [INFO ] [MTD ] </dev/input/event0> axes invertion: X is 0, Y is 0 [INFO ] [MTD ] </dev/input/event0> rotation set to 180 And this are the code files: main.py from kivy.app import App class HelloApp(App): pass if __name__ == '__main__': HelloApp().run() hello.kv Label: text: "Hello World"
Email Classifier using Spacy , throwing the below error due to version issue when tried to implement BOW
I'm trying to Create the TextCategorizer with exclusive classes and "bow" architecture but its throwing the below error due to version issue and my python version is 3.8 ,also my spacy version is 3.2.3 , please some one help me in resolving this ######## Main method ######## def main(): # Load dataset data = pd.read_csv(data_path, sep='\t') observations = len(data.index) # print("Dataset Size: {}".format(observations)) # Create an empty spacy model nlp = spacy.blank("en") # Create the TextCategorizer with exclusive classes and "bow" architecture text_cat = nlp.create_pipe( "textcat", config={ "exclusive_classes": True, "architecture": "bow"}) # Adding the TextCategorizer to the created empty model nlp.add_pipe(text_cat) # Add labels to text classifier text_cat.add_label("ham") text_cat.add_label("spam") # Split data into train and test datasets x_train, x_test, y_train, y_test = train_test_split( data['text'], data['label'], test_size=0.33, random_state=7) # Create the train and test data for the spacy model train_lables = [{'cats': {'ham': label == 'ham', 'spam': label == 'spam'}} for label in y_train] test_lables = [{'cats': {'ham': label == 'ham', 'spam': label == 'spam'}} for label in y_test] # Spacy model data train_data = list(zip(x_train, train_lables)) test_data = list(zip(x_test, test_lables)) # Model configurations optimizer = nlp.begin_training() batch_size = 5 epochs = 10 # Training the model train_model(nlp, train_data, optimizer, batch_size, epochs) # Sample predictions # print(train_data[0]) # sample_test = nlp(train_data[0][0]) # print(sample_test.cats) # Train and test accuracy train_predictions = get_predictions(nlp, x_train) test_predictions = get_predictions(nlp, x_test) train_accuracy = accuracy_score(y_train, train_predictions) test_accuracy = accuracy_score(y_test, test_predictions) print("Train accuracy: {}".format(train_accuracy)) print("Test accuracy: {}".format(test_accuracy)) # Creating the confusion matrix graphs cf_train_matrix = confusion_matrix(y_train, train_predictions) plt.figure(figsize=(10,8)) sns.heatmap(cf_train_matrix, annot=True, fmt='d') cf_test_matrix = confusion_matrix(y_test, test_predictions) plt.figure(figsize=(10,8)) sns.heatmap(cf_test_matrix, annot=True, fmt='d') if __name__ == "__main__": main() Below is the error --------------------------------------------------------------------------- ConfigValidationError Traceback (most recent call last) <ipython-input-6-a77bb5692b25> in <module> 72 73 if __name__ == "__main__": ---> 74 main() <ipython-input-6-a77bb5692b25> in main() 12 13 # Create the TextCategorizer with exclusive classes and "bow" architecture ---> 14 text_cat = nlp.add_pipe( 15 "textcat", 16 config={ ~\anaconda3\lib\site-packages\spacy\language.py in add_pipe(self, factory_name, name, before, after, first, last, source, config, raw_config, validate) 790 lang_code=self.lang, 791 ) --> 792 pipe_component = self.create_pipe( 793 factory_name, 794 name=name, ~\anaconda3\lib\site-packages\spacy\language.py in create_pipe(self, factory_name, name, config, raw_config, validate) 672 # We're calling the internal _fill here to avoid constructing the 673 # registered functions twice --> 674 resolved = registry.resolve(cfg, validate=validate) 675 filled = registry.fill({"cfg": cfg[factory_name]}, validate=validate)["cfg"] 676 filled = Config(filled) ~\anaconda3\lib\site-packages\thinc\config.py in resolve(cls, config, schema, overrides, validate) 727 validate: bool = True, 728 ) -> Dict[str, Any]: --> 729 resolved, _ = cls._make( 730 config, schema=schema, overrides=overrides, validate=validate, resolve=True 731 ) ~\anaconda3\lib\site-packages\thinc\config.py in _make(cls, config, schema, overrides, resolve, validate) 776 if not is_interpolated: 777 config = Config(orig_config).interpolate() --> 778 filled, _, resolved = cls._fill( 779 config, schema, validate=validate, overrides=overrides, resolve=resolve 780 ) ~\anaconda3\lib\site-packages\thinc\config.py in _fill(cls, config, schema, validate, resolve, parent, overrides) 831 schema.__fields__[key] = copy_model_field(field, Any) 832 promise_schema = cls.make_promise_schema(value, resolve=resolve) --> 833 filled[key], validation[v_key], final[key] = cls._fill( 834 value, 835 promise_schema, ~\anaconda3\lib\site-packages\thinc\config.py in _fill(cls, config, schema, validate, resolve, parent, overrides) 897 result = schema.parse_obj(validation) 898 except ValidationError as e: --> 899 raise ConfigValidationError( 900 config=config, errors=e.errors(), parent=parent 901 ) from None ConfigValidationError: Config validation error textcat -> architecture extra fields not permitted textcat -> exclusive_classes extra fields not permitted {'nlp': <spacy.lang.en.English object at 0x000001B90CD4BF70>, 'name': 'textcat', 'architecture': 'bow', 'exclusive_classes': True, 'model': {'#architectures': 'spacy.TextCatEnsemble.v2', 'linear_model': {'#architectures': 'spacy.TextCatBOW.v2', 'exclusive_classes': True, 'ngram_size': 1, 'no_output_layer': False}, 'tok2vec': {'#architectures': 'spacy.Tok2Vec.v2', 'embed': {'#architectures': 'spacy.MultiHashEmbed.v2', 'width': 64, 'rows': [2000, 2000, 1000, 1000, 1000, 1000], 'attrs': ['ORTH', 'LOWER', 'PREFIX', 'SUFFIX', 'SHAPE', 'ID'], 'include_static_vectors': False}, 'encode': {'#architectures': 'spacy.MaxoutWindowEncoder.v2', 'width': 64, 'window_size': 1, 'maxout_pieces': 3, 'depth': 2}}}, 'scorer': {'#scorers': 'spacy.textcat_scorer.v1'}, 'threshold': 0.5, '#factories': 'textcat'} My Spacy-Version print(spacy.__version__) 3.2.3 My Python Version import sys print(sys.version) 3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)] Tring to downgrade the Spacy-Version !conda install -c conda-forge spacy = 2.1.8 Collecting package metadata (current_repodata.json): ...working... done Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve. Collecting package metadata (repodata.json): ...working... done Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve. Solving environment: ...working... Building graph of deps: 0%| | 0/5 [00:00<?, ?it/s] Examining spacy=2.1.8: 0%| | 0/5 [00:00<?, ?it/s] Examining python=3.8: 20%|## | 1/5 [00:00<00:00, 4.80it/s] Examining python=3.8: 40%|#### | 2/5 [00:00<00:00, 9.60it/s] Examining #/win-64::__cuda==11.6=0: 40%|#### | 2/5 [00:01<00:00, 9.60it/s] Examining #/win-64::__cuda==11.6=0: 60%|###### | 3/5 [00:01<00:01, 1.97it/s] Examining #/win-64::__win==0=0: 60%|###### | 3/5 [00:01<00:01, 1.97it/s] Examining #/win-64::__archspec==1=x86_64: 80%|######## | 4/5 [00:01<00:00, 1.97it/s] Determining conflicts: 0%| | 0/5 [00:00<?, ?it/s] Examining conflict for spacy python: 0%| | 0/5 [00:00<?, ?it/s] UnsatisfiableError: The following specifications were found to be incompatible with the existing python installation in your environment: Specifications: - spacy=2.1.8 -> python[version='>=3.6,<3.7.0a0|>=3.7,<3.8.0a0'] Your python: python=3.8 Found conflicts! Looking for incompatible packages. This can take several minutes. Press CTRL-C to abort. failed If python is on the left-most side of the chain, that's the version you've asked for. When python appears to the right, that indicates that the thing on the left is somehow not available for the python version you are constrained to. Note that conda will not change your python version to a different minor version unless you explicitly specify that. Please feel free to comment or ask . Thank you
Just from the way I would understand that error message it tells you that the spacy version you want to install (2.1.8) is incompatible with the python version you have (3.8.8). It needs Python 3.6 or 3.7. So either create an environment with Python 3.6 or 3.7 (its quite easy to specify Python version when creating a new environment in conda) or use a higher version of spacy. Did you already try if the code works if you just use the newest version of spacy? Is there a specific reason for why you are using this spacy version? If you are using some methods that are not supported anymore it might make more sense to update your code to the newer spacy methods. Especially if you are doing this to learn about spacy it is counterproductive to learn methods that are not supported anymore. Sadly a lot of tutorials fail to either update their code or at least specify what versions they are using and then leave their code online for years.
BERT model loading not working with pytorch 1.3.1-eia container
I have a custom python file for inference in which I have implemented the functions model_fn, input_fn, predict_fn and output_fn. I have saved the model as a torchscript using torch.jit.trace, torch.jit.save and loading it using torch.jit.load. The model_fn implementation is as follows: import torch import os import logging logger = logging.getLogger() is_ei = os.getenv("SAGEMAKER_INFERENCE_ACCELERATOR_PRESENT") == "true" logger.warn(f"Elastic Inference enabled: {is_ei}") def model_fn(model_dir): model_path = os.path.join(model_dir, "model_best.pt") try: loaded_model = torch.jit.load(model_path, map_location=torch.device('cpu')) loaded_model.eval() return loaded_model except Exception as e: logger.exception(f"Exception in model fn {e}") return None This implementation works perfectly for the container with pytorch 1.5. But for container with torch 1.3.1 it exits abruptly when loading the pretrained model without any logs. The only line I see in the logs is algo-1-nvqf7_1 | 2020-11-30 07:17:15,392 [WARN ] W-9000-model com.amazonaws.ml.mms.wlm.BatchAggregator - Load model failed: model, error: Worker died. algo-1-nvqf7_1 | 2020-11-30 07:17:15,393 [INFO ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Retry worker: 9000-44f1cd64 in 1 seconds. The worker dies and tries to restart, and the process repeats till I stop the container. The model I am using is trained with pytorch 1.5. But since EI support is only supported till 1.3.1, I am using this container. Things I have tried: The same code with same model works outside the container with pytorch version 1.3.1. So, I don't think pytorch version compatibility is the issue. Tried using debug and notset levels for logs. Didn't get any more info as to why model loading fails Tried loading the original model instead of the traced one. Again this works in 1.5 but not in 1.3.1. Fails at the same point, while loading the BERT pretrained model. Tried this setup on sagemaker notebook instance with gpu accelerator and sagemaker PytorchModel's deploy() function with framework_version as 1.3.1. Also tried it using the 1.3.1 container without eia. Has same behaviour everywhere. Am I doing something wrong or missing something crucial from the documentation? Any help would be much appreciated. **Logs for container with torch 1.3.1-eia ** algo-1-nvqf7_1 | 2020-11-30 07:17:14,333 [INFO ] main com.amazonaws.ml.mms.ModelServer - algo-1-nvqf7_1 | MMS Home: /opt/conda/lib/python3.6/site-packages algo-1-nvqf7_1 | Current directory: / algo-1-nvqf7_1 | Temp directory: /home/model-server/tmp algo-1-nvqf7_1 | Number of GPUs: 0 algo-1-nvqf7_1 | Number of CPUs: 8 algo-1-nvqf7_1 | Max heap size: 6972 M algo-1-nvqf7_1 | Python executable: /opt/conda/bin/python algo-1-nvqf7_1 | Config file: /etc/sagemaker-mms.properties algo-1-nvqf7_1 | Inference address: http://0.0.0.0:8080 algo-1-nvqf7_1 | Management address: http://0.0.0.0:8080 algo-1-nvqf7_1 | Model Store: /.sagemaker/mms/models algo-1-nvqf7_1 | Initial Models: ALL algo-1-nvqf7_1 | Log dir: /logs algo-1-nvqf7_1 | Metrics dir: /logs algo-1-nvqf7_1 | Netty threads: 0 algo-1-nvqf7_1 | Netty client threads: 0 algo-1-nvqf7_1 | Default workers per model: 1 algo-1-nvqf7_1 | Blacklist Regex: N/A algo-1-nvqf7_1 | Maximum Response Size: 6553500 algo-1-nvqf7_1 | Maximum Request Size: 6553500 algo-1-nvqf7_1 | Preload model: false algo-1-nvqf7_1 | Prefer direct buffer: false algo-1-nvqf7_1 | 2020-11-30 07:17:14,391 [WARN ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerLifeCycle - attachIOStreams() threadName=W-9000-model algo-1-nvqf7_1 | 2020-11-30 07:17:14,481 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - model_service_worker started with args: --sock-type unix --sock-name /home/model-server/tmp/.mms.sock.9000 --handler sagemaker_pytorch_serving_container.handler_service --model-path /.sagemaker/mms/models/model --model-name model --preload-model false --tmp-dir /home/model-server/tmp algo-1-nvqf7_1 | 2020-11-30 07:17:14,482 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Listening on port: /home/model-server/tmp/.mms.sock.9000 algo-1-nvqf7_1 | 2020-11-30 07:17:14,482 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - [PID] 51 algo-1-nvqf7_1 | 2020-11-30 07:17:14,482 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - MMS worker started. algo-1-nvqf7_1 | 2020-11-30 07:17:14,483 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Python runtime: 3.6.6 algo-1-nvqf7_1 | 2020-11-30 07:17:14,483 [INFO ] main com.amazonaws.ml.mms.wlm.ModelManager - Model model loaded. algo-1-nvqf7_1 | 2020-11-30 07:17:14,487 [INFO ] main com.amazonaws.ml.mms.ModelServer - Initialize Inference server with: EpollServerSocketChannel. algo-1-nvqf7_1 | 2020-11-30 07:17:14,496 [INFO ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.mms.sock.9000 algo-1-nvqf7_1 | 2020-11-30 07:17:14,544 [INFO ] main com.amazonaws.ml.mms.ModelServer - Inference API bind to: http://0.0.0.0:8080 algo-1-nvqf7_1 | 2020-11-30 07:17:14,545 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9000. algo-1-nvqf7_1 | Model server started. algo-1-nvqf7_1 | 2020-11-30 07:17:14,547 [WARN ] pool-2-thread-1 com.amazonaws.ml.mms.metrics.MetricCollector - worker pid is not available yet. algo-1-nvqf7_1 | 2020-11-30 07:17:14,962 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - PyTorch version 1.3.1 available. algo-1-nvqf7_1 | 2020-11-30 07:17:15,314 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Lock 140580224398952 acquired on /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084.lock algo-1-nvqf7_1 | 2020-11-30 07:17:15,315 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt not found in cache or force_download set to True, downloading to /root/.cache/torch/transformers/tmpcln39mxo algo-1-nvqf7_1 | 2020-11-30 07:17:15,344 [WARN ] W-9000-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - algo-1-nvqf7_1 | 2020-11-30 07:17:15,349 [WARN ] W-9000-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Downloading: 0%| | 0.00/232k [00:00<?, ?B/s] algo-1-nvqf7_1 | 2020-11-30 07:17:15,349 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - storing https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt in cache at /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084 algo-1-nvqf7_1 | 2020-11-30 07:17:15,349 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - creating metadata file for /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084 algo-1-nvqf7_1 | 2020-11-30 07:17:15,349 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Lock 140580224398952 released on /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084.lock algo-1-nvqf7_1 | 2020-11-30 07:17:15,350 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084 algo-1-nvqf7_1 | 2020-11-30 07:17:15,378 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Created tokenizer algo-1-nvqf7_1 | 2020-11-30 07:17:15,378 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Elastic Inference enabled: True algo-1-nvqf7_1 | 2020-11-30 07:17:15,378 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - inside model fn algo-1-nvqf7_1 | 2020-11-30 07:17:15,379 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - /.sagemaker/mms/models/model algo-1-nvqf7_1 | 2020-11-30 07:17:15,379 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - /.sagemaker/mms/models/model/model.pt algo-1-nvqf7_1 | 2020-11-30 07:17:15,379 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ['model.pt', 'model.tar.gz', 'code', 'model_tn_best.pth', 'MAR-INF'] algo-1-nvqf7_1 | 2020-11-30 07:17:15,379 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Loading torch script algo-1-nvqf7_1 | 2020-11-30 07:17:15,392 [INFO ] epollEventLoopGroup-4-1 com.amazonaws.ml.mms.wlm.WorkerThread - 9000-44f1cd64 Worker disconnected. WORKER_STARTED algo-1-nvqf7_1 | 2020-11-30 07:17:15,392 [WARN ] W-9000-model com.amazonaws.ml.mms.wlm.BatchAggregator - Load model failed: model, error: Worker died. algo-1-nvqf7_1 | 2020-11-30 07:17:15,393 [INFO ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Retry worker: 9000-44f1cd64 in 1 seconds. algo-1-nvqf7_1 | 2020-11-30 07:17:16,065 [INFO ] W-9000-model ACCESS_LOG - /172.18.0.1:45110 "GET /ping HTTP/1.1" 200 8
Fit convergence failure in pyhf for small signal model
(This is a question that we (the pyhf dev team) recently got and thought was good and worth sharing. So we're posting a modified version of it here.) I am trying to do a simple hypothesis test with pyhf v0.4.0. The model I am using has a small signal and so I need to scan signal strengths almost all the way out to mu=100. However, I am consistently getting a convergence problem. Why is the fit failing to converge? The following is my environment, the code I'm using, and my error. Environment $ "$(which python3)" --version Python 3.7.5 $ python3 -m venv "${HOME}/.venvs/example" $ . "${HOME}/.venvs/example/bin/activate" (example) $ python -m pip install --upgrade pip setuptools wheel (example) $ cat requirements.txt pyhf~=0.4.0 black (example) $ python -m pip install -r requirements.txt (example) $ pip list Package Version ------------------ -------- appdirs 1.4.3 attrs 19.3.0 black 19.10b0 Click 7.0 importlib-metadata 1.5.0 jsonpatch 1.25 jsonpointer 2.0 jsonschema 3.2.0 numpy 1.18.1 pathspec 0.7.0 pip 20.0.2 pkg-resources 0.0.0 pyhf 0.4.0 pyrsistent 0.15.7 PyYAML 5.3 regex 2020.1.8 scipy 1.4.1 setuptools 45.1.0 six 1.14.0 toml 0.10.0 tqdm 4.42.1 typed-ast 1.4.1 wheel 0.34.2 zipp 2.1.0 Code # example.py import pyhf from pyhf import Model, infer def main(): signal=[0.00000000e+00,2.16147594e-04,4.26391320e-04,8.53157029e-04, 7.95947245e-04,1.85458682e-03,3.15515589e-03,4.22895664e-03, 4.65887617e-03,7.35380863e-03,8.71947686e-03,7.94697901e-03, 1.02721341e-02,9.24346489e-03,9.38926633e-03,9.68742497e-03, 8.11072856e-03,7.71003446e-03,6.80873211e-03,5.43234586e-03, 4.98376829e-03,4.72218222e-03,3.40645378e-03,3.44950579e-03, 2.61473009e-03,2.18345641e-03,2.00960464e-03,1.33786215e-03, 1.18440675e-03,8.36366201e-04,5.99855228e-04,4.27406780e-04, 2.71607026e-04,1.81370902e-04,1.03710513e-04,4.42737056e-05, 2.25835175e-05,1.04470885e-05,4.08162922e-06,3.20004812e-06, 3.37990384e-07,6.72843977e-07,0.00000000e+00,9.08675772e-08, 0.00000000e+00] bkgrd=[1.47142981e+03,9.07095061e+02,9.11188195e+02,7.06123452e+02, 6.08054685e+02,5.23577562e+02,4.41672633e+02,4.00423307e+02, 3.59576067e+02,3.26368076e+02,2.88077216e+02,2.48887339e+02, 2.20355981e+02,1.91623853e+02,1.57733823e+02,1.32733279e+02, 1.12789438e+02,9.53141118e+01,8.15735557e+01,6.89604141e+01, 5.64245978e+01,4.49094779e+01,3.95547919e+01,3.13005748e+01, 2.55212288e+01,1.93057913e+01,1.48268648e+01,1.13639821e+01, 8.64408136e+00,5.81608649e+00,3.98839138e+00,2.61636610e+00, 1.55906281e+00,1.08550560e+00,5.57450828e-01,2.25258250e-01, 2.05230728e-01,1.28735312e-01,6.13798028e-02,2.00805073e-02, 5.91436617e-02,0.00000000e+00,0.00000000e+00,0.00000000e+00, 0.00000000e+00] spec = { "channels": [ { "name": "singlechannel", "samples": [ { "name": "signal", "data": signal, "modifiers": [ {"name": "mu", "type": "normfactor", "data": None} ], }, {"name": "background", "data": bkgrd, "modifiers": [],}, ], } ] } model = pyhf.Model(spec) hypo_tests = pyhf.infer.hypotest( 1.0, model.expected_data([0]), model, 0.5, [(0, 80)], return_expected_set=True, return_test_statistics=True, qtilde=True, ) print(hypo_tests) if __name__ == "__main__": main() Error (example) $ python example.py /home/jovyan/.venvs/example/lib/python3.7/site-packages/pyhf/tensor/numpy_backend.py:253: RuntimeWarning: divide by zero encountered in log return n * np.log(lam) - lam - gammaln(n + 1.0) /home/jovyan/.venvs/example/lib/python3.7/site-packages/pyhf/tensor/numpy_backend.py:253: RuntimeWarning: invalid value encountered in multiply return n * np.log(lam) - lam - gammaln(n + 1.0) ERROR:pyhf.optimize.opt_scipy: fun: nan jac: array([nan]) message: 'Iteration limit exceeded' nfev: 1300003 nit: 100001 njev: 100001 status: 9 success: False x: array([0.499995]) Traceback (most recent call last): File "example.py", line 65, in <module> main() File "example.py", line 59, in main qtilde=True, File "/home/jovyan/.venvs/example/lib/python3.7/site-packages/pyhf/infer/__init__.py", line 82, in hypotest asimov_data = generate_asimov_data(asimov_mu, data, pdf, init_pars, par_bounds) File "/home/jovyan/.venvs/example/lib/python3.7/site-packages/pyhf/infer/utils.py", line 8, in generate_asimov_data bestfit_nuisance_asimov = fixed_poi_fit(asimov_mu, data, pdf, init_pars, par_bounds) File "/home/jovyan/.venvs/example/lib/python3.7/site-packages/pyhf/infer/mle.py", line 62, in fixed_poi_fit **kwargs, File "/home/jovyan/.venvs/example/lib/python3.7/site-packages/pyhf/optimize/opt_scipy.py", line 47, in minimize assert result.success AssertionError
Looking at the model, the background estimate shouldn't be zero, so add an epsilon of 1e-7 to it and then an 1% background uncertainty. Though the issue here is that reasonable intervals for signal strength are between μ ∈ [0,10]. If your model is such that you aren't sensitive to a signal strength in this range then you should test a new signal model which is the original signal scaled by some scale factor. Environment For visualization purposes let's extend the environment a bit (example) $ cat requirements.txt pyhf~=0.4.0 black matplotlib~=3.1 altair~=4.0 Code # answer.py import pyhf from pyhf import Model, infer import numpy as np import matplotlib.pyplot as plt import pyhf.contrib.viz.brazil def invert_interval(test_mus, hypo_tests, test_size=0.05): cls_obs = np.array([test[0] for test in hypo_tests]).flatten() cls_exp = [ np.array([test[1][i] for test in hypo_tests]).flatten() for i in range(5) ] crossing_test_stats = {"exp": [], "obs": None} for cls_exp_sigma in cls_exp: crossing_test_stats["exp"].append( np.interp( test_size, list(reversed(cls_exp_sigma)), list(reversed(test_mus)) ) ) crossing_test_stats["obs"] = np.interp( test_size, list(reversed(cls_obs)), list(reversed(test_mus)) ) return crossing_test_stats def main(): unscaled_signal=[0.00000000e+00,2.16147594e-04,4.26391320e-04,8.53157029e-04, 7.95947245e-04,1.85458682e-03,3.15515589e-03,4.22895664e-03, 4.65887617e-03,7.35380863e-03,8.71947686e-03,7.94697901e-03, 1.02721341e-02,9.24346489e-03,9.38926633e-03,9.68742497e-03, 8.11072856e-03,7.71003446e-03,6.80873211e-03,5.43234586e-03, 4.98376829e-03,4.72218222e-03,3.40645378e-03,3.44950579e-03, 2.61473009e-03,2.18345641e-03,2.00960464e-03,1.33786215e-03, 1.18440675e-03,8.36366201e-04,5.99855228e-04,4.27406780e-04, 2.71607026e-04,1.81370902e-04,1.03710513e-04,4.42737056e-05, 2.25835175e-05,1.04470885e-05,4.08162922e-06,3.20004812e-06, 3.37990384e-07,6.72843977e-07,0.00000000e+00,9.08675772e-08, 0.00000000e+00] bkgrd=[1.47142981e+03,9.07095061e+02,9.11188195e+02,7.06123452e+02, 6.08054685e+02,5.23577562e+02,4.41672633e+02,4.00423307e+02, 3.59576067e+02,3.26368076e+02,2.88077216e+02,2.48887339e+02, 2.20355981e+02,1.91623853e+02,1.57733823e+02,1.32733279e+02, 1.12789438e+02,9.53141118e+01,8.15735557e+01,6.89604141e+01, 5.64245978e+01,4.49094779e+01,3.95547919e+01,3.13005748e+01, 2.55212288e+01,1.93057913e+01,1.48268648e+01,1.13639821e+01, 8.64408136e+00,5.81608649e+00,3.98839138e+00,2.61636610e+00, 1.55906281e+00,1.08550560e+00,5.57450828e-01,2.25258250e-01, 2.05230728e-01,1.28735312e-01,6.13798028e-02,2.00805073e-02, 5.91436617e-02,0.00000000e+00,0.00000000e+00,0.00000000e+00, 0.00000000e+00] scale_factor = 500 signal = np.asarray(unscaled_signal) * scale_factor epsilon = 1e-7 background = np.asarray(bkgrd) + epsilon spec = { "channels": [ { "name": "singlechannel", "samples": [ { "name": "signal", "data": signal.tolist(), "modifiers": [ {"name": "mu", "type": "normfactor", "data": None} ], }, { "name": "background", "data": background.tolist(), "modifiers": [ { "name": "uncert", "type": "shapesys", "data": (0.01 * background).tolist(), }, ], }, ], } ] } model = pyhf.Model(spec) init_pars = model.config.suggested_init() par_bounds = model.config.suggested_bounds() data = model.expected_data(init_pars) cls_obs, cls_exp = pyhf.infer.hypotest( 1.0, data, model, init_pars, par_bounds, return_expected_set=True, return_test_statistics=True, qtilde=True, ) # Show that the scale factor chosen gives reasonable values print(f"Observed CLs for µ=1: {cls_obs[0]:.2f}") print("-----") for idx, n_sigma in enumerate(np.arange(-2, 3)): print( "Expected {}CLs for µ=1: {:.3f}".format( " " if n_sigma == 0 else "({} σ) ".format(n_sigma), cls_exp[idx][0], ) ) # Perform hypothesis test scan _start = 0.1 _stop = 4 _step = 0.1 poi_tests = np.arange(_start, _stop + _step, _step) print("\nPerforming hypothesis tests\n") hypo_tests = [ pyhf.infer.hypotest( mu_test, data, model, init_pars, par_bounds, return_expected_set=True, return_test_statistics=True, qtilde=True, ) for mu_test in poi_tests ] # This is all you need. Below is just to demonstrate. # Upper limits on signal strength results = invert_interval(poi_tests, hypo_tests) print(f"Observed Limit on µ: {results['obs']:.2f}") print("-----") for idx, n_sigma in enumerate(np.arange(-2, 3)): print( "Expected {}Limit on µ: {:.3f}".format( " " if n_sigma == 0 else "({} σ) ".format(n_sigma), results["exp"][idx], ) ) # Visualize the "Brazil band" fig, ax = plt.subplots() fig.set_size_inches(7, 5) ax.set_title("Hypothesis Tests") ax.set_ylabel("CLs") ax.set_xlabel(f"µ (for Signal x {scale_factor})") pyhf.contrib.viz.brazil.plot_results(ax, poi_tests, hypo_tests) fig.savefig("brazil_band.pdf") if __name__ == "__main__": main() Output The value that the signal needs to be scaled by can be determined by just trying a few scale factor values until the CLs values for a signal strength of mu=1 begin to look reasonable (something larger than 1e-3 or so). In this particular example, a scale factor of 500 seems okay. The upper limit on the unscaled signal strength is then just the observed limit divided by the scale factor, which in this case there is obviously no sensitivity. (example) $ python answer.py Observed CLs for µ=1: 0.54 ----- Expected (-2 σ) CLs for µ=1: 0.014 Expected (-1 σ) CLs for µ=1: 0.049 Expected CLs for µ=1: 0.157 Expected (1 σ) CLs for µ=1: 0.403 Expected (2 σ) CLs for µ=1: 0.737 Performing hypothesis tests Observed Limit on µ: 2.22 ----- Expected (-2 σ) Limit on µ: 0.746 Expected (-1 σ) Limit on µ: 0.998 Expected Limit on µ: 1.392 Expected (1 σ) Limit on µ: 1.953 Expected (2 σ) Limit on µ: 2.638
spark long delay before submitting jobs to the executors
I'm using spark-Cassandra driver through spark-sql to query my Cassandra cluster. Each Cassandra node has a spark worker (co-located). Problem: There is a long delay before submitting tasks to the executor (based on time stamps on web UI and also driver logs). The query is a simple select which specifies all cassandra partition keys and contains two stages and two tasks. Previously, the query took 300ms on another server with colocated driver and master. But i have to move my application and spark master to another server (same as before but just on another physical server) and now the query took 40 seconds. Although task duration is about 7 seconds, Job took 40 seconds, i can not figure out what the extra delay is for? I've also checked spark with a job with no connection to Cassandra, and it took 200ms, so i thought that its more related to spark-cassandra than to spark itself. Here is spark logs during execution of job: [INFO ] 2019-03-04 06:59:07.067 [qtp1151421920-470] SparkSqlParser 54 - Parsing command: select * from ... [INFO ] 2019-03-04 06:59:07.276 [qtp1151421920-470] CassandraSourceRelation 35 - Input Predicates: ... [INFO ] 2019-03-04 06:59:07.279 [qtp1151421920-470] ClockFactory 52 - Using native clock to generate timestamps. [INFO ] 2019-03-04 06:59:07.439 [qtp1151421920-470] Cluster 1543 - New Cassandra host /192.168.1.201:9042 added [INFO ] 2019-03-04 06:59:07.440 [qtp1151421920-470] Cluster 1543 - New Cassandra host /192.168.1.202:9042 added [INFO ] 2019-03-04 06:59:07.440 [qtp1151421920-470] Cluster 1543 - New Cassandra host /192.168.1.203:9042 added [INFO ] 2019-03-04 06:59:07.440 [qtp1151421920-470] Cluster 1543 - New Cassandra host /192.168.1.204:9042 added [INFO ] 2019-03-04 06:59:07.446 [qtp1151421920-470] CassandraConnector 35 - Connected to Cassandra cluster: Digger Cluster [INFO ] 2019-03-04 06:59:07.526 [qtp1151421920-470] CassandraSourceRelation 35 - Input Predicates: ... [INFO ] 2019-03-04 06:59:07.848 [qtp1151421920-470] CodeGenerator 54 - Code generated in 120.31952 ms [INFO ] 2019-03-04 06:59:08.264 [qtp1151421920-470] CodeGenerator 54 - Code generated in 15.084165 ms [INFO ] 2019-03-04 06:59:08.289 [qtp1151421920-470] CodeGenerator 54 - Code generated in 17.893182 ms [INFO ] 2019-03-04 06:59:08.379 [qtp1151421920-470] SparkContext 54 - Starting job: collectAsList at MyClass.java:5 [INFO ] 2019-03-04 06:59:08.394 [dag-scheduler-event-loop] DAGScheduler 54 - Registering RDD 12 (toJSON at MyClass.java.java:5) [INFO ] 2019-03-04 06:59:08.397 [dag-scheduler-event-loop] DAGScheduler 54 - Got job 0 (collectAsList at MyClass.java.java:5) with 1 output partitions [INFO ] 2019-03-04 06:59:08.398 [dag-scheduler-event-loop] DAGScheduler 54 - Final stage: ResultStage 1 (collectAsList at MyClass.java.java:5) [INFO ] 2019-03-04 06:59:08.398 [dag-scheduler-event-loop] DAGScheduler 54 - Parents of final stage: List(ShuffleMapStage 0) [INFO ] 2019-03-04 06:59:08.400 [dag-scheduler-event-loop] DAGScheduler 54 - Missing parents: List(ShuffleMapStage 0) [INFO ] 2019-03-04 06:59:08.405 [dag-scheduler-event-loop] DAGScheduler 54 - Submitting ShuffleMapStage 0 (MapPartitionsRDD[12] at toJSON at MyClass.java.java:5), which has no missing parents [INFO ] 2019-03-04 06:59:15.703 [pool-44-thread-1] CassandraConnector 35 - Disconnected from Cassandra cluster: Digger Cluster -----------------long delay here [INFO ] 2019-03-04 06:59:43.547 [dag-scheduler-event-loop] MemoryStore 54 - Block broadcast_0 stored as values in memory (estimated size 20.6 KB, free 17.8 GB) [INFO ] 2019-03-04 06:59:43.579 [dag-scheduler-event-loop] MemoryStore 54 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 9.5 KB, free 17.8 GB) [INFO ] 2019-03-04 06:59:43.581 [dispatcher-event-loop-1] BlockManagerInfo 54 - Added broadcast_0_piece0 in memory on 192.168.1.94:38311 (size: 9.5 KB, free: 17.8 GB) [INFO ] 2019-03-04 06:59:43.584 [dag-scheduler-event-loop] SparkContext 54 - Created broadcast 0 from broadcast at DAGScheduler.scala:1006 [INFO ] 2019-03-04 06:59:43.597 [dag-scheduler-event-loop] DAGScheduler 54 - Submitting 1 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[12] at toJSON at MyClass.java.java:5) (first 15 tasks are for partitions Vector(0)) [INFO ] 2019-03-04 06:59:43.598 [dag-scheduler-event-loop] TaskSchedulerImpl 54 - Adding task set 0.0 with 1 tasks [INFO ] 2019-03-04 06:59:43.619 [dag-scheduler-event-loop] FairSchedulableBuilder 54 - Added task set TaskSet_0.0 tasks to pool rest [INFO ] 2019-03-04 06:59:43.652 [dispatcher-event-loop-35] TaskSetManager 54 - Starting task 0.0 in stage 0.0 (TID 0, 192.168.1.210, executor 11, partition 0, NODE_LOCAL, 6357 bytes) [INFO ] 2019-03-04 06:59:43.920 [dispatcher-event-loop-36] BlockManagerInfo 54 - Added broadcast_0_piece0 in memory on 192.168.1.210:42612 (size: 9.5 KB, free: 912.3 MB) [INFO ] 2019-03-04 06:59:46.591 [task-result-getter-0] TaskSetManager 54 - Finished task 0.0 in stage 0.0 (TID 0) in 2963 ms on 192.168.1.210 (executor 11) (1/1) [INFO ] 2019-03-04 06:59:46.594 [task-result-getter-0] TaskSchedulerImpl 54 - Removed TaskSet 0.0, whose tasks have all completed, from pool rest [INFO ] 2019-03-04 06:59:46.601 [dag-scheduler-event-loop] DAGScheduler 54 - ShuffleMapStage 0 (toJSON at MyClass.java.java:5) finished in 2.981 s [INFO ] 2019-03-04 06:59:46.602 [dag-scheduler-event-loop] DAGScheduler 54 - looking for newly runnable stages [INFO ] 2019-03-04 06:59:46.603 [dag-scheduler-event-loop] DAGScheduler 54 - running: Set() [INFO ] 2019-03-04 06:59:46.603 [dag-scheduler-event-loop] DAGScheduler 54 - waiting: Set(ResultStage 1) [INFO ] 2019-03-04 06:59:46.604 [dag-scheduler-event-loop] DAGScheduler 54 - failed: Set() [INFO ] 2019-03-04 06:59:46.608 [dag-scheduler-event-loop] DAGScheduler 54 - Submitting ResultStage 1 (MapPartitionsRDD[18] at collectAsList at MyClass.java.java:5), which has no missing parents [INFO ] 2019-03-04 06:59:46.615 [dag-scheduler-event-loop] MemoryStore 54 - Block broadcast_1 stored as values in memory (estimated size 20.8 KB, free 17.8 GB) [INFO ] 2019-03-04 06:59:46.618 [dag-scheduler-event-loop] MemoryStore 54 - Block broadcast_1_piece0 stored as bytes in memory (estimated size 9.8 KB, free 17.8 GB) [INFO ] 2019-03-04 06:59:46.619 [dispatcher-event-loop-21] BlockManagerInfo 54 - Added broadcast_1_piece0 in memory on 192.168.1.94:38311 (size: 9.8 KB, free: 17.8 GB) [INFO ] 2019-03-04 06:59:46.620 [dag-scheduler-event-loop] SparkContext 54 - Created broadcast 1 from broadcast at DAGScheduler.scala:1006 [INFO ] 2019-03-04 06:59:46.622 [dag-scheduler-event-loop] DAGScheduler 54 - Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[18] at collectAsList at MyClass.java.java:5) (first 15 tasks are for partitions Vector(0)) [INFO ] 2019-03-04 06:59:46.622 [dag-scheduler-event-loop] TaskSchedulerImpl 54 - Adding task set 1.0 with 1 tasks [INFO ] 2019-03-04 06:59:46.622 [dag-scheduler-event-loop] FairSchedulableBuilder 54 - Added task set TaskSet_1.0 tasks to pool rest [INFO ] 2019-03-04 06:59:46.627 [dispatcher-event-loop-25] TaskSetManager 54 - Starting task 0.0 in stage 1.0 (TID 1, 192.168.1.212, executor 9, partition 0, PROCESS_LOCAL, 4730 bytes) [INFO ] 2019-03-04 06:59:46.851 [dispatcher-event-loop-9] BlockManagerInfo 54 - Added broadcast_1_piece0 in memory on 192.168.1.212:43471 (size: 9.8 KB, free: 912.3 MB) [INFO ] 2019-03-04 06:59:47.257 [dispatcher-event-loop-38] MapOutputTrackerMasterEndpoint 54 - Asked to send map output locations for shuffle 0 to 192.168.1.212:46794 [INFO ] 2019-03-04 06:59:47.262 [map-output-dispatcher-0] MapOutputTrackerMaster 54 - Size of output statuses for shuffle 0 is 141 bytes [INFO ] 2019-03-04 06:59:47.763 [task-result-getter-1] TaskSetManager 54 - Finished task 0.0 in stage 1.0 (TID 1) in 1140 ms on 192.168.1.212 (executor 9) (1/1) [INFO ] 2019-03-04 06:59:47.763 [task-result-getter-1] TaskSchedulerImpl 54 - Removed TaskSet 1.0, whose tasks have all completed, from pool rest [INFO ] 2019-03-04 06:59:47.765 [dag-scheduler-event-loop] DAGScheduler 54 - ResultStage 1 (collectAsList at MyClass.java.java:5) finished in 1.142 s [INFO ] 2019-03-04 06:59:47.771 [qtp1151421920-470] DAGScheduler 54 - Job 0 finished: collectAsList at MyClass.java.java:5, took 39.391066 s [INFO ] 2019-03-04 07:00:09.014 [Spark Context Cleaner] ContextCleaner 54 - Cleaned accumulator 4 [INFO ] 2019-03-04 07:00:09.015 [Spark Context Cleaner] ContextCleaner 54 - Cleaned accumulator 0 [INFO ] 2019-03-04 07:00:09.015 [Spark Context Cleaner] ContextCleaner 54 - Cleaned accumulator 3 [INFO ] 2019-03-04 07:00:09.015 [Spark Context Cleaner] ContextCleaner 54 - Cleaned accumulator 1 [INFO ] 2019-03-04 07:00:09.028 [dispatcher-event-loop-10] BlockManagerInfo 54 - Removed broadcast_1_piece0 on 192.168.1.94:38311 in memory (size: 9.8 KB, free: 17.8 GB) [INFO ] 2019-03-04 07:00:09.045 [dispatcher-event-loop-0] BlockManagerInfo 54 - Removed broadcast_1_piece0 on 192.168.1.212:43471 in memory (size: 9.8 KB, free: 912.3 MB) [INFO ] 2019-03-04 07:00:09.063 [Spark Context Cleaner] ContextCleaner 54 - Cleaned shuffle 0 [INFO ] 2019-03-04 07:00:09.065 [dispatcher-event-loop-16] BlockManagerInfo 54 - Removed broadcast_0_piece0 on 192.168.1.94:38311 in memory (size: 9.5 KB, free: 17.8 GB) [INFO ] 2019-03-04 07:00:09.071 [dispatcher-event-loop-37] BlockManagerInfo 54 - Removed broadcast_0_piece0 on 192.168.1.210:42612 in memory (size: 9.5 KB, free: 912.3 MB) [INFO ] 2019-03-04 07:00:09.074 [Spark Context Cleaner] ContextCleaner 54 - Cleaned accumulator 2 Also attached screenshots to spark web ui for the job and its tasks.Logs and images are not for the same job. P.S: Is spark-cassandra connectors creates a new session each time i run a query (i see connect-disconnect to cassandra cluster everytime)? i run many queries in parallel, isn't that going to be much slower than pure-cassandra? spark job
Checking with jvisualvm, Executors had no activity during the time gap, but the driver (my application) had a thread called "dag-scheduler..." running only at the time gap. The thread dump said that it stuck on InetAddress.getHostName(). Then in debug mode, i put a breakpoint there and find out that it's trying to reverse lookup (ip to hostname) for all of my cassandra-cluster, so just added all "IP HOSTNAME"s of my cassandra cluster to the end of /etc/hosts and problem solved!