Dask - How to cancel and resubmit stalled tasks? - python-3.x

Frequently, I encounter an issue where Dask randomly stalls on a couple tasks, usually tied to a read of data from a different node on my network (more details about this below). This can happen after several hours of running the script with no issues. It will hang indefinitely in a form shown below (this loop otherwise takes a few seconds to complete):
In this case, I see that there just a handful of stalled processes, and all are on one particular node (192.168.0.228):
Each worker on this node is stalled on a couple read_parquet tasks:
This was called using the following code and is using fastparquet:
ddf = dd.read_parquet(file_path, columns=['col1', 'col2'], index=False, gather_statistics=False)
My cluster is running Ubuntu 19.04 and all the latest versions (as of 11/12) of Dask and Distributed and the required packages (e.g., tornado, fsspec, fastparquet, etc.)
The data that the .228 node is trying to access is located on another node in my cluster. The .228 node accesses the data through CIFS file sharing. I run the Dask scheduler on the same node on which I'm running the script (different from both the .228 node and the data storage node). The script connects the workers to the scheduler via SSH using Paramiko:
ssh_client = paramiko.SSHClient()
stdin, stdout, stderr = ssh_client.exec_command('sudo dask-worker ' +
' --name ' + comp_name_decode +
' --nprocs ' + str(nproc_int) +
' --nthreads 10 ' +
self.dask_scheduler_ip, get_pty=True)
The connectivity of the .228 node to the scheduler and to the data storing node all look healthy. It is possible that the .228 node experienced some sort of brief connectivity issue while trying to process the read_parquet task, but if that occurred, the connectivity of .228 node to the scheduler and the CIFS shares were not impacted beyond that brief moment. In any case, the logs do not show any issues. This is the whole log from the .228 node:
distributed.worker - INFO - Start worker at: tcp://192.168.0.228:42445
distributed.worker - INFO - Listening to: tcp://192.168.0.228:42445
distributed.worker - INFO - dashboard at: 192.168.0.228:37751
distributed.worker - INFO - Waiting to connect to: tcp://192.168.0.167:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 2
distributed.worker - INFO - Memory: 14.53 GB
distributed.worker - INFO - Local Directory: /home/dan/worker-50_838ig
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://192.168.0.167:8786
distributed.worker - INFO - -------------------------------------------------
Putting aside whether this is a bug in Dask or in my code/network, is it possible to set a general timeout for all tasks handled by the scheduler? Alternatively, is it possible to:
identify stalled tasks,
copy a stalled task and move it to another worker, and
cancel the stalled task?

is it possible to set a general timeout for all tasks handled by the scheduler?
As of 2019-11-13 unfortunately the answer is no.
If a task has properly failed then you can retry that task with client.retry(...) but there is no automatic way to have a task fail itself after a certain time. This is something that you would have to write into your Python functions yourself. Unfortunately it is hard to interrupt a Python function in another thread, which is partially why this is not implemented.
If the worker goes down then things will be tried elsewhere. However from what you say it sounds like everything is healthy, it's just that the tasks themselves are likely to take forever. It's hard to identify this as a failure case unfortunately.

Related

Spark job failing with "Fail to know the executor driver is alive or not", "Cannot find endpoint: spark://CoarseGrainedScheduler#<host:port>

I'm running a job on a local Spark cluster (pyspark). When I run it with a small dataset it works fine, but once it's large, I get an error. I'm wondering 1. How to find logs from the scheduler process that appears to be crashing, and 2. more generally, what might be going on and how to debug the problem. Thanks in advance. Happy to provide more info.
Here's the error (from what I understand to be the driver logs):
block-manager-ask-thread-pool-224 ERROR BlockManagerMasterEndpoint: Fail to know th
e executor driver is alive or not.
org.apache.spark.SparkException: Exception thrown in awaitResult:
at...
...
<stacktrace>
...
Caused by: org.apache.spark.rpc.RpcEndpointNotFoundException: Cannot find endpoint: spark://CoarseGrainedScheduler#<host:port>
and then immediately below that
block-manager-ask-thread-pool-224 WARN BlockManagerMasterEndpoint: Error trying to remove shuffle 25. The executor driver may have been lost.
org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply from <host:port>
What I know about my job... I'm using Pyspark and running Spark standalone, using a local cluster with 72 workers (the machine has 96 cores). Here's my config:
spark:
master: "local[72]"
files:
maxPartitionBytes: 67108864
sql:
files:
maxPartitionBytes: 67108864
driver:
memory: "50g"
maxResultSize: "2g"
supervise: true
cores: 72
log:
dfsDir: <my/logs/dir>
persistToDfs:
enabled: true
loglevel: "WARN"
logConf: true
I've set SPARK_LOG_DIR and SPARK_WORKER_LOG_DIR to attempt to see scheduler logs, but I still only see driver (worker?) logs as far as I can tell, with the above error. I'm monitoring memory usage and it doesn't seem like my machine is memory-constrained, but I can't be sure I'm checking at the right moments. The machine has about 1TB of memory and tens of terabytes of free disk space.
Thanks in advance!

Lots of "Uncaught signal: 6" errors in Cloud Run

I have a Python (3.x) webservice deployed in GCP. Everytime Cloud Run is shutting down instances, most noticeably after a big load spike, I get many logs like these Uncaught signal: 6, pid=6, tid=6, fault_addr=0. together with [CRITICAL] WORKER TIMEOUT (pid:6) They are always signal 6.
The service is using FastAPI and Gunicorn running in a Docker with this start command
CMD gunicorn -w 2 -k uvicorn.workers.UvicornWorker -b 0.0.0.0:8080 app.__main__:app
The service is deployed using Terraform with 1 gig of ram, 2 cpu's and the timeout is set to 2 minutes
resource "google_cloud_run_service" <ressource-name> {
name = <name>
location = <location>
template {
spec {
service_account_name = <sa-email>
timeout_seconds = 120
containers {
image = var.image
env {
name = "GCP_PROJECT"
value = var.project
}
env {
name = "BRANCH_NAME"
value = var.branch
}
resources {
limits = {
cpu = "2000m"
memory = "1Gi"
}
}
}
}
}
autogenerate_revision_name = true
}
I have already tried tweaking the resources and timeout in Cloud Run, using the --timeout and --preload flag for gunicorn as that is what people always seem to recommend when googling the problem but all without success. I also dont exactly know why the workers are timing out.
Extending on the top answer which is correct, You are using GUnicorn which is a process manager that manages Uvicorn processes which runs the actual app.
When Cloudrun wants to shutdown the instance (due to lack of requests probably) it will send a signal 6 to process 1. However, GUnicorn occupies this process as the manager and will not pass it to the Uvicorn workers for handling - thus you receive the Unhandled signal 6.
The simplest solution, is to run Uvicorn directly instead of through GUnicorn (possibly with a smaller instance) and allow the scaling part to be handled via Cloudrun.
CMD ["uvicorn", "app.__main__:app", "--host", "0.0.0.0", "--port", "8080"]
Unless you have enabled CPU is always allocated, background threads and processes might stop receiving CPU time after all HTTP requests return. This means background threads and processes can fail, connections can timeout, etc. I cannot think of any benefits to running background workers with Cloud Run except when setting the --cpu-no-throttling flag. Cloud Run instances that are not processing requests, can be terminated.
Signal 6 means abort which terminates processes. This probably means your container is being terminated due to a lack of requests to process.
Run more workloads on Cloud Run with new CPU allocation controls
What if my application is doing background work outside of request processing?
This error happens when a background process is aborted. There are some advantages of running background threads on cloud just like for other applications. Luckily, you can still use them on Cloud Run without processes getting aborted. To do so, when deploying, chose the option "CPU always allocated" instead of "CPU only allocated during request processing"
For more details, check https://cloud.google.com/run/docs/configuring/cpu-allocation

Cassandra replicas on single docker-swarm node

I've a single docker-swarm manager node (18.09.6) running and I'm playing with spinning up a cassandra cluster. I'm using the following definition and it works in that the seed/master and slave spin up and communicate/replicate their data/schema changes fine:
services:
cassandra-masters:
image: cassandra:2.2
environment:
- MAX_HEAP_SIZE=128m
- HEAP_NEWSIZE=32m
- CASSANDRA_BROADCAST_ADDRESS=cassandra-masters
deploy:
mode: replicated
replicas: 1
cassandra-slaves:
image: cassandra:2.2
environment:
- MAX_HEAP_SIZE=128m
- HEAP_NEWSIZE=32m
- CASSANDRA_SEEDS=cassandra-masters
- CASSANDRA_BROADCAST_ADDRESS=cassandra-slaves
deploy:
mode: replicated
replicas: 1
depends_on:
- cassandra-masters
When I change the replica count from 1 to 2, either on deployment of the stack or a post deploy scale, the second task for the cassandra slave is created, but constantly fails with an error indicating it cannot gossip with the seed node:
INFO 10:51:03 Loading persisted ring state
INFO 10:51:03 Starting Messaging Service on /10.10.0.200:7000 (eth0
INFO 10:51:03 Handshaking version with cassandra-masters/10.10.0.142
Exception (java.lang.RuntimeException) encountered during startup: Unable to gossip with any seeds
java.lang.RuntimeException: Unable to gossip with any seeds
at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1360)
at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:521)
at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:756)
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:676)
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:562)
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:310)
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:548)
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:657)
ERROR 10:51:34 Exception encountered during startup
java.lang.RuntimeException: Unable to gossip with any seeds
at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1360) ~[apache-cassandra-2.2.14.jar:2.2.14]
at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:521) ~[apache-cassandra-2.2.14.jar:2.2.14]
at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:756) ~[apache-cassandra-2.2.14.jar:2.2.14]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:676) ~[apache-cassandra-2.2.14.jar:2.2.14]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:562) ~[apache-cassandra-2.2.14.jar:2.2.14]
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:310) [apache-cassandra-2.2.14.jar:2.2.14]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:548) [apache-cassandra-2.2.14.jar:2.2.14]
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:657) [apache-cassandra-2.2.14.jar:2.2.14]
I'd like to understand what is causing the issue and whether there is a way to work-around it? I'm just investigating what any roadblocks are to getting to production where we'd obviously be spinning the cassandra tasks/replicas up on different nodes rather than the one node.
EDIT: I've spun the same stack up on a two node swarm and I'm seeing the same behaviour, i.e. when I scale to a second "slave" task, it fails with the same error, so it's not an issue particular to trying to run two tasks on the same node.
I've not gotten to the bottom of why the gossiping fails but ultimately we agreed a production deployment strategy where we'd not require auto-scaling and should instead be making capacity planning based on the system's behaviour and expected traffic. This answer also points out to the additional strain that auto-scaling can add to an already stretched system: AWS and auto scaling cassandra

Cassandra decommission loss of data

We are running a Cassandra cluster with node servers. Initially, the cluster only had one node, and we decided that since that node was running out of space, we could add another node to the cluster.
Info on the cluster:
Keyspace with replication factor 1 using the SimpleStrategy class on a single datacenter
Node 1 - 256 tokens, almost no space available (1TB occupied by Cassandra data)
Node 2 - connected with 256 tokens, had 13TB available
First we added node 2 to the cluster and then realized that to stream the data to node 2, we'd have to decommission node 1.
So we decided to decommission, empty and reconfigure node 1 (we wanted node 1 to hold only 32 tokens) and re-add node 1 to the cluster datacenter.
When launching the decommission process, it created a stream of 29 files making a total of almost 600GB. That stream copied successfully (we checked the logs and used nodetool netstats) and we were expecting that a second stream would follow as we had 1TB on node 1. But nothing else happened, the node reported as decommissioned and node 2 reported data stream complete.
The log from node 2 related to the copy stream:
INFO [STREAM-INIT-/10.131.155.200:48267] 2018-10-08 16:05:55,636 StreamResultFuture.java:116 - [Stream #a248d100-cb0b-11e8-a427-37a119a8af0a ID#0] Creating new streaming plan for Unbootstrap
INFO [STREAM-INIT-/10.131.155.200:48267] 2018-10-08 16:05:55,648 StreamResultFuture.java:123 - [Stream #a248d100-cb0b-11e8-a427-37a119a8af0a, ID#0] Received streaming plan for Unbootstrap
INFO [STREAM-INIT-/10.131.155.200:57298] 2018-10-08 16:05:55,648 StreamResultFuture.java:123 - [Stream #a248d100-cb0b-11e8-a427-37a119a8af0a, ID#0] Received streaming plan for Unbootstrap
INFO [STREAM-IN-/10.131.155.200:57298] 2018-10-08 16:05:55,663 StreamResultFuture.java:173 - [Stream #a248d100-cb0b-11e8-a427-37a119a8af0a ID#0] Prepare completed. Receiving 29 files(584.444GiB), sending 0 files(0.000KiB)
INFO [StreamReceiveTask:2] 2018-10-09 16:55:33,646 StreamResultFuture.java:187 - [Stream #a248d100-cb0b-11e8-a427-37a119a8af0a] Session with /10.131.155.200 is complete
INFO [StreamReceiveTask:2] 2018-10-09 16:55:33,709 StreamResultFuture.java:219 - [Stream #a248d100-cb0b-11e8-a427-37a119a8af0a] All sessions completed
After clearing the cassandra data folder (we should've backed it up, we know), we started cassandra again in node 1 and it successfully joined the cluster.
The cluster is functional with:
Node 1 - 32 tokens
Node 2 - 256 tokens
But, we seem to have a lost a lot of data. We were doing this as instructed in the Cassandra documentation.
We tried doing nodetool repair on both nodes, but to no avail (both reported no data to be recovered).
What did we miss here? Is there a way to recover this lost data?
Thank you all!

Spark Java Application no cores and waiting

I'm new to spark and I'm trying to develop my first application. I'm only trying to count lines in a file but I got the error:
2015-11-28 10:21:34 WARN TaskSchedulerImpl:71 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
I have enough cores and enough memory. I read that can be a firewall problem but I'm getting this error both on my server and on my macbook and for sure on the macbook there is no firewall. If I open the UI it says that the application is WAITING and apparently the application is getting no cores at all:
Application ID Name Cores Memory per Node State
app-20151128102116-0002 (kill) New app 0 1024.0 MB WAITING
My code is very simple:
SparkConf sparkConf = new SparkConf().setAppName(new String("New app"));
sparkConf.setMaster("spark://MacBook-Air.local:7077");
JavaRDD<String> textFile = sc.textFile("/Users/mattiazeni/Desktop/test.csv.bz2");
if(logger.isInfoEnabled()){
logger.info(textFile.count());
}
if I try to run the same program from the shell in scala it works great.
Any suggestion?
Check that workers are running - there should be at least one worker listed on the Spark UI at http://:8080
If not are running, use /sbin/start-slaves.sh (or start-all.sh)

Resources