My spark job accessing AWS dynamodb, under high load my spark stdout having this error-
ERROR: Failed to refresh endpoint : RequestCanceled: request context canceled
caused by: context deadline exceeded
Any idea about this error?
Related
I have some schedule spark jobs running on k8s,
sometimes the driver pod is stacking on "Running" status regarding failures in the sparkSession.getOrCreate function, everytime regarding another reason.
failure examples:
Exception in thread "main"
io.fabric8.kubernetes.client.KubernetesClientException: Failed to
start websocket
Exception in thread "main"
io.fabric8.kubernetes.client.KubernetesClientException: Failure
executing: POST at:
https://kubernetes.default.svc/api/v1/namespaces/xxxxx/configmaps.
Message: Internal error occurred: failed calling webhook
"objects.hnc.x-k8s.io": Post
"https://webhook-service.system.svc/validate-objects?timeout=2s":
context deadline exceeded.
For now I don't care about why those exception are occurring (unless you know it, which can be greate bonus for here) - but I care about how to make the driver exit and close.
I am afraid that during the creation of the context some of the resources already created but not being free.
Thanks.
I am facing the following error:
I wrote an application which is based on Spark streaming (Dstream) to pull messages coming from PubSub. Unfortunately, I am facing errors during the execution of this job. Actually I am using a cluster composed of 4 nodes to execute the spark Job.
After 10 minutes of the job running without any specific error, I get the following error permanently:
ERROR org.apache.spark.streaming.CheckpointWriter:
Could not submit checkpoint task to the thread pool executor java.util.concurrent.RejectedExecutionException: Task org.apache.spark.streaming.CheckpointWriter$CheckpointWriteHandler#68395dc9 rejected
from java.util.concurrent.ThreadPoolExecutor#1a1acc25
[Running, pool size = 1, active threads = 1, queued tasks = 1000, completed tasks = 412]
I am playing around with the Azure Spark-CosmosDB connector which lets you access CosmosDB nodes directly from a Spark cluster for analytics using Jupyter on HDINsight
I have been following the steps described here,including uploading the required jars to Azure storage and executing the %%configure magic to prepare the environment.
But it always seems to terminate due to an I/O exception when trying to open the jar (see yarn log below)
17/10/09 20:10:35 INFO ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.io.IOException: Error accessing /mnt/resource/hadoop/yarn/local/usercache/livy/appcache/application_1507534135641_0014/container_1507534135641_0014_01_000001/azure-cosmosdb-spark-0.0.3-SNAPSHOT.jar)
17/10/09 20:10:35 ERROR ApplicationMaster: RECEIVED SIGNAL TERM
17/10/09 20:10:35 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User class threw exception: java.io.IOException: Error accessing /mnt/resource/hadoop/yarn/local/usercache/livy/appcache/application_1507534135641_0014/container_1507534135641_0014_01_000001/azure-cosmosdb-spark-0.0.3-SNAPSHOT.jar)`
Not sure whether this is related to the jar not being copied to the worker nodes.
any idea? thanks, Nick
I'm using a S3 bucket in region eu-central-1 as a checkpoint directory for my spark streaming job.
It writes data to that directory but every 10th batch fails with the following exception:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4040.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4040.0 (TID 0, 127.0.0.1, executor 0): com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: ..., AWS Error Code: null, AWS Error Message: Bad Request
When this happens, the batch data is lost. How can I solve this behavior?
It ended up being an authentication exception with the bucket in eu-central-1 because that S3 zone uses the V4 authentication.
It was configured on the driver itself but not on the workers so that's why some worked and some didn't.
When attempting to call H2OContext.getOrCreate with a valid SparkContext, randomly we keep seeing failures to deploy:
17/04/21 17:21:32 ERROR TaskSchedulerImpl: Lost executor 0 on 172.17.0.4: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/04/21 17:21:38 ERROR LiveListenerBus: Listener ExecutorAddNotSupportedListener threw an exception
java.lang.IllegalArgumentException: Executor without H2O instance discovered, killing the cloud!
at org.apache.spark.listeners.ExecutorAddNotSupportedListener.onExecutorAdded(H2OSparkListener.scala:27)
at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:61)
at org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
at org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1252)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77)
The H2OContext.getOrCreate causes the error:
Context.spark_session = SparkSession.builder.getOrCreate()
Context.h2o_context = H2OContext.getOrCreate(Context.spark_session)
Any thoughts from the H2O Crew?
this is a known behaviour of Sparkling Water internal backend at the moment. To avoid this, the external Sparkling Water backend can be used. More information about this can be found here https://github.com/h2oai/sparkling-water/blob/master/doc/backends.md
I'm currently working on this JIRA which should eliminate the behaviour above as well. It's work in progress, this JIRA https://0xdata.atlassian.net/browse/SW-369 can be tracked to get the status of the task.