Spark Client Pod in Kubernetes Getting 401 Error After One Hour - apache-spark

Kubernetes Version: 1.21
Spark Version: 3.0.0
I am using a container in a Kubernetes pod (client pod) to invoke Spark Submit which then starts a Driver pod. The client pod which did the Spark Submit starts to watch the Driver pod via LoggingPodStatusWatcherImpl. After approximately 1 hour, the client pod experiences 401 error
22/11/03 13:05:44 WARN WatchConnectionManager: Exec Failure: HTTP 401, Status: 401 - Unauthorized
java.net.ProtocolException: Expected HTTP 101 response but was '401 Unauthorized'
at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
22/11/03 13:05:46 INFO LoggingPodStatusWatcherImpl: Application status for spark-blahblahblah (phase: Running)
I think Spark on Kubernetes usually looks in /var/run/secrets/kubernetes.io/serviceaccount/token so I would get the warning below when starting the client pod.
22/11/03 13:13:13 WARN Config: Error reading service account token from: [/var/run/secrets/kubernetes.io/serviceaccount/token]. Ignoring.
However, since I provide another oauth token file via the conf below in Spark Submit command the client pod was able to connect to the Kubernetes API and start the Driver pod.
--conf spark.kubernetes.authenticate.submission.oauthTokenFile=/mytokendir/token
The token is provided to the client pod via projected volume (new to Kubernete versions 1.20+), the token expiration duration can be specified in the yaml manifest as shown below:
See this doc for reference on how this is implemented:
https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-bound-service-account-tokens
spec:
serviceAccountName: my-serviceaccount
volumes:
- name: token-vol
projected:
sources:
- serviceAccountToken:
expirationSeconds: 7200
path: token
containers:
-name: my-container
image: some-image
volumeMounts:
-name: token-vol
mountPath: /mytokendir
I then exec into the client pod to get the JWT token in /mytokendir and decoded it.
It showed valid for 2 hours; however, coming back to the original question, my client pod is still getting 401 error after 1 hour.
Sometimes I would get this error:
22/11/03 14:10:57 INFO LoggingPodStatusWatcherImpl: Application my-application with submission ID my-namespace:my-driver finished
22/11/03 14:10:57 INFO ShutdownHookManager: Shutdown hook called
22/11/03 14:10:57 INFO ShutdownHookManager: Deleting directory /tmp/spark-blahblah
The connection to the server localhost:8080 was refused - did you specify the right host or port?

Related

Apache Pulsar Zookeeper: Unable to access datadir, exiting abnormally

I am using these steps to use apache pulsar on docker: https://github.com/streamnative/tgip/blob/master/episodes/001/demo.md
I was able to use these steps before to install and use pulsar but for some reason now when am creating a directory, it is going to write protected and pulsar zookeeper container is exiting with following logs as soon as it is created:
ERROR org.apache.zookeeper.server.ZooKeeperServerMain - Unable to access datadir, exiting abnormally
org.apache.zookeeper.server.persistence.FileTxnSnapLog$DatadirException: Unable to create data directory data/zookeeper/version-2
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.<init>(FileTxnSnapLog.java:136) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:137) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:112) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:67) [org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:140) [org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:90) [org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
Unable to access datadir, exiting abnormally
23:36:15.223 [main] INFO org.apache.zookeeper.audit.ZKAuditProvider - ZooKeeper audit is disabled.
23:36:15.226 [main] ERROR org.apache.zookeeper.util.ServiceUtils - Exiting JVM with code 3
23:36:15.196 [PurgeTask] ERROR org.apache.zookeeper.server.DatadirCleanupManager - Error occurred while purging.
org.apache.zookeeper.server.persistence.FileTxnSnapLog$DatadirException: Unable to create data directory data/zookeeper/version-2
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.<init>(FileTxnSnapLog.java:136) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.server.PurgeTxnLog.purge(PurgeTxnLog.java:80) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.server.DatadirCleanupManager$PurgeTask.run(DatadirCleanupManager.java:141) [org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at java.util.TimerThread.mainLoop(Timer.java:556) [?:?]
at java.util.TimerThread.run(Timer.java:506) [?:?]
23:36:15.229 [PurgeTask] INFO org.apache.zookeeper.server.DatadirCleanupManager - Purge task completed
I have made sure that SELinux is disabled and tried changing permission using chmod 777 data/ and every other step available to resolve this but still unable to find any. Please help me with the possible resolution.

AWS elastic beanstalk deploy always fails (uploading a zipfile)

I upload a new version of my app as a zipfile and click deploy. Then the status changes to severe.
This is the error trace:
WARN
Environment health has transitioned from Info to Degraded. Command failed on all instances. Incorrect application version found on all instances. Expected version "Sample" (deployment 2). Application update failed 10 seconds ago and took 4 minutes.
ERROR
During an aborted deployment, some instances may have deployed the new application version. To ensure all instances are running the same version, re-deploy the appropriate application version.
ERROR
Failed to deploy application.
ERROR
Unsuccessful command execution on instance id(s) 'i------'. Aborting the operation.
ERROR
[Instance: i-002326d7ceeba0ea9] Command failed on instance. Return code:
1 Output: nginx: [emerg] no host in upstream ":80" in /etc/nginx/conf.d/elasticbeanstalk-nginx-docker-upstream.conf:
2 nginx: configuration file /etc/nginx/nginx.conf test failed Failed to start nginx, abort deployment.
Hook /opt/elasticbeanstalk/hooks/appdeploy/enact/01flip.sh failed.
For more detail, check /var/log/eb-activity.log using console or EB CLI.
ERROR
Failed to start nginx, abort deployment
/var/log/eb-activity.log
here are errors in this log:
[0mInstalling dependencies from Pipfile.lock (5e00f3)…
Failed to load paths: /bin/sh: 1: /root/.local/share/virtualenvs/app-lp47FrbD/bin/python: not found
...
[2020-05-29T01:51:24.746Z] INFO [11395] - [Application update v1.3.3-1#3/AppDeployStage1/AppDeployEnactHook/00run.sh] : Completed activity. Result:
jq: error (at <stdin>:1): Cannot iterate over null (null)
a2f568b1c255eb9e0fdc6ceebdd29b9ec64b9ab4481a3e1c5bcb11828b0ac526
[2020-05-29T01:51:24.747Z] INFO [11395] - [Application update v1.3.3-1#3/AppDeployStage1/AppDeployEnactHook/01flip.sh] : Starting activity...
[2020-05-29T01:51:26.099Z] INFO [11395] - [Application update v1.3.3-1#3/AppDeployStage1/AppDeployEnactHook/01flip.sh] : Activity execution failed, because: nginx: [emerg] no host in upstream ":80" in /etc/nginx/conf.d/elasticbeanstalk-nginx-docker-upstream.conf:2
nginx: configuration file /etc/nginx/nginx.conf test failed
Failed to start nginx, abort deployment (ElasticBeanstalk::ExternalInvocationError)
caused by: nginx: [emerg] no host in upstream ":80" in /etc/nginx/conf.d/elasticbeanstalk-nginx-docker-upstream.conf:2
nginx: configuration file /etc/nginx/nginx.conf test failed
Failed to start nginx, abort deployment (Executor::NonZeroExitStatus)
...
[2020-05-29T01:51:26.099Z] INFO [11395] - [Application update v1.3.3-1#3/AppDeployStage1/AppDeployEnactHook/01flip.sh] : Activity failed.
[2020-05-29T01:51:26.099Z] INFO [11395] - [Application update v1.3.3-1#3/AppDeployStage1/AppDeployEnactHook] : Activity failed.
[2020-05-29T01:51:26.099Z] INFO [11395] - [Application update v1.3.3-1#3/AppDeployStage1] : Activity failed.
[2020-05-29T01:51:26.100Z] INFO [11395] - [Application update v1.3.3-1#3] : Completed activity. Result:
Application update - Command CMD-AppDeploy failed
The inability to deploy has been consistent for this environment, after several attempts, even reverting to an older version.
Afterwards, I resolved this by isolating the code and error messages using a local docker image with the zipfile. Running the code on my machine outside of docker did NOT reveal any problems, because the pip / pipenv part was missing some depdendency.
Steps for local docker testing:
WITHIN a docker container:
docker system prune
Go to the folder with Dockerfile
docker image build -t <app_name>:<version_number> .
TO run locally:
(docker rm <app_name> first, if you've already got a stopped container with the same name from prior testing)
docker container run --publish 80:80 --name <app_name> myapp:1.0
NOTE:
this won't let you test AWS functions that require environment variables, such as ~.aws credentials because they're not inside the image.
(but you could add them with your Dockerfile)
Once the docker container is running, you'll see (I saw) error messages that were not there when testing locally, because they were caused by a missing package dependency and a pipenv error.

How to fix this fatal error while running spark jobs on HDIinsight cluster? Session 681 unexpectedly reached final status 'dead'. See logs:

I am running pyspark code on HDIcluster and getting this error:
The code failed because of a fatal error: Session 681 unexpectedly
reached final status 'dead'. See logs:
I don't have experience in YARN or Hadoop. I tried few links provided in stack overflow. But none of them helped. One strange thing is I was able to run the same code yesterday with out that error.
I just ran this import
from pyspark.sql import SparkSession
This is the error I am getting:
19/06/21 20:35:35 INFO Client:
client token: N/A
diagnostics: [Fri Jun 21 20:35:35 +0000 2019] Application is Activated, waiting for resources to be assigned for AM. Details : AM Partition = <DEFAULT_PARTITION> ; Partition Resource = <memory:819200, vCores:240> ; Queue's Absolute capacity = 50.0 % ; Queue's Absolute used capacity = 99.1875 % ; Queue's Absolute max capacity = 100.0 % ;
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1561149335158
final status: UNDEFINED
tracking URL: https://mmsorderpredhdi.azurehdinsight.net/yarnui/hn/proxy/application_1560840076505_0062/
user: livy
19/06/21 20:35:35 INFO ShutdownHookManager: Shutdown hook called
19/06/21 20:35:35 INFO ShutdownHookManager: Deleting directory /tmp/spark-bb63c5f0-7579-4456-b32a-0e643ca97ecc
YARN Diagnostics:
Application killed by user..
Question : Is there something to deal with Queue's absolute used capacity?
Could you please check the logs to find the exact issue?
Where do I find the log file?
On Azure HDInsight cluster, You may found the livy log by connecting to one of the Head nodes with SSH and downloading a file at this path.
hdfs dfs -ls /app-logs/livy/logs-ifile
For more details, refer "Access Apache Hadoop YARN application logs on Linux-based HDInsight"
And also, you may refer “How to start sparksession in pyspark”.
Hope this helps.

I am unable to claim cStor volume using the custom cStor storage pool

I was creating manual Cstor pool for disks attached to my nodes, but i am unable to claim volume using the custom storage pool
cStor storage pools are running in healthy state.
gem-cstor-disk-9wzj-6c9c8f75c5-mq6gp 2/2 Running 0 154m
gem-cstor-disk-iru2-68c85445cf-pqg6b 2/2 Running 0 132m
gem-cstor-disk-m3bx-6fcddc7dcd-f9dg2 2/2 Running 0 154m
From maya-apiserver,
2019/01/16 12:47:17.907038 [ERR] http: Request GET /latest/volumes/pvc-b8dacfcc-198c-11e9-9141-503eaa028845, error: error converting YAML to JSON: yaml: line 4: mapping values are not allowed in this context
2019/01/16 12:47:17.907070 [DEBUG] http: Request /latest/volumes/pvc-b8dacfcc-198c-11e9-9141-503eaa028845 (596.053376ms)
2019/01/16 12:47:43.536149 [DEBUG] http: Request /latest/volumes/pvc-b9a40023-198c-11e9-9141-503eaa028845 (GET)
I0116 12:47:43.536184 8 volume_endpoint_v1alpha1.go:52] cas template based volume request was received: method 'GET'
I0116 12:47:43.536206 8 volume_endpoint_v1alpha1.go:137] cas template based volume read request was received
E0116 12:47:44.705380 8 volume_endpoint_v1alpha1.go:175] failed to read cas template based volume: error 'error converting YAML to JSON: yaml: line 4: mapping values are not allowed in this context'
2019/01/16 12:47:44.705424 [ERR] http: Request GET /latest/volumes/pvc-b9a40023-198c-11e9-9141-503eaa028845, error: error converting YAML to JSON: yaml: line 4: mapping values are not allowed in this context
2019/01/16 12:47:44.705458 [DEBUG] http: Request /latest/volumes/pvc-b9a40023-198c-11e9-9141-503eaa028845 (1.169318391s)
Going thourgh PVC, SC yamls and found that StorageClass YAML is not correct. It contain extra space added. Just removed and re-applied. Everything went fine.

SPARK YARN: cannot send job from client (org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:8032)

I'm trying to send spark job to yarn (without HDFS) in HA mode.
For submitting I'm using org.apache.spark.deploy.SparkSubmit.
When I send request from machine with active Resource Manager, it works well. But if I' trying to send from machine with standby Resource Manager, job fails with error:
DEBUG org.apache.hadoop.ipc.Client - Connecting to spark2-node-dev/10.10.10.167:8032
DEBUG org.apache.hadoop.ipc.Client - Connecting to /0.0.0.0:8032
org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep
However, when I send request via command line (spark-submit), it works well through both active and standby machine.
What can cause the problem?
P.S. Use the same parameters for both type of sending job: org.apache.spark.deploy.SparkSubmit and spark-submit command line request. And properties yarn.resourcemanager.hostname.rm_id defined for all rm hosts
The problem was with absence of yarn-site.xml within class path for spark-submitter jar. Actually spark submitter jar does not take to account YARN_CONF_DIR or HADOOP_CONF_DIR env var, so cannot see yarn-site.
One solution that I found was to put yarn-site into classpath of jar.

Resources