Kubernetes Version: 1.21
Spark Version: 3.0.0
I am using a container in a Kubernetes pod (client pod) to invoke Spark Submit which then starts a Driver pod. The client pod which did the Spark Submit starts to watch the Driver pod via LoggingPodStatusWatcherImpl. After approximately 1 hour, the client pod experiences 401 error
22/11/03 13:05:44 WARN WatchConnectionManager: Exec Failure: HTTP 401, Status: 401 - Unauthorized
java.net.ProtocolException: Expected HTTP 101 response but was '401 Unauthorized'
at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
22/11/03 13:05:46 INFO LoggingPodStatusWatcherImpl: Application status for spark-blahblahblah (phase: Running)
I think Spark on Kubernetes usually looks in /var/run/secrets/kubernetes.io/serviceaccount/token so I would get the warning below when starting the client pod.
22/11/03 13:13:13 WARN Config: Error reading service account token from: [/var/run/secrets/kubernetes.io/serviceaccount/token]. Ignoring.
However, since I provide another oauth token file via the conf below in Spark Submit command the client pod was able to connect to the Kubernetes API and start the Driver pod.
--conf spark.kubernetes.authenticate.submission.oauthTokenFile=/mytokendir/token
The token is provided to the client pod via projected volume (new to Kubernete versions 1.20+), the token expiration duration can be specified in the yaml manifest as shown below:
See this doc for reference on how this is implemented:
https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-bound-service-account-tokens
spec:
serviceAccountName: my-serviceaccount
volumes:
- name: token-vol
projected:
sources:
- serviceAccountToken:
expirationSeconds: 7200
path: token
containers:
-name: my-container
image: some-image
volumeMounts:
-name: token-vol
mountPath: /mytokendir
I then exec into the client pod to get the JWT token in /mytokendir and decoded it.
It showed valid for 2 hours; however, coming back to the original question, my client pod is still getting 401 error after 1 hour.
Sometimes I would get this error:
22/11/03 14:10:57 INFO LoggingPodStatusWatcherImpl: Application my-application with submission ID my-namespace:my-driver finished
22/11/03 14:10:57 INFO ShutdownHookManager: Shutdown hook called
22/11/03 14:10:57 INFO ShutdownHookManager: Deleting directory /tmp/spark-blahblah
The connection to the server localhost:8080 was refused - did you specify the right host or port?
I am using these steps to use apache pulsar on docker: https://github.com/streamnative/tgip/blob/master/episodes/001/demo.md
I was able to use these steps before to install and use pulsar but for some reason now when am creating a directory, it is going to write protected and pulsar zookeeper container is exiting with following logs as soon as it is created:
ERROR org.apache.zookeeper.server.ZooKeeperServerMain - Unable to access datadir, exiting abnormally
org.apache.zookeeper.server.persistence.FileTxnSnapLog$DatadirException: Unable to create data directory data/zookeeper/version-2
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.<init>(FileTxnSnapLog.java:136) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:137) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:112) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:67) [org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:140) [org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:90) [org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
Unable to access datadir, exiting abnormally
23:36:15.223 [main] INFO org.apache.zookeeper.audit.ZKAuditProvider - ZooKeeper audit is disabled.
23:36:15.226 [main] ERROR org.apache.zookeeper.util.ServiceUtils - Exiting JVM with code 3
23:36:15.196 [PurgeTask] ERROR org.apache.zookeeper.server.DatadirCleanupManager - Error occurred while purging.
org.apache.zookeeper.server.persistence.FileTxnSnapLog$DatadirException: Unable to create data directory data/zookeeper/version-2
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.<init>(FileTxnSnapLog.java:136) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.server.PurgeTxnLog.purge(PurgeTxnLog.java:80) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.server.DatadirCleanupManager$PurgeTask.run(DatadirCleanupManager.java:141) [org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at java.util.TimerThread.mainLoop(Timer.java:556) [?:?]
at java.util.TimerThread.run(Timer.java:506) [?:?]
23:36:15.229 [PurgeTask] INFO org.apache.zookeeper.server.DatadirCleanupManager - Purge task completed
I have made sure that SELinux is disabled and tried changing permission using chmod 777 data/ and every other step available to resolve this but still unable to find any. Please help me with the possible resolution.
I am running pyspark code on HDIcluster and getting this error:
The code failed because of a fatal error: Session 681 unexpectedly
reached final status 'dead'. See logs:
I don't have experience in YARN or Hadoop. I tried few links provided in stack overflow. But none of them helped. One strange thing is I was able to run the same code yesterday with out that error.
I just ran this import
from pyspark.sql import SparkSession
This is the error I am getting:
19/06/21 20:35:35 INFO Client:
client token: N/A
diagnostics: [Fri Jun 21 20:35:35 +0000 2019] Application is Activated, waiting for resources to be assigned for AM. Details : AM Partition = <DEFAULT_PARTITION> ; Partition Resource = <memory:819200, vCores:240> ; Queue's Absolute capacity = 50.0 % ; Queue's Absolute used capacity = 99.1875 % ; Queue's Absolute max capacity = 100.0 % ;
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1561149335158
final status: UNDEFINED
tracking URL: https://mmsorderpredhdi.azurehdinsight.net/yarnui/hn/proxy/application_1560840076505_0062/
user: livy
19/06/21 20:35:35 INFO ShutdownHookManager: Shutdown hook called
19/06/21 20:35:35 INFO ShutdownHookManager: Deleting directory /tmp/spark-bb63c5f0-7579-4456-b32a-0e643ca97ecc
YARN Diagnostics:
Application killed by user..
Question : Is there something to deal with Queue's absolute used capacity?
Could you please check the logs to find the exact issue?
Where do I find the log file?
On Azure HDInsight cluster, You may found the livy log by connecting to one of the Head nodes with SSH and downloading a file at this path.
hdfs dfs -ls /app-logs/livy/logs-ifile
For more details, refer "Access Apache Hadoop YARN application logs on Linux-based HDInsight"
And also, you may refer “How to start sparksession in pyspark”.
Hope this helps.
Today one of our web application servers crashed (not Hazelcast related),
and that crash took down all other Hazelcast clusters for several minutes as well.
We are using Hazelcast for session replication, caching and distributed statistics.
How to minimize the impact, when one cluster crashes?
Our stack:
Hazelcast 3.7.4
Spring Boot 1.5.1.RELEASE
Spring Framework 4.3.6.RELEASE
Spring Websocket 4.3.6.RELEASE
Apache Tomcat 8.5.11
Java 1.8 112
Windows Server 2012 R2
Logs of a hazelcast cluster:
2017-07-07 14:41:20.177 ERROR 1600 --- [https-jsse-nio-8443-exec-20] c.p.p.r.GlobalControllerExceptionHandler : Unhandled Exception, null
com.hazelcast.core.OperationTimeoutException: GetOperation invocation failed to complete due to operation-heartbeat-timeout. Current time: 2017-07-07 14:41:20.177. Total elapsed time: 121392 ms. Last operation heartbeat: never. Last operation heartbeat from member: 2017-07-07 14:39:09.296. Invocation{op=com.hazelcast.map.impl.operation.GetOperation{serviceName='hz:impl:mapService', identityHash=790688984, partitionId=11, replicaIndex=0, callId=0, invocationTime=1499438359178 (2017-07-07 14:39:19.178), waitTimeout=-1, callTimeout=60000, name=subscription-type-by-subscription-level}, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeoutMillis=60000, firstInvocationTimeMs=1499438358785, firstInvocationTime='2017-07-07 14:39:18.785', lastHeartbeatMillis=0, lastHeartbeatTime='1970-01-01 00:00:00.000', target=[10.0.0.4]:5702, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=Connection[id=6, /10.0.0.7:5702->/10.0.0.4:49193, endpoint=[10.0.0.4]:5702, alive=true, type=MEMBER]}
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.newOperationTimeoutException(InvocationFuture.java:150)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolve(InvocationFuture.java:98)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveAndThrow(InvocationFuture.java:74)
at com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:158)
at com.hazelcast.map.impl.proxy.MapProxySupport.invokeOperation(MapProxySupport.java:376)
at com.hazelcast.map.impl.proxy.MapProxySupport.getInternal(MapProxySupport.java:307)
at com.hazelcast.map.impl.proxy.MapProxyImpl.get(MapProxyImpl.java:94)
...
2017-07-07 14:41:40.181 WARN 1600 --- [https-jsse-nio-8443-exec-32] c.h.map.impl.query.MapQueryEngineImpl : [10.0.0.7]:5702 [app-v19] [3.7.4] Could not get results
java.util.concurrent.ExecutionException: QueryOperation invocation failed to complete due to operation-heartbeat-timeout. Current time: 2017-07-07 14:41:40.181. Total elapsed time: 120624 ms. Last operation heartbeat: never. Last operation heartbeat from member: 2017-07-07 14:39:09.296. Invocation{op=com.hazelcast.map.impl.query.QueryOperation{serviceName='hz:impl:mapService', identityHash=2056544024, partitionId=-1, replicaIndex=0, callId=0, invocationTime=1499438379950 (2017-07-07 14:39:39.950), waitTimeout=-1, callTimeout=60000, name=online-messaging-session-containers}, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeoutMillis=60000, firstInvocationTimeMs=1499438379557, firstInvocationTime='2017-07-07 14:39:39.557', lastHeartbeatMillis=0, lastHeartbeatTime='1970-01-01 00:00:00.000', target=[10.0.0.4]:5702, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=Connection[id=6, /10.0.0.7:5702->/10.0.0.4:49193, endpoint=[10.0.0.4]:5702, alive=true, type=MEMBER]}
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.newOperationTimeoutException(InvocationFuture.java:150)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolve(InvocationFuture.java:98)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveAndThrow(InvocationFuture.java:74)
at com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:158)
...
2017-07-07 14:42:58.385 WARN 1600 --- [https-jsse-nio-8443-exec-191] c.h.map.impl.query.MapQueryEngineImpl : [10.0.0.7]:5702 [app-v19] [3.7.4] Could not get results
com.hazelcast.core.MemberLeftException: Member [10.0.0.4]:5702 - 14a2452c-45bc-40c9-bf77-ce4d73bf6f7e has left cluster!
at com.hazelcast.spi.impl.operationservice.impl.InvocationMonitor$OnMemberLeftTask.run0(InvocationMonitor.java:379)
at com.hazelcast.spi.impl.operationservice.impl.InvocationMonitor$MonitorTask.run(InvocationMonitor.java:221)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
I am running Linux in VirtualBox and am having an issue that I did not encounter on my machine with Linux as the primary OS.
When launching the neo4j service through sudo ./neo4j start in /opt/neo4j-community-2.3.1/bin I get a timeout with the message Failed to start within 120 seconds. Neo4j Server may have failed to start, please check the logs
my log from /opt/neo4j-community-2.3.1/data/graph.db/messages.log says:
http://pastebin.com/wUA715QQ
and data/log/console.log says:
2016-01-06 02:07:03.404+0100 INFO Successfully started database
2016-01-06 02:07:03.603+0100 INFO Successfully stopped database
2016-01-06 02:07:03.604+0100 INFO Successfully shutdown Neo4j Server
2016-01-06 02:07:03.608+0100 ERROR Failed to start Neo4j: Starting Neo4j failed: Component 'org.neo4j.server.security.auth.FileUserRepository#9ab182' was successfully initialized, but failed to start. Please see attached cause exception. Starting Neo4j failed: Component 'org.neo4j.server.security.auth.FileUserRepository#9ab182' was successfully initialized, but failed to start. Please see attached cause exception.
org.neo4j.server.ServerStartupException: Starting Neo4j failed: Component 'org.neo4j.server.security.auth.FileUserRepository#9ab182' was successfully initialized, but failed to start. Please see attached cause exception.
at org.neo4j.server.exception.ServerStartupErrors.translateToServerStartupError(ServerStartupErrors.java:67)
at org.neo4j.server.AbstractNeoServer.start(AbstractNeoServer.java:234)
at org.neo4j.server.Bootstrapper.start(Bootstrapper.java:97)
at org.neo4j.server.CommunityBootstrapper.start(CommunityBootstrapper.java:48)
at org.neo4j.server.CommunityBootstrapper.main(CommunityBootstrapper.java:35)
Caused by: org.neo4j.kernel.lifecycle.LifecycleException: Component 'org.neo4j.server.security.auth.FileUserRepository#9ab182' was successfully initialized, but failed to start. Please see attached cause exception.
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.start(LifeSupport.java:462)
at org.neo4j.kernel.lifecycle.LifeSupport.start(LifeSupport.java:111)
at org.neo4j.server.AbstractNeoServer.start(AbstractNeoServer.java:194)
... 3 more
Caused by: java.nio.file.AccessDeniedException: /opt/neo4j-community-2.3.1/data/dbms/auth
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:84)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
at java.nio.file.Files.newByteChannel(Files.java:361)
at java.nio.file.Files.newByteChannel(Files.java:407)
at java.nio.file.Files.readAllBytes(Files.java:3152)
at org.neo4j.server.security.auth.FileUserRepository.loadUsersFromFile(FileUserRepository.java:208)
at org.neo4j.server.security.auth.FileUserRepository.start(FileUserRepository.java:73)
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.start(LifeSupport.java:452)
... 5 more
Any idea why the server won't start?
Check the permissions on /opt/neo4j-community-2.3.1/data/dbms/auth
See the line that says:
Caused by: java.nio.file.AccessDeniedException: /opt/neo4j-community-2.3.1/data/dbms/auth