I am trying to deploy a Flink job in Kubernetes cluster (Azure AKS). The Job Cluster is getting aborted just after starting but Task manager is running fine.
The docker image is created successfully without any exception. I am able to run the docker image as well as able to SSH to docker image.
I have followed steps mentioned in the below link:
https://github.com/apache/flink/tree/release-1.9/flink-container/kubernetes
While creating image I have provided Job jar and it has been copied on "/opt/artifacts" inside the image. But still not getting why getting below exception in Job Cluster pod log.
Caused by: org.apache.flink.util.FlinkException: Failed to find job JAR on class path. Please provide the job class name explicitly.
I am new in Kubernetes, Could you please give me some clue to debug this issue.
Please find below complete logs:
A. flink-job-cluster Pod Log
develk#ACIDLAELKV01:~/cntx_eng$ kubectl logs flink-job-cluster-kszwf
Starting the job-cluster
Starting standalonejob as a console application on host flink-job-cluster-kszwf.
2019-12-12 10:37:17,170 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - --------------------------------------------------------------------------------
2019-12-12 10:37:17,172 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Starting StandaloneJobClusterEntryPoint (Version: 1.8.0, Rev:4caec0d, Date:03.04.2019 # 13:25:54 PDT)
2019-12-12 10:37:17,172 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - OS current user: flink
2019-12-12 10:37:17,173 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Current Hadoop/Kerberos user: <no hadoop dependency found>
2019-12-12 10:37:17,173 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JVM: OpenJDK 64-Bit Server VM - IcedTea - 1.8/25.212-b04
2019-12-12 10:37:17,173 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Maximum heap size: 989 MiBytes
2019-12-12 10:37:17,173 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JAVA_HOME: /usr/lib/jvm/java-1.8-openjdk/jre
2019-12-12 10:37:17,174 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - No Hadoop Dependency available
2019-12-12 10:37:17,174 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JVM Options:
2019-12-12 10:37:17,174 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Xms1024m
2019-12-12 10:37:17,174 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Xmx1024m
2019-12-12 10:37:17,174 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Dlog4j.configuration=file:/opt/flink-1.8.0/conf/log4j-console.properties
2019-12-12 10:37:17,175 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Dlogback.configurationFile=file:/opt/flink-1.8.0/conf/logback-console.xml
2019-12-12 10:37:17,175 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Program Arguments:
2019-12-12 10:37:17,175 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - --configDir
2019-12-12 10:37:17,175 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - /opt/flink-1.8.0/conf
2019-12-12 10:37:17,175 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Djobmanager.rpc.address=flink-job-cluster
2019-12-12 10:37:17,175 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Dparallelism.default=1
2019-12-12 10:37:17,176 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Dblob.server.port=6124
2019-12-12 10:37:17,176 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Dqueryable-state.server.ports=6125
2019-12-12 10:37:17,176 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Classpath: /opt/flink-1.8.0/lib/log4j-1.2.17.jar:/opt/flink-1.8.0/lib/slf4j-log4j12-1.7.15.jar:/opt/flink-1.8.0/lib/flink-dist_2.11-1.8.0.jar:::
2019-12-12 10:37:17,176 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - --------------------------------------------------------------------------------
2019-12-12 10:37:17,178 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Registered UNIX signal handlers for [TERM, HUP, INT]
2019-12-12 10:37:17,306 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, localhost
2019-12-12 10:37:17,306 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123
2019-12-12 10:37:17,307 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.size, 1024m
2019-12-12 10:37:17,307 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.size, 1024m
2019-12-12 10:37:17,307 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 1
2019-12-12 10:37:17,307 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: parallelism.default, 1
2019-12-12 10:37:17,336 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Starting StandaloneJobClusterEntryPoint.
2019-12-12 10:37:17,336 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Install default filesystem.
2019-12-12 10:37:17,343 INFO org.apache.flink.core.fs.FileSystem - Hadoop is not in the classpath/dependencies. The extended set of supported File Systems via Hadoop is not available.
2019-12-12 10:37:17,352 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Install security context.
2019-12-12 10:37:17,362 INFO org.apache.flink.runtime.security.modules.HadoopModuleFactory - Cannot create Hadoop Security Module because Hadoop cannot be found in the Classpath.
2019-12-12 10:37:17,381 INFO org.apache.flink.runtime.security.SecurityUtils - Cannot install HadoopSecurityContext because Hadoop cannot be found in the Classpath.
2019-12-12 10:37:17,382 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Initializing cluster services.
2019-12-12 10:37:17,638 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Trying to start actor system at flink-job-cluster:6123
2019-12-12 10:37:18,163 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2019-12-12 10:37:18,237 INFO akka.remote.Remoting - Starting remoting
2019-12-12 10:37:18,366 INFO akka.remote.Remoting - Remoting started; listening on addresses :[akka.tcp://flink#flink-job-cluster:6123]
2019-12-12 10:37:18,375 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Actor system started at akka.tcp://flink#flink-job-cluster:6123
2019-12-12 10:37:18,398 INFO org.apache.flink.configuration.Configuration - Config uses fallback configuration key 'jobmanager.rpc.address' instead of key 'rest.address'
2019-12-12 10:37:18,407 INFO org.apache.flink.runtime.blob.BlobServer - Created BLOB server storage directory /tmp/blobStore-63338044-67c1-4872-a3d9-c94563b3a7c3
2019-12-12 10:37:18,412 INFO org.apache.flink.runtime.blob.BlobServer - Started BLOB server at 0.0.0.0:6124 - max concurrent requests: 50 - max backlog: 1000
2019-12-12 10:37:18,428 INFO org.apache.flink.runtime.metrics.MetricRegistryImpl - No metrics reporter configured, no metrics will be exposed/reported.
2019-12-12 10:37:18,430 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Trying to start actor system at flink-job-cluster:0
2019-12-12 10:37:18,464 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2019-12-12 10:37:18,472 INFO akka.remote.Remoting - Starting remoting
2019-12-12 10:37:18,480 INFO akka.remote.Remoting - Remoting started; listening on addresses :[akka.tcp://flink-metrics#flink-job-cluster:33529]
2019-12-12 10:37:18,482 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Actor system started at akka.tcp://flink-metrics#flink-job-cluster:33529
2019-12-12 10:37:18,490 INFO org.apache.flink.runtime.blob.TransientBlobCache - Created BLOB cache storage directory /tmp/blobStore-ba64dcdb-5095-41fc-9c98-0f1528d95c40
2019-12-12 10:37:18,514 INFO org.apache.flink.configuration.Configuration - Config uses fallback configuration key 'jobmanager.rpc.address' instead of key 'rest.address'
2019-12-12 10:37:18,515 WARN org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint - Upload directory /tmp/flink-web-f6be0c2d-5099-4bd6-bc72-a0ae1fc6448e/flink-web-upload does not exist, or has been deleted externally. Previously uploaded files are no longer available.
2019-12-12 10:37:18,516 INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint - Created directory /tmp/flink-web-f6be0c2d-5099-4bd6-bc72-a0ae1fc6448e/flink-web-upload for file uploads.
2019-12-12 10:37:18,603 INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint - Starting rest endpoint.
2019-12-12 10:37:18,872 WARN org.apache.flink.runtime.webmonitor.WebMonitorUtils - Log file environment variable 'log.file' is not set.
2019-12-12 10:37:18,872 WARN org.apache.flink.runtime.webmonitor.WebMonitorUtils - JobManager log files are unavailable in the web dashboard. Log file location not found in environment variable 'log.file' or configuration key 'Key: 'web.log.path' , default: null (fallback keys: [{key=jobmanager.web.log.path, isDeprecated=true}])'.
2019-12-12 10:37:19,115 INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint - Rest endpoint listening at flink-job-cluster:8081
2019-12-12 10:37:19,116 INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint - http://flink-job-cluster:8081 was granted leadership with leaderSessionID=00000000-0000-0000-0000-000000000000
2019-12-12 10:37:19,116 INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint - Web frontend listening at http://flink-job-cluster:8081.
2019-12-12 10:37:19,239 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting RPC endpoint for org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at akka://flink/user/resourcemanager .
2019-12-12 10:37:19,262 INFO org.apache.flink.container.entrypoint.ClassPathJobGraphRetriever - Scanning class path for job JAR
2019-12-12 10:37:19,270 INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint - Shutting down rest endpoint.
2019-12-12 10:37:19,295 INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint - Removing cache directory /tmp/flink-web-f6be0c2d-5099-4bd6-bc72-a0ae1fc6448e/flink-web-ui
2019-12-12 10:37:19,299 INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint - http://flink-job-cluster:8081 lost leadership
2019-12-12 10:37:19,299 INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint - Shut down complete.
2019-12-12 10:37:19,302 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Shutting StandaloneJobClusterEntryPoint down with application status FAILED. Diagnostics org.apache.flink.util.FlinkException: Could not create the DispatcherResourceManagerComponent.
at org.apache.flink.runtime.entrypoint.component.AbstractDispatcherResourceManagerComponentFactory.create(AbstractDispatcherResourceManagerComponentFactory.java:257)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:224)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:172)
at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:171)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:535)
at org.apache.flink.container.entrypoint.StandaloneJobClusterEntryPoint.main(StandaloneJobClusterEntryPoint.java:105)
Caused by: org.apache.flink.util.FlinkException: Failed to find job JAR on class path. Please provide the job class name explicitly.
at org.apache.flink.container.entrypoint.ClassPathJobGraphRetriever.getJobClassNameOrScanClassPath(ClassPathJobGraphRetriever.java:131)
at org.apache.flink.container.entrypoint.ClassPathJobGraphRetriever.createPackagedProgram(ClassPathJobGraphRetriever.java:114)
at org.apache.flink.container.entrypoint.ClassPathJobGraphRetriever.retrieveJobGraph(ClassPathJobGraphRetriever.java:96)
at org.apache.flink.runtime.dispatcher.JobDispatcherFactory.createDispatcher(JobDispatcherFactory.java:62)
at org.apache.flink.runtime.dispatcher.JobDispatcherFactory.createDispatcher(JobDispatcherFactory.java:41)
at org.apache.flink.runtime.entrypoint.component.AbstractDispatcherResourceManagerComponentFactory.create(AbstractDispatcherResourceManagerComponentFactory.java:184)
... 6 more
Caused by: java.util.NoSuchElementException: No JAR with manifest attribute for entry class
at org.apache.flink.container.entrypoint.JarManifestParser.findOnlyEntryClass(JarManifestParser.java:80)
at org.apache.flink.container.entrypoint.ClassPathJobGraphRetriever.scanClassPathForJobJar(ClassPathJobGraphRetriever.java:137)
at org.apache.flink.container.entrypoint.ClassPathJobGraphRetriever.getJobClassNameOrScanClassPath(ClassPathJobGraphRetriever.java:129)
... 11 more
.
2019-12-12 10:37:19,305 INFO org.apache.flink.runtime.blob.BlobServer - Stopped BLOB server at 0.0.0.0:6124
2019-12-12 10:37:19,305 INFO org.apache.flink.runtime.blob.TransientBlobCache - Shutting down BLOB cache
2019-12-12 10:37:19,315 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Stopping Akka RPC service.
2019-12-12 10:37:19,320 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting down remote daemon.
2019-12-12 10:37:19,321 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting down remote daemon.
2019-12-12 10:37:19,323 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remote daemon shut down; proceeding with flushing remote transports.
2019-12-12 10:37:19,325 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remote daemon shut down; proceeding with flushing remote transports.
2019-12-12 10:37:19,354 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remoting shut down.
2019-12-12 10:37:19,356 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remoting shut down.
2019-12-12 10:37:19,378 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Stopped Akka RPC service.
2019-12-12 10:37:19,382 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Could not start cluster entrypoint StandaloneJobClusterEntryPoint.
org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to initialize the cluster entrypoint StandaloneJobClusterEntryPoint.
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:190)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:535)
at org.apache.flink.container.entrypoint.StandaloneJobClusterEntryPoint.main(StandaloneJobClusterEntryPoint.java:105)
Caused by: org.apache.flink.util.FlinkException: Could not create the DispatcherResourceManagerComponent.
at org.apache.flink.runtime.entrypoint.component.AbstractDispatcherResourceManagerComponentFactory.create(AbstractDispatcherResourceManagerComponentFactory.java:257)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:224)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:172)
at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:171)
... 2 more
Caused by: org.apache.flink.util.FlinkException: Failed to find job JAR on class path. Please provide the job class name explicitly.
at org.apache.flink.container.entrypoint.ClassPathJobGraphRetriever.getJobClassNameOrScanClassPath(ClassPathJobGraphRetriever.java:131)
at org.apache.flink.container.entrypoint.ClassPathJobGraphRetriever.createPackagedProgram(ClassPathJobGraphRetriever.java:114)
at org.apache.flink.container.entrypoint.ClassPathJobGraphRetriever.retrieveJobGraph(ClassPathJobGraphRetriever.java:96)
at org.apache.flink.runtime.dispatcher.JobDispatcherFactory.createDispatcher(JobDispatcherFactory.java:62)
at org.apache.flink.runtime.dispatcher.JobDispatcherFactory.createDispatcher(JobDispatcherFactory.java:41)
at org.apache.flink.runtime.entrypoint.component.AbstractDispatcherResourceManagerComponentFactory.create(AbstractDispatcherResourceManagerComponentFactory.java:184)
... 6 more
Caused by: java.util.NoSuchElementException: No JAR with manifest attribute for entry class
at org.apache.flink.container.entrypoint.JarManifestParser.findOnlyEntryClass(JarManifestParser.java:80)
at org.apache.flink.container.entrypoint.ClassPathJobGraphRetriever.scanClassPathForJobJar(ClassPathJobGraphRetriever.java:137)
at org.apache.flink.container.entrypoint.ClassPathJobGraphRetriever.getJobClassNameOrScanClassPath(ClassPathJobGraphRetriever.java:129)
... 11 more
develk#ACIDLAELKV01:~/cntx_eng$
Now, I have added Job class name as in argument section of "job-cluster-job.yaml.template" file.
Like below:
args: ["job-cluster",
"--job-classname", "com.flink.wordCountSimple",
"-Djobmanager.rpc.address=flink-job-cluster",
But after that I am getting below exception:
Caused by: org.apache.flink.util.FlinkException: Could not load the provided entrypoint class.
Please see below detail log.
2019-12-13 19:08:34,323 INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint - Shut down complete.
2019-12-13 19:08:34,329 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Shutting StandaloneJobClusterEntryPoint down with application status FAILED. Diagnostics org.apache.flink.util.FlinkException: Could not create the DispatcherResourceManagerComponent.
at org.apache.flink.runtime.entrypoint.component.AbstractDispatcherResourceManagerComponentFactory.create(AbstractDispatcherResourceManagerComponentFactory.java:257)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:224)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:172)
at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:171)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:535)
at org.apache.flink.container.entrypoint.StandaloneJobClusterEntryPoint.main(StandaloneJobClusterEntryPoint.java:105)
Caused by: org.apache.flink.util.FlinkException: Could not load the provided entrypoint class.
at org.apache.flink.container.entrypoint.ClassPathJobGraphRetriever.createPackagedProgram(ClassPathJobGraphRetriever.java:119)
at org.apache.flink.container.entrypoint.ClassPathJobGraphRetriever.retrieveJobGraph(ClassPathJobGraphRetriever.java:96)
at org.apache.flink.runtime.dispatcher.JobDispatcherFactory.createDispatcher(JobDispatcherFactory.java:62)
at org.apache.flink.runtime.dispatcher.JobDispatcherFactory.createDispatcher(JobDispatcherFactory.java:41)
at org.apache.flink.runtime.entrypoint.component.AbstractDispatcherResourceManagerComponentFactory.create(AbstractDispatcherResourceManagerComponentFactory.java:184)
... 6 more
Caused by: java.lang.ClassNotFoundException: com.flink.wordCountSimple
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.flink.container.entrypoint.ClassPathJobGraphRetriever.createPackagedProgram(ClassPathJobGraphRetriever.java:116)
... 10 more
.
2019-12-13 19:08:34,337 INFO org.apache.flink.runtime.blob.BlobServer - Stopped BLOB server at 0.0.0.0:6124
2019-12-13 19:08:34,338 INFO org.apache.flink.runtime.blob.TransientBlobCache - Shutting down BLOB cache
2019-12-13 19:08:34,364 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Stopping Akka RPC service.
2019-12-13 19:08:34,368 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting down remote daemon.
2019-12-13 19:08:34,372 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remote daemon shut down; proceeding with flushing remote transports.
2019-12-13 19:08:34,392 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting down remote daemon.
2019-12-13 19:08:34,392 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remote daemon shut down; proceeding with flushing remote transports.
2019-12-13 19:08:34,406 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remoting shut down.
2019-12-13 19:08:34,410 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remoting shut down.
2019-12-13 19:08:34,434 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Stopped Akka RPC service.
2019-12-13 19:08:34,443 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Could not start cluster entrypoint StandaloneJobClusterEntryPoint.
org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to initialize the cluster entrypoint StandaloneJobClusterEntryPoint.
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:190)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:535)
at org.apache.flink.container.entrypoint.StandaloneJobClusterEntryPoint.main(StandaloneJobClusterEntryPoint.java:105)
Caused by: org.apache.flink.util.FlinkException: Could not create the DispatcherResourceManagerComponent.
at org.apache.flink.runtime.entrypoint.component.AbstractDispatcherResourceManagerComponentFactory.create(AbstractDispatcherResourceManagerComponentFactory.java:257)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:224)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:172)
at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:171)
... 2 more
Caused by: org.apache.flink.util.FlinkException: Could not load the provided entrypoint class.
at org.apache.flink.container.entrypoint.ClassPathJobGraphRetriever.createPackagedProgram(ClassPathJobGraphRetriever.java:119)
at org.apache.flink.container.entrypoint.ClassPathJobGraphRetriever.retrieveJobGraph(ClassPathJobGraphRetriever.java:96)
at org.apache.flink.runtime.dispatcher.JobDispatcherFactory.createDispatcher(JobDispatcherFactory.java:62)
at org.apache.flink.runtime.dispatcher.JobDispatcherFactory.createDispatcher(JobDispatcherFactory.java:41)
at org.apache.flink.runtime.entrypoint.component.AbstractDispatcherResourceManagerComponentFactory.create(AbstractDispatcherResourceManagerComponentFactory.java:184)
... 6 more
Caused by: java.lang.ClassNotFoundException: com.flink.wordCountSimple
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.flink.container.entrypoint.ClassPathJobGraphRetriever.createPackagedProgram(ClassPathJobGraphRetriever.java:116)
... 10 more
There's a complete, working example of creating and running a flink job cluster on kubernetes in https://github.com/alpinegizmo/flink-containers-example. Maybe that will help. See also https://www.youtube.com/watch?v=ceZtUDgh2TE.
version: "2.1"
services:
jobmanager:
build:
context: ./
args:
JAR_FILE: flink-event-tracker-bundled-1.6.0.jar
image: test/flink-event-tracker
expose:
- "6123"
ports:
- "8081:8081"
- "6123:6123"
command: job-cluster --job-classname com.company.test.flink.pipelines.KafkaPipelineConsumer -Djobmanager.rpc.address=jobmanager --runner=FlinkRunner --streaming=true --checkpointingInterval=30000
environment:
- JOB_MANAGER_RPC_ADDRESS=jobmanager
- JOB_MANAGER=jobmanager
volumes:
- data-volume:/docker/volumes
taskmanager:
image: test/flink-event-tracker
expose:
- "6121"
- "6122"
depends_on:
- jobmanager
command: task-manager -Djobmanager.rpc.address=jobmanager
links:
- "jobmanager:jobmanager"
environment:
- JOB_MANAGER_RPC_ADDRESS=jobmanager
- JOB_MANAGER=jobmanager
volumes:
- data-volume:/docker/volumes
volumes:
data-volume:
driver: local
driver_opts:
o: bind
type: none
device: /Users/home/Development/docker/volumes/flink
Docker file
FROM flink:1.9
ARG JAR_FILE=""
ENV APP_OPTS ""
ENV JAVA_OPTS ""
ENV JOB_MANAGER=""
# Build arg allows passing the version at runtime
ARG VERSION=unset-version
COPY flink-conf.yml $FLINK_HOME/conf/flink-conf.yaml
COPY target/$JAR_FILE $FLINK_HOME/lib/event-tracker.jar
COPY docker-cluster-entrypoint.sh /docker-cluster-entrypoint.sh
RUN apt-get update && apt-get install procps -y && apt-get install curl -y
RUN echo "root:root" | chpasswd
RUN chmod 777 /docker-cluster-entrypoint.sh
RUN chmod 777 $FLINK_HOME/lib/event-tracker.jar
ENTRYPOINT [ "bash","/docker-cluster-entrypoint.sh" ]
docker-cluster-entrypoint.sh
FLINK_HOME=${FLINK_HOME:-"/opt/flink/bin"}
JOB_CLUSTER="job-cluster"
TASK_MANAGER="task-manager"
CMD="$1"
shift;
if [ "${CMD}" = "--help" -o "${CMD}" = "-h" ]; then
echo "Usage: $(basename $0) (${JOB_CLUSTER}|${TASK_MANAGER})"
exit 0
elif [ "${CMD}" = "${JOB_CLUSTER}" -o "${CMD}" = "${TASK_MANAGER}" ]; then
echo "Starting the ${CMD}"
if [ "${CMD}" = "${TASK_MANAGER}" ]; then
exec $FLINK_HOME/bin/taskmanager.sh start-foreground "$#"
else
exec $FLINK_HOME/bin/standalone-job.sh start-foreground "$#"
fi
fi
How to run:-
mvn clean install
docker-compose -f docker-compose.local.yml up --scale taskmanager=2 > exceptionlog.log
docker-compose -f docker-compose.local.yml build
this is the entire conf that runs your docker. but if you want to run in kube, just convert the docker-compose file to its corresponding kube files...remaining can stay same.. may be do a helm that way kube maintenance is better.
Note:- we are using apache beam to code the job
Related
My main objective is to experiment with a Cassandra cluster. I have a Laptop so my options are, as I understand, (1) Use some virtualization software (e.g. Hyper-V) to create multiple VMs and then run each VM with a Cassandra instance, or (2) Use docker to create multiple instances of Cassandra, or (3) Directly run multiple instances.
I thought (3) would be faster, and provide me with more insights. So I tried this (by following https://stackoverflow.com/a/25348301/1029599). But I'm getting strange situations when I see that I'm not able to change the JMX port. More details below:
I've created two folders of Cassandra 3.11.7 - one in C drive and other in D drive.
For C drive folder, I've edited cassandra.yaml to replace 'listen_address: localhost' by 'listen_address: 127.0.0.1' and 'rpc_address: localhost' by 'rpc_address: 127.0.0.1'. In addition, set seeds to to point to D-drive-instance as '- seeds: "127.0.0.2"'. I've NOT edited cassandra-env.sh to let JMX_PORT be the default 7199.
For D drive folder, I've edited cassandra.yaml to point localhost as '127.0.0.2' and seed as '- seeds: "127.0.0.1"'. In addition, I've edited cassandra-env.sh to let JMX_PORT=7200.
Surprisingly, when I'm starting D drive's cassandra instance, it's always picking JMX_PORT as 7199 and not 7200.
Log (relevant portion from the start):
D:\apache-cassandra-3.11.7\bin>.\cassandra
WARNING! Powershell script execution unavailable. Please use 'powershell Set-ExecutionPolicy Unrestricted' on this user-account to run cassandra with fully featured functionality on this platform. Starting with legacy startup options Starting Cassandra Server INFO [main] 2020-08-17 17:31:21,632 YamlConfigurationLoader.java:89 - Configuration location: file:/D:/apache-cassandra-3.11.7/conf/cassandra.yaml INFO [main] 2020-08-17 17:31:22,359 Config.java:534 - Node configuration:[allocate_tokens_for_keyspace=null; authenticator=AllowAllAuthenticator; authorizer=AllowAllAuthorizer; auto_bootstrap=true; auto_snapshot=true; back_pressure_enabled=false; back_pressure_strategy=org.apache.cassandra.net.RateBasedBackPressure{high_ratio=0.9, factor=5, flow=FAST}; batch_size_fail_threshold_in_kb=50; batch_size_warn_threshold_in_kb=5; batchlog_replay_throttle_in_kb=1024; broadcast_address=null; broadcast_rpc_address=null; buffer_pool_use_heap_if_exhausted=true; cas_contention_timeout_in_ms=1000; cdc_enabled=false; cdc_free_space_check_interval_ms=250; cdc_raw_directory=null; cdc_total_space_in_mb=0; check_for_duplicate_rows_during_compaction=true; check_for_duplicate_rows_during_reads=true; client_encryption_options=<REDACTED>; cluster_name=Test Cluster; column_index_cache_size_in_kb=2; column_index_size_in_kb=64; commit_failure_policy=stop; commitlog_compression=null; commitlog_directory=null; commitlog_max_compression_buffers_in_pool=3; commitlog_periodic_queue_size=-1; commitlog_segment_size_in_mb=32; commitlog_sync=periodic; commitlog_sync_batch_window_in_ms=NaN; commitlog_sync_period_in_ms=10000; commitlog_total_space_in_mb=null; compaction_large_partition_warning_threshold_mb=100; compaction_throughput_mb_per_sec=16; concurrent_compactors=null; concurrent_counter_writes=32; concurrent_materialized_view_writes=32; concurrent_reads=32; concurrent_replicates=null; concurrent_writes=32; counter_cache_keys_to_save=2147483647; counter_cache_save_period=7200; counter_cache_size_in_mb=null; counter_write_request_timeout_in_ms=5000; credentials_cache_max_entries=1000; credentials_update_interval_in_ms=-1; credentials_validity_in_ms=2000; cross_node_timeout=false; data_file_directories=[Ljava.lang.String;#235834f2; disk_access_mode=auto; disk_failure_policy=stop; disk_optimization_estimate_percentile=0.95; disk_optimization_page_cross_chance=0.1; disk_optimization_strategy=ssd; dynamic_snitch=true; dynamic_snitch_badness_threshold=0.1; dynamic_snitch_reset_interval_in_ms=600000; dynamic_snitch_update_interval_in_ms=100; enable_materialized_views=true; enable_sasi_indexes=true; enable_scripted_user_defined_functions=false; enable_user_defined_functions=false; enable_user_defined_functions_threads=true; encryption_options=null; endpoint_snitch=SimpleSnitch; file_cache_round_up=null; file_cache_size_in_mb=null; gc_log_threshold_in_ms=200; gc_warn_threshold_in_ms=1000; hinted_handoff_disabled_datacenters=[]; hinted_handoff_enabled=true; hinted_handoff_throttle_in_kb=1024; hints_compression=null; hints_directory=null; hints_flush_period_in_ms=10000; incremental_backups=false; index_interval=null; index_summary_capacity_in_mb=null; index_summary_resize_interval_in_minutes=60; initial_token=null; inter_dc_stream_throughput_outbound_megabits_per_sec=200; inter_dc_tcp_nodelay=false; internode_authenticator=null; internode_compression=dc; internode_recv_buff_size_in_bytes=0; internode_send_buff_size_in_bytes=0; key_cache_keys_to_save=2147483647; key_cache_save_period=14400; key_cache_size_in_mb=null; listen_address=127.0.0.2; listen_interface=null; listen_interface_prefer_ipv6=false; listen_on_broadcast_address=false; max_hint_window_in_ms=10800000; max_hints_delivery_threads=2; max_hints_file_size_in_mb=128; max_mutation_size_in_kb=null; max_streaming_retries=3; max_value_size_in_mb=256; memtable_allocation_type=heap_buffers; memtable_cleanup_threshold=null; memtable_flush_writers=0; memtable_heap_space_in_mb=null; memtable_offheap_space_in_mb=null; min_free_space_per_drive_in_mb=50; native_transport_flush_in_batches_legacy=true; native_transport_max_concurrent_connections=-1; native_transport_max_concurrent_connections_per_ip=-1; native_transport_max_concurrent_requests_in_bytes=-1; native_transport_max_concurrent_requests_in_bytes_per_ip=-1; native_transport_max_frame_size_in_mb=256; native_transport_max_negotiable_protocol_version=-2147483648; native_transport_max_threads=128; native_transport_port=9042; native_transport_port_ssl=null; num_tokens=256; otc_backlog_expiration_interval_ms=200; otc_coalescing_enough_coalesced_messages=8; otc_coalescing_strategy=DISABLED; otc_coalescing_window_us=200; partitioner=org.apache.cassandra.dht.Murmur3Partitioner; permissions_cache_max_entries=1000; permissions_update_interval_in_ms=-1; permissions_validity_in_ms=2000; phi_convict_threshold=8.0; prepared_statements_cache_size_mb=null; range_request_timeout_in_ms=10000; read_request_timeout_in_ms=5000; repair_session_max_tree_depth=18; request_scheduler=org.apache.cassandra.scheduler.NoScheduler; request_scheduler_id=null; request_scheduler_options=null; request_timeout_in_ms=10000; role_manager=CassandraRoleManager; roles_cache_max_entries=1000; roles_update_interval_in_ms=-1; roles_validity_in_ms=2000; row_cache_class_name=org.apache.cassandra.cache.OHCProvider; row_cache_keys_to_save=2147483647; row_cache_save_period=0; row_cache_size_in_mb=0; rpc_address=127.0.0.2; rpc_interface=null; rpc_interface_prefer_ipv6=false; rpc_keepalive=true; rpc_listen_backlog=50; rpc_max_threads=2147483647; rpc_min_threads=16; rpc_port=9160; rpc_recv_buff_size_in_bytes=null; rpc_send_buff_size_in_bytes=null; rpc_server_type=sync; saved_caches_directory=null; seed_provider=org.apache.cassandra.locator.SimpleSeedProvider{seeds=127.0.0.1}; server_encryption_options=<REDACTED>; slow_query_log_timeout_in_ms=500; snapshot_before_compaction=false; snapshot_on_duplicate_row_detection=false; ssl_storage_port=7001; sstable_preemptive_open_interval_in_mb=50; start_native_transport=true; start_rpc=false; storage_port=7000; stream_throughput_outbound_megabits_per_sec=200; streaming_keep_alive_period_in_secs=300; streaming_socket_timeout_in_ms=86400000; thrift_framed_transport_size_in_mb=15; thrift_max_message_length_in_mb=16; thrift_prepared_statements_cache_size_mb=null; tombstone_failure_threshold=100000; tombstone_warn_threshold=1000; tracetype_query_ttl=86400; tracetype_repair_ttl=604800; transparent_data_encryption_options=org.apache.cassandra.config.TransparentDataEncryptionOptions#5656be13; trickle_fsync=false; trickle_fsync_interval_in_kb=10240; truncate_request_timeout_in_ms=60000; unlogged_batch_across_partitions_warn_threshold=10; user_defined_function_fail_timeout=1500; user_defined_function_warn_timeout=500; user_function_timeout_policy=die; windows_timer_interval=1; write_request_timeout_in_ms=2000] INFO [main] 2020-08-17 17:31:22,361 DatabaseDescriptor.java:381 - DiskAccessMode 'auto' determined to be mmap, indexAccessMode is mmap INFO [main] 2020-08-17 17:31:22,366 DatabaseDescriptor.java:439 - Global memtable on-heap threshold is enabled at 503MB INFO [main] 2020-08-17 17:31:22,367 DatabaseDescriptor.java:443 - Global memtable off-heap threshold is enabled at 503MB INFO [main] 2020-08-17 17:31:22,538 RateBasedBackPressure.java:123 - Initialized back-pressure with high ratio: 0.9, factor: 5, flow: FAST, window size: 2000. INFO [main] 2020-08-17 17:31:22,538 DatabaseDescriptor.java:773 - Back-pressure is disabled with strategy org.apache.cassandra.net.RateBasedBackPressure{high_ratio=0.9, factor=5, flow=FAST}. INFO [main] 2020-08-17 17:31:22,686 JMXServerUtils.java:252 - Configured JMX server at: service:jmx:rmi://127.0.0.1/jndi/rmi://127.0.0.1:7199/jmxrmi INFO [main] 2020-08-17 17:31:22,700 CassandraDaemon.java:490 - Hostname: DESKTOP-NQ7673H INFO [main] 2020-08-17 17:31:22,703 CassandraDaemon.java:497 - JVM vendor/version: Java HotSpot(TM) 64-Bit Server VM/1.8.0_261 INFO [main] 2020-08-17 17:31:22,709 CassandraDaemon.java:498 - Heap size: 1.968GiB/1.968GiB INFO [main] 2020-08-17 17:31:22,712 CassandraDaemon.java:503 - Code Cache Non-heap memory: init = 2555904(2496K) used = 5181696(5060K) committed = 5242880(5120K) max = 251658240(245760K) INFO [main] 2020-08-17 17:31:22,714 CassandraDaemon.java:503 - Metaspace Non-heap memory: init = 0(0K) used = 19412472(18957K) committed = 20054016(19584K) max
= -1(-1K) INFO [main] 2020-08-17 17:31:22,738 CassandraDaemon.java:503 - Compressed Class Space Non-heap memory: init = 0(0K) used = 2373616(2317K) committed = 2621440(2560K) max = 1073741824(1048576K) INFO [main] 2020-08-17 17:31:22,739 CassandraDaemon.java:503 - Par Eden Space Heap memory: init = 279183360(272640K) used = 111694624(109076K) committed = 279183360(272640K) max = 279183360(272640K) INFO [main] 2020-08-17 17:31:22,740 CassandraDaemon.java:503 - Par Survivor Space Heap memory: init = 34865152(34048K) used = 0(0K) committed = 34865152(34048K) max = 34865152(34048K) INFO [main] 2020-08-17 17:31:22,743 CassandraDaemon.java:503 - CMS Old Gen Heap memory: init
= 1798569984(1756416K) used = 0(0K) committed = 1798569984(1756416K) max = 1798569984(1756416K) INFO [main] 2020-08-17 17:31:22,744 CassandraDaemon.java:505 - Classpath: D:\apache-cassandra-3.11.7\conf;D:\apache-cassandra-3.11.7\lib\airline-0.6.jar;D:\apache-cassandra-3.11.7\lib\antlr-runtime-3.5.2.jar;D:\apache-cassandra-3.11.7\lib\apache-cassandra-3.11.7.jar;D:\apache-cassandra-3.11.7\lib\apache-cassandra-thrift-3.11.7.jar;D:\apache-cassandra-3.11.7\lib\asm-5.0.4.jar;D:\apache-cassandra-3.11.7\lib\caffeine-2.2.6.jar;D:\apache-cassandra-3.11.7\lib\cassandra-driver-core-3.0.1-shaded.jar;D:\apache-cassandra-3.11.7\lib\commons-cli-1.1.jar;D:\apache-cassandra-3.11.7\lib\commons-codec-1.9.jar;D:\apache-cassandra-3.11.7\lib\commons-lang3-3.1.jar;D:\apache-cassandra-3.11.7\lib\commons-math3-3.2.jar;D:\apache-cassandra-3.11.7\lib\compress-lzf-0.8.4.jar;D:\apache-cassandra-3.11.7\lib\concurrent-trees-2.4.0.jar;D:\apache-cassandra-3.11.7\lib\concurrentlinkedhashmap-lru-1.4.jar;D:\apache-cassandra-3.11.7\lib\disruptor-3.0.1.jar;D:\apache-cassandra-3.11.7\lib\ecj-4.4.2.jar;D:\apache-cassandra-3.11.7\lib\guava-18.0.jar;D:\apache-cassandra-3.11.7\lib\HdrHistogram-2.1.9.jar;D:\apache-cassandra-3.11.7\lib\high-scale-lib-1.0.6.jar;D:\apache-cassandra-3.11.7\lib\hppc-0.5.4.jar;D:\apache-cassandra-3.11.7\lib\jackson-annotations-2.9.10.jar;D:\apache-cassandra-3.11.7\lib\jackson-core-2.9.10.jar;D:\apache-cassandra-3.11.7\lib\jackson-databind-2.9.10.4.jar;D:\apache-cassandra-3.11.7\lib\jamm-0.3.0.jar;D:\apache-cassandra-3.11.7\lib\javax.inject.jar;D:\apache-cassandra-3.11.7\lib\jbcrypt-0.3m.jar;D:\apache-cassandra-3.11.7\lib\jcl-over-slf4j-1.7.7.jar;D:\apache-cassandra-3.11.7\lib\jctools-core-1.2.1.jar;D:\apache-cassandra-3.11.7\lib\jflex-1.6.0.jar;D:\apache-cassandra-3.11.7\lib\jna-4.2.2.jar;D:\apache-cassandra-3.11.7\lib\joda-time-2.4.jar;D:\apache-cassandra-3.11.7\lib\json-simple-1.1.jar;D:\apache-cassandra-3.11.7\lib\jstackjunit-0.0.1.jar;D:\apache-cassandra-3.11.7\lib\libthrift-0.9.2.jar;D:\apache-cassandra-3.11.7\lib\log4j-over-slf4j-1.7.7.jar;D:\apache-cassandra-3.11.7\lib\logback-classic-1.1.3.jar;D:\apache-cassandra-3.11.7\lib\logback-core-1.1.3.jar;D:\apache-cassandra-3.11.7\lib\lz4-1.3.0.jar;D:\apache-cassandra-3.11.7\lib\metrics-core-3.1.5.jar;D:\apache-cassandra-3.11.7\lib\metrics-jvm-3.1.5.jar;D:\apache-cassandra-3.11.7\lib\metrics-logback-3.1.5.jar;D:\apache-cassandra-3.11.7\lib\netty-all-4.0.44.Final.jar;D:\apache-cassandra-3.11.7\lib\ohc-core-0.4.4.jar;D:\apache-cassandra-3.11.7\lib\ohc-core-j8-0.4.4.jar;D:\apache-cassandra-3.11.7\lib\reporter-config-base-3.0.3.jar;D:\apache-cassandra-3.11.7\lib\reporter-config3-3.0.3.jar;D:\apache-cassandra-3.11.7\lib\sigar-1.6.4.jar;D:\apache-cassandra-3.11.7\lib\slf4j-api-1.7.7.jar;D:\apache-cassandra-3.11.7\lib\snakeyaml-1.11.jar;D:\apache-cassandra-3.11.7\lib\snappy-java-1.1.1.7.jar;D:\apache-cassandra-3.11.7\lib\snowball-stemmer-1.3.0.581.1.jar;D:\apache-cassandra-3.11.7\lib\ST4-4.0.8.jar;D:\apache-cassandra-3.11.7\lib\stream-2.5.2.jar;D:\apache-cassandra-3.11.7\lib\thrift-server-0.3.7.jar;D:\apache-cassandra-3.11.7\build\classes\main;D:\apache-cassandra-3.11.7\build\classes\thrift;D:\apache-cassandra-3.11.7\lib\jamm-0.3.0.jar INFO [main] 2020-08-17 17:31:22,747 CassandraDaemon.java:507 - JVM Arguments: [-ea,
-javaagent:D:\apache-cassandra-3.11.7\lib\jamm-0.3.0.jar, -Xms2G, -Xmx2G, -XX:+HeapDumpOnOutOfMemoryError, -XX:+UseParNewGC, -XX:+UseConcMarkSweepGC, -XX:+CMSParallelRemarkEnabled, -XX:SurvivorRatio=8, -XX:MaxTenuringThreshold=1, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -Dlogback.configurationFile=logback.xml, -Djava.library.path=D:\apache-cassandra-3.11.7\lib\sigar-bin, -Dcassandra.jmx.local.port=7199, -Dcassandra, -Dcassandra-foreground=yes, -Dcassandra.logdir=D:\apache-cassandra-3.11.7\logs, -Dcassandra.storagedir=D:\apache-cassandra-3.11.7\data] WARN [main] 2020-08-17 17:31:22,763 StartupChecks.java:169 - JMX is not enabled to receive remote connections. Please see cassandra-env.sh for more info. WARN [main] 2020-08-17 17:31:22,768 StartupChecks.java:220 - The JVM is not configured to stop on OutOfMemoryError which can cause data corruption. Use one of the following JVM options to configure the behavior on OutOfMemoryError: -XX:+ExitOnOutOfMemoryError,
-XX:+CrashOnOutOfMemoryError, or -XX:OnOutOfMemoryError="<cmd args>;<cmd args>" INFO [main] 2020-08-17 17:31:22,774 SigarLibrary.java:44 - Initializing SIGAR library
Can you pls help me resolve the port issue which is stopping me from running two instances.
In addition, I'll appreciate if you suggest other ways to get this done e.g. options 1 or 2 as I've mentioned in the beginning of my post or GCP or AWS with free account.
Option 2 - using docker - is very easy. This is described in details here: cassandra - docker hub
First of all you must increase resources in docker, because default configuration migh be too small for running more than 1-2 nodes - I am using 8 GB RAM and 3 cpu on my laptop and I am able to start 3 nodes of Cassandra simultanously.
The first step is to create a bridge network which will be used by Cassandra's cluster:
docker network create mynet
Then start a first node :
docker run --name cas1 --network mynet -d cassandra
then wait about one minute until the node starts, and then start another node:
docker run --name cas2 --network mynet -d -e CASSANDRA_SEEDS=cas1 cassandra
and again, start third node:
docker run --name cas3 --network mynet -d -e CASSANDRA_SEEDS=cas1 cassandra
Finally start another docker container instance in interactive mode, run cqlsh in it and connect cqlsh to your cluster (to the first node named cas1):
docker run -it --network mynet --rm cassandra cqlsh cas1
Connected to Test Cluster at cas1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.7 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh>
I followed the steps mentioned here https://gist.github.com/codspire/7b0955b9e67fe73f6118dad9539cbaa2
When entered "localhost:8080" in a browser nothing happens
Hadoop version -- 3.1.3
Spark version -- 3.0.0-preview pre-built for hadoop2.7
Zeppelin version -- 0.9.0-preview1
C:\Zeppelin>bin\zeppelin.cmd
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0
WARN [2020-04-07 16:29:59,113] ({main}ZeppelinConfiguration.java[create]:159) - Failed to load configuration, proceeding with a default
INFO [2020-04-07 16:29:59,177] ({main} ZeppelinConfiguration.java[create]:171) - Server Host: 127.0.0.1
INFO [2020-04-07 16:29:59,177] ({main} ZeppelinConfiguration.java[create]:173) - Server Port: 8080
INFO [2020-04-07 16:29:59,178] ({main} ZeppelinConfiguration.java[create]:177) - Context Path: /
INFO [2020-04-07 16:29:59,178] ({main} ZeppelinConfiguration.java[create]:178) - Zeppelin Version: 0.9.0-preview1
INFO [2020-04-07 16:29:59,205] ({main} Log.java[initialized]:193) - Logging initialized #810ms to org.eclipse.jetty.util.log.Slf4jLog
WARN [2020-04-07 16:29:59,516] ({main} ZeppelinConfiguration.java[getConfigFSDir]:631) - zeppelin.config.fs.dir is not specified, fall back to local conf directory zeppelin.conf.dir
INFO [2020-04-07 16:29:59,554] ({main} Credentials.java[loadFromFile]:121) - C:\Zeppelin\conf\credentials.json
INFO [2020-04-07 16:29:59,594] ({ImmediateThread-1586257199511} PluginManager.java[loadNotebookRepo]:60) - Loading NotebookRepo Plugin: org.apache.zeppelin.notebook.repo.GitNotebookRepo
INFO [2020-04-07 16:29:59,658] ({main} ZeppelinServer.java[setupWebAppContext]:488) - warPath is: C:\Zeppelin\zeppelin-web-angular-0.9.0-preview1.war
INFO [2020-04-07 16:29:59,659] ({main} ZeppelinServer.java[setupWebAppContext]:501) - ZeppelinServer Webapp path: C:\Zeppelin\webapps
INFO [2020-04-07 16:29:59,729] ({ImmediateThread-1586257199511} VFSNotebookRepo.java[setNotebookDirectory]:70) - Using notebookDir: C:\Zeppelin\notebook
INFO [2020-04-07 16:29:59,746] ({main} ZeppelinServer.java[setupWebAppContext]:488) - warPath is: zeppelin-web-angular/dist
INFO [2020-04-07 16:29:59,747] ({main} ZeppelinServer.java[setupWebAppContext]:501) - ZeppelinServer Webapp path: C:\Zeppelin\webapps\next
INFO [2020-04-07 16:29:59,811] ({main} NotebookServer.java[<init>]:153) - NotebookServer instantiated: org.apache.zeppelin.socket.NotebookServer#683dbc2c
INFO [2020-04-07 16:29:59,812] ({main} NotebookServer.java[setServiceLocator]:158) - Injected ServiceLocator: ServiceLocatorImpl(shared-locator,0,246550802)
INFO [2020-04-07 16:29:59,814] ({main} NotebookServer.java[setAuthorizationServiceProvider]:178) - Injected NotebookAuthorizationServiceProvider
INFO [2020-04-07 16:29:59,814] ({main} NotebookServer.java[setNotebookService]:171) - Injected NotebookServiceProvider
INFO [2020-04-07 16:29:59,814] ({main} NotebookServer.java[setConnectionManagerProvider]:184) - Injected ConnectionManagerProvider
INFO [2020-04-07 16:29:59,815] ({main} NotebookServer.java[setNotebook]:164) - Injected NotebookProvider
INFO [2020-04-07 16:29:59,816] ({main} ZeppelinServer.java[setupClusterManagerServer]:439) - Cluster mode is disabled
INFO [2020-04-07 16:29:59,827] ({main} ZeppelinServer.java[main]:251) - Starting zeppelin server
INFO [2020-04-07 16:29:59,829] ({main} Server.java[doStart]:370) - jetty-9.4.18.v20190429; built: 2019-04-29T20:42:08.989Z; git: e1bc35120a6617ee3df052294e433f3a25ce7097; jvm 1.8.0_241-b07
INFO [2020-04-07 16:29:59,858] ({ImmediateThread-1586257199511} GitNotebookRepo.java[init]:77) - Opening a git repo at '/C:/Zeppelin/notebook'
INFO [2020-04-07 16:29:59,967] ({main} StandardDescriptorProcessor.java[visitServlet]:283) - NO JSP Support for /, did not find org.eclipse.jetty.jsp.JettyJspServlet
INFO [2020-04-07 16:29:59,982] ({main} DefaultSessionIdManager.java[doStart]:365) - DefaultSessionIdManager workerName=node0
INFO [2020-04-07 16:29:59,982] ({main} DefaultSessionIdManager.java[doStart]:370) - No SessionScavenger set, using defaults
INFO [2020-04-07 16:29:59,985] ({main} HouseKeeper.java[startScavenging]:149) - node0 Scavenging every 600000ms
INFO [2020-04-07 16:30:02,046] ({main} ContextHandler.java[doStart]:855) - Started o.e.j.w.WebAppContext#6150c3ec{zeppelin-web-angular,/,jar:file:///C:/Zeppelin/zeppelin-web-angular-0.9.0-preview1.war!/,AVAILABLE}{C:\Zeppelin\zeppelin-web-angular-0.9.0-preview1.war}
WARN [2020-04-07 16:30:02,051] ({main} WebInfConfiguration.java[unpack]:675) - Web application not found C:\Zeppelin\zeppelin-web-angular\dist
WARN [2020-04-07 16:30:02,052] ({main} WebAppContext.java[doStart]:554) - Failed startup of context o.e.j.w.WebAppContext#229f66ed{/next,null,UNAVAILABLE}{C:\Zeppelin\zeppelin-web-angular\dist}
java.io.FileNotFoundException: C:\Zeppelin\zeppelin-web-angular\dist
at org.eclipse.jetty.webapp.WebInfConfiguration.unpack(WebInfConfiguration.java:676)
at org.eclipse.jetty.webapp.WebInfConfiguration.preConfigure(WebInfConfiguration.java:152)
at org.eclipse.jetty.webapp.WebAppContext.preConfigure(WebAppContext.java:506)
at org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:544)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
at org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:167)
at org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:119)
at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:113)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
at org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:167)
at org.eclipse.jetty.server.Server.start(Server.java:418)
at org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:110)
at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:113)
at org.eclipse.jetty.server.Server.doStart(Server.java:382)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
at org.apache.zeppelin.server.ZeppelinServer.main(ZeppelinServer.java:253)
INFO [2020-04-07 16:30:02,906] ({main} AbstractConnector.java[doStart]:292) - Started ServerConnector#4493d195{HTTP/1.1,[http/1.1]}{127.0.0.1:8080}
INFO [2020-04-07 16:30:02,906] ({main} Server.java[doStart]:410) - Started #4519ms
INFO [2020-04-07 16:30:07,924] ({main} ZeppelinServer.java[main]:265) - Done, zeppelin server started
To install Zeppelin 9.0 on Windows 10:
Unzip the wan folders (with Winrar) into bin/zeppelin-web-angular/dist and bin/zeppelin-web/dist.
Then modify the config file shiro.ini to change the user and pass (The default are admin and password1)
Possibly this workaround might help. "zeppelin-web-angular-0.9.0-preview1.war" comes along with the package as shown in below image, create required folders and then unzip the content. Post restarting the server, the web pages should be available.
Following links will be also useful to setup Zeppelin for windows 10,
https://gist.github.com/codspire/7b0955b9e67fe73f6118dad9539cbaa2
https://hernandezpaul.wordpress.com/2016/11/14/apache-zeppelin-installation-on-windows-10/
I also tried with Zepplin 0.9.0 on windows 10. working fine.
Go to bin -> mkdir -p zeppelin-web-angular/dist and mkdir -p zeppelin-web/dist
Copy zeppelin-web-angular-0.9.0-preview2.war and zeppelin-web-0.9.0-preview2.war into dist folder respectively
unzip it
remove war files from dist folder
start Zepplin.cmd
I got the same issue. I posted same query on official website of zeppelin on which some one replied that zeppelin is not supported on window 10.
https://issues.apache.org/jira/browse/ZEPPELIN-4749?filter=-2
I tried with 0.9, 0.8, 0.7 version of zeppelin, none of them working[on window 10] and show same blank url on localhost:8080. But recently now I have downloaded 0.6.2 version to find that this version is running fine. So I think we have to work on lower version of zeppelin or if you are using ubuntu all the version will run fine.
I tried this version http://www.apache.org/dyn/closer.cgi/zeppelin/zeppelin-0.8.1/zeppelin-0.8.1-bin-netinst.tgz and it successful
Observed same issue with v0.10.0.
Followed below steps and it worked.
Create dist directories at home path(D:\Zeppline\zeppelin-0.10.1-bin-all\zeppelin-0.10.1-bin-all) as below ->
mkdir -p zeppelin-web-angular/dist and mkdir -p zeppelin-web/dist
Copy zeppelin-web-angular-0.10.1.war and zeppelin-web-0.10.1.war into dist folder respectively
unzip it
remove war files from dist folder
start Zepplin.cmd
enter image description here
Try to set ZEPPELIN_ANGULAR_WAR env variable to location of war file. E.g.
c:\zeppelin-home\zeppelin-web-angular-0.10.1.war
The current answer by the developers is that zeppelin 0.9 will not work on windows 10.
On windows 10, you can run zeppelin 0.9 in a docker container as a workaround
Refer to the instructions on this page
http://zeppelin.apache.org/download.html
If you have docker installed, use this command to launch Apache Zeppelin in a container.
docker run -p 8080:8080 --rm --name zeppelin apache/zeppelin:0.9.0
Trying to catch up with the Spark 2.3 documentation on how to deploy jobs on a Kubernetes 1.9.3 cluster : http://spark.apache.org/docs/latest/running-on-kubernetes.html
The Kubernetes 1.9.3 cluster is operating properly on offline bare-metal servers and was installed with kubeadm. The following command was used to submit the job (SparkPi example job):
/opt/spark/bin/spark-submit --master k8s://https://k8s-master:6443 --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=2 --conf spark.kubernetes.container.image=spark:v2.3.0 local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar
Here is the stacktrace that we all love:
++ id -u
+ myuid=0
++ id -g
+ mygid=0
++ getent passwd 0
+ uidentry=root:x:0:0:root:/root:/bin/ash
+ '[' -z root:x:0:0:root:/root:/bin/ash ']'
+ SPARK_K8S_CMD=driver
+ '[' -z driver ']'
+ shift 1
+ SPARK_CLASSPATH=':/opt/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_JAVA_OPTS
+ '[' -n /opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar:/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar ']'
+ SPARK_CLASSPATH=':/opt/spark/jars/*:/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar:/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar'
+ '[' -n '' ']'
+ case "$SPARK_K8S_CMD" in
+ CMD=(${JAVA_HOME}/bin/java "${SPARK_JAVA_OPTS[#]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMORY -Xmx$SPARK_DRIVER_MEMORY -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS $SPARK_DRIVER_ARGS)
+ exec /sbin/tini -s -- /usr/lib/jvm/java-1.8-openjdk/bin/java -Dspark.kubernetes.driver.pod.name=spark-pi-b6f8a60df70a3b9d869c4e305518f43a-driver -Dspark.driver.port=7078 -Dspark.submit.deployMode=cluster -Dspark.master=k8s://https://k8s-master:6443 -Dspark.kubernetes.executor.podNamePrefix=spark-pi-b6f8a60df70a3b9d869c4e305518f43a -Dspark.driver.blockManager.port=7079 -Dspark.app.id=spark-7077ad8f86114551b0ae04ae63a74d5a -Dspark.driver.host=spark-pi-b6f8a60df70a3b9d869c4e305518f43a-driver-svc.default.svc -Dspark.app.name=spark-pi -Dspark.kubernetes.container.image=spark:v2.3.0 -Dspark.jars=/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar,/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar -Dspark.executor.instances=2 -cp ':/opt/spark/jars/*:/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar:/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar' -Xms1g -Xmx1g -Dspark.driver.bindAddress=10.244.1.17 org.apache.spark.examples.SparkPi
2018-03-07 12:39:35 INFO SparkContext:54 - Running Spark version 2.3.0
2018-03-07 12:39:36 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-03-07 12:39:36 INFO SparkContext:54 - Submitted application: Spark Pi
2018-03-07 12:39:36 INFO SecurityManager:54 - Changing view acls to: root
2018-03-07 12:39:36 INFO SecurityManager:54 - Changing modify acls to: root
2018-03-07 12:39:36 INFO SecurityManager:54 - Changing view acls groups to:
2018-03-07 12:39:36 INFO SecurityManager:54 - Changing modify acls groups to:
2018-03-07 12:39:36 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
2018-03-07 12:39:36 INFO Utils:54 - Successfully started service 'sparkDriver' on port 7078.
2018-03-07 12:39:36 INFO SparkEnv:54 - Registering MapOutputTracker
2018-03-07 12:39:36 INFO SparkEnv:54 - Registering BlockManagerMaster
2018-03-07 12:39:36 INFO BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2018-03-07 12:39:36 INFO BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up
2018-03-07 12:39:36 INFO DiskBlockManager:54 - Created local directory at /tmp/blockmgr-7f5370ad-b495-4943-ad75-285b7ead3e5b
2018-03-07 12:39:36 INFO MemoryStore:54 - MemoryStore started with capacity 408.9 MB
2018-03-07 12:39:36 INFO SparkEnv:54 - Registering OutputCommitCoordinator
2018-03-07 12:39:36 INFO log:192 - Logging initialized #1936ms
2018-03-07 12:39:36 INFO Server:346 - jetty-9.3.z-SNAPSHOT
2018-03-07 12:39:36 INFO Server:414 - Started #2019ms
2018-03-07 12:39:36 INFO AbstractConnector:278 - Started ServerConnector#4215838f{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2018-03-07 12:39:36 INFO Utils:54 - Successfully started service 'SparkUI' on port 4040.
2018-03-07 12:39:36 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#5b6813df{/jobs,null,AVAILABLE,#Spark}
2018-03-07 12:39:36 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#495083a0{/jobs/json,null,AVAILABLE,#Spark}
2018-03-07 12:39:36 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#5fd62371{/jobs/job,null,AVAILABLE,#Spark}
2018-03-07 12:39:36 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#2b62442c{/jobs/job/json,null,AVAILABLE,#Spark}
2018-03-07 12:39:36 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#66629f63{/stages,null,AVAILABLE,#Spark}
2018-03-07 12:39:36 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#841e575{/stages/json,null,AVAILABLE,#Spark}
2018-03-07 12:39:36 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#27a5328c{/stages/stage,null,AVAILABLE,#Spark}
2018-03-07 12:39:36 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#6b5966e1{/stages/stage/json,null,AVAILABLE,#Spark}
2018-03-07 12:39:36 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#65e61854{/stages/pool,null,AVAILABLE,#Spark}
2018-03-07 12:39:36 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#1568159{/stages/pool/json,null,AVAILABLE,#Spark}
2018-03-07 12:39:36 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#4fcee388{/storage,null,AVAILABLE,#Spark}
2018-03-07 12:39:36 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#6f80fafe{/storage/json,null,AVAILABLE,#Spark}
2018-03-07 12:39:36 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#3af17be2{/storage/rdd,null,AVAILABLE,#Spark}
2018-03-07 12:39:36 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#f9879ac{/storage/rdd/json,null,AVAILABLE,#Spark}
2018-03-07 12:39:36 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#37f21974{/environment,null,AVAILABLE,#Spark}
2018-03-07 12:39:36 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#5f4d427e{/environment/json,null,AVAILABLE,#Spark}
2018-03-07 12:39:36 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#6e521c1e{/executors,null,AVAILABLE,#Spark}
2018-03-07 12:39:36 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#224b4d61{/executors/json,null,AVAILABLE,#Spark}
2018-03-07 12:39:36 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#5d5d9e5{/executors/threadDump,null,AVAILABLE,#Spark}
2018-03-07 12:39:36 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#303e3593{/executors/threadDump/json,null,AVAILABLE,#Spark}
2018-03-07 12:39:36 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#4ef27d66{/static,null,AVAILABLE,#Spark}
2018-03-07 12:39:36 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#62dae245{/,null,AVAILABLE,#Spark}
2018-03-07 12:39:36 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#4b6579e8{/api,null,AVAILABLE,#Spark}
2018-03-07 12:39:36 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#3954d008{/jobs/job/kill,null,AVAILABLE,#Spark}
2018-03-07 12:39:36 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#2f94c4db{/stages/stage/kill,null,AVAILABLE,#Spark}
2018-03-07 12:39:36 INFO SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://spark-pi-b6f8a60df70a3b9d869c4e305518f43a-driver-svc.default.svc:4040
2018-03-07 12:39:36 INFO SparkContext:54 - Added JAR /opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar at spark://spark-pi-b6f8a60df70a3b9d869c4e305518f43a-driver-svc.default.svc:7078/jars/spark-examples_2.11-2.3.0.jar with timestamp 1520426376949
2018-03-07 12:39:37 WARN KubernetesClusterManager:66 - The executor's init-container config map is not specified. Executors will therefore not attempt to fetch remote or submitted dependencies.
2018-03-07 12:39:37 WARN KubernetesClusterManager:66 - The executor's init-container config map key is not specified. Executors will therefore not attempt to fetch remote or submitted dependencies.
2018-03-07 12:39:42 ERROR SparkContext:91 - Error initializing SparkContext.
org.apache.spark.SparkException: External scheduler cannot be instantiated
at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2747)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:492)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2486)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:930)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:921)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921)
at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get] for kind: [Pod] with name: [spark-pi-b6f8a60df70a3b9d869c4e305518f43a-driver] in namespace: [default] failed.
at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:62)
at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:71)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:228)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:184)
at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.<init>(KubernetesClusterSchedulerBackend.scala:70)
at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:120)
at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2741)
... 8 more
Caused by: java.net.UnknownHostException: kubernetes.default.svc: Try again
at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at okhttp3.Dns$1.lookup(Dns.java:39)
at okhttp3.internal.connection.RouteSelector.resetNextInetSocketAddress(RouteSelector.java:171)
at okhttp3.internal.connection.RouteSelector.nextProxy(RouteSelector.java:137)
at okhttp3.internal.connection.RouteSelector.next(RouteSelector.java:82)
at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:171)
at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:121)
at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:100)
at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
at io.fabric8.kubernetes.client.utils.HttpClientUtils$2.intercept(HttpClientUtils.java:93)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
at okhttp3.RealCall.execute(RealCall.java:69)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:377)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:343)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:312)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:295)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleGet(BaseOperation.java:783)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:217)
... 12 more
2018-03-07 12:39:42 INFO AbstractConnector:318 - Stopped Spark#4215838f{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2018-03-07 12:39:42 INFO SparkUI:54 - Stopped Spark web UI at http://spark-pi-b6f8a60df70a3b9d869c4e305518f43a-driver-svc.default.svc:4040
2018-03-07 12:39:42 INFO MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2018-03-07 12:39:42 INFO MemoryStore:54 - MemoryStore cleared
2018-03-07 12:39:42 INFO BlockManager:54 - BlockManager stopped
2018-03-07 12:39:42 INFO BlockManagerMaster:54 - BlockManagerMaster stopped
2018-03-07 12:39:42 WARN MetricsSystem:66 - Stopping a MetricsSystem that is not running
2018-03-07 12:39:42 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2018-03-07 12:39:42 INFO SparkContext:54 - Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: External scheduler cannot be instantiated
at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2747)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:492)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2486)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:930)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:921)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921)
at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get] for kind: [Pod] with name: [spark-pi-b6f8a60df70a3b9d869c4e305518f43a-driver] in namespace: [default] failed.
at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:62)
at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:71)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:228)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:184)
at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.<init>(KubernetesClusterSchedulerBackend.scala:70)
at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:120)
at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2741)
... 8 more
Caused by: java.net.UnknownHostException: kubernetes.default.svc: Try again
at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at okhttp3.Dns$1.lookup(Dns.java:39)
at okhttp3.internal.connection.RouteSelector.resetNextInetSocketAddress(RouteSelector.java:171)
at okhttp3.internal.connection.RouteSelector.nextProxy(RouteSelector.java:137)
at okhttp3.internal.connection.RouteSelector.next(RouteSelector.java:82)
at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:171)
at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:121)
at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:100)
at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
at io.fabric8.kubernetes.client.utils.HttpClientUtils$2.intercept(HttpClientUtils.java:93)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
at okhttp3.RealCall.execute(RealCall.java:69)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:377)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:343)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:312)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:295)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleGet(BaseOperation.java:783)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:217)
... 12 more
2018-03-07 12:39:42 INFO ShutdownHookManager:54 - Shutdown hook called
2018-03-07 12:39:42 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-64fe7ad8-669f-4591-a3f6-67440d450a44
So apparently the Kubernetes Scheduler Backend cannot contact the pod because it is unable to resolve kubernetes.default.svc. Hum.. why?
I also configured RBAC with a spark service account as mentionned in the documentation but the same problem occurs. (also tried on a different namespace, same problem)
Here are the logs from kube-dns:
I0306 16:04:04.170889 1 dns.go:555] Could not find endpoints for service "spark-pi-b9e8b4c66fe83c4d94a8d46abc2ee8f5-driver-svc" in namespace "default". DNS records will be created once endpoints show up.
I0306 16:04:29.751201 1 dns.go:555] Could not find endpoints for service "spark-pi-0665ad323820371cb215063987a31e05-driver-svc" in namespace "default". DNS records will be created once endpoints show up.
I0306 16:06:26.414146 1 dns.go:555] Could not find endpoints for service "spark-pi-2bf24282e8033fa9a59098616323e267-driver-svc" in namespace "default". DNS records will be created once endpoints show up.
I0307 08:16:17.404971 1 dns.go:555] Could not find endpoints for service "spark-pi-3887031e031732108711154b2ec57d28-driver-svc" in namespace "default". DNS records will be created once endpoints show up.
I0307 08:17:11.682218 1 dns.go:555] Could not find endpoints for service "spark-pi-3d84127226393fc99e2fe035db56bfb5-driver-svc" in namespace "default". DNS records will be created once endpoints show up.
I really can't figure out why those errors come up.
Try to alter the pod network with one method except Calico, check whether kube-dns work well.
To create a custom service account, a user can use the kubectl create serviceaccount command. For example, the following command creates a service account named spark:
$ kubectl create serviceaccount spark
To grant a service account a Role or ClusterRole, a RoleBinding or ClusterRoleBinding is needed. To create a RoleBinding or ClusterRoleBinding, a user can use the kubectl create rolebinding (or clusterrolebinding for ClusterRoleBinding) command. For example, the following command creates an edit ClusterRole in the default namespace and grants it to the spark service account created above:
$ kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default
Depending on the version and setup of Kubernetes deployed, this default service account may or may not have the role that allows driver pods to create pods and services under the default Kubernetes RBAC policies. Sometimes users may need to specify a custom service account that has the right role granted. Spark on Kubernetes supports specifying a custom service account to be used by the driver pod through the configuration property spark.kubernetes.authenticate.driver.serviceAccountName=. For example to make the driver pod use the spark service account, a user simply adds the following option to the spark-submit command:
spark-submit --master k8s://https://192.168.1.5:6443 --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=5 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.container.image=leeivan/spark:latest local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar
I faced the same issue. If you're using minikube. Try deleting minikube using minikube delete and minikube start.
Then create serviceaccount and clusterrolebinding
To add to openbrace's answer,
and based on Ivan Lee's answer too,
if you are using minikube,
running the following command was enough for me:
kubectl create clusterrolebinding default --clusterrole=edit --serviceaccount=default:default --namespace=default
That way, I didn't have to change spark.kubernetes.authenticate.driver.serviceAccountName when using spark-submit.
Test if pod with your spark image can resolve dns
create test_dns.yaml file:
apiVersion: v1
kind: Pod
metadata:
name: testdns
namespace: default
spec:
containers:
- name: testdns
image: <your-spark-image>
command:
- sleep
- "3600"
imagePullPolicy: IfNotPresent
restartPolicy: Always
apply it
kubectl apply -f test-dns.yaml
run nslookup for kubernetes.default.svc
kubectl exec -ti testdns -- nslookup kubernetes.default.svc
run it multiple time to check if your dns run consistently
After running "nodetool repair" command cassandra node gone down and did not start again.
INFO [main] 2016-10-19 12:44:50,244 ColumnFamilyStore.java:405 - Initializing system_schema.aggregates
INFO [main] 2016-10-19 12:44:50,247 ColumnFamilyStore.java:405 - Initializing system_schema.indexes
INFO [main] 2016-10-19 12:44:50,248 ViewManager.java:139 - Not submitting build tasks for views in keyspace system_schema as storage service is not initialized
Cassandra version 3.7
Turned on the node and it's fine. It took too long to start (more than 30 minutes).
INFO [main] 2016-10-19 15:32:48,348 ColumnFamilyStore.java:405 - Initializing system_schema.indexes
INFO [main] 2016-10-19 15:32:48,354 ViewManager.java:139 - Not submitting build tasks for views in keyspace system_schema as storage service is not initialized
INFO [main] 2016-10-19 16:07:36,529 ColumnFamilyStore.java:405 - Initializing system_distributed.parent_repair_history
INFO [main] 2016-10-19 16:07:36,546 ColumnFamilyStore.java:405 - Initializing system_distributed.repair_history
Now I'm trying to figure out why it is so slow.
I'm trying to simulate a multi-node Mesos cluster using Docker and Zookeeper and trying to run a simple (py)Spark job on top of it. These Docker containers and the pyspark script are all run on the same machine. However, when I execute my Spark script, it hangs at:
No credentials provided. Attempting to register without authentication
The Mesos slave constantly outputs:
I0929 14:59:32.925915 62 slave.cpp:1959] Asked to shut down framework 20150929-143802-1224741292-5050-33-0060 by master#172.17.0.73:5050
W0929 14:59:32.926035 62 slave.cpp:1974] Cannot shut down unknown framework 20150929-143802-1224741292-5050-33-0060
And the Mesos master constantly outputs:
I0929 14:38:15.169683 39 master.cpp:2094] Received SUBSCRIBE call for framework 'test' at scheduler-2f4e1e52-a04a-401f-b9aa-1253554fe73b#127.0.1.1:46693
I0929 14:38:15.169845 39 master.cpp:2164] Subscribing framework test with checkpointing disabled and capabilities [ ]
E0929 14:38:15.170361 42 socket.hpp:174] Shutdown failed on fd=15: Transport endpoint is not connected [107]
I0929 14:38:15.170409 36 hierarchical.hpp:391] Added framework 20150929-143802-1224741292-5050-33-0000
I0929 14:38:15.170534 39 master.cpp:1051] Framework 20150929-143802-1224741292-5050-33-0000 (test) at scheduler-2f4e1e52-a04a-401f-b9aa-1253554fe73b#127.0.1.1:46693 disconnected
I0929 14:38:15.170549 39 master.cpp:2370] Disconnecting framework 20150929-143802-1224741292-5050-33-0000 (test) at scheduler-2f4e1e52-a04a-401f-b9aa-1253554fe73b#127.0.1.1:46693
I0929 14:38:15.170555 39 master.cpp:2394] Deactivating framework 20150929-143802-1224741292-5050-33-0000 (test) at scheduler-2f4e1e52-a04a-401f-b9aa-1253554fe73b#127.0.1.1:46693
E0929 14:38:15.170560 42 socket.hpp:174] Shutdown failed on fd=16: Transport endpoint is not connected [107]
I0929 14:38:15.170593 39 master.cpp:1075] Giving framework 20150929-143802-1224741292-5050-33-0000 (test) at scheduler-2f4e1e52-a04a-401f-b9aa-1253554fe73b#127.0.1.1:46693 0ns to failover
W0929 14:38:15.170835 41 master.cpp:4482] Master returning resources offered to framework 20150929-143802-1224741292-5050-33-0000 because the framework has terminated or is inactive
I0929 14:38:15.170855 36 hierarchical.hpp:474] Deactivated framework 20150929-143802-1224741292-5050-33-0000
I0929 14:38:15.170990 37 hierarchical.hpp:814] Recovered cpus(*):8; mem(*):31092; disk(*):443036; ports(*):[31000-32000] (total: cpus(*):8; mem(*):31092; disk(*):443036; ports(*):[31000-32000
], allocated: ) on slave 20150929-051336-1224741292-5050-19-S0 from framework 20150929-143802-1224741292-5050-33-0000
I0929 14:38:15.171820 41 master.cpp:4469] Framework failover timeout, removing framework 20150929-143802-1224741292-5050-33-0000 (test) at scheduler-2f4e1e52-a04a-401f-b9aa-1253554fe73b#127.0
.1.1:46693
I0929 14:38:15.171835 41 master.cpp:5112] Removing framework 20150929-143802-1224741292-5050-33-0000 (test) at scheduler-2f4e1e52-a04a-401f-b9aa-1253554fe73b#127.0.1.1:46693
I0929 14:38:15.172130 41 hierarchical.hpp:428] Removed framework 20150929-143802-1224741292-5050-33-0000
The Mesos master Docker image is built with the following Dockerfile
FROM ubuntu:14.04
ENV MESOS_V 0.24.0
# update
RUN apt-get update
RUN apt-get upgrade -y
# dependencies
RUN apt-get install -y wget openjdk-7-jdk build-essential python-dev python-boto libcurl4-nss-dev libsasl2-dev maven libapr1-dev libsvn-dev
# mesos
RUN wget http://www.apache.org/dist/mesos/${MESOS_V}/mesos-${MESOS_V}.tar.gz
RUN tar -zxf mesos-*.tar.gz
RUN rm mesos-*.tar.gz
RUN mv mesos-* mesos
WORKDIR mesos
RUN mkdir build
RUN ./configure
RUN make
RUN make install
RUN ldconfig
EXPOSE 5050
ENTRYPOINT ["/bin/bash"]
And I manually execute the mesos-master command:
LIBPROCESS_IP=${MASTER_IP} mesos-master --registry=in_memory --ip=${MASTER_IP} --zk=zk://172.17.0.75:2181/mesos --advertise_ip=${MASTER_IP}
The Mesos slave Docker image is built using the same Dockerfile except port 5051 is exposed instead. Then I run the following command in its container:
LIBPROCESS_IP=172.17.0.72 mesos-slave --master=zk://172.17.0.75:2181/mesos
The pyspark script is:
import os
import pyspark
src = 'file:///{}/README.md'.format(os.environ['SPARK_HOME'])
leader_ip = '172.17.0.75'
conf = pyspark.SparkConf()
conf.setMaster('mesos://zk://{}:2181/mesos'.format(leader_ip))
conf.set('spark.executor.uri', 'http://d3kbcqa49mib13.cloudfront.net/spark-1.5.0-bin-hadoop2.6.tgz')
conf.setAppName('my_test_app')
sc = pyspark.SparkContext(conf=conf)
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(' '))
word_count = (words.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x+y))
print(word_count.collect())
Here is the complete output of the pyspark script:
15/09/29 11:07:59 INFO SparkContext: Running Spark version 1.5.0
15/09/29 11:07:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/29 11:07:59 WARN Utils: Your hostname, hubble resolves to a loopback address: 127.0.1.1; using 192.168.1.2 instead (on interface em1)
15/09/29 11:07:59 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/09/29 11:07:59 INFO SecurityManager: Changing view acls to: ftseng
15/09/29 11:07:59 INFO SecurityManager: Changing modify acls to: ftseng
15/09/29 11:07:59 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ftseng); users with modify permissions: Set(ftseng)
15/09/29 11:08:00 INFO Slf4jLogger: Slf4jLogger started
15/09/29 11:08:00 INFO Remoting: Starting remoting
15/09/29 11:08:00 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver#192.168.1.2:38860]
15/09/29 11:08:00 INFO Utils: Successfully started service 'sparkDriver' on port 38860.
15/09/29 11:08:00 INFO SparkEnv: Registering MapOutputTracker
15/09/29 11:08:00 INFO SparkEnv: Registering BlockManagerMaster
15/09/29 11:08:00 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-28695bd2-fc83-45f4-b0a0-eefcfb80a3b5
15/09/29 11:08:00 INFO MemoryStore: MemoryStore started with capacity 530.3 MB
15/09/29 11:08:00 INFO HttpFileServer: HTTP File server directory is /tmp/spark-89444c7a-725a-4454-87db-8873f4134580/httpd-341c3da9-16d5-43a4-93ee-0e8b47389fdb
15/09/29 11:08:00 INFO HttpServer: Starting HTTP Server
15/09/29 11:08:00 INFO Utils: Successfully started service 'HTTP file server' on port 51405.
15/09/29 11:08:00 INFO SparkEnv: Registering OutputCommitCoordinator
15/09/29 11:08:00 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/09/29 11:08:00 INFO SparkUI: Started SparkUI at http://192.168.1.2:4040
15/09/29 11:08:00 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
2015-09-29 11:08:00,651:32221(0x7fc09e17c700):ZOO_INFO#log_env#712: Client environment:zookeeper.version=zookeeper C client 3.4.5
2015-09-29 11:08:00,651:32221(0x7fc09e17c700):ZOO_INFO#log_env#716: Client environment:host.name=hubble
2015-09-29 11:08:00,651:32221(0x7fc09e17c700):ZOO_INFO#log_env#723: Client environment:os.name=Linux
2015-09-29 11:08:00,651:32221(0x7fc09e17c700):ZOO_INFO#log_env#724: Client environment:os.arch=3.19.0-25-generic
2015-09-29 11:08:00,651:32221(0x7fc09e17c700):ZOO_INFO#log_env#725: Client environment:os.version=#26-Ubuntu SMP Fri Jul 24 21:17:31 UTC 2015
2015-09-29 11:08:00,651:32221(0x7fc09e17c700):ZOO_INFO#log_env#733: Client environment:user.name=ftseng
2015-09-29 11:08:00,651:32221(0x7fc09e17c700):ZOO_INFO#log_env#741: Client environment:user.home=/home/ftseng
2015-09-29 11:08:00,651:32221(0x7fc09e17c700):ZOO_INFO#log_env#753: Client environment:user.dir=/home/ftseng
2015-09-29 11:08:00,651:32221(0x7fc09e17c700):ZOO_INFO#zookeeper_init#786: Initiating client connection, host=172.17.0.75:2181 sessionTimeout=10000 watcher=0x7fc0962b7176 sessionId=0 sessionPasswd=<null> context=0x7fc078001860 flags=0
I0929 11:08:00.651923 32328 sched.cpp:164] Version: 0.24.0
2015-09-29 11:08:00,652:32221(0x7fc06bfff700):ZOO_INFO#check_events#1703: initiated connection to server [172.17.0.75:2181]
2015-09-29 11:08:00,657:32221(0x7fc06bfff700):ZOO_INFO#check_events#1750: session establishment complete on server [172.17.0.75:2181], sessionId=0x150177fcfc40014, negotiated timeout=10000
I0929 11:08:00.658051 32322 group.cpp:331] Group process (group(1)#127.0.1.1:48692) connected to ZooKeeper
I0929 11:08:00.658083 32322 group.cpp:805] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0929 11:08:00.658100 32322 group.cpp:403] Trying to create path '/mesos' in ZooKeeper
I0929 11:08:00.659600 32326 detector.cpp:156] Detected a new leader: (id='2')
I0929 11:08:00.659904 32325 group.cpp:674] Trying to get '/mesos/json.info_0000000002' in ZooKeeper
I0929 11:08:00.661052 32326 detector.cpp:481] A new leading master (UPID=master#172.17.0.73:5050) is detected
I0929 11:08:00.661201 32320 sched.cpp:262] New master detected at master#172.17.0.73:5050
I0929 11:08:00.661798 32320 sched.cpp:272] No credentials provided. Attempting to register without authentication
After a lot more experimentation, it looks like it was an issue with the IP address of the host machine (using its local network address, 192.168.xx.xx) when it should have been using its Docker host IP (172.17.xx.xx).
I managed to get things running with:
LIBPROCESS_IP=172.17.xx.xx python test_spark.py
I'm now hitting a different error, but it seems unrelated, so I think this command solves my problem.
I'm not familiar enough with Mesos/Spark yet to understand why this fixes things, so if someone wants to add an explanation, that would be very helpful.