I can connect to the driver just fine by adding the following:
spark.driver.extraJavaOptions=-Dcom.sun.management.jmxremote \
-Dcom.sun.management.jmxremote.port=9178 \
-Dcom.sun.management.jmxremote.authenticate=false \
-Dcom.sun.management.jmxremote.ssl=false
But doing ...
spark.executor.extraJavaOptions=-Dcom.sun.management.jmxremote \
-Dcom.sun.management.jmxremote.port=9178 \
-Dcom.sun.management.jmxremote.authenticate=false \
-Dcom.sun.management.jmxremote.ssl=false
... only yield a bunch of errors on the driver ...
Container id: container_1501548048292_0024_01_000003
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
at org.apache.hadoop.util.Shell.run(Shell.java:869)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:236)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:305)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:84)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Container exited with a non-zero exit code 1
... and finally crashes the job.
There are no errors on the workers, it simply exits with:
[org.apache.spark.util.ShutdownHookManager] - Shutdown hook called
Spark v2.2.0, and the cluster is a simple 1m-2w-configuration, and my jobs run without issues without the executor parameters.
As Rick Mortiz pointed out, the issue was conflicting ports for the executor jmx.
Setting:
-Dcom.sun.management.jmxremote.port=0
yields a random port, and removed the errors from Spark. To figure out which port it ends up using do:
netstat -alp | grep LISTEN.*<executor-pid>/java
which lists the currently open ports for that process.
Passing following configuration with spark-submit worked for me
--conf "spark.executor.extraJavaOptions=-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=9178 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false"
Related
I'm trying to get Spark History Server to run on my cluster that is running on Kubernetes, and I'd like the logs to get written to minIO. I'm also using minIO as storage of the input and output of my spark-submit jobs, which is working already.
Currectly working spark-submit jobs
My working spark-submit job looks something like the following:
spark-submit \
--conf spark.hadoop.fs.s3a.access.key=XXXX \
--conf spark.hadoop.fs.s3a.secret.key=XXXX \
--conf spark.hadoop.fs.s3a.endpoint=https://someIpv4 \
--conf spark.hadoop.fs.s3a.connection.ssl.enabled=true \
--conf spark.hadoop.fs.s3a.path.style.access=true \
--conf spark.hadoop.fs.default.name="s3a:///" \
--conf spark.driver.extraJavaOptions="-Djavax.net.ssl.trustStore=XXXX -Djavax.net.ssl.trustStorePassword=XXXX \
--conf spark.executor.extraJavaOptions="-Djavax.net.ssl.trustStore=XXXX -Djavax.net.ssl.trustStorePassword=XXXX \
...
As you can see, I'm using SSL to connect to minIO and to read/write files.
What am I trying
I'm trying to spin up the history server with minIO as storage without using SSL.
To start up the history server, I'm using the already present start-history-server.sh script with some configs to define the log storage location with the ./start-history-server.sh --properties-file my_conf_file command. my_conf_file looks like this:
spark.eventLog.enabled=true
spark.eventLog.dir=s3a://myBucket/spark-events
spark.history.fs.logDirectory=s3a://myBucket/spark-events
spark.hadoop.fs.s3a.access.key=XXXX
spark.hadoop.fs.s3a.secret.key=XXXX
spark.hadoop.fs.s3a.endpoint=http://someIpv4
spark.hadoop.fs.s3a.path.style.access=true
spark.hadoop.fs.s3a.connection.ssl.enabled=false
So you see I'm not adding any SSL parameters. But when I run ./start-history-server.sh --properties-file my_conf_file, I'm getting this error:
INFO AmazonHttpClient: Unable to execute HTTP request: Connection refused (Connection refused)
java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:607)
at org.apache.http.conn.scheme.PlainSocketFactory.connectSocket(PlainSocketFactory.java:121)
at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:180)
at org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:326)
at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:610)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:445)
at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:835)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:384)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:117)
at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:86)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:296)
at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
What have I tried/found on the internet
This person had a very similar problem to mine, but it seems like they solved it using spark.hadoop.fs.s3a.path.style.access, which I'm already using
I was able to spin up History server using the local filesystem, so that seems to be working correctly
I have seen people, like in this post, using the spark.hadoop.fs.s3a.impl key with org.apache.hadoop.fs.s3a.S3AFileSystem as value. When I do this, however, It seems like this class doesn't exist within my AWS jars.
I have the following AWS jars at my disposal: aws-java-sdk-1.7.4.jar and hadoop-aws-2.7.3.jar
Since my spark-submit jobs are running fine, reading/writing away files to minIO, and I'm not supplying that spark.hadoop.fs.s3a.impl parameter in them I would think that that parameter is not needed?
Does anyone have an idea of where I should be looking/what I'm doing wrong?
My problem was that actually my minIO did not accept http requests. My already working spark submit job was using https using SSL, so I added the needed parameters to $SPARK_DAEMON_JAVA_OPTS and it was working.
I followed the Spark on Kubernetes blog but got to a point where it runs the job but fails inside the worker pods with an file access error.
2018-05-22 22:20:51 WARN TaskSetManager:66 - Lost task 0.0 in stage 0.0 (TID 0, 172.17.0.15, executor 3): java.nio.file.AccessDeniedException: ./spark-examples_2.11-2.3.0.jar
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:84)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixCopyFile.copyFile(UnixCopyFile.java:243)
at sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:581)
at sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253)
at java.nio.file.Files.copy(Files.java:1274)
at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$copyRecursive(Utils.scala:632)
at org.apache.spark.util.Utils$.copyFile(Utils.scala:603)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:478)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:755)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:747)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:747)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:312)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
The command i use to run the SparkPi example is :
$DIR/$SPARKVERSION/bin/spark-submit \
--master=k8s://https://192.168.99.101:8443 \
--deploy-mode=cluster \
--conf spark.executor.instances=3 \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.kubernetes.container.image=172.30.1.1:5000/myapp/spark-docker:latest \
--conf spark.kubernetes.namespace=$namespace \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.driver.pod.name=spark-pi-driver \
local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar
On working through the code it seems like the spark jar files are being copied to an internal location inside the container. But:
Should this happen since they are local and are already there
If the do need to be copied to another location in the container how do i make this part of the container writable since it is created by the master node.
RBAC has been setup as follows: (oc get rolebinding -n myapp)
NAME ROLE USERS GROUPS SERVICE ACCOUNTS SUBJECTS
admin /admin developer
spark-role /edit spark
And the service account (oc get sa -n myapp)
NAME SECRETS AGE
builder 2 18d
default 2 18d
deployer 2 18d
pusher 2 13d
spark 2 12d
Or am i doing something silly here?
My kubernetes system is running inside Docker Machine (via virtualbox on osx)
I am using:
openshift v3.9.0+d0f9aed-12
kubernetes v1.9.1+a0ce1bc657
Any hints on solving this greatly appreciated?
I know this is an 5m old post, but it looks that there's not enough information related to this issue around, so I'm posting my answer in case it can help someone.
It looks like you are not running the process inside the container as root, if that's the case you can take a look at this link (https://github.com/minishift/minishift/issues/2836).
Since it looks like you are also using openshift you can do:
oc adm policy add-scc-to-user anyuid -z spark-sa -n spark
In my case I'm using kubernetes and I need to use runAsUser:XX. Thus I gave group read/write access to /opt/spark inside the container and that solved the issue, just add the following line to resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.
RUN chmod g+rwx -R /opt/spark
Of course you have to re-build the docker images manually or using the provided script like shown below.
./bin/docker-image-tool.sh -r YOUR_REPO -t YOUR_TAG build
./bin/docker-image-tool.sh -r YOUR_REPO -t YOUR_TAG push
I have a cluster with Cloudera 5.10.
For profiling I'm running spark-submit with parameters:
--conf "spark.driver.extraJavaOptions= -agentpath:/root/yjp-2017.02/bin/linux-x86-64/libyjpagent.so=sampling"
--conf "spark.executor.extraJavaOptions= -agentpath:/root/yjp-2017.02/bin/linux-x86-64/libyjpagent.so=sampling"
And it is working good only for driver. When i'm using this option for executor i'm getting the exception
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:601)
at org.apache.hadoop.util.Shell.run(Shell.java:504)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:786)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
I couldn't find any useful logs and the same exception I've got on every node.
The same if I'am using this manual enter link description here
And if I leave only configuration for driver, everything is working fine and i can use YourKit to connect to the driver
What can be the problem?
May be an executor launches 32-bit JVM? So path to 32-bit YourKit agent should be specified?
I experianced the same issue. You have to install YourKit on all nodes in the cluster.
I am trying Spark Sql on a dataset ~16Tb with large number of files (~50K). Each file is roughly 400-500 Megs.
I am issuing a fairly simple hive query on the dataset with just filters (No groupBy's and Joins) and the job is very very slow. It runs for 7-8 hrs and processes about 80-100 Gigs on a 12 node cluster.
I have experimented with different values of spark.sql.shuffle.partitions from 20 to 4000 but havn't seen lot of difference.
From the logs I have the yarn error attached at end [1]. I have got the below spark configs [2] for the job.
Is there any other tuning I need to look into. Any tips would be appreciated,
Thanks
2. Spark config -
spark-submit
--master yarn-client
--driver-memory 1G
--executor-memory 10G
--executor-cores 5
--conf spark.dynamicAllocation.enabled=true
--conf spark.shuffle.service.enabled=true
--conf spark.dynamicAllocation.initialExecutors=2
--conf spark.dynamicAllocation.minExecutors=2
1. Yarn Error:
16/04/07 13:05:37 INFO yarn.YarnAllocator: Container marked as failed: container_1459747472046_1618_02_000003. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1459747472046_1618_02_000003
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
I have explored the container logs but did not get lot of information from it.
I have seen this error log for few containers but not sure of the cause for it.
1. java.lang.NullPointerException at org.apache.spark.storage.DiskBlockManager.org$apache$spark$storage$DiskBlockManager$$doStop(DiskBlockManager.scala:167)
2. java.lang.ClassCastException: Cannot cast org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages$RegisterExecutorFailed to org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages$RegisteredExecutor$
I am using Horton Works Cluster (2 Node cluster) to run the spark and flume , So when I am running the job with --master "local[*]" , Flume is able to send the events and Spark is also able to receive and on checking at localhost:4040 I can see the events are being received from the flume. (We are pumping 100 Events/Sec from flume using flume-ng-sql source with an approx size of ~1KB each)
Where as when I run the same example with --master "yarn-client" , I am getting the below error in flume and spark is not getting any events as well.
2015-08-13 18:24:24,927 (SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR - org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:160)] Unable to deliver event. Exception follows.
org.apache.flume.EventDeliveryException: Failed to send events
at org.apache.flume.sink.AbstractRpcSink.process(AbstractRpcSink.java:403)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.flume.FlumeException: NettyAvroRpcClient { host: localhost, port: 55555 }: RPC connection error
at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:182)
at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:121)
at org.apache.flume.api.NettyAvroRpcClient.configure(NettyAvroRpcClient.java:638)
at org.apache.flume.api.RpcClientFactory.getInstance(RpcClientFactory.java:88)
at org.apache.flume.sink.AvroSink.initializeRpcClient(AvroSink.java:127)
at org.apache.flume.sink.AbstractRpcSink.createConnection(AbstractRpcSink.java:222)
at org.apache.flume.sink.AbstractRpcSink.verifyConnection(AbstractRpcSink.java:283)
at org.apache.flume.sink.AbstractRpcSink.process(AbstractRpcSink.java:360)
... 3 more
Caused by: java.io.IOException: Error connecting to localhost/127.0.0.1:55555
at org.apache.avro.ipc.NettyTransceiver.getChannel(NettyTransceiver.java:261)
at org.apache.avro.ipc.NettyTransceiver.<init>(NettyTransceiver.java:203)
at org.apache.avro.ipc.NettyTransceiver.<init>(NettyTransceiver.java:152)
at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:168)
... 10 more
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744)
at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.connect(NioClientSocketPipelineSink.java:496)
at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.processSelectedKeys(NioClientSocketPipelineSink.java:452)
at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.run(NioClientSocketPipelineSink.java:365)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
... 1 more
^
Also below observation has been observed in cluster:
-- Memory consumption using yarn is pretty much higher than compared to that being used in case of local.
-- Also when I am pumping 100 events per 30 second then Flume and spark are able to connect and process the same using yarn-client as well as local..
Below is the command which I am using for flume and spark.
Flume:
sudo -u hdfs flume-ng agent --conf conf/ -f conf/flume_mysql_spark.conf -n agent1 -Dflume.root.logger=INFO,console > flumelog.txt
Spark:
sudo -u hdfs spark-submit --master "yarn-client" --class "org.paladion.atm.FlumeEventCount" target/atm-1.1-jar-with-dependencies.jar > sparklog.txt
sudo -u hdfs spark-submit --master "local[*]" --class "org.paladion.atm.FlumeEventCount" target/atm-1.1-jar-with-dependencies.jar > sparklog.txt
Kindly l;et me know what could be wrong over here?
It got solves as below:
1 - If running as local give IP of local machine in Flume as well as spark.
2 - If running as cluster (yarn-client or yarn-cluster) give IP of the machine in cluster where you want to send the events (other than the one where you are executing the program so may be give IP of node which is not a master node) machine in Flume as well as spark.
Let me know if I am wrong and this could have worked for some other reason and any better solution is there for the same.