I have just rolled out a Hadoop/Spark cluster in efforts to kick start a data science program at my company. I used Ambari as the manager and installed the Hortonworks distribution (HDFS 2.7.3, Hive 1.2.1, Spark 2.1.1, as well as the other required services. By the way, I am running RHEL 7. I have 2 name nodes, 10 data nodes, 1 hive node and 1 management node (Ambari).
I built a list of firewall ports based on Apache and Ambari documentation and had my infrastructure guys push those rules. I ran into an issue with Spark wanting to pick random ports. When I attempted to run a Spark job (the traditional Pi example), it would fail, as I did not have the whole ephemeral port range open. Since we will probably be running multiple jobs, it makes sense to let Spark handle this and just choose from the ephemeral range of ports (1024 - 65535) rather than specifying a single port. I know I can pick a range, but to make it easy I just asked my guys to open the whole ephemeral range. At first my infrastructure guys balked at that, but when I told them the purpose, they went ahead and did so.
Based on that, I thought I had my issue fixed, but when I attempt to run a job, it still fails with:
Log Type: stderr
Log Upload Time: Thu Oct 12 11:31:01 -0700 2017
Log Length: 14617
Showing 4096 bytes of 14617 total. Click here for the full log.
Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:52 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:53 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:54 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:55 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:56 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:57 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:57 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:59 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:00 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:01 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:02 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:03 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:04 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:05 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:06 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:06 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:07 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:09 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:10 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:11 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:12 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:13 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:14 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:15 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:15 ERROR ApplicationMaster: Uncaught exception:
org.apache.spark.SparkException: Failed to connect to driver!
at org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkDriver(ApplicationMaster.scala:607)
at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:461)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:283)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:783)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:67)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:781)
at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:804)
at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
17/10/12 11:29:15 INFO ApplicationMaster: Final app status: FAILED, exitCode: 10, (reason: Uncaught exception: org.apache.spark.SparkException: Failed to connect to driver!)
17/10/12 11:29:15 INFO ShutdownHookManager: Shutdown hook called
At first I thought maybe I had some sort of misconfiguration with Spark and the namenodes/datanodes. However, to test it, I simply stopped firewalld on every node and attempted the job again and it worked just fine.
So, my question - I have the entire 1024 - 65535 port range open - I can see the Spark drivers are trying to connect on those high ports (as show above - 30k - 40k range). However, for some reason when the firewall is on, it fails and when its off it works. I checked the firewall rules and sure enough, the ports are open - and those rules are working as I can access the web services for Ambair, Yarn and HFDS which are specified in the same firewalld xml rules file....
I am new to Hadoop/Spark, so I am wondering is there something I am missing? Is there some lower port under 1024 I need to account for? Here is a list of the ports below 1024 I have open, in addition to the 1024 - 65535 port range:
88
111
443
1004
1006
1019
It's quite possible I missed a lower number port that I really need and just don't know it. Above that, everything else should be handled by the 1024 - 65535 port range.
Ok, so working with some folks on the Hortonworks community, I was was able to come up with a solution. Basically, you need to have at least one port defined, but you can extend that by specifying the spark.port.MaxRetries = xxxxx. By combining this setting along with the spark.driver.port = xxxxx, you can have a range starting at the spark.driver.port and ending with the spark.port.maxRetries.
If you use Ambari as your manager, the settings are under "Custom spark2-defaults" section (I assume under a fully open-source stack install, this would just be a setting under the normal Spark config):
I was advised to separate out these ports by 32 count blocks, for example, if you start at 40000 for your driver, you should start the spark.blockManager.port at 40033, etc.. See this post at:
https://community.hortonworks.com/questions/141484/spark-jobs-failing-firewall-issue.html
Related
I am trying to setup a spark standalone cluster with 3 nodes. Configurations for Linux servers are below:
master node with 2 core and 25GB memory
worker node 1 with 4 core and 21GB memory
worker node 2 with 8 core and 19GB memory
I have started the master node successfully and its url is spark://IP:7077
when I start any of the worker node with command ./sbin/start-worker.sh spark://IP:7077 I get the below error message:
22/11/25 09:13:58 INFO Worker: Connecting to master IP:7077...
22/11/25 09:14:54 ERROR RpcOutboxMessage: Ask terminated before connecting successfully
22/11/25 09:14:54 WARN NettyRpcEnv: Ignored failure: java.io.IOException: Connecting to /IP:7077 timed out (120000 ms)
22/11/25 09:14:54 WARN Worker: Failed to connect to master IP:7077
org.apache.spark.SparkException: Exception thrown in awaitResult:
openjdk 11.0.17 is the java version installed on all the 3 nodes
Any solutions to resolve this issue will be helpful.
On hadoop kerberized cluster. If im not impersonate user on spark thrift server. It work well. But when i do it. Im facing an error about authentication with metastore.
I flow this document
https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.4/bk_spark-component-guide/content/config-sts-user-imp.html
https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.4/bk_data-access/content/ref-5422cb60-d1d5-425a-b719-ec7bd03ee5d3.1.html
Step 1:
Set hive.server2.enable.doAs = true in Advanced spark-hive-site-override
Add spark.jars = /usr/hdp/current/spark-thriftserver/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-thriftserver/lib/datanucleus-core-3.2.10.jar,/usr/hdp/current/spark-thriftserver/lib/datanucleus-rdbms-3.2.9.jar in Custom spark-thrift-sparkconf
Step 2: in Advanced hiveserver2-site
Set hive.security.authorization.enabled = true
Set hive.server2.enable.doAs = true
Set hive.metastore.pre.event.listeners = org.apache.hadoop.hive.ql.security.authorization.AuthorizationPreEventListener
Set hive.security.metastore.authorization.manager = org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider
Step 3:
I created a user keytab and princinpal and kinit
Run cli: beeline -u 'jdbc:hive2://:/default;principal=spark3/#;auth=KERBEROS;transportMode=binary'
Result:
Connecting to jdbc:hive2://<host>:<port>/default;principal=spark3/<HOST>#<REAM>;auth=KERBEROS;transportMode=binary
Connected to: Spark SQL (version 3.2.2)
Driver: Hive JDBC (version 3.1.0.3.1.4.0-315)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 3.1.0.3.1.4.0-315 by Apache Hive
Run cli: show databases;
And I'm facing an error like this
Error: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
....
Caused by: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
....
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
....
Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
....
Caused by: java.lang.reflect.InvocationTargetException
....
Caused by: MetaException(message:Could not connect to meta store using any of the URIs provided. Most recent failure: org.apache.thrift.transport.TTransportException: GSS initiate failed
I checked log of spark thrift see like that
22/10/07 15:07:31 INFO ThriftCLIService: Client protocol version: HIVE_CLI_SERVICE_PROTOCOL_V10
22/10/07 15:07:31 INFO HiveSessionImpl: Operation log session directory is created: /tmp/spark3/operation_logs/64eb19a6-1bdc-4ed8-81c9-8881c4251e75
22/10/07 15:07:31 INFO metastore: Trying to connect to metastore with URI thrift://<host>:<port>
22/10/07 15:07:32 INFO metastore: Opened a connection to metastore, current connections: 1
22/10/07 15:07:32 INFO metastore: Connected to metastore.
22/10/07 15:07:39 INFO SparkExecuteStatementOperation: Submitting query 'show databases' with fdcf90cb-74bb-4574-99b7-bfd981ce8010
22/10/07 15:07:39 INFO SparkExecuteStatementOperation: Running query with fdcf90cb-74bb-4574-99b7-bfd981ce8010
22/10/07 15:07:39 INFO metastore: Closed a connection to metastore, current connections: 0
22/10/07 15:07:39 INFO metastore: Trying to connect to metastore with URI thrift://<host>:<port>
22/10/07 15:07:39 ERROR TSaslTransport: SASL negotiation failure
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
22/10/07 15:07:39 WARN metastore: Failed to connect to the MetaStore Server...
22/10/07 15:07:39 INFO metastore: Waiting 5 seconds before next connection attempt.
22/10/07 15:07:44 INFO metastore: Trying to connect to metastore with URI thrift://<host>:<port>
22/10/07 15:07:44 ERROR TSaslTransport: SASL negotiation failure
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
I test connect to spark thrift server successed, but when i run query. Im facing error above. Where am i wrong?
Spark Thrift Server is built upon a single spark application, unfortunately, it does not support impersonation yet.
Maybe you can try Apache Kyuubi https://github.com/apache/incubator-kyuubi
I have a Kafka Streaming Consumer job which stores the data in Hive table.
The issue is that sometimes the job fails due to Hive and has to be restarted again.
This is happening with all the streaming consumer jobs which are trying to access Hive.
Not able to trace the reason.
Appreciate your suggestions.
Below is the high level error log
INFO ApplicationMaster: Starting the user application in a separate Thread
INFO ApplicationMaster: Waiting for spark context initialization...
WARN SparkConf: The configuration key 'spark.yarn.applicationMaster.waitTries' has been deprecated as of Spark 1.3 and may be removed in the future. Please use the new key 'spark.yarn.am.waitTime' instead.
INFO metastore: Trying to connect to metastore with URI thrift://<server1:port1>
WARN metastore: Failed to connect to the MetaStore Server...
INFO metastore: Trying to connect to metastore with URI thrift://<server1:port1>
WARN metastore: Failed to connect to the MetaStore Server...
INFO metastore: Waiting 1 seconds before next connection attempt.
INFO metastore: Trying to connect to metastore with URI thrift://<server2:port1>
WARN metastore: Failed to connect to the MetaStore Server...
INFO metastore: Trying to connect to metastore with URI thrift://<server1:port1>
WARN metastore: Failed to connect to the MetaStore Server...
INFO metastore: Waiting 1 seconds before next connection attempt.
INFO metastore: Trying to connect to metastore with URI thrift://<server2:port1>
WARN metastore: Failed to connect to the MetaStore Server...
INFO metastore: Trying to connect to metastore with URI thrift://<server1:port1>
WARN metastore: Failed to connect to the MetaStore Server...
INFO metastore: Waiting 1 seconds before next connection attempt.
WARN Hive: Failed to access metastore. This class should not accessed in runtime.
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
Caused by: java.lang.reflect.InvocationTargetException
Caused by: MetaException(message:Could not connect to meta store using any of the URIs provided. Most recent failure: org.apache.thrift.transport.TTransportException: Peer indicated failure: DIGEST-MD5: IO error acquiring password
INFO metastore: Trying to connect to metastore with URI thrift://<server2:port1>
WARN metastore: Failed to connect to the MetaStore Server...
INFO metastore: Trying to connect to metastore with URI thrift://<server1:port1>
WARN metastore: Failed to connect to the MetaStore Server...
INFO metastore: Waiting 1 seconds before next connection attempt.
INFO metastore: Trying to connect to metastore with URI thrift://<server2:port1>
WARN metastore: Failed to connect to the MetaStore Server...
INFO metastore: Trying to connect to metastore with URI thrift://<server1:port1>
WARN metastore: Failed to connect to the MetaStore Server...
INFO metastore: Waiting 1 seconds before next connection attempt.
INFO metastore: Trying to connect to metastore with URI thrift://<server2:port1>
WARN metastore: Failed to connect to the MetaStore Server...
INFO metastore: Trying to connect to metastore with URI thrift://<server1:port1>
WARN metastore: Failed to connect to the MetaStore Server...
INFO metastore: Waiting 1 seconds before next connection attempt.
ERROR ApplicationMaster: User class threw exception: java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':
java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':
Caused by: org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
Caused by: java.lang.reflect.InvocationTargetException
Caused by: MetaException(message:Could not connect to meta store using any of the URIs provided. Most recent failure: org.apache.thrift.transport.TTransportException: Peer indicated failure: DIGEST-MD5: IO error acquiring password
I am trying to connect to spark cluster from remote system. My java code is shown below.
JAVA Code
new SparkConf()
.setAppName("Java API demo")
.setMaster("spark://192.168.XX.XX:7077")
.set("spark.driver.host","192.168.XX.XX")
.set("spark.driver.port","9929");
It gives me this error:
Error Message
16/04/14 16:08:42 ERROR NettyTransport: failed to bind to /192.168.XX.XX:9929, shutting down Netty transport
16/04/14 16:08:42 WARN Utils: Service 'sparkDriver' could not bind on port 9929. Attempting port 9930.
16/04/14 16:08:42 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
First of all, is this possible in spark? I am using Spark 1.4 version. Thanks in advcance..
I am getting an error in launching the standalone Spark driver in cluster mode. As per the documentation, it is noted that cluster mode is supported in the Spark 1.2.1 release. However, it is currently not working properly for me. Please help me in fixing the issue(s) that are preventing the proper functioning of Spark.
I have 3 node spark cluster node1 , node2 and node 3
I running below command on node 1 for deploying driver
/usr/local/spark-1.2.1-bin-hadoop2.4/bin/spark-submit --class com.fst.firststep.aggregator.FirstStepMessageProcessor --master spark://ec2-xx-xx-xx-xx.compute-1.amazonaws.com:7077 --deploy-mode cluster --supervise file:///home/xyz/sparkstreaming-0.0.1-SNAPSHOT.jar /home/xyz/config.properties
driver gets launched on node 2 in cluster. but getting exception on node 2 that it is trying to bind to node 1 ip.
2015-02-26 08:47:32 DEBUG AkkaUtils:63 - In createActorSystem, requireCookie is: off
2015-02-26 08:47:32 INFO Slf4jLogger:80 - Slf4jLogger started
2015-02-26 08:47:33 ERROR NettyTransport:65 - failed to bind to ec2-xx.xx.xx.xx.compute-1.amazonaws.com/xx.xx.xx.xx:0, shutting down Netty transport
2015-02-26 08:47:33 WARN Utils:71 - Service 'Driver' could not bind on port 0. Attempting port 1.
2015-02-26 08:47:33 DEBUG AkkaUtils:63 - In createActorSystem, requireCookie is: off
2015-02-26 08:47:33 ERROR Remoting:65 - Remoting error: [Startup failed] [
akka.remote.RemoteTransportException: Startup failed
at akka.remote.Remoting.akka$remote$Remoting$$notifyError(Remoting.scala:136)
at akka.remote.Remoting.start(Remoting.scala:201)
at akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184)
at akka.actor.ActorSystemImpl.liftedTree2$1(ActorSystem.scala:618)
at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:615)
at akka.actor.ActorSystemImpl._start(ActorSystem.scala:615)
at akka.actor.ActorSystemImpl.start(ActorSystem.scala:632)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:141)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:118)
at org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53)
at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1765)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1756)
at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:56)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:33)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: org.jboss.netty.channel.ChannelException: Failed to bind to: ec2-xx-xx-xx.compute-1.amazonaws.com/xx.xx.xx.xx:0
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:393)
at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:389)
at scala.util.Success$$anonfun$map$1.apply(Try.scala:206)
at scala.util.Try$.apply(Try.scala:161)
at scala.util.Success.map(Try.scala:206)
kindly suggest
Thanks`enter code here`
It is not possible to bind to port 0. There is/are errors in your spark configuration. Specifically look at the
spark.webui.port
It is probably set to 0.