How to find spark master URL on Amazon EMR

How to find spark master URL on Amazon EMR - apache-spark

I am new to spark and trying to install spark on Amazon cluster with version 1.3.1. when i do
SparkConf sparkConfig = new SparkConf().setAppName("SparkSQLTest").setMaster("local[2]");
it does work for me , however i came to know that this is for testing purpose i can set local[2]
When i tried to use cluster mode i changed it to
SparkConf sparkConfig = new SparkConf().setAppName("SparkSQLTest").setMaster("spark://localhost:7077");
with this i am getting below error
Tried to associate with unreachable remote address [akka.tcp://sparkMaster#localhost:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused
15/06/10 15:22:21 INFO client.AppClient$ClientActor: Connecting to master akka.tcp://sparkMaster#localhost:7077/user/Master..
Could someone please let me how to set the master url.

If you are using the bootstrap action from https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark the configuration is setup for Spark on YARN. So just set master to yarn-client or yarn-cluster. Be sure to define the number of executors with memory and cores. More details about Spark on YARN at https://spark.apache.org/docs/latest/running-on-yarn.html
Addition regarding executor settings for memory and core sizing:
Take a look at the default YARN node manager configs for each type at http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/TaskConfiguration_H2.html, specifically yarn.scheduler.maximum-allocation-mb. You can determine the number of cores from the basic EC2 info url (http://aws.amazon.com/ec2/instance-types/). The max size of the executor memory has to fit within the max allocation less Spark's overhead and in increments of 256MB. A good description of this calculation is at http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/. Don't forget that a little over half the executor memory can be used for RDD cache.

Related

On kubernetes my spark worker pod is trying to access thrift pod by name

Okay. Where to start? I am deploying a set of Spark applications to a Kubernetes cluster. I have one Spark Master, 2 Spark Workers, MariaDB, a Hive Metastore (that uses MariaDB - and it's not a full Hive install - it's just the Metastore), and a Spark Thrift Server (that talks to Hive Metastore and implements the Hive API).
So this setup is working pretty well for everything except the setup of the Thrift Server job (start-thriftserver.sh in the Spark sbin directory on the thrift server pod). By working well I say that outside my cluster I can create spark jobs and submit them to master and then using the Web UI I can see my code test app ran to completion utilizing both workers.
Now the problem. When you launch the start-thriftserver.sh it submits a job to the cluster with itself as the driver (I believe - which is correct behavior). And when I look at the related spark job via the WebUI I see it has workers and they repeatedly get hatched and then exit shortly therafter. When I look at the workers' stderr logs I see that every worker launches and tries to connect back to the thrift server pod at the spark.driver.port. This is correct behavior I believe. The gotcha is that connection fails because it says unknown host exception and it uses a kubernetes raw pod name (not a service name and with no IP in the name) of the thrift server pod to say it can't find the thrift server that initiated the connection. Now Kubernetes DNS stores service names and then only pod names as prefaced with their private IP. In other words the raw name of the pod (without an IP) is never registered with the DNS. That is not how kubernetes works.
So my question. I am struggling to figure out why the spark worker pod is using a raw pod name to try to find the thrift server. It seems it should never do this and that it should be impossible to ever satisfy that request. I have wondered if there is some spark config setting that would tell the workers that the (thrift) driver it needs to be searching for is actually spark-thriftserver.my-namespace.svc. But I can't find anything having done much searching.
There are so many settings that go into a cluster like this that I don't want to barrage you with info. One thing that might clarify my setup: the following string is dumped at the top of a worker log that fails. Notice the raw pod name of the thrift server for driver-url. If anyone has any clue what steps to take to fix this please let me know. I'll edit this post and share settings etc as people request them. Thanks for helping.
Spark Executor Command: "/usr/lib/jvm/java-1.8-openjdk/jre/bin/java" "-cp" "/spark/conf/:/spark/jars/*" "-Xmx512M" "-Dspark.master.port=7077" "-Dspark.history.ui.port=18081" "-Dspark.ui.port=4040" "-Dspark.driver.port=41617" "-Dspark.blockManager.port=41618" "-Dspark.master.rest.port=6066" "-Dspark.master.ui.port=8080" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#spark-thriftserver-6bbb54768b-j8hz8:41617" "--executor-id" "12" "--hostname" "172.17.0.6" "--cores" "1" "--app-id" "app-20220408001035-0000" "--worker-url" "spark://Worker#172.17.0.6:37369"

Spark app fails after ACCEPTED state for a long time. Log says Socket timeout exception

I have Hadoop 3.2.2 running on a cluster with 1 name node, 2 data nodes and 1 resource manager node. I tried to run the sparkpi example on cluster mode. The spark-submit is done from my local machine. YARN accepts the job but the application UI says
this. Further in the terminal where I submitted the job it says
2021-06-05 13:10:03,881 INFO yarn.Client: Application report for application_1622897708349_0001 (state: ACCEPTED)
This continues to print until it fails. Upon failure it prints
I tried increasing the spark.executor.heartbeatInterval to 3600 secs. Still no luck. I also tried running the code from namenode thinking there must be some connection issue with my local machine. Still I'm unable to run it

found the answer albeit I don't know why it works! Adding the private IP address to the security group in AWS did the trick.

How to control the number of Hadoop IPC retry attempts for a Spark job submission?

Suppose I attempt to submit a Spark (2.4.x) job to a Kerberized cluster, without having valid Kerberos credentials. In this case, the Spark launcher tries repeatedly to initiate a Hadoop IPC call, but fails:
20/01/22 15:49:32 INFO retry.RetryInvocationHandler: java.io.IOException: Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: "node-1.cluster/172.18.0.2"; destination host is: "node-1.cluster":8032; , while invoking ApplicationClientProtocolPBClientImpl.getClusterMetrics over null after 1 failover attempts. Trying to failover after sleeping for 35160ms.
This will repeat a number of times (30, in my case), until eventually the launcher gives up and the job submission is considered failed.
Various other similar questions mention these properties (which are actually YARN properties but prefixed with spark. as per the standard mechanism to pass them with a Spark application).
spark.yarn.maxAppAttempts
spark.yarn.resourcemanager.am.max-attempts
However, neither of these properties affects the behavior I'm describing. How can I control the number of IPC retries in a Spark job submission?

After a good deal of debugging, I figured out the properties involved here.
yarn.client.failover-max-attempts (controls the max attempts)
Without specifying this, the number of attempts appears to come from the ratio of these two properties (numerator first, denominator second).
yarn.resourcemanager.connect.max-wait.ms
yarn.client.failover-sleep-base-ms
Of course as with any YARN properties, these must be prefixed with spark.hadoop. in the context of a Spark job submission.
The relevant class (which resolves all these properties) is RMProxy, within the Hadoop YARN project (source here). All these, and related, properties are documented here.

Increasing network load in HDFS traffic with stream jobs and Kafka

We experience unexplained behaviour with our new EMR setup that includes:
EMR 5.16 (3 nodes - c4.8xlarge and 1 master - c4.8xlarge)
Kafka Cluster based on ECS
We running simple stream job that reads from a Kafka topic, makes some logic and writeStream back to Kafka topic (using checkpointLocation as HDFS path)
The "problem" is that in Ganglia I can see increasing network traffic that came out from the driver (that runs on one of the slaves) to the Master server.
I can see from a simple pcap file that's the traffic belongs to 50010 (Hadoop Data Transfer) and here I'm in a dead end.
Some help needed, thanks!

After some investigation and view the payload of the traffic, it was the logs that sent to the Master! It was delivered to Spark history server and located in HDFS..
I just needed to add this config to my spark-submit --conf spark.eventLog.enabled=false

Apache Spark behavior when a node in a cluster fails.

What's the behavior when a partition is sent to a node and the node crashes right before executing a job? If a new node is introduced into the cluster, what's the entity that detects the addition of this new machine? Does the new machine get assigned the partition that didn't get processed?

The master considers the worker to be failure if it didnt receive the heartbeat message for past 60 sec (according to spark.worker.timeout). In that case the partition is assigned to another worker(remember partitioned RDD can be reconstructed even if its lost).
For the question if the new node is introduced into cluster? the spark-master will not detect the new node addition to the cluster once the slaves are started, because before application-submit in cluster the sbin/start-master.sh starts the master and sbin/start-slaves.sh reads the conf/slaves file (contains IP address of all slaves) in spark-master machine and starts a slave instance on each machine specified. The spark-master will not read this configuration file after being started. so its not possible to add a new node once all slaves started.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string