I was wondering if there is anyway of checking the number of HDFS bytes read or written by an Spark application launched through YARN. For example, if I check the jobs completed in YARN:
Total number of applications (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, RUNNING, FINISHED, FAILED, KILLED]):2
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1451137492021_0002 com.abrandon.upm.GenerateRandomText SPARK alvarobrandon default FINISHED SUCCEEDED 100% N/A
The idea is to be able to monitor the number of bytes that application_1451137492021_0002 has read or written. I have check the datanode logs but I can only find traces of non mapreduce jobs and of course no trace of this particular application.
Related
I'm running YARN on an EMR cluster.
mapred queue -list returns:
Queue Name : default
Queue State : running
Scheduling Info : Capacity: 100.0, MaximumCapacity: 100.0, CurrentCapacity: 0.0
How do I clear this queue or add a new one? I've been looking for a while now and can't find CLI commands to do so. I only have access to CLI. Any Spark applications I submit hang in the ACCEPTED state, and I've killed all submitted applications via yarn app --kill [app_id]
CurrentCapacity: 0.0 means that the queue is fully unused.
Your jobs, if thats your concern, are NOT hung due to unavailability of resources.
Not sure whether EMR allows yarn cli commands such as schedulerconf
https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YarnCommands.html#queue:~:text=ResourceManager%20admin%20client-,schedulerconf,-Usage%3A%20yarn
Suppose I attempt to submit a Spark (2.4.x) job to a Kerberized cluster, without having valid Kerberos credentials. In this case, the Spark launcher tries repeatedly to initiate a Hadoop IPC call, but fails:
20/01/22 15:49:32 INFO retry.RetryInvocationHandler: java.io.IOException: Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: "node-1.cluster/172.18.0.2"; destination host is: "node-1.cluster":8032; , while invoking ApplicationClientProtocolPBClientImpl.getClusterMetrics over null after 1 failover attempts. Trying to failover after sleeping for 35160ms.
This will repeat a number of times (30, in my case), until eventually the launcher gives up and the job submission is considered failed.
Various other similar questions mention these properties (which are actually YARN properties but prefixed with spark. as per the standard mechanism to pass them with a Spark application).
spark.yarn.maxAppAttempts
spark.yarn.resourcemanager.am.max-attempts
However, neither of these properties affects the behavior I'm describing. How can I control the number of IPC retries in a Spark job submission?
After a good deal of debugging, I figured out the properties involved here.
yarn.client.failover-max-attempts (controls the max attempts)
Without specifying this, the number of attempts appears to come from the ratio of these two properties (numerator first, denominator second).
yarn.resourcemanager.connect.max-wait.ms
yarn.client.failover-sleep-base-ms
Of course as with any YARN properties, these must be prefixed with spark.hadoop. in the context of a Spark job submission.
The relevant class (which resolves all these properties) is RMProxy, within the Hadoop YARN project (source here). All these, and related, properties are documented here.
My Environment:
I'm trying to connect Cassandra through Spark Thrift server. Then I create a Meta-Table in Hive Metastore which holds the Cassandra table data. In a web application I connect to Meta-table through JDBC driver. I have enabled fair scheduling for Spark Thrift Server.
Issue:
When I perform a load test for concurrency through JMeter for 100 users for 300 seconds duration, I get sub seconds response time for initial requests(Say like first 30 seconds). Then the response time gradually increases (like 2 to 3 seconds). When I check the Spark UI, all the jobs are executed less than 100 milliseconds. I also notice that jobs and tasks are in pending stage when request are received. So I assume that even though the tasks take sub seconds to execute they are submitted with a latency by the scheduler. How to fix this latency in job submission?
Following are my configuration details,
Number of Workers - 2
Number of Executors per Worker - 1
Number of cores per Executor - 14
Total core of workers - 30
Memory per Executor - 20Gb
Total Memory of worker - 106Gb
Configuration in Fair Schedule XML
<pool name="default">
<schedulingMode>FAIR</schedulingMode>
<weight>2</weight>
<minShare>15</minShare>
</pool>
<pool name="test">
<schedulingMode>FIFO</schedulingMode>
<weight>2</weight>
<minShare>3</minShare>
</pool>
I'm executing in Spark Standalone mode.
Is it not the case queries pending in the queue while others are running. Try reducing spark.locality.wait to say 1s
We are trying to setup HA on spark standalone master using zookeeper.
We have two zookeeper hosts which we are using for spark ha as well.
Configured following thing in spark-env.sh
SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=zk_server1:2181,zk_server2:2181"
Started both the masters.
started shell and status of the job is RUNNING.
master1 is in ALIVE and master2 is in STANDBY status.
Killed the master1 and master2 has been picked up and all the workers appeared alive in master2.
The shell which is already running has been moved to new master. However, the status is in WAITING status and executors are in LOADING status.
No error in worker log and executor log, except notification that connected to new master.
I could see the worker re-registered, but the executor does not seems to be started. Is there any thing that i am missing.?
My spark version is 1.5.0
When I submit a job using qsub job gets rejected if not enough nodes available. Is there any config that says to QUEUE(without running) the job instead of rejecting.