Checking yarn application logs - apache-spark

I am new to spark . I have a 10node Hadoop cluster with one edge node. I am submitting spark application from edge node and redirecting spark-submit command output to local file on edge node.
So when spark application fails I can check edge node log file and take an action .
When I read about yarn application logs ,it is said that node managers running that application will log into some location (yarn.nodemanager.log-dir) .
How is this nodemanager log different from edge node log . Can anyone explain yarn application logs in detail.

"Edge node logs" would be Spark driver application logs, which would likely say something like URL to track the Job: <link to YARN UI>
If you want the actual Spark runtime logs, you need to look at the inidivual Spark executors via the Spark UI (which redirect to the YARN UI, if that is how you run Spark)
The NodeManager (and ResourceManager) is a YARN process, with its own logs, and not related to your Spark code

Related

Python+PySpark File locally connecting to a Remote HDFS/Spark/Yarn Cluster

I've been playing around with HDFS and Spark. I've set up a five node cluster on my network running HDFS, Spark, and managed by Yarn. Workers are running in client mode.
From the master node, I can launch the PySpark shell just fine. Running example jars, the job is split up to the worker nodes and executes nicely.
I have a few questions on whether and how to run python/Pyspark files against this cluster.
If I have a python file with a PySpark calls elsewhere else, like on my local dev laptop or a docker container somewhere, is there a way to run or submit this file locally and have it executed on the remote Spark cluster? Methods that I'm wondering about involve running spark-submit in the local/docker environment and but the file has SparkSession.builder.master() configured to the remote cluster.
Related, I see a configuration for --master in spark-submit, but the only yarn option is to pass "yarn" which seems to only queue locally? Is there a way to specify remote yarn?
If I can set up and run the file remotely, how do I set up SparkSession.builder.master()? Is the url just to the hdfs:// url to port 9000, or do I submit it to one of the Yarn ports?
TIA!
way to run or submit this file locally and have it executed on the remote Spark cluster
Yes, well "YARN", not "remote Spark cluster". You set --master=yarn when running with spark-submit, and this will run against the configured yarn-site.xml in HADOOP_CONF_DIR environment variable. You can define this at the OS level, or in spark-env.sh.
You can also use SparkSession.builder.master('yarn') in code. If both options are supplied, one will get overridden.
To run fully "in the cluster", also set --deploy-mode=cluster
Is there a way to specify remote yarn?
As mentioned, this is configured from yarn-site.xml for providing resourcemanager location(s).
how do I set up SparkSession.builder.master()? Is the url just to the hdfs:// url to port 9000
No - The YARN resource manager has its own RPC protocol, not hdfs:// ... You can use spark.read("hdfs://namenode:port/path") to read HDFS files, though. As mentioned, .master('yarn') or --master yarn are the only configs you need that are specific for Spark.
If you want to use Docker containers, YARN does support this, but Spark's Kubernetes master will be easier to setup, and you can use Hadoop Ozone or MinIO rather than HDFS in Kubernetes.

spark standalone running on docker cleanup not running

I'm running spark on standalone mode as a docker service where I have one master node and one spark worker. I followed the spark documentation instructions:
https://spark.apache.org/docs/latest/spark-standalone.html
to add the properties where the spark cluster cleans itself and I set those in my docker_entrypoint
export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.interval=900 -Dspark.worker.cleanup.appDataTtl=900
and verify that it was enables following the logs of the worker node service
My question is do we expect to get all directories located on SPARK_WORKER_DIR directory to be cleaned ? or does it only clean the application files
Because I still see some empty directories holding there

How to get full worker output in Apache Spark

How do I view / download the complete stderr output from a worker in Apache Spark, deployed in cluster mode?
I've deployed a program with spark-submit --deploy-mode cluter foo.jar, and a worker crashed. To investigate, I go to localhost:8081 and access the worker's log (stderr in particular), but then it shows me only the bottom of the file, and I have to click the "Load More" button a hundred times to scroll up to the first error -- clearly, I shouldn't have to do that. Is there a way to download the whole stderr output, or to redirect it to a known location? Which part of Spark's documentation gives me this kind of information?
Get the Application Id of your spark job from yarn URL OR after submitting spark job you will get spark Application Id.
Then use the below command in YARN CLI to view the yarn logs from your edge/gateway node. for more details about YARN CLI refer this link click here
yarn logs -applicationId <Application ID> -log_files stderr

How does a MasterNode fit into a Spark cluster?

I'm getting a little confused with how to setup my Spark configuration for workloads using YARN as the resource manager. I've got a small cluster spun up right now with 1 master node and 2 core nodes.
Do I include the master node when calculating the number of executors or no?
Do I leave out 1 core for every node to account for Yarn management?
Am I supposed to designate the master node for anything in particular in Spark configurations?
Master node shouldn't be taken into account to calculate number of executors
Each node is actually EC2 instance with operating system so you have to leave 1 or more cores for system tasks and yarn agents
Master node can be used to run spark driver. For this start EMR cluster in client mode from master node by adding arguments --master yarn --deploy-mode client to spark-submit command. Keep in mind following:
Cluster mode allows you to submit work using S3 URIs. Client mode requires that you put the application in the local file system on the cluster master node
To do all preparation work (copy libs, scripts etc to a master node) you can setup a separate step and then run spark-submit --master yarn --deploy-mode client command as next step.

How to setup yarn client in code?

I want to run my spark application on my hortonworks data platform. As in this setup I don't have a spark master standalone I want to run as a yarn client.
I am trying to create the SparkSession like this:
SparkSession
.builder()
.master("yarn-client")
.appName("my-app")
.getOrCreate())
I know I am missing some properties to let spark client where my yarn server is running but I can't seem to find those properties.
Currently the app just hangs init with no error or exception.
Any ideas what I am missing?
It looks like you're trying to run your app locally while your Hortonworks HDP is somewhere else.
Unlike Spark standalone and Mesos modes, in which the master’s address
is specified in the --master parameter, in YARN mode the
ResourceManager’s address is picked up from the Hadoop configuration.
So your app should be run from Hortonworks itself, which has all the Hadoop configuration in place.

Resources