How to get full worker output in Apache Spark - apache-spark

How do I view / download the complete stderr output from a worker in Apache Spark, deployed in cluster mode?
I've deployed a program with spark-submit --deploy-mode cluter foo.jar, and a worker crashed. To investigate, I go to localhost:8081 and access the worker's log (stderr in particular), but then it shows me only the bottom of the file, and I have to click the "Load More" button a hundred times to scroll up to the first error -- clearly, I shouldn't have to do that. Is there a way to download the whole stderr output, or to redirect it to a known location? Which part of Spark's documentation gives me this kind of information?

Get the Application Id of your spark job from yarn URL OR after submitting spark job you will get spark Application Id.
Then use the below command in YARN CLI to view the yarn logs from your edge/gateway node. for more details about YARN CLI refer this link click here
yarn logs -applicationId <Application ID> -log_files stderr

Related

Checking yarn application logs

I am new to spark . I have a 10node Hadoop cluster with one edge node. I am submitting spark application from edge node and redirecting spark-submit command output to local file on edge node.
So when spark application fails I can check edge node log file and take an action .
When I read about yarn application logs ,it is said that node managers running that application will log into some location (yarn.nodemanager.log-dir) .
How is this nodemanager log different from edge node log . Can anyone explain yarn application logs in detail.
"Edge node logs" would be Spark driver application logs, which would likely say something like URL to track the Job: <link to YARN UI>
If you want the actual Spark runtime logs, you need to look at the inidivual Spark executors via the Spark UI (which redirect to the YARN UI, if that is how you run Spark)
The NodeManager (and ResourceManager) is a YARN process, with its own logs, and not related to your Spark code

Writing to a local FS in cluster mode SPARK

For spark jobs, we are trying to add a logging framework that creates a custom log file on a local FS.
In client mode, everything is fine, the files are created on the local FS with the user who launched the spark-submit.
However in cluster mode, the local files are created with the user yarn who does not have the permission to write to the local directory...
Is there any solution to write a local file in cluster mode with the user who submited the job without changing the permission to 777 everywhere?
Is the cluster mode better in this case (we are on PROD environment), knowing that the job is launched from a node of the cluster (so there is no network issue).
Thank you.
Yes, here is a way: Using shell script to submit spark jobs
We use logger to print all our logs. we always have unique text with the log message
eg: log.info("INFO_CUSTOM: Info message"). Once our application is completed we will Yarn logs command and grep for the unique text.
Get the application id using yarn command with application name.
eg. yarn application -list -appStates FINISHED,FAIED,KILLED | grep <application name>
Run yarn logs command and grep, redirect it to the file you want.
eg. yarn logs -applicationId <application id u got fro step 1> | grep -w "INFO_CUSTOM" >> joblog.log

How to tail yarn logs?

I am submitting a Spark Job using below command. I want to tail the yarn log using application Id similar to tail command operation in Linux box.
export SPARK_MAJOR_VERSION=2
nohup spark-submit --class "com.test.TestApplication" --name TestApp --queue queue1 --properties-file application.properties --files "hive-site.xml,tez-site.xml,hbase-site.xml,application.properties" --master yarn --deploy-mode cluster Test-app.jar > /tmp/TestApp.log &
Not easily.
"YARN logs" aren't really in YARN, they are actually on the executor nodes of Spark. If YARN log aggregation is enabled, then logs are in HDFS, and available from Spark History server.
The industry deployment pattern is to configure the Spark log4j properties to write to a file with a log forwarder (like Filebeat, Splunk, Fluentd), then those processes collect data into a search engine like Solr, Elasticsearch, Graylog, Splunk, etc. From these tools, you can approximately tail/search/analyze log messages outside of a CLI.
yarn logs -applicationId application_1648123761230_0106 -log_files stdout -size -1000
https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.0.1/data-operating-system/content/use_the_yarn_cli_to_view_logs_for_running_applications.html

Spark Job running even after spark Master process is killed

We are working on spark cluster where spark job(s) are getting submitted successfully even after spark "Master" process is killed.
Here is the complete details about what we are doing.
process details :-
jps
19560 NameNode
18369 QuorumPeerMain
22414 Jps
20168 ResourceManager
22235 Master
and we submitted one spark job to this Master using the command like
spark-1.6.1-bin-without-hadoop/bin/spark-submit --class com.test.test --master yarn-client --deploy-mode client test.jar -incomingHost hostIP
where hostIP having correct ip address of the machine running "Master" process.
And after this we are able to see the job in RM Web UI also.
Now when we kill the "Master" Process , we can see the submitted job is running fine which is expected here as we we are using yarn mode and that job will run without any issue.
Now we killed the "Master" process.
But when we submit once again the same command "spark-submit" pointing to same Master IP which is currently down , we see once more job in RM web ui (host:8088), This we are not able to understand as Spark "Master" is killed ( and host:8080) the spark UI also does not come.
Please note that we are using "yarn-client" mode as below code
sparkProcess = new SparkLauncher()
.......
.setSparkHome(System.getenv("SPARK_HOME"))
.setMaster("yarn-client")
.setDeployMode("client")
Please some can explain me about this behaviour ? Did not found after reading many blogs (http://spark.apache.org/docs/latest/running-on-yarn.html ) and official docs .
Thanks
Please check cluster overview. As per your description you are running spark application on yarn cluster mode with driver placed in instance where you launch command. The Spark master is related to spark standalone cluster mode which on your case launch command should be similar to
spark-submit --master spark://your-spark-master-address:port

How to know the state of Spark job

Now I have a job running on amazon ec2 and I use putty to connect with the ec2 cluster,but just know the connection of putty is lost.After I reconnect with the ec2 cluster I have no output of the job,so I don't know if my job is still running.Anybody know how to check the state of Spark job?
thanks
assuming you are on yarn cluster, you could run
yarn application -list
to get a list of appliactions and then run
yarn application -status applicationId
to know the status
It is good practice to use GNU Screen (or other similar tool) to keep session alive (but detached, if connection lost with machine) when working on remote machines.
The status of a Spark application can be ascertained from Spark UI (or Yarn UI).
If you are looking for cli command:
For stand-alone cluster use:
spark-submit --status <app-driver-id>
For yarn:
yarn application --status <app-id>

Resources