How to tail yarn logs? - apache-spark

I am submitting a Spark Job using below command. I want to tail the yarn log using application Id similar to tail command operation in Linux box.
export SPARK_MAJOR_VERSION=2
nohup spark-submit --class "com.test.TestApplication" --name TestApp --queue queue1 --properties-file application.properties --files "hive-site.xml,tez-site.xml,hbase-site.xml,application.properties" --master yarn --deploy-mode cluster Test-app.jar > /tmp/TestApp.log &

Not easily.
"YARN logs" aren't really in YARN, they are actually on the executor nodes of Spark. If YARN log aggregation is enabled, then logs are in HDFS, and available from Spark History server.
The industry deployment pattern is to configure the Spark log4j properties to write to a file with a log forwarder (like Filebeat, Splunk, Fluentd), then those processes collect data into a search engine like Solr, Elasticsearch, Graylog, Splunk, etc. From these tools, you can approximately tail/search/analyze log messages outside of a CLI.

yarn logs -applicationId application_1648123761230_0106 -log_files stdout -size -1000
https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.0.1/data-operating-system/content/use_the_yarn_cli_to_view_logs_for_running_applications.html

Related

Standard way to store/upload application jar on Spark cluster on Kubernetes

I have a Spark based Kubernetes cluster where I am using spark-submit to submit the jobs on cluster as needed.
e.g.
spark-submit \
--master spark://my-spark-master-svc:7077 \
--class com.Main \
examples/jars/my-spark-application.jar
Here I have uploaded file my-spark-application.jar using kubectl cp under the directory examples/jars on the master Pod/container before I run the spark-submit command.
Another option could be by mounting a Volume on the cluster and share the jar on the volume that way.
What is the typical way to share the application jar with the spark cluster while using spark-submit on Kubernetes?

Checking yarn application logs

I am new to spark . I have a 10node Hadoop cluster with one edge node. I am submitting spark application from edge node and redirecting spark-submit command output to local file on edge node.
So when spark application fails I can check edge node log file and take an action .
When I read about yarn application logs ,it is said that node managers running that application will log into some location (yarn.nodemanager.log-dir) .
How is this nodemanager log different from edge node log . Can anyone explain yarn application logs in detail.
"Edge node logs" would be Spark driver application logs, which would likely say something like URL to track the Job: <link to YARN UI>
If you want the actual Spark runtime logs, you need to look at the inidivual Spark executors via the Spark UI (which redirect to the YARN UI, if that is how you run Spark)
The NodeManager (and ResourceManager) is a YARN process, with its own logs, and not related to your Spark code

How to get full worker output in Apache Spark

How do I view / download the complete stderr output from a worker in Apache Spark, deployed in cluster mode?
I've deployed a program with spark-submit --deploy-mode cluter foo.jar, and a worker crashed. To investigate, I go to localhost:8081 and access the worker's log (stderr in particular), but then it shows me only the bottom of the file, and I have to click the "Load More" button a hundred times to scroll up to the first error -- clearly, I shouldn't have to do that. Is there a way to download the whole stderr output, or to redirect it to a known location? Which part of Spark's documentation gives me this kind of information?
Get the Application Id of your spark job from yarn URL OR after submitting spark job you will get spark Application Id.
Then use the below command in YARN CLI to view the yarn logs from your edge/gateway node. for more details about YARN CLI refer this link click here
yarn logs -applicationId <Application ID> -log_files stderr

How can I run spark-submit commands using the GCP spark operator on kubernetes

I have a spark application which i want to deploy on kubernetes using the GCP spark operatorhttps://github.com/GoogleCloudPlatform/spark-on-k8s-operator.
I was able to run a spark application using command kubectl apply -f example.yaml but i want to use spark-submit commands.
There are few options mentione by https://github.com/big-data-europe/docker-spark which can use
see if that solves your problem
kubectl run spark-base --rm -it --labels="app=spark-client" --image bde2020/spark-base:2.4.5-hadoop2.7 -- bash ./spark/bin/spark-shell --master spark://spark-master:7077 --conf spark.driver.host=spark-client
or
kubectl run spark-base --rm -it --labels="app=spark-client" --image bde2020/spark-base:2.4.5-hadoop2.7 -- bash ./spark/bin/spark-submit --class CLASS_TO_RUN --master spark://spark-master:7077 --deploy-mode client --conf spark.driver.host=spark-client URL_TO_YOUR_APP
There is no way to manipulate directly the spark-submit command that the spark operator generates when it translates the yaml configuration file to spark specific options and kubernetes resources. This is kind of the point of using the operator. It lets you use a yaml config file to run either a SparkApplication or a ScheduledSparkApplication like if it were a kubernetes resource. Most options can be set either with hadoop or spark config files in config maps or as command line arguments to the jvm in the driver and executor pods. I recommend to use this last approach in order to have more flexibility when it comes to fine tuning spark jobs

Oozie spark action Log4j configuration

I am working on Oozie, using a Spark action on a Hortonworks2.5 cluster. I have configured this job in yarn client mode, with master=yarn mode=client.
My log4j configuration is shown below.
log4j.appender.RollingAppender.File=/opt/appName/logs/appNameInfo.log
log4j.appender.debugFileAppender.File=/opt/appName/logs/appNameDebug.log
log4j.appender.errorFileAppender.File=/opt/appName/logs/appNameError.log
The job expectation is once we trigger oozie job, in the above locations we should be able to see my application logs as Info,Debug,Error respectively.
Below is my spark-opts tag in my workflow.xml
<spark-opts>--driver-memory 4G --executor-memory 4G --num-executors 6 --executor-cores 3 --files /tmp/logs/appName/log4j.properties --conf spark.driver.extraJavaOptions='-Dlog4j.configuration=file:/tmp/logs/appName/log4j.properties' --conf spark.executor.extraJavaOptions='-Dlog4j.configuration=file:/tmp/logs/appName/log4j.properties'</spark-opts>
Once I trigger oozie coordinator, I am not able to see my application logs in /opt/appName/logs/ as configured in log4j.properties.
The same configuration is working with plain Spark-submit when I run from the node where /tmp/logs/appName/log4j.properties available in that particular node. Can some one please look in to the issue. It is not able to write to the location which is configured in log4j.properties file.
Is this log4j.properties file should be in hdfs?? if so, how to provide in spark-opts. is it would be hdfs:// ??
Can some one look in to the issue please?
Copy this log4j.properties in oozie.sharelib.path(HDFS) and the spark should be able to copy in the final yarn container.

Resources