Oozie spark action Log4j configuration - apache-spark

I am working on Oozie, using a Spark action on a Hortonworks2.5 cluster. I have configured this job in yarn client mode, with master=yarn mode=client.
My log4j configuration is shown below.
log4j.appender.RollingAppender.File=/opt/appName/logs/appNameInfo.log
log4j.appender.debugFileAppender.File=/opt/appName/logs/appNameDebug.log
log4j.appender.errorFileAppender.File=/opt/appName/logs/appNameError.log
The job expectation is once we trigger oozie job, in the above locations we should be able to see my application logs as Info,Debug,Error respectively.
Below is my spark-opts tag in my workflow.xml
<spark-opts>--driver-memory 4G --executor-memory 4G --num-executors 6 --executor-cores 3 --files /tmp/logs/appName/log4j.properties --conf spark.driver.extraJavaOptions='-Dlog4j.configuration=file:/tmp/logs/appName/log4j.properties' --conf spark.executor.extraJavaOptions='-Dlog4j.configuration=file:/tmp/logs/appName/log4j.properties'</spark-opts>
Once I trigger oozie coordinator, I am not able to see my application logs in /opt/appName/logs/ as configured in log4j.properties.
The same configuration is working with plain Spark-submit when I run from the node where /tmp/logs/appName/log4j.properties available in that particular node. Can some one please look in to the issue. It is not able to write to the location which is configured in log4j.properties file.
Is this log4j.properties file should be in hdfs?? if so, how to provide in spark-opts. is it would be hdfs:// ??
Can some one look in to the issue please?

Copy this log4j.properties in oozie.sharelib.path(HDFS) and the spark should be able to copy in the final yarn container.

Related

What is the command to call Spark2 from Shell

I have two services for Spark in my cluster, one is with name of Spark(1.6 version) and another one is Spark2(2.0 Version). I am able to call Spark with below command.
spark-shell --master yarn
But not able to connect Spark2 service even after set "export SPARK_MAJOR_VERSION=2"
Can some one help me on.
I'm using CDH cluster and following command works for me.
spark2-shell --queue <queue-name-if-any> --deploy-mode client
If I remember, SPARK_MAJOR_VERSION only works with spark-submit
You would need to find the spark2 installation directory to use the other spark-shell
Sounds like you are in an HDP cluster, so look under /usr/hdp

Spark Standalone how to pass local .jar file to cluster

I have a cluster with two workers and one master.
To start master & workers I use the sbin/start-master.sh and sbin/start-slaves.shin the master's machine. Then, the master UI shows me that the slaves are ALIVE (so, everything OK so far). Issue comes when I want to use spark-submit.
I execute this command in my local machine:
spark-submit --master spark://<master-ip>:7077 --deploy-mode cluster /home/user/example.jar
But the following error pops up: ERROR ClientEndpoint: Exception from cluster was: java.nio.file.NoSuchFileException: /home/user/example.jar
I have been doing some research in stack overflow and Spark's documentation and it seems like I should specify the application-jar of spark-submit command as "Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes." (as it indicates https://spark.apache.org/docs/latest/submitting-applications.html).
My question is: how can I set my .jar as globally visible inside the cluster? There is a similar question in here Spark Standalone cluster cannot read the files in local filesystem but solutions do not work for me.
Also, am I doing something wrong by initialising the cluster inside my master's machine using sbin/start-master.sh but then doing the spark-submit in my local machine? I initialise the master inside my master's terminal because I read so in Spark's documentation, but maybe this has something to do with the issue. From Spark's documentation:
Once you’ve set up this file, you can launch or stop your cluster with the following shell scripts, based on Hadoop’s deploy scripts, and available in SPARK_HOME/sbin: [...] Note that these scripts must be executed on the machine you want to run the Spark master on, not your local machine.
Thank you very much
EDIT:
I have copied the file .jar in every worker and it works. But my point is to know if there is a better way, since this method makes me copy the .jar to each worker everytime I create a new jar. (This was one of the answers from the question of the already posted link Spark Standalone cluster cannot read the files in local filesystem )
#meisan your spark-submit command is missing out on 2 things.
your jars should be added with flag --jar
file holding your driver code i.e. the main function.
Now you have not specified anywhere if you are using scala or python but in the nutshell your command will look something like:
for python :
spark-submit --master spark://<master>:7077 --deploy-mode cluster --jar <dependency-jars> <python-file-holding-driver-logic>
for scala:
spark-submit --master spark://<master>:7077 --deploy-mode cluster --class <scala-driver-class> --driver-class-path <application-jar> --jar <dependency-jars>
Also, spark takes care of sending the required files and jars to the executors when you use the documented flags.
If you want to omit the --driver-class-path flag, you can set the environmental variable SPARK_CLASSPATH to path where all your jars are placed.

Send spark driver logs running in k8s to Splunk

I am trying to run a sample spark job in kubernetes by following the steps mentioned here: https://spark.apache.org/docs/latest/running-on-kubernetes.html.
I am trying to send the spark driver and executor logs to Splunk.
Does spark provide any configuration to do the same?
How do I send the Splunk configurations like the HEC endpoint, port, token, etc in the spark-submit command?
I did try passing it as args to the the spark driver as
bin/spark-submit
--deploy-mode cluster
--class org.apache.spark.examples.JavaSparkPi
--master k8s://http://127.0.0.1:8001
--conf spark.executor.instances=2
--conf spark.app.name=spark-pi
--conf spark.kubernetes.container.image=gcr.io/spark-operator/spark:v2.4.4
--conf spark.kubernetes.authenticate.driver.serviceAccountName=<account>
--conf spark.kubernetes.docker.image.pullPolicy=Always
--conf spark.kubernetes.namespace=default
local:///opt/spark/examples/jars/spark-examples_2.11-2.4.4.jar
--log-driver=splunk
--log-opt splunk-url=<url:port>
-—log-opt splunk-token=<token>
--log-opt splunk-index=<index>
--log-opt splunk-sourcetype=<sourceType>
--log-opt splunk-format=json
but the logs were not forwarded to the desired index.
I am using spark version 2.4.4 to run spark-submit.
Thanks in advance for any inputs!!
Hi and welcome to the Stackoverflow.
I've searched the web for a while trying to find the similar to your question cases of Spark + Splunk usages. What I've managed to realize is that possibly you're mixing several things. Referring the Docker docs about Splunk logging driver seems that you try to reproduce the same steps with `spark-submit. Unfortunately for you it doesn't work so.
Basically all the config options after local:///opt/spark/examples/jars/spark-examples_2.11-2.4.4.jar ... in your script are the program arguments for the org.apache.spark.examples.JavaSparkPi#main method , which (unless you customize it) simply ignores them.
What you need to do is to connect your Kubrnetes cluster to the Splunk API. One of the ways of doing that is installing the Splunk Connector to you Kubernetes cluster. Depending on your environment specifics there can be other ways of doing that, but reading the docs is a good place to start.
Hope it directs you to the right road.

execute Spark jobs, with Livy, using `--master yarn-cluster` without making systemwide changes

I'd like to execute a Spark job, via an HTTP call from outside the cluster using Livy, where the Spark jar already exists in HDFS.
I'm able to spark-submit the job from shell on the cluster nodes, e.g.:
spark-submit --class io.woolford.Main --master yarn-cluster hdfs://hadoop01:8020/path/to/spark-job.jar
Note that the --master yarn-cluster is necessary to access HDFS where the jar resides.
I'm also able to submit commands, via Livy, using curl. For example, this request:
curl -X POST --data '{"file": "/path/to/spark-job.jar", "className": "io.woolford.Main"}' -H "Content-Type: application/json" hadoop01:8998/batches
... executes the following command on the cluster:
spark-submit --class io.woolford.Main hdfs://hadoop01:8020/path/to/spark-job.jar
This is the same as the command that works, minus the --master yarn-cluster params. This was verified by tailing /var/log/livy/livy-livy-server.out.
So, I just need to modify the curl command to include --master yarn-cluster when it's executed by Livy. At first glance, it seems like this should be possible by adding arguments to the JSON dictionary. Unfortunately, these aren't passed through.
Does anyone know how to pass --master yarn-cluster to Livy so that jobs are executed on YARN without making systemwide changes?
I recently tried something similar as your question. I need to send a HTTP request to Livy's API, while Livy is already installed in a cluster (YARN), and then I want to let Livy start a Spark job.
My command to call Livy did not include --master yarn-cluster, but that seems to work for me. Maybe you can try to put your JAR file in local in stead of in a cluster?
spark.master = yarn-cluster
set it in the spark conf, for me:/etc/spark2/conf/spark-defaults.conf

Understanding spark --master

I have simple spark app that reads master from a config file:
new SparkConf()
.setMaster(config.getString(SPARK_MASTER))
.setAppName(config.getString(SPARK_APPNAME))
What will happen when ill run my app with as follow:
spark-submit --class <main class> --master yarn <my jar>
Is my master going to be overwritten?
I prefer having the master provided in standard way so I don't need to maintain it in my configuration, but then the question how can I run this job directly from IDEA? this isn't my application argument but spark-submit argument.
Just for clarification my desired end product should:
when run in cluster using --master yarn, will use this configuration
when run from IDEA will run with local[*]
Do not set the master into your code.
In production you could use the option --master of spark-submit which will tell spark which master to use (yarn in you case). also the value of spark.master in spark-defaults.conf file will do the job (priority is for --master and then the property in configuration file)
In an IDEA... well I know in Eclipse you could pass a VM argument in Run Configuration -Dspark.master=local[*] for example (https://stackoverflow.com/a/24481688/1314742).
In IDEA I think it is not too much different, you could check here to add VM options

Resources