Get spark variables in oozie spark action - apache-spark

I am new to spark and oozie technologies.
I am trying to get few variables from spark and use it in next oozie action.
In "Decision" node spark submit will be called and few processing is done and a counter variable is generated
Eg: var counter = 8 from spark
So now I need to use this variable in next oozie action which is "take decision"
node.
take decision
[Decision ][counter]
When I googled I was able to find few solutions
1. Write to hdfs
2. Wrap spark submit in shell and use <capture-output>
(I am not able to use this as I use oozie spark action node)
Any other ways to do the same?

The best approach is store the values in either HDFS (Hive) or HBase/Cassandra and your decision action read the values.
If you are wrap spark-submit with shell action, there would be problem if you submit the jobs in cluster mode because spark-submit jobs to yarn cluster and run any of the node where you cannot get the output.

Related

How to chain multiple jobs in Apache Spark

I would like to know is there a way to chain the jobs in Spark, so the output RDD (or other format) of first job is passed as input to the second job ?
Is there any API for it from Apache Spark ? Is this even idiomatic approach ?
From what I found is that there is a way to spin up another process through the yarn client for example Spark - Call Spark jar from java with arguments, but this assumes that you save it to some intermediate storage between jobs.
Also there are runJob and submitJob on SparkContext, but are they good fit for it ?
Use the same RDD definition to define the input/output of your jobs.
You should then be able to chain them.
The other option is to use DataFrames instead of RDD and figure out the schema at run-time.

Apache Spark Correlation only runs on driver

I am new to Spark and learn that transformations happen on workers and action on the driver but the intermediate action can happen(if the operation is commutative and associative) at the workers also which gives the actual parallelism.
I looked into the correlation and covariance code: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/PearsonCorrelation.scala
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
How could I find what part of the correlation has happened at the driver and what at executor?
Update 1: The setup I'm talking about to run the correlation is the cluster setup consisting of multiple VM's.
Look here for the images from the SparK web UI: Distributed cross correlation matrix computation
Update 2
I setup my cluster in standalone mode like It was a 3 Node cluster, 1 master/driver(actual machine: workstation) and 2 VM slaves/executor.
submitting the job like this
./bin/spark-submit --master spark://192.168.0.11:7077 examples/src/main/python/mllib/correlations_example.py
from master node
My correlation sample file is correlations_example.py:
data = sc.parallelize(np.array([range(10000000), range(10000000, 20000000),range(20000000, 30000000)]).transpose())
print(Statistics.corr(data, method="pearson"))
sc.stop()
I always get a sequential timeline as :
Doesn't it mean that it not happening in parallel based on timeline of events ? Am I doing something wrong with the job submission or correlation computation in Spark is not parallel?
Update 3:
I tried even adding another executor, still the same seqquential treeAggreagate.
I set the spark cluster as mentioned here:
http://paxcel.net/blog/how-to-setup-apache-spark-standalone-cluster-on-multiple-machine/
Your statement is not entirely accurate. The container[executor] for the driver is launched on the client/edge node or on the cluster, depending on the spark submit mode e.g. client or yarn. The actions are executed by the workers and the results are sent back to the driver (e.g. collect)
This has been answered already. See link below for more details.
When does an action not run on the driver in Apache Spark?

Run spark job in parallel and using single spark context running in local mode

I need to run some HQLs via spark. I have a jar having class that creates dataset from JSON, perform HQL and create JSON. Finally, save that JSON to a text file in local file system.
Spark is running on local mode.
Problem
: Jobs are sequential and every job is starting spark context. Hence, taking more time.
I want to create single Spark Context and execute jobs in parallel.
Option 1 : Queque based model
I can create a infinitely running job that starts spark context and listen on kafka queue. JSON data & HQL are passed as kafka message.
Option 2 : Spark Streaming
Use spark streaming with kafka to propagate JSON data & HQL
OR is there any other way to achieve this?

Measure runtime of algorithm on a spark cluster

How do I measure the runtime of an algorithm in spark, especially on a cluster? I am interested in measuring the time from when the spark job is submitted to the cluster to when the submitted job has completed.
If it is important, I am mainly interested in machine learning algorithms using dataframes.
In my experience a reasonable approach is to measure the time from the submission of job to the completion on the driver. This is achieved by surrounding the spark action with timestamps:
val myRdd = sc.textFile("hdfs://foo/bar/..")
val startt = System.currentTimeMillis
val cnt = myRdd.count() // Or any other "action" such as take(), save(), etc
val elapsed = System.currentTimeMillis - startt
Notice that the initial sc.textFile() is lazy - i.e. it does not cause spark driver to submit the job to the cluster. therefore it is not really important if you include that in the timing or not.
A consideration for the results: the approach above is susceptible to variance due to existing load on the spark scheduler and cluster. A more precise approach would include the spark job writing the
System.currentTimeMillis
inside of its closure (executed on worker nodes) to an Accumulator at the beginning of its processing. This would remove the scheduling latency from the calculation.
To calculate the runtime of an algorithm, follow this procedure-
establish a single/multi node cluster
Make a folder and save your algorithm in that folder (eg. myalgo.scala/java/pyhton)
3.build it using sbt (you can follow this link to build your program. https://www.youtube.com/watch?v=1BeTWT8ADfE)
4.Run this command: SPARK_HOME$ /bin/spark-submit --class "class name" --master "spark master URL" "target jar file path" "arguments if any"
For example- spark-submit --class "GroupByTest" --master spark://BD:7077 /home/negi/sparksample/target/scala-2.11/spark-sample_2.11-1.0.jar
After this, refresh your web UI(eg. localhost:8080) and you will get all information there about your executed program including run-time.

Spark Executor Id in JAVA_OPTS

I was trying to profile some Spark jobs and I want to collect Java Flight Recorder(JFR) files from each executor. I am running my job on a YARN cluster with several nodes, so I cannot manually collect JRF file for each run. I want to write a script which can collect JFR file from each node in cluster for a given job.
MR provides a way to name JFR files generated by each task with taskId. It replaces '#task#' with TaskId in Java opts. With this I can get a unique name for JFR files created by each task and the since TaskId also has JobId, I can parse it to distinguish files generated by different MR jobs.
I am wondering, if Spark has something similar. Does Spark provides a way to determine executorId in Java opts? Has anyone else has tried to do something similar and found a better way collect all JFR files for a Spark job?
You can't set an executor id in the opts, but you can get the executor Id from the event log, as well as the slave node bearing it.
However I believe the option you give to spark-submit for a yarn master and a standalone one have the same effect on executors JVM, so you should be fine!
You can use {{EXECUTOR_ID}} and {{APP_ID}} placeholders in spark.executor.extraJavaOptions parameter. They will be replaced by Spark with executor's ID and application's ID, respectively.

Resources