How to chain multiple jobs in Apache Spark - apache-spark

I would like to know is there a way to chain the jobs in Spark, so the output RDD (or other format) of first job is passed as input to the second job ?
Is there any API for it from Apache Spark ? Is this even idiomatic approach ?
From what I found is that there is a way to spin up another process through the yarn client for example Spark - Call Spark jar from java with arguments, but this assumes that you save it to some intermediate storage between jobs.
Also there are runJob and submitJob on SparkContext, but are they good fit for it ?

Use the same RDD definition to define the input/output of your jobs.
You should then be able to chain them.
The other option is to use DataFrames instead of RDD and figure out the schema at run-time.

Related

Spark: How to write files to s3/hdfs from each executor

I have a use case where I am running some modeling code on each executor and want to store the result in s3/hdfs immediately before waiting for all the executors to finish the tasks.
The dataframe write API works in the same fashion you intend to use here, if you write the dataframe into hdfs, the executors will independently write the data into files rather bringing them all to the driver and then performing the write operation.
Refer this link to further read this topic.

Spark - jdbc read all happens on driver?

I have spark reading from Jdbc source (oracle) I specify lowerbound,upperbound,numpartitions,partitioncolumn but looking at web ui all the read is happening on driver not workers,executors. Is that expected?
In Spark framework, in general whatever code you write within a transformation such as map, flatMap etc. will be executed on the executor. To invoke a transformation you need a RDD which is created using the dataset that you are trying to compute on. To materialize the RDD you need to invoke an action so that transformations are applied to the data.
I believe in your case, you have written a spark application that reads jdbc data. If that is the case it will all be executed on Driver and not executor.
If you haven not already, try creating a Dataframe using this API.

Get spark variables in oozie spark action

I am new to spark and oozie technologies.
I am trying to get few variables from spark and use it in next oozie action.
In "Decision" node spark submit will be called and few processing is done and a counter variable is generated
Eg: var counter = 8 from spark
So now I need to use this variable in next oozie action which is "take decision"
node.
take decision
[Decision ][counter]
When I googled I was able to find few solutions
1. Write to hdfs
2. Wrap spark submit in shell and use <capture-output>
(I am not able to use this as I use oozie spark action node)
Any other ways to do the same?
The best approach is store the values in either HDFS (Hive) or HBase/Cassandra and your decision action read the values.
If you are wrap spark-submit with shell action, there would be problem if you submit the jobs in cluster mode because spark-submit jobs to yarn cluster and run any of the node where you cannot get the output.

How does mllib code run on spark?

I am new to distributed computing, and I'm trying to run Kmeans on EC2 using Spark's mllib kmeans. As I was reading through the tutorial I found the following code snippet on
http://spark.apache.org/docs/latest/mllib-clustering.html#k-means
I am having trouble understanding how this code runs inside the cluster. Specifically, I'm having trouble understanding the following:
After submitting the code to master node, how does spark know how to parallelize the job? Because there seem to be no part of the code that deals with this.
Is the code copied to all nodes and executed on each node? Does the master node do computation?
How do node communitate the partial result of each iteration? Is this dealt inside the kmeans.train code, or is the spark core takes care of it automatically?
Spark divides data to many partitions. For example, if you read a file from HDFS, then partitions should be equal to partitioning of data in HDFS. You can manually specify number of partitions by doing repartition(numberOfPartitions). Each partition can be processed on separate node, thread, etc. Sometimes data are partitioned by i.e. HashPartitioner, which looks on hash of the data.
Number of partitions and size of partitions generally tells you if data is distributed/parallelized correctly. Creating partitions of data is hidden in RDD.getPartitions methods.
Resource scheduling depends on cluster manager. We can post very long post about them ;) I think that in this question, the partitioning is the most important. If not, please inform me, I will edit answer.
Spark serializes clusures, that are given as arguments to transformations and actions. Spark creates DAG, which is sent to all executors and executors execute this DAG on the data - it launches closures on each partition.
Currently after each iteration, data is returned to the driver and then next job is scheduled. In Drizzle project, AMPLab/RISELab is creating possibility to create multiple jobs on one time, so data won't be sent to the driver. It will create DAG one time and schedules i.e. job with 10 iterations. Shuffle between them will be limited / will not exists at all. Currently DAG is created in each iteration and job in scheduled to executors
There is very helpful presentation about resource scheduling in Spark and Spark Drizzle.

Apache Spark: Python function serialized automatically

I was going through the Apache spark documentation. Spark docs for python says the following:
...We can pass Python functions to Spark, which are automatically
serialized along with any variables that they reference...
I don't fully understand what it means. Does it have to do something the the RDD type?
What does it mean in the context of spark?
The serialization is necessary when using PySpark because the function you define locally needs to be executed remotely on each of the worker nodes. This concept isn't really related to the RDD type.

Resources