Can we submit a spark job from spark itself (may be from another spark program)? - apache-spark

Can anyone clarify the question asked in one of my interviews to me? It could be that the question itself is wrong, I am not sure. However, I have searched everywhere and could not find anything related to this question. The question is:
Can we run a spark job from another spark program?

Yup, you are right its not make any sense. Like we can run our application by our driver program but its same as like we are run it from any application using spark launcher https://github.com/phalodi/Spark-launcher . Except that we can't run application inside rdd closures because they run on worker nodes so it will not work.

Can we run a spark job from another spark program?
I'd focus on another part of the question since the following holds for any Spark program:
Can we run a Spark job from any Spark program?
That means that either there was a follow-up question or some introduction to the one you were asked.
If I were you and heard the question, I'd say "Yes, indeed!"
A Spark application is in other words a launcher of Spark jobs and the only reason to have a Spark application is to run Spark jobs sequentially or in parallel.
Any Spark application does this and nothing more.
A Spark application is a Scala application (when Scala is the programming language) and as such it is possible to run a Spark program from another Spark program (where it makes sense in general sense I put aside as there could be conflicts with multiple SparkContexts per one single JVM).
Given the other Spark application is a separate executable jar, you could launch it using Process API in Scala (as any other application):
scala.sys.process This package handles the execution of external processes.

Related

Run both batch and real time jobs on Spark with jobserver

I have a spark job that runs every day as part of a pipeline and perform simple batch processing - let's say, adding a column to DF with other column's value squared. (old DF: x, new DF: x,x^2).
I also have a front app that consumes these 2 columns.
I want to allow my users to edit x and get the answer from the same code base.
Since the batch job is already written in spark, i looked for a way to achieve that against my spark cluster and run into spark jobserver which thought might help here.
My questions:
Can spark jobserver support both batch and single processing?
Can i use the same jobserver-compatible JAR to run a spark job on AWS EMR?
Open to hear about other tools that can help in such use case.
Thanks!
Not sure I understood your scenario fully, but with Spark Jobserver you can configure your batch jobs and pass different parameters to it.
Yes, once you have Jobserver-compatible JAR, you should be able to use it with Jobserver running with Spark in Standalone mode, with YARN or with EMR. But please take into account that you will need to make a setup for Jobserver on EMR. Open source documentation seems to be a bit outdated currently.

Importing vs Installing spark

I am new to the spark world and to some extent coding.
This question might seem too basic but please clear my confusion.
I know that we have to import spark libraries to write spark application. I use intellij and sbt.
After writing the application , I can also run them and see the output on "run".
My question is, why should I install spark separately on my machine(local) if I can just import them as libraries and run them.
Also what is the need for it to be installed on the cluster since we can just submit the jar file and jvm is already present in all the machines of the clustor
Thank you for the help!
I understand your confusion.
Actually you don't really need to install spark on your machine if you are for example running it on scala/java and you can just import spark-core or any other dependancies into your project and once you start your spark job on mainClass it will create an standalone spark runner on your machine and run your job on if (local[*]).
There are many reasons for having spark on your local machine.
One of them is for running spark job on pyspark which requires spark/python/etc libraries and a runner(local[] or remote[]).
Another reason can be if you want to run your job on-premise.
It might be easier to create cluster on your local datacenter and maybe appoint your machine as master and the other machines connected to your master as worker.(this solution might be abit naive but you asked for basics so this might spark your curiosity to read more about infrastructure design of a data processing system more)

Spark Job Server multithreading and dynamic allocation

I had pretty big expectations from Spark Job Server, but found out it critically lack of documentation.
Could you please answer one/all of next questions:
Does Spark Job Server submit jobs through Spark session?
Is it possible to run few jobs in parallel with Spark Job Server? I saw people faced some troubles, I haven't seen solution yet.
Is it possible to run few jobs in parallel with different CPU, cores, executors configs?
Spark jobserver do not support SparkSession yet. We will be working on it.
Either you can create multiple contexts or you could run a context to use FAIR scheduler.
Use different contexts with different resource config.
Basically job server is just a rest API for creating spark contexts. So you should be able to do what you could do with spark context.

Running spark streaming forever on production

I am developing a spark streaming application which basically reads data off kafka and saves it periodically to HDFS.
I am running pyspark on YARN.
My question is more for production purpose. Right now, I run my application like this:
spark-submit stream.py
Imagine you are going to deliver this spark streaming application (in python) to a client, what would you do in order to keep it running forever? You wouldn't just give this file and say "Run this on the terminal". It's too unprofessional.
What I want to do , is to submit the job to the cluster (or processors in local) and never have to see logs on the console, or use a solution like linux screen to run it in the background (because it seems too unprofessional).
What is the most professional and efficient way to permanently submit a spark-streaming job to the cluster ?
I hope I was unambiguous. Thanks!
You could use spark-jobserver which provides rest interface for uploading your jar and running it . You can find the documentation here spark-jobserver .

How to determine the underlying MapReduce jobs in Spark?

Given a Spark application, how to determine how the application is mapped into its underlying MapReduce jobs?
The Spark application itself doesn't know anything about the underlying execution framework. That is part of the abstraction which allows to run in the different modes (local, mesos, standalone, yarn.client and yarn-cluster).
You will however see the yarn application id after submitting your application with spark-submit, it's usually something like this:
application_1453729472522_0110
You can also use the yarn command to list currently running applications like this:
yarn application -list
that will print all applications running in the cluster, Spark applications have the appliccationType SPARK.
I would say each stage is a MapReduce job. I can not give you a reference for this, but from my experience, looking at the stage construction you can see what was cast as a Map phase (chained maps, filters, flatMaps) and what was cast as a Reduce phase (groupBy,collect,join,etc) and grouped into one stage. You can also deduce Map only or Reduce only Mapreduce jobs.
It also helps to output the DAG as you see again the same chaning.
You can access the Stages in the Spark UI while your spark job is running.
Disclaimer This is deduced from experience and deduction reasoning.

Resources