Spark UI for Pyspark - apache-spark

How to access Spark UI for pyspark jobs?
I am trying to login to (localhost:4040) for track my jobs but not load although when open spark- shell, but when open pyspark it is not login

Spark UI available only for the time when your spark session is present. Also spark looks for ports starting from 4040 and iterates if it cannot use that port. If you are starting spark shell, it will mention in the beginning, the port it is using for spark UI.

Spark UI provides a realtime view for your spark job and if your job terminates you lose that view in order to preserve that view you have to add a blocking code at the end of your Spark job like input()and as Relic16 said Spark starts from port 4040 and if it was occupied it tries port 4041 and so on. also if you look at logs carefully Spark mentions the ip and port in the logs

use sparkContext.uiWebUrl to get the URL, where sparkContext is an instance of SparkContext

Related

Spark context is created every time a batch job is executed in yarn

I am wondering, is there any way where i create the spark-context once in the YARN cluster, then the incoming jobs will re-use that context. The context creation takes 20s or sometimes more in my cluster. I am using pyspark for scripting and livy to submit jobs.
No, you can't just have a standing SparkContext running in Yarn. Maybe another idea is to run in client mode, where the client has it's own SparkContext (this is the method used by tools like Apache Zeppelin and the spark-shell).
An option would be to use Apache Livy. Livy is an additional server in your Yarn cluster that provides an interface for clients that want to run Spark jobs on the cluster. One of Livy's features is that you can
Have long running Spark Contexts that can be used for multiple Spark jobs, by multiple clients
If the client is written in Scala or Java it is possible to use a programmatic API:
LivyClient client = new LivyClientBuilder()....build();
Object result = client.submit(new SparkJob(sparkParameters)).get();
All other clients can use a REST API.

Spark job submission using Airflow by submitting batch POST method on Livy and tracking job

I want to use Airflow for orchestration of jobs that includes running some pig scripts, shell scripts and spark jobs.
Mainly on Spark jobs, I want to use Apache Livy but not sure whether it is good idea to use or run spark-submit.
What is best way to track Spark job using Airflow if even I submitted?
My assumption is you an application JAR containing Java / Scala code that you want to submit to remote Spark cluster. Livy is arguably the best option for remote spark-submit when evaluated against other possibilities:
Specifying remote master IP: Requires modifying global configurations / environment variables
Using SSHOperator: SSH connection might break
Using EmrAddStepsOperator: Dependent on EMR
Regarding tracking
Livy only reports state and not progress (% completion of stages)
If your'e OK with that, you can just poll the Livy server via REST API and keep printing logs in console, those will appear on task logs in WebUI (View Logs)
Other considerations
Livy doesn't support reusing SparkSession for POST/batches request
If that's imperative, you'll have to write your application code in PySpark and use POST/session requests
References
How to submit Spark jobs to EMR cluster from Airflow?
livy/examples/pi_app
rssanders3/livy_spark_operator_python_example
Useful links
How to submit Spark jobs to EMR cluster from Airflow?
Remote spark-submit to YARN running on EMR

Is there any way to submit spark job using API

I am able to submit spark job on linux server using console. But is there any API or some framework that can enable to submit spark job in linux server?
You can use port 7077 to submit spark jobs in you spark cluster instead of using spark-submit.
val spark = SparkSession
.builder()
.master(spark://master-machine:7077)
you can look into Livy server. It is in GA mode in Hortonworks and Cloudera distros of Apache Hadoop. We have had good success with it. its documentation is good enough to get started with. Spark jobs start instantaneously when submitted via Livy since it has multiple SparkContexts running inside it.

Zeppelin - how to connect to spark ui when spark interpreter configured in local mode

I am using zippelin with spark Interpreter configured with master = local[*]
I need to connect to spark web ui to observe the tasks and the execution DAG. does zippelin provide access to spark web ui with the above configuration?
Local mode means the spark UI will be accessible on the same host as Zeppelin, and unless the UI port is taken (or configured explicitly), UI will use the default 4040 port.
So, for example, if Zeppelin's host is localhost, try http://localhost:4040

Flume connections refused to spark streaming job

Considering my connection to FLUME with my spark streaming application. I'm working on a cluster with x nodes. Documentation says:
"When your Flume + Spark Streaming application is launched, one of the Spark workers must run on that machine.
Flume can be configured to push data to a port on that machine."
I understood that my spark streaming job must be launched from a possible worker (all nodes are workers but I don't use all of them), and also I have configured flume to push data to a hostname/port that is also a possible worker for my streaming job. Still I get a connection refused to this hostname/port though there is no firewall, it's not used by anything else etc. I'm sure I understood something wrong. Anyone has any idea?
PS1: I'm using Spark 1.2.0
PS2: My code is tested locally and runs as expected
PS3: Probably I've understood things wrong since I'm quite new to the whole hadoop/spark thing.
Thanks in advance!

Resources