I am running Spark jobs in my Jupyter notebook deployed in an EKS Cluster. Jupyterlab provides a Spark UI Monitoring extension where I can view my Spark jobs by clicking on the "SparkMonitor" tab. I am also trying to access the History Server that is deployed on a different pod. What is the best way for me to access the History Server? Is there any way I can route to the History Server within the Jupyter Notebook?
Related
how do I pull out these metrics from spark history logs? Is there some api I can pull these from?
I tried downloading the json event logs, but I can't grep for the numbers seen in the photo
The spark history server keeps all that information for you. You can access it via a rest API.
If you are on EMR:
You can view the Spark web UIs by following the procedures to create
an SSH tunnel or create a proxy in the section called Connect to the
cluster in the Amazon EMR Management Guide and then navigating to the
YARN ResourceManager for your cluster. Choose the link under Tracking
UI for your application. If your application is running, you see
ApplicationMaster. This takes you to the application master's web UI
at port 20888 wherever the driver is located. The driver may be
located on the cluster's primary node if you run in YARN client mode.
If you are running an application in YARN cluster mode, the driver is
located in the ApplicationMaster for the application on the cluster.
If your application has finished, you see History, which takes you to
the Spark HistoryServer UI port number at 18080 of the EMR cluster's
primary node. This is for applications that have already completed.
You can also navigate to the Spark HistoryServer UI directly at
http://master-public-dns-name:18080/.
I have an Apache Spark pool running on web.azuresynapse.net and can e.g. use it to run spark jobs from my "Synapse Analytics workspace"
I can develop and run python notebooks using the spark pool from there.
From what I see when I run a job, livy is supported (though) the link provided from azure synapse is for some reason inaccessible.
How could I connect to that pool using e.g. livy from e.g. my local jupyter notebook and use the pool? Or the only way to run spark code is to use a pipeline?
So I work at a place where, I have a laptop and everyday I connect to a remote server using shell, and do everything(run jupyter notebook,use pyspark for spark jobs) on the server.
I want to keep a log of all the server resources that I am using when I run my spark job(memory,cpu usage etc).
I though one way I could do this is my looking at web UI, but I cant connect to the web UI.
I got all the proeperties of my driver ip, port and everything using sc._conf.getAll()
I am running spark on Yarn client
(u'spark.master', u'yarn-client'),
and tried those on web browser but could not connect.
I created a cluster in Google Cloud and submitted a spark job. Then I connected to the UI following these instructions: I created an ssh tunnel and used it to open the Hadoop web interface. But the job is not showing up.
Some extra information:
If I connect to the master node of the cluster via ssh and run spark-shell, this "job" does show up in the hadoop web interface.
I'm pretty sure I did this before and I could see my jobs (both running and already finished). I don't know what happened in between for them to stop appearing.
The problem was that I was running my jobs in local mode. My code had a .master("local[*]") that was causing this. After removing it, the jobs showed up in the Hadoop UI as before.
I don't know if this is already answered in SO but I couldn't find a solution to my problem.
I have an IPython notebook running in a docker container in Google Container Engine, the container is based on this image jupyter/all-spark-notebook
I have also a spark cluster created with google cloud dataproc
Spark master and the notebook are running in different VMs but in the same region and zone.
My problem is that I'm trying to connect to the spark master from the IPython notebook but without success. I use this snippet of code in my python notebook
import pyspark
conf = pyspark.SparkConf()
conf.setMaster("spark://<spark-master-ip or spark-master-hostname>:7077")
I just started working with spark, so I'm sure I'm missing something (authentication, security ...),
What I found over there is connecting a local browser over an SSH tunnel
Somebody already did this kind of set up?
Thank you in advance
Dataproc runs Spark on YARN, so you need to set master to 'yarn-client'. You also need to point Spark at your YARN ResourceManager, which requires a under-documented SparkConf -> Hadoop Configuration conversion. You also have to tell Spark about HDFS on the cluster, so it can stage resources for YARN. You could use Google Cloud Storage instead of HDFS, if you baked The Google Cloud Storage Connector for Hadoop into your image.
Try:
import pyspark
conf = pyspark.SparkConf()
conf.setMaster('yarn-client')
conf.setAppName('My Jupyter Notebook')
# 'spark.hadoop.foo.bar' sets key 'foo.bar' in the Hadoop Configuaration.
conf.set('spark.hadoop.yarn.resourcemanager.address', '<spark-master-hostname>')
conf.set('spark.hadoop.fs.default.name', 'hdfs://<spark-master-hostname>/')
sc = pyspark.SparkContext(conf=conf)
For a more permanent config, you could bake these into a local file 'core-site.xml' as described here, place that in a local directory, and set HADOOP_CONF_DIR to that directory in your environment.
It's also worth noting that while being in the same Zone is important for performance, it is being in the same Network and allowing TCP between internal IP addresses in that network that allows your VMs to communicate. If you are using the default network, then the default-allow-internal firewall rule, should be sufficient.
Hope that helps.