Spark 2.1.0 reverse proxy does not work properly - apache-spark

I'm trying to proxy individual spark applications. That means I need to get a single UI per spark application. To achieve that, I use the spark reverse proxy feature. So, if I have my spark master UI running at http://localhost:8080, when I click on one application name from this spark UI, I'm redirected to http://localhost:8080/proxy/{application-id}/jobs/ where application-id is the application id of the spark application I'm trying to access. Everything looks good, I get the spark job UI for this particular application and some other tabs displayed. But when I click on another tab, for instance "Environment" I'm redirected to http://localhost:8080/environment instead of http://localhost/proxy/{application-id}/environment/
This is the single line I add in my spark-defaults.conf file
spark.ui.reverseProxy=true
I use spark 2.1.0 in standalone mode and deploy some sample applications to reproduce the issue. Any clue? How can I make this proxy working without this issue? Thanks.

I had this problem.
Make sure that you correctly supply spark.ui.reverseProxy and spark.ui.reverseProxyUrl properties to all masters, workers and drivers.
In my case I used spark-submit (cluster mode) from remote machine and forgot to update local spark-defaults.conf on the machine I was submitting from.

Related

Cluster-wide Spark configuration in standalone mode

We are running a Spark standalone cluster in a Docker environment. How do I set cluster-wide configurations that every application getting submitted to the cluster use? As far as I understand it, it seems that the local spark-defaults get used from the host submitting the application, even if cluster is used as deploy mode. Can that be changed?

Spark jobs not showing up in Hadoop UI in Google Cloud

I created a cluster in Google Cloud and submitted a spark job. Then I connected to the UI following these instructions: I created an ssh tunnel and used it to open the Hadoop web interface. But the job is not showing up.
Some extra information:
If I connect to the master node of the cluster via ssh and run spark-shell, this "job" does show up in the hadoop web interface.
I'm pretty sure I did this before and I could see my jobs (both running and already finished). I don't know what happened in between for them to stop appearing.
The problem was that I was running my jobs in local mode. My code had a .master("local[*]") that was causing this. After removing it, the jobs showed up in the Hadoop UI as before.

How to make sure Spark master node is using the worker nodes? (Google cluster)

I just created a Google Cloud cluster (1 master and 6 workers) and by default Spark is configured.
I have a pure python code that uses NLTK to build the dependency tree for each line from a text file. When I run this code on the master spark-submit run.py I get the same execution time when I run it using my machine.
How to make sure that the master is using the workers in order to reduce the execution time ?
You can check the spark UI. If its running on top of yarn, please open the yarn UI and click on your application id which will open the spark UI. Check under the executors tab it will have the node ip address also.
could you please share your spark submit config.
Your command 'spark-submit run.py' doesn't seem to send your job to YARN. To do such thing, you need to add the --master parameter. For example, a valid command to execute a job in YARN is:
./bin/spark-submit --master yarn python/pi.py 1000
If you execute your job from the master, this execution will be straightforward. Anyway, check this link for another parameter that spark-submit accept.
For a Dataproc cluster (Hadoop Google cluster) you have two options to check the job history including the ones that are running:
By command line from the master: yarn application -list, this option sometimes needs additional configuration. If you have troubles, this link will be useful.
By UI. Dataproc enables you to access the Spark Web UI, it improves monitoring tasks. Check this link to learn how to access the Spark UI and other Dataproc UIs. In summary, you have to create a tunnel and configure your browser to use socks proxy.
Hope the information above help you.

Why does spark UI at port 18080 say "cluster mode" when I launched in client mode in the config file?

In my spark-defaults.conf file I have the following line:
spark.master=yarn-client
Now, I launch a job, and I look at the spark UI (which is at <master ip address>:18080) and I see the following at the top of the page:
REST URL: spark://<master ip address>:6066 (cluster mode)
I restarted all of the spark workers and spark master, and distributed the spark-defaults.conf file to all of the spark workers/slaves.
I cannot tell if this is running in cluster mode or client mode? And why is my setting not getting picked up by the spark UI?
I cannot tell if this is running in cluster mode or client mode? And
why is my setting not getting picked up by the spark UI?
Spark UI running on port 18080 is spark history server. If you want to find which mode your particular application ran in, go to :18080), and click on the any ID under App ID which will take you to Spark Jobs page.
On that page, click on Environment tab. In that tab, look for section Spark Properties and under that you will find spark-master property which will tell you which mode that application ran in.

How to setup the cores of driver in spark?

I am using the client deploy model with spark standalone from Jupyter notebook. My configure in the application is like:
conf.setMaster('spark://hadoop-master:7077')
conf.set("spark.driver.cores","4")
conf.set("spark.driver.memory","8g")
conf.set("spark.executor.memory","16g")
But, I speculated that spark.driver.cores was not working, the Spark WEB UI shown:
Why is the cores zero? How to modify this value?
Additionally, the Environment of Spark WEB UI showed, like:
So, the config seemly has uploaded to the server.

Resources