How to setup the cores of driver in spark? - apache-spark

I am using the client deploy model with spark standalone from Jupyter notebook. My configure in the application is like:
conf.setMaster('spark://hadoop-master:7077')
conf.set("spark.driver.cores","4")
conf.set("spark.driver.memory","8g")
conf.set("spark.executor.memory","16g")
But, I speculated that spark.driver.cores was not working, the Spark WEB UI shown:
Why is the cores zero? How to modify this value?
Additionally, the Environment of Spark WEB UI showed, like:
So, the config seemly has uploaded to the server.

Related

Cluster-wide Spark configuration in standalone mode

We are running a Spark standalone cluster in a Docker environment. How do I set cluster-wide configurations that every application getting submitted to the cluster use? As far as I understand it, it seems that the local spark-defaults get used from the host submitting the application, even if cluster is used as deploy mode. Can that be changed?

Determine where spark program is failing?

Is there anyway to debug a Spark application that is running in a cluster mode? I have a program that has been running successfully for a while, which processes a couple hundred GB at a time. Recently I had some data cause the run to fail due to executors being disconnected. From what I have read, this is likely a memory issue. I'm trying to determine what function/action is causing the memory issue to trigger. I am using Spark on an EMR cluster(which uses YARN), what would be the best way to debug this issue?
For cluster mode you can go to the YARN Resource Manager UI and select the Tracking UI for your specific running application (which points to the spark driver running on the Application Master within the YARN Node Manager) to open up the Spark UI which is the core developer interface for debugging spark apps.
For client mode you can also go to the YARN RM UI like previously mentioned as well as hit the Spark UI via this address => http://[driverHostname]:4040 where driverHostName is the Master Node in EMR and 4040 is the default port (this can be changed).
Additionally you can access submitted and completed spark apps via the Spark History Server via this default address => http://master-public-dns-name:18080/
These are the essential resources with the Spark UI being the main toolkit for your request.
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-web-interfaces.html
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-webui.html

How to connect to spark (CDH-5.8 docker vms at remote)? Do I need to map port 7077 at container?

Currently, I can access the HDFS from inside my application, but I'd also like to - instead of running my local spark - to use Cloudera's spark as it is enabled in Cloudera Manager.
Righ now I have the HDFS defined at core-site.xml, and I run my app as (--master) YARN. Thus I don't need to set the machine address to my HDFS files. In this way, my SPARK job runs locally and not in the "cluster." I don't want that for now. When I try to set --master to [namenode]:[port] it does not connect. I wonder if I'm directing to the correct port, or if I have to map this port at docker container. Or if I'm missing something about Yarn setup.
Additionally, I've been testing SnappyData (Inc) solution as a Spark SQL in-memory database. So my goal is to run snappy JVMs locally, but redirecting spark jobs to the VM cluster. The whole idea here is to test some performance against some Hadoop implementation. This solution is not a final product (if snappy is local, and spark is "really" remote, I believe it won't be efficient - but in this scenario, I would bring snappy JVMs to the same cluster..)
Thanks in advance!

Spark 2.1.0 reverse proxy does not work properly

I'm trying to proxy individual spark applications. That means I need to get a single UI per spark application. To achieve that, I use the spark reverse proxy feature. So, if I have my spark master UI running at http://localhost:8080, when I click on one application name from this spark UI, I'm redirected to http://localhost:8080/proxy/{application-id}/jobs/ where application-id is the application id of the spark application I'm trying to access. Everything looks good, I get the spark job UI for this particular application and some other tabs displayed. But when I click on another tab, for instance "Environment" I'm redirected to http://localhost:8080/environment instead of http://localhost/proxy/{application-id}/environment/
This is the single line I add in my spark-defaults.conf file
spark.ui.reverseProxy=true
I use spark 2.1.0 in standalone mode and deploy some sample applications to reproduce the issue. Any clue? How can I make this proxy working without this issue? Thanks.
I had this problem.
Make sure that you correctly supply spark.ui.reverseProxy and spark.ui.reverseProxyUrl properties to all masters, workers and drivers.
In my case I used spark-submit (cluster mode) from remote machine and forgot to update local spark-defaults.conf on the machine I was submitting from.

Datastax Spark Sql Thriftserver with Spark Application

I have an analytics node running, with Spark Sql Thriftserver running on it. Now I can't run another Spark Application with spark-submit.
It says it doesn't have resources. How to configure the dse node, to be able to run both?
The SparkSqlThriftServer is a Spark application like any other. This means it requests and reserves all resources in the cluster by default.
There are two options if you want to run multiple applications at the same time:
Allocate only part of your resources to each application.
This is done by setting spark.cores.max to a smaller value than the max resources in your cluster.
See Spark Docs
Dynamic Allocation
Which allows applications to change the amount of resources they use depending on how much work they are trying to do.
See Spark Docs

Resources