I have imported databricks jars in my project and have databricks-connect configured to my remote db cluster.
When I run unit tests in Intellij, it starts running unit tests on remote machine. I want to disable this and whenever I get the spark session, it attempts to get session from my remote server. How do I make it run in my local machine? Is there any config I neet ot set it to not connect to the remote db cluster?
Thanks
Related
So I work at a place where, I have a laptop and everyday I connect to a remote server using shell, and do everything(run jupyter notebook,use pyspark for spark jobs) on the server.
I want to keep a log of all the server resources that I am using when I run my spark job(memory,cpu usage etc).
I though one way I could do this is my looking at web UI, but I cant connect to the web UI.
I got all the proeperties of my driver ip, port and everything using sc._conf.getAll()
I am running spark on Yarn client
(u'spark.master', u'yarn-client'),
and tried those on web browser but could not connect.
I created a cluster in Google Cloud and submitted a spark job. Then I connected to the UI following these instructions: I created an ssh tunnel and used it to open the Hadoop web interface. But the job is not showing up.
Some extra information:
If I connect to the master node of the cluster via ssh and run spark-shell, this "job" does show up in the hadoop web interface.
I'm pretty sure I did this before and I could see my jobs (both running and already finished). I don't know what happened in between for them to stop appearing.
The problem was that I was running my jobs in local mode. My code had a .master("local[*]") that was causing this. After removing it, the jobs showed up in the Hadoop UI as before.
How can I connect Spark from my local machine in Eclipse to a remote HiveServer?
Get a copy of the hive-site.xml from the remote server, and add it to $SPARK_HOME/conf
Then, assuming Spark2, you need to use SparkSession.enableHiveSupport() method, and any spark.sql() queries should be able to communicate with Hive.
Also see my answer here
I am trying to figure out if it is possible to work locally in python with a spark context of a remote EMR cluster(AWS). I've set up the cluster but a locally defined SparkContext with remote master doesn't seem to work. Does anybody have experience with that? Working on a remote notebook is limited because you cannot create python modules and files. Working locally is limited due to computing resources. There is the option to SSH to the master node but then I cannot use a graphical IDE such as pyCharm
I am trying to run a PySpark job on a Mesosphere cluster but I cannot seem to get it to run. I understand that Mesos does not support cluster deploy mode for PySpark applications and that it needs to be run in client mode. I believe this is where the problem lies.
When I try submitting a PySpark job I am getting the output below.
... socket.hpp:107] Shutdown failed on fd=48: Transport endpoint is not connected [107]
I believe that a spark job running in client mode needs to connect to the nodes directly and this is being blocked?
What configuration would I need to change to be able to run a PySpark job in client mode?
When running PySpark in client mode (meaning the driver is running where you invoke Python) the driver becomes the Mesos Framework. When this happens, the host the framework is running on needs to be able to connect to all nodes in the cluster, and they need to be able to connect back, meaning no NAT.
If this is indeed the cause of your problems, there are two environment variables that might be useful. If you can get a VPN in place, you can set LIBPROCESS_IP and SPARK_LOCAL_IP both to the IP of the host machine that cluster nodes can use to connect back to the driver.