I'm running Spark with apache zeppelin and hadoop. My understanding is that Zeppelin is like a kube app that sends commands to a remote machine that's running Spark and accessing files with Hadoop.
I often run into a situation where the Spark Context gets stopped. In the past, I believed it was because I overloaded the system with a data pull that required too much data, but now I'm less enthusiastic about that theory. I've frequently had it happen after running totally reasonable and normal queries.
In order to restart the Spark Context, I've gone to the interpreter binding settings and restarted spark.
I've also run this command
%python
JSESSIONID="09123q-23se-12ae-23e23-dwtl12312
YOURFOLDERNAME="[myname]"
import requests
import json
cookies = {"JSESSIONID": JSESSIONID}
notebook_response = requests.get('http://localhost:8890/api/notebook/jobmanager', cookies=cookies)
body = json.loads(notebook_response.text)["body"]["jobs"]
notebook_ids = [(note["noteId"]) for note in body if note.get("interpreter") == "spark" and YOURFOLDERNAME in note.get("noteName", "")]
for note_id in notebook_ids:
requests.put("http://localhost:8890/api/interpreter/setting/restart/spark", data=json.dumps({"noteId": note_id}), cookies=cookies)
I've also gone to the machine running spark and entered yarn top and I don't see my username listed within the list of running applications.
I know that I can get it working if I restart the machine, but that'll also restart the machine for everyone else using it.
What other ways can I restart a Spark Context?
I assume that you have configured you spark interpreter to run in isolated mode:
In this case you get separate instances for each user:
You can restart your own instance and get a new SparkContext from the interpreter binding menu of a notebook by pressing the refresh button (tested with zeppelin 0.82):
Related
Is there anyway to debug a Spark application that is running in a cluster mode? I have a program that has been running successfully for a while, which processes a couple hundred GB at a time. Recently I had some data cause the run to fail due to executors being disconnected. From what I have read, this is likely a memory issue. I'm trying to determine what function/action is causing the memory issue to trigger. I am using Spark on an EMR cluster(which uses YARN), what would be the best way to debug this issue?
For cluster mode you can go to the YARN Resource Manager UI and select the Tracking UI for your specific running application (which points to the spark driver running on the Application Master within the YARN Node Manager) to open up the Spark UI which is the core developer interface for debugging spark apps.
For client mode you can also go to the YARN RM UI like previously mentioned as well as hit the Spark UI via this address => http://[driverHostname]:4040 where driverHostName is the Master Node in EMR and 4040 is the default port (this can be changed).
Additionally you can access submitted and completed spark apps via the Spark History Server via this default address => http://master-public-dns-name:18080/
These are the essential resources with the Spark UI being the main toolkit for your request.
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-web-interfaces.html
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-webui.html
I created a cluster in Google Cloud and submitted a spark job. Then I connected to the UI following these instructions: I created an ssh tunnel and used it to open the Hadoop web interface. But the job is not showing up.
Some extra information:
If I connect to the master node of the cluster via ssh and run spark-shell, this "job" does show up in the hadoop web interface.
I'm pretty sure I did this before and I could see my jobs (both running and already finished). I don't know what happened in between for them to stop appearing.
The problem was that I was running my jobs in local mode. My code had a .master("local[*]") that was causing this. After removing it, the jobs showed up in the Hadoop UI as before.
I am working with Spark and Cassandra and in general things are straight forward and working as intended; in particular the spark-shell and running .scala processes to get results.
I'm now looking at utilisation of the Spark Job Server; I have the Job Server up and running and working as expected for both the test items, as well as some initial, simple .scala developed.
However I now want to take one of the .scala programs that works in spark-shell and get it onto the Spark Job Server to access via that mechanism. The issue I have is that the Job Server doesn't seem to recognise the import statements around cassandra and fails to build (sbt compile; sbt package) a jar for upload to the Job Server.
At some level it just looks like I need the Job Server equivalent to the spark shell package switch (--packages datastax:spark-cassandra-connector:2.0.1-s_2.11) on the Spark Job Server so that import com.datastax.spark.connector._ and similar code in the .scala files will work.
Currently when I attempt to build (sbt complie) I get message such as:
[error] /home/SparkCassandraTest.scala:10: object datastax is not a member of package com
[error] import com.datastax.spark.connector._
I have added different items to the build.sbt file based on searches and message board advice; but no real change; if that is the answer I'm after what should be added to the base Job Server to enable that usage of the cassandra connector.
I think that you need spark-submit to do this. I am working with Spark and Cassandra also, but only since one month; so I've needed read a lot of information. I had compiled this info in a repository, maybe this could help you, however is an alpha version, sorry about that.
I am using Isolated mode of zeppelins spark interpreter, with this mode it will start a new job for each notebook in spark cluster. I want to kill the job via zeppelin when the notebook execution is completed. For this I did sc.stop this stopped the sparkContext and the job is also stopped from spark cluster. But next time when I try to run the notebook its not starting the sparkContext again. So how to do that?
It's a bit counter intuitive but you need to access the interpreter menu tab instead of stopping SparkContext directly:
go to interpreter list.
find Spark interpreter and click restart in the right upper corner:
You can restart the interpreter for the notebook in the interpreter bindings (gear in upper right hand corner) by clicking on the restart icon to the left of the interpreter in question (in this case it would be the spark interpreter).
While working with Zeppelin and Spark I also stumbled upon the same problem and made some investigations. After some time, my first conclusion was that:
Stopping the SparkContext can be accomplished by using sc.stop() in a paragraph
Restarting the SparkContext only works by using the UI (Menu -> Interpreter -> Spark Interpreter -> click on restart button)
However, since the UI allows restarting the Spark Interpreter via a button press, why not just reverse engineer the API call of the restart button! The result was, that restarting the Spark Interpreter sends the following HTTP request:
PUT http://localhost:8080/api/interpreter/setting/restart/spark
Fortunately, Zeppelin has the ability to work with multiple interpreters, where one of them is also a shell Interpreter. Therefore, i created two paragraphs:
The first paragraph was for stopping the SparkContext whenever needed:
%spark
// stop SparkContext
sc.stop()
The second paragraph was for restarting the SparkContext programmatically:
%sh
# restart SparkContext
curl -X PUT http://localhost:8080/api/interpreter/setting/restart/spark
After stopping and restarting the SparkContext with the two paragraphs, I run another paragraph to check if restarting worked...and it worked! So while this is no official solution and more of a workaround, it is still legit as we do nothing else than "pressing" the restart button within a paragraph!
Zeppelin version: 0.8.1
I'm investigated the problem why sc stop in spark in yarn-client. I find that it's the problem of spark itself(Spark version >=1.6). In spark client mode, the AM connect to the Driver via RPC connection, there are two connections. It setup NettyRpcEndPointRef to connect to the driver's service 'YarnSchedulerBackEnd' of server 'SparkDriver', and other another connection is EndPoint 'YarnAM'.
In these RPC connections between AM and Driver ,there are no heartbeats. So the only way AM know the Driver is connectted or not is that the OnDisconnected method in EndPoint 'YarnAM'. The disconnect message of driver and AM connetcion though NettyRpcEndPointRef will 'postToAll' though RPCHandler to the EndPoint 'YarnAM'. When the TCP connetion between them disconnected, or keep alive message find the tcp not alive(2 hours maybe in Linux system), it will mark the application SUCCESS.
So when the Driver Monitor Process find the yarn application state change to SUCCESS, it will stop the sc.
So the root cause is that , in Spark client, there are no retry connect to the driver to check the driver is live or not,but just mark the yarn application as quick as possible.Maybe Spark can modify this issue.
I am trying to run a PySpark job on a Mesosphere cluster but I cannot seem to get it to run. I understand that Mesos does not support cluster deploy mode for PySpark applications and that it needs to be run in client mode. I believe this is where the problem lies.
When I try submitting a PySpark job I am getting the output below.
... socket.hpp:107] Shutdown failed on fd=48: Transport endpoint is not connected [107]
I believe that a spark job running in client mode needs to connect to the nodes directly and this is being blocked?
What configuration would I need to change to be able to run a PySpark job in client mode?
When running PySpark in client mode (meaning the driver is running where you invoke Python) the driver becomes the Mesos Framework. When this happens, the host the framework is running on needs to be able to connect to all nodes in the cluster, and they need to be able to connect back, meaning no NAT.
If this is indeed the cause of your problems, there are two environment variables that might be useful. If you can get a VPN in place, you can set LIBPROCESS_IP and SPARK_LOCAL_IP both to the IP of the host machine that cluster nodes can use to connect back to the driver.