I am using Isolated mode of zeppelins spark interpreter, with this mode it will start a new job for each notebook in spark cluster. I want to kill the job via zeppelin when the notebook execution is completed. For this I did sc.stop this stopped the sparkContext and the job is also stopped from spark cluster. But next time when I try to run the notebook its not starting the sparkContext again. So how to do that?
It's a bit counter intuitive but you need to access the interpreter menu tab instead of stopping SparkContext directly:
go to interpreter list.
find Spark interpreter and click restart in the right upper corner:
You can restart the interpreter for the notebook in the interpreter bindings (gear in upper right hand corner) by clicking on the restart icon to the left of the interpreter in question (in this case it would be the spark interpreter).
While working with Zeppelin and Spark I also stumbled upon the same problem and made some investigations. After some time, my first conclusion was that:
Stopping the SparkContext can be accomplished by using sc.stop() in a paragraph
Restarting the SparkContext only works by using the UI (Menu -> Interpreter -> Spark Interpreter -> click on restart button)
However, since the UI allows restarting the Spark Interpreter via a button press, why not just reverse engineer the API call of the restart button! The result was, that restarting the Spark Interpreter sends the following HTTP request:
PUT http://localhost:8080/api/interpreter/setting/restart/spark
Fortunately, Zeppelin has the ability to work with multiple interpreters, where one of them is also a shell Interpreter. Therefore, i created two paragraphs:
The first paragraph was for stopping the SparkContext whenever needed:
%spark
// stop SparkContext
sc.stop()
The second paragraph was for restarting the SparkContext programmatically:
%sh
# restart SparkContext
curl -X PUT http://localhost:8080/api/interpreter/setting/restart/spark
After stopping and restarting the SparkContext with the two paragraphs, I run another paragraph to check if restarting worked...and it worked! So while this is no official solution and more of a workaround, it is still legit as we do nothing else than "pressing" the restart button within a paragraph!
Zeppelin version: 0.8.1
I'm investigated the problem why sc stop in spark in yarn-client. I find that it's the problem of spark itself(Spark version >=1.6). In spark client mode, the AM connect to the Driver via RPC connection, there are two connections. It setup NettyRpcEndPointRef to connect to the driver's service 'YarnSchedulerBackEnd' of server 'SparkDriver', and other another connection is EndPoint 'YarnAM'.
In these RPC connections between AM and Driver ,there are no heartbeats. So the only way AM know the Driver is connectted or not is that the OnDisconnected method in EndPoint 'YarnAM'. The disconnect message of driver and AM connetcion though NettyRpcEndPointRef will 'postToAll' though RPCHandler to the EndPoint 'YarnAM'. When the TCP connetion between them disconnected, or keep alive message find the tcp not alive(2 hours maybe in Linux system), it will mark the application SUCCESS.
So when the Driver Monitor Process find the yarn application state change to SUCCESS, it will stop the sc.
So the root cause is that , in Spark client, there are no retry connect to the driver to check the driver is live or not,but just mark the yarn application as quick as possible.Maybe Spark can modify this issue.
Related
How to access Spark UI for pyspark jobs?
I am trying to login to (localhost:4040) for track my jobs but not load although when open spark- shell, but when open pyspark it is not login
Spark UI available only for the time when your spark session is present. Also spark looks for ports starting from 4040 and iterates if it cannot use that port. If you are starting spark shell, it will mention in the beginning, the port it is using for spark UI.
Spark UI provides a realtime view for your spark job and if your job terminates you lose that view in order to preserve that view you have to add a blocking code at the end of your Spark job like input()and as Relic16 said Spark starts from port 4040 and if it was occupied it tries port 4041 and so on. also if you look at logs carefully Spark mentions the ip and port in the logs
use sparkContext.uiWebUrl to get the URL, where sparkContext is an instance of SparkContext
I'm running Spark with apache zeppelin and hadoop. My understanding is that Zeppelin is like a kube app that sends commands to a remote machine that's running Spark and accessing files with Hadoop.
I often run into a situation where the Spark Context gets stopped. In the past, I believed it was because I overloaded the system with a data pull that required too much data, but now I'm less enthusiastic about that theory. I've frequently had it happen after running totally reasonable and normal queries.
In order to restart the Spark Context, I've gone to the interpreter binding settings and restarted spark.
I've also run this command
%python
JSESSIONID="09123q-23se-12ae-23e23-dwtl12312
YOURFOLDERNAME="[myname]"
import requests
import json
cookies = {"JSESSIONID": JSESSIONID}
notebook_response = requests.get('http://localhost:8890/api/notebook/jobmanager', cookies=cookies)
body = json.loads(notebook_response.text)["body"]["jobs"]
notebook_ids = [(note["noteId"]) for note in body if note.get("interpreter") == "spark" and YOURFOLDERNAME in note.get("noteName", "")]
for note_id in notebook_ids:
requests.put("http://localhost:8890/api/interpreter/setting/restart/spark", data=json.dumps({"noteId": note_id}), cookies=cookies)
I've also gone to the machine running spark and entered yarn top and I don't see my username listed within the list of running applications.
I know that I can get it working if I restart the machine, but that'll also restart the machine for everyone else using it.
What other ways can I restart a Spark Context?
I assume that you have configured you spark interpreter to run in isolated mode:
In this case you get separate instances for each user:
You can restart your own instance and get a new SparkContext from the interpreter binding menu of a notebook by pressing the refresh button (tested with zeppelin 0.82):
I am using a notebook environment to try out some commands against Spark. Can someone explain how the overall flow works when we run a cell from the notebook? In a notebook environment which component acts as the driver?
Also, can we call the code snippets we run from a notebook as a "Spark Application", or we call a code snippet "Spark Application" only when we use spark-submit to submit it to spark? Basically, I am trying to find out what qualifies a "Spark Application".
Notebook environment like Zeppelin creates a SparkContext during first execution of cell. Once a SparkContext is created all further cell executions are submitted to the same SparkContext which was created earlier.
A driver program is started based on whether you're using spark cluster in standalone mode or cluster mode where a resource manager is managing you're spark cluster. So In case of standalone mode driver program is started on host where the notebook is running. And in case of cluster mode it will be started on one of the node in the cluster.
You can consider a each running SparkContext on cluster as a different Application. Notebooks like Zeppelin provide capability to share same SparkContext for all cell across all notebooks or you can even configure it to be created per notebook.
Most of the notebooks internally calls spark-submit only.
I have started working on Spark using Python. I'm working on an application that uses SparkML Linear Regression APIs. When I submit my job in YARN cluster mode, during the execution phase, many pyspark-shell apps get created with YARN as the user. I could see them in the YARN UI. They eventually get finished with succeeded status and my main application which I actually submitted then gets finished with succeeded status. Is this an expected behavior? This is kinda interesting to me since I create the singleton sparkSession instance and use it throughout my application so I don't know why pyspark-shell sessions/apps get created.
The immediate solution would be to use sparkContext instead of sparkSession. But it would be interesting to see your configuration lines to see how you're creating your sessions to be able to tell why multiple apps are being created.
We just updated to Spark 2.2 from Spark 1.6, so we have yet to delve seriously into sparkSessions (which are new in 2+).
I cannot launch a spark job on Mesos, when it starts automatically gives this error:
"Caused by: org.apache.spark.rpc.RpcEndpointNotFoundException: Cannot find
endpoint: spark://CoarseGrainedScheduler#10.32.8.178:59737"
Could be because of mismatch between versions? If I launch an example that is brought with the distribution works perfectly.
thanks
Works now. It was application fault that I did not put correctly a data input path.
In mesos you have two options to deploy, Cluster Mode or Client Mode. I chose cluster mode and I have a spark daemon (MesosClusterDispatcher) that is always listening to spark jobs, this is why I use mesos://spark-mesos-dispatcher.marathon.mesos:7077
Thanks Jacek!