Can I start the another cluster from current notebook in Databricks? - azure

I have notebook1 assigned to cluster1 and notebook2 assigned to cluster2.
I want to trigger notebook2 from notebook1 but notebook2 should use only cluster2 for execution.
Currently its getting triggered using Cluster1.
Please let me know for more information.

Unfortunately, you cannot start another cluster from current notebook.
This is excepted behaviour, when you trigger notebook2 from notebook it will use cluster1 and not cluster2.
Reason: When you run any command from notebook1, always runs on the attached cluster.
Notebooks cannot be statically assigned to a cluster; that's actually runtime state only. If you want to run some code on a different cluster (in this case, the code is a notebook), then you have to do it by having your first notebook submit a separate job, rather than using dbutils.notebook.run or %run.
Notebook Job Details:
Hope this helps.

Related

Unable to set environment variables in Spark using livy and sparkmagic

Scenario :
I have setup a spark cluster on my kubernetes environment :
Livy Pod for submission of jobs
Spark Master Pod
Spark Worker Pod for execution
What I want to achieve is as follows:
I have a jupyter notebook with a Pyspark kernel as a pod in the same environment wherein on the execution of cells a spark session is created and using livy post request /statements all my code gets executed. I was able to achieve the above scenario
Note : There is no YARN, HDFS, Hadoop in my env. I have made use of kubernetes, spark standalone and jupyter only.
Issue :
Now what I wanted, is when I run my pyspark code and it gets executed over in the spark worker, I would like to send the following over in that execution environment :
environment variables which I have used in the notebook
pip packages which I have used in the notebook
or a custom virtualenv in which i could provide all the packages
used together
I am unable to do the same.
Things that I have tried out so far :
Since I made use of spark magic, have tried to set environment variables using the following ways I could find in the documentations and other answers.
%%configure {
"conf": {
spark.executorEnv.TESTVAR
spark.appMasterEnv.TESTVAR
spark.driver.TESTVAR
spark.driverenv.TESTVAR
spark.kubernetes.driverenv.TESTVAR
spark.kubernetes.driver.TESTVAR
spark.yarn.executorEnv.TESTVAR
spark.yarn.appMasterEnv.TESTVAR
spark.workerenv.TESTVAR
}
}
Bunching up for reference, I have tried the above options individually.
I have tried directly hitting the livy pod's service name like a normal post request but still no luck.
But the variables are still not getting detected
After this I tried directly setting the same manually in spark-defaults.conf in the spark cluster but did not work.
Would appreciate any help on the matter. Also is my first SO question so please let know incase of issues.

Using databricks for twtter sentiment analysis - issue running the official tutorial

I am starting to use Databricks and tried to implement one of the official tutorials (https://learn.microsoft.com/en-gb/azure/azure-databricks/databricks-sentiment-analysis-cognitive-services) from the website. However, I run into an issue - not even sure if I can call it an issue - when I run the second notebook (analysetweetsfromeventhub) then all commands (2nd, 3rd, 4th ...) are officially waiting to run, but never run. See the picture. Any idea what might be? Thanks.
After you cancel a running streaming cell in a notebook attached to a Databricks Runtime cluster, you cannot run any subsequent commands in the notebook. The commands are left in the “waiting to run” state, and you must clear the notebook’s state or detach and reattach the cluster before you can successfully run commands on the notebook.
Note that this issue occurs only when you cancel a single cell; it does not apply when you run all and cancel all cells.
In the meantime, you can do either of the following:
To remediate an affected notebook without restarting the cluster, go to the notebook’s Clear menu and select Clear State:
If restarting the cluster is acceptable, you can solve the issue by turning off idle context tracking. Set the following Spark configuration value on the cluster:
spark.databricks.chauffeur.enableIdleContextTracking false
Then restart the cluster.

Get information about the current dataproc cluster created after workflow submission

Suppose I run a pyspark job using a dataproc workflow template and an ephemeral cluster... How can I get the name of the cluster created inside my pyspark job
One way would be to fork out and run this command:
/usr/share/google/get_metadata_value attributes/dataproc-cluster-name
The only output will be the cluster name, without any new line characters or anything else to cleanup. See Running shell command and capturing the output

How notebook sends code to Spark?

I am using a notebook environment to try out some commands against Spark. Can someone explain how the overall flow works when we run a cell from the notebook? In a notebook environment which component acts as the driver?
Also, can we call the code snippets we run from a notebook as a "Spark Application", or we call a code snippet "Spark Application" only when we use spark-submit to submit it to spark? Basically, I am trying to find out what qualifies a "Spark Application".
Notebook environment like Zeppelin creates a SparkContext during first execution of cell. Once a SparkContext is created all further cell executions are submitted to the same SparkContext which was created earlier.
A driver program is started based on whether you're using spark cluster in standalone mode or cluster mode where a resource manager is managing you're spark cluster. So In case of standalone mode driver program is started on host where the notebook is running. And in case of cluster mode it will be started on one of the node in the cluster.
You can consider a each running SparkContext on cluster as a different Application. Notebooks like Zeppelin provide capability to share same SparkContext for all cell across all notebooks or you can even configure it to be created per notebook.
Most of the notebooks internally calls spark-submit only.

Zeppelin notebook execute not manual

is there a way to execute the spark code in a zeppelin notebook, without having to do it interactively? I'm looking for something specific or if anyone could point me in the correct direction. Or alternatively, other ways to submit spark code, which is currently in a zeppelin notebook. The reason I can't use spark-submit is that there is no command line access due to security reasons.
Zeppelin provides REST API which, among other functions, can be used to run either individual paragraphs, either synchronously
http://[zeppelin-server]:[zeppelin-port]/api/notebook/run/[noteId]/[paragraphId]
or asynchronously
http://[zeppelin-server]:[zeppelin-port]/api/notebook/job/[noteId]/[paragraphId]
as well as whole notebook:
http://[zeppelin-server]:[zeppelin-port]/api/notebook/job/[noteId]
It is also possible to define CRON jobs, both from notebook itself and from the REST API.

Resources