How host Spark Structured Streams in Production - databricks

Assume I have a notebook running a stream in a notebook cell such as
...readStream...writeStream....
With an eye on production: is it o.k. to keep the stream running in a notebook on e.g. databricks?
Here, it is stated that checkpoints should be used for production while here it is simply stated that the notebook has to be attached to a cluster.
Is it o.k. to use a notebook in production (for structured streaming)?

Related

spark structure streaming with efs is causing delay in job

Im using Spark Structure streaming 2.4.4. Im using Spark with kubernetes. But when i enable local checkpointing in some /tmp/ folder, jobs finishes in 7-8s. If EFS is mounted and checkpointing location is used on that then jobs are taking more than 5 mins and its quite unstable.
Please find the screenshot from spark sql tab.

Which of my Databricks notebook uses the cluster nodes?

I run several notebooks on Azure Databricks Spark cluster at the same time.
How can I see the cluster nodes usage rate of each notebook \ app over a period of time?
Both the "Spark Cluster UI - Master" and "Spark UI" tabs didn't provide such information
There is no automated/built in support today for isolating the usage of particular notebooks on Databricks.
That being said, one approach would be to use the Ganglia Metrics available for Databricks clusters.
If you run both notebooks at the same time it will be difficult to discern which is responsible for a specific quantity of usage. I would recommend running one notebook to completion and taking note of the usage on the cluster. Then, run the second notebook to completion and observe its usage. You can then compare the two and have a baseline for how each one is utilizing resources on the cluster.

Spark and HDFS on Kuberenetes data locality

I'm trying to run Spark on K8 and struggling a bit with data locality. I'm using the native spark support but just watched https://databricks.com/session/hdfs-on-kubernetes-lessons-learned. I've followed the steps there in setting up my HDFS cluster (namenode on first k8 node, using host networking). I was wondering if anyone knows if the fix to the spark driver presented has been merged into the mainline spark code?
I ask as I still see ANY locality in places I'd expect NODE_LOCAL.
The code has been a part of version v2.2.0-kubernetes-0.4.0

How notebook sends code to Spark?

I am using a notebook environment to try out some commands against Spark. Can someone explain how the overall flow works when we run a cell from the notebook? In a notebook environment which component acts as the driver?
Also, can we call the code snippets we run from a notebook as a "Spark Application", or we call a code snippet "Spark Application" only when we use spark-submit to submit it to spark? Basically, I am trying to find out what qualifies a "Spark Application".
Notebook environment like Zeppelin creates a SparkContext during first execution of cell. Once a SparkContext is created all further cell executions are submitted to the same SparkContext which was created earlier.
A driver program is started based on whether you're using spark cluster in standalone mode or cluster mode where a resource manager is managing you're spark cluster. So In case of standalone mode driver program is started on host where the notebook is running. And in case of cluster mode it will be started on one of the node in the cluster.
You can consider a each running SparkContext on cluster as a different Application. Notebooks like Zeppelin provide capability to share same SparkContext for all cell across all notebooks or you can even configure it to be created per notebook.
Most of the notebooks internally calls spark-submit only.

How to connect to Spark EMR from the locally running Spark Shell

I have created a Spark EMR cluster. I would like to execute jobs either on my localhost or EMR cluster.
Assuming I run spark-shell on my local computer how can I tell it to connect to the Spark EMR cluster, what would be the exact configuration options and/or commands to run.
It looks like others have also failed at this and ended up running the Spark driver on EMR, but then making use of e.g. Zeppelin or Jupyter running on EMR.
Setting up our own machines as spark drivers that connected to the core nodes on EMR would have been ideal. Unfortunately, this was impossible to do and we forfeited after trying many configuration changes. The driver would start up and then keep waiting unsuccessfully, trying to connect to the slaves.
Most of our Spark development is on pyspark using Jupyter Notebook as our IDE. Since we had to run Jupyter from the master node, we couldn’t risk losing our work if the cluster were to go down. So, we created an EBS volume and attached it to the master node and placed all of our work on this volume. [...]
source
Note: If you go down this route, I would consider using S3 for storing notebooks, then you don't have to manage EBS volumes.
One way of doing this is to add your spark job as an EMR step to your EMR cluster. For this, you need AWS CLI installed on your local computer
(see here for installation guide), and your jar file on s3.
Once you have aws cli, assuming your spark class to run is com.company.my.MySparkJob and your jar file is located on s3 at s3://hadi/my-project-0.1.jar, you can run the following command from your terminal:
aws emr add-steps --cluster-id j-************* --steps Type=spark,Name=My_Spark_Job,Args=[-class,com.company.my.MySparkJob,s3://hadi/my-project-0.1.jar],ActionOnFailure=CONTINUE

Resources