Azuresynapse Spark, run jupyter notebook with localhost running on e.g. localhost - apache-spark

I have an Apache Spark pool running on web.azuresynapse.net and can e.g. use it to run spark jobs from my "Synapse Analytics workspace"
I can develop and run python notebooks using the spark pool from there.
From what I see when I run a job, livy is supported (though) the link provided from azure synapse is for some reason inaccessible.
How could I connect to that pool using e.g. livy from e.g. my local jupyter notebook and use the pool? Or the only way to run spark code is to use a pipeline?

Related

Spark History Server within Jupyterlab

I am running Spark jobs in my Jupyter notebook deployed in an EKS Cluster. Jupyterlab provides a Spark UI Monitoring extension where I can view my Spark jobs by clicking on the "SparkMonitor" tab. I am also trying to access the History Server that is deployed on a different pod. What is the best way for me to access the History Server? Is there any way I can route to the History Server within the Jupyter Notebook?

Unable to Connect Remote Spark Session with YARN mode on Kubeflow

The main problem is that we are unable to run spark in client mode.
Whenever we try to connect to spark on YARN mode from kubeflow notebook we have the following error:
`Py4JJavaError: An error occurred while calling o81.showString.
: org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut down
at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:932)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:930)`
It seems we have exact same issue here:
Up to now:
we have managed to submit spark on notebook.
Also, it is possible to connect cluster mode from kubeflow notebook.
We have also managed to run spark session with python shell on one of the worker server on kubernetes. We are able to connect remote edge node which managed by Cloudera.
We have checked that there is no network issue between hadoop clusters and kubernetes clusters.
However, we still have no access interactive spark on jupyter notebook.

PySpark too slow in Google Cloud Dataproc

I deployed a PySpark ML model into a Google Cloud Dataproc cluster and it was running for over an hour, but my data is about 800 MB.
Is there anything being needed to declare as master on my SparkSession? I set the default option 'local'.
When you pass a local deploy mode option to SparkContext it executes your application locally on a single VM, to avoid this you should not pass any options in the SparkContext constructor - it will use pre-configured properties by Dataproc and run your application on YARN utilizing all cluster resources/nodes.

How to submit spark job from Airflow server to hadoop cluster

I have installed Airflow on a server. I'm able to ping from Airflow server to hadoop cluster. I want to submit a spark job from airflow server to hadoop cluster. Can someone list the steps I need to do for that? Do I need to install spark client on airflow server?
Apache Livy can be utilized to submit spark jobs , take look at following blog post.Spark Job submission via Airflow Operators
The simplest way to go about this is via establishing SSH connectivity between the Airflow server and (the edge node of) the Hadoop cluster. Now,
1. Create a SSH Connection from the Airflow UI (under the admin tab).
2. Use the above created connection in your Airflow pipeline via SSHHook.
3. Compose the spark-submit command.
4. Use the outputs of (2) & (3) in the constructor of SSHOperator.

How to schedule a pyspark job in jupiter notebook in microsoft azure for a spark cluster?

I am new to Spark. i have developed a pyspark script though the jupyter notebook interactive UI installed in our HDInsight cluster. A of now I ran the code from the jupyter itself but now I have to automate the script. I tried to use Azure Datafactory but could not find a way to run the pyspark script from there. Also tried to use oozie but could not figure out how to use it.I have tried by saving the notebook and reopened it and ran all cells but it is like manual way.
Please Help me to schedule a pyspark job in microsoft Azure.
I searched a discussion about the best practice to run scheduled jobs like crontab with Apache Spark for pyspark, which you might reviewed.
If without oozie, I have a simple idea that is to save jupyter notebook to local and write a shell script to submit the python script to HDInsight Spark via Livy with linux crontab as scheduler. As reference, you can refer to there as below.
IPython Notebook save location
How can I configure pyspark on livy to use anaconda python instead of the default one
Submit Spark jobs remotely to an Apache Spark cluster on HDInsight using Livy
Hope it helps.

Resources