Azure Synapse Apache Spark : Pipeline level spark configuration - apache-spark

Trying to configure spark for the entire azure synapse pipeline, Found Spark session config magic command and How to set Spark / Pyspark custom configs in Synapse Workspace spark pool . %%configure magic command works fine for a single notebook. Example:
Insert cell with the below content at the Beginning of the notebook
%%configure -f
{
"driverMemory": "28g",
"driverCores": 4,
"executorMemory": "32g",
"executorCores": 4,
"numExecutors" : 5
}
Then the below emits expected values.
spark_executor_instances = spark.conf.get("spark.executor.instances")
print(f"spark.executor.instances {spark_executor_instances}")
spark_executor_memory = spark.conf.get("spark.executor.memory")
print(f"spark.executor.memory {spark_executor_memory}")
spark_driver_memory = spark.conf.get("spark.driver.memory")
print(f"spark.driver.memory {spark_driver_memory}")
Although if i add that notebook as a first activity in Azure Synapse Pipeline, what happens is that Apache Spark Application which executes that notebook has correct configuration, but the rest of the notebooks in pipeline fall back to default configuration.
How can i configure spark for the entire pipeline ? Should i copy paste above %%configure .. in each and every notebook in pipeline or is there a better way ?

Yes, this is the well known option AFAIK. You need to define %%configure -f at the beginning of each Notebook in order to override default settings for your Job.
Alternatively, you can try by traversing to the Spark pool on Azure Portal and set the configurations in the spark pool by uploading text file which looks like this:
Please refer this third-party article for more details.
Moreover, looks like one cannot specify less than 4 cores for executor, nor driver. If you do, you get 1 core but nevertheless 4 core is reserved.

Related

Azure Databricks cluster spark configuration is disabled

When creating an Azure Databricks and configuring its cluster, I had chosen the default languages for Spark to be python,sql. But now I want to add Scala, as well. When running the Scala script I was getting the following error. So, my online search took me to this article that describes that you can change Cluster configuration by going to the Advanced options section of the cluster settings page and clicking on the Spark tab there (as shown in image below). But I find the Spark section there greyed out (disabled):
Question: How can I enabled the Spark section of the Advanced section of the cluster settings page (shown in image below) so I can edit the last line of the section? Note: I created the Databricks and its cluster and hence I am the admin (as shown in image 2 below).
Databricks Notebook error: Your administrator has only allowed sql and python commands on this cluster.
You need to click "Edit" button in the cluster controls - after that you should be able to change Spark configuration. But you can't enable Scala for the High concurrency clusters with credentials passthrough as it supports only Python & SQL (doc) - primary reason for that is that with Scala you can bypass users isolation.
If you need credentials passthrough + Scala, then you need to use Standard cluster, but it will work only with a single specific user (doc).

How to get cluster details like clusterID from code running on hdinsight cluster

I need ClusterInner object using Azure API or some cluster information like cluster id etc.
But to get ClusterInner object or cluster ID I need to provide the authentication object to API, but this code will be running on same HDInsight cluster so ideally it worn't ask for credential or use some env etc (My spark job already running on this cluster and spark job need this information).
Is there any API or alternative there to get this information from same running HDInsight cluster.
Editing my answer as per the comment
This particular method is not a clean way but you can get the details.
Note: This is applicable only for HDInsight clusters.
The deployment details can be extracted only from the head nodes. You have to look for the field in /etc/ambari-server/conf/ambari.properties
server.jdbc.database_name.For example if it is v40e8b2c1e26279460ca3e8c0cbc75af8f8AmbariDb then you can trim out first 3 characters and last 8 characters of the String.The left out string will be your clusterid.
You can use Linux script within your job to extract the details from the file.
Below is the shell command
#!/bin/bash
string=$(sed -n 's/server.jdbc.database_name=//p' /etc/ambari-server/conf/ambari.properties)
POS=3
LEN=32
clusterid=${string:$POS:$LEN}
You can embed the script in Python/Java.I am using Python to achieve this
import os
import subprocess
subprocess.call(['sh', '/path/to/script.sh'])

How to set apache spark config to run in cluster mode as a databricks job

I have developed an Apache Spark app, compiled it into a jar and I want to run it as a Databricks job. So far I have been setting master=local to test. What should I set this property or others in the spark config for it to run in cluster mode in databricks. Note that I do not have a cluster created in Databricks, I only have a job that will run on demand so I do not have the url of the master node.
For the databricks job, you do not need to set master to anything.
You will need to do following:
val spark = SparkSession.builder().getOrCreate()

Configure external jars with HDI Jupyter Spark (Scala) notebook

I have an external custom jar that I would like to use with Azure HDInsight Jupyter notebooks; the Jupyter notebooks in HDI use Spark Magic and Livy.
Within the first cell of the notebook, I'm trying to use the jars configuration:
%%configure -f
{"jars": ["wasb://$container$#$account#.blob.core.windows.net/folder/my-custom-jar.jar"]}
But the error message I receive is:
Starting Spark application
The code failed because of a fatal error:
Status 'shutting_down' not supported by session..
Some things to try:
a) Make sure Spark has enough available resources for Jupyter to create a Spark context. For instructions on how to assign resources see http://go.microsoft.com/fwlink/?LinkId=717038
b) Contact your cluster administrator to make sure the Spark magics library is configured correctly.
Current session configs: {u'jars': [u'wasb://$container$#$account#.blob.core.windows.net/folder/my-custom-jar.jar'], u'kind': 'spark'}
An error was encountered:
Status 'shutting_down' not supported by session.
I'm wondering if I'm just not understanding how Livy works in this case as I was able to successfully include a spark-package (GraphFrames) on the same cluster:
%%configure -f
{ "conf": {"spark.jars.packages": "graphframes:graphframes:0.3.0-spark2.0-s_2.11" }}
Some additional references that may be handy (just in case I missed something):
Jupyter notebooks kernels with Apache Spark clusters in HDInsight
Livy Documentation
Submit Spark jobs remotely to an Apache Spark cluster on HDInsight using Livy
Oh, I was able to figure it out and forgot to update my question. This can work if you put the jar in the default storage account of your HDI cluster.
HTH!
in case people come here for adding jars on EMR.
%%configure -f
{"name": "sparkTest", "conf": {"spark.jars": "s3://somebucket/artifacts/jars/spark-avro_2.11-2.4.4.jar"}}
contrary to the document, use jars directly won't work.

How to submit Apache Spark job to Hadoop YARN on Azure HDInsight

I am very excited that HDInsight switched to Hadoop version 2, which supports Apache Spark through YARN. Apache Spark is a much better fitting parallel programming paradigm than MapReduce for the task that I want to perform.
I was unable to find any documentation however on how to do remote job submission of a Apache Spark job to my HDInsight cluster. For remote job submission of standard MapReduce jobs I know that there are several REST endpoints like Templeton and Oozie. But as for as I was able to find, running Spark jobs is not possible through Templeton. I did find it to be possible to incorporate Spark jobs into Oozie, but I've read that this is a very tedious thing to do and also I've read some reports of job failure detection not working in this case.
Probably there must be a more appropriate way to submit Spark jobs. Does anyone know how to do remote job submissions of Apache Spark jobs to HDInsight?
Many thanks in advance!
You can install spark on a hdinsight cluster. You have to do it at by creating a custom cluster and adding an action script that will install Spark on the cluster at the time it creates the VMs for the Cluster.
To install with an action script on cluster install is pretty easy, you can do it in C# or powershell by adding a few lines of code to a standard custom create cluster script/program.
powershell:
# ADD SCRIPT ACTION TO CLUSTER CONFIGURATION
$config = Add-AzureHDInsightScriptAction -Config $config -Name "Install Spark" -ClusterRoleCollection HeadNode -Urin https://hdiconfigactions.blob.core.windows.net/sparkconfigactionv02/spark-installer-v02.ps1
C#:
// ADD THE SCRIPT ACTION TO INSTALL SPARK
clusterInfo.ConfigActions.Add(new ScriptAction(
"Install Spark", // Name of the config action
new ClusterNodeType[] { ClusterNodeType.HeadNode }, // List of nodes to install Spark on
new Uri("https://hdiconfigactions.blob.core.windows.net/sparkconfigactionv02/spark-installer-v02.ps1"), // Location of the script to install Spark
null //because the script used does not require any parameters.
));
you can then RDP into the headnode and run use the spark-shell or use spark-submit to run jobs. I am not sure how would run spark job and not rdp into the the headnode but that is an other question.
I also asked the same question with Azure guys. Following is the solution from them:
"Two questions to the topic: 1. How can we submit a job outside of the cluster without "Remote to…" — Tao Li
Currently, this functionality is not supported. One workaround is to build job submission web service yourself:
Create Scala web service that will use Spark APIs to start jobs on the cluster.
Host this web service in the VM inside the same VNet as the cluster.
Expose web service end-point externally through some authentication scheme. You can also employ intermediate map reduce job, it would take longer though.
You might consider using Brisk (https://brisk.elastatools.com) which offers Spark on Azure as a provisioned service (with support available). There's a free tier and it lets you access blob storage with a wasb://path/to/files just like HDInsight.
It doesn't sit on YARN; instead it is a lightweight and Azure oriented distribution of Spark.
Disclaimer: I work on the project!
Best wishes,
Andy

Resources