When creating an Azure Databricks and configuring its cluster, I had chosen the default languages for Spark to be python,sql. But now I want to add Scala, as well. When running the Scala script I was getting the following error. So, my online search took me to this article that describes that you can change Cluster configuration by going to the Advanced options section of the cluster settings page and clicking on the Spark tab there (as shown in image below). But I find the Spark section there greyed out (disabled):
Question: How can I enabled the Spark section of the Advanced section of the cluster settings page (shown in image below) so I can edit the last line of the section? Note: I created the Databricks and its cluster and hence I am the admin (as shown in image 2 below).
Databricks Notebook error: Your administrator has only allowed sql and python commands on this cluster.
You need to click "Edit" button in the cluster controls - after that you should be able to change Spark configuration. But you can't enable Scala for the High concurrency clusters with credentials passthrough as it supports only Python & SQL (doc) - primary reason for that is that with Scala you can bypass users isolation.
If you need credentials passthrough + Scala, then you need to use Standard cluster, but it will work only with a single specific user (doc).
Related
Trying to configure spark for the entire azure synapse pipeline, Found Spark session config magic command and How to set Spark / Pyspark custom configs in Synapse Workspace spark pool . %%configure magic command works fine for a single notebook. Example:
Insert cell with the below content at the Beginning of the notebook
%%configure -f
{
"driverMemory": "28g",
"driverCores": 4,
"executorMemory": "32g",
"executorCores": 4,
"numExecutors" : 5
}
Then the below emits expected values.
spark_executor_instances = spark.conf.get("spark.executor.instances")
print(f"spark.executor.instances {spark_executor_instances}")
spark_executor_memory = spark.conf.get("spark.executor.memory")
print(f"spark.executor.memory {spark_executor_memory}")
spark_driver_memory = spark.conf.get("spark.driver.memory")
print(f"spark.driver.memory {spark_driver_memory}")
Although if i add that notebook as a first activity in Azure Synapse Pipeline, what happens is that Apache Spark Application which executes that notebook has correct configuration, but the rest of the notebooks in pipeline fall back to default configuration.
How can i configure spark for the entire pipeline ? Should i copy paste above %%configure .. in each and every notebook in pipeline or is there a better way ?
Yes, this is the well known option AFAIK. You need to define %%configure -f at the beginning of each Notebook in order to override default settings for your Job.
Alternatively, you can try by traversing to the Spark pool on Azure Portal and set the configurations in the spark pool by uploading text file which looks like this:
Please refer this third-party article for more details.
Moreover, looks like one cannot specify less than 4 cores for executor, nor driver. If you do, you get 1 core but nevertheless 4 core is reserved.
As part performance one suggestion in spark documentation is to make pointers 4 bytes instead of 8 as shown in the figure.
I'm working on Azure databricks. Now where do I add this config?
I tried adding in advanced options of a cluster under spark config, the following parameter:
jvm -XX:+UseCompressedOops
I'm I adding this config in right location? If not where should I add?
Edit:
Document link
https://spark.apache.org/docs/latest/tuning.html
Edit Databricks cluster. Go to Advanced options below and then to Environment variables. Add there:
JAVA_OPTS="$JAVA_OPTS -XX:+UseCompressedOops"
Please notice that there is some example, when you start to create a brand new cluster.
What i'm doing
I'm working on microsoft Azure, and here is the thing. I'm trying to create an R cluster on azure with hadoop 3.6 but i need some default tools like nifi, kafka and storm which are available on an HDF.
Problem
When i create the cluster, i cant't chose the instance of ambari so i tried to create the cluster with a template wich i activate every morging to create the cluster and another one to delete the cluster every night. I was wondering if it's possible to chose the Ambari instrance while using the template.
anyone has an idea ?
AFAIK, you cannot change the version of Ambari. Since it comes by default with HDI versions.
You can find more details on this documentation regarding Hadoop components available with different HDInsight versions.
I've used the Windows version of HDInsight before, and that has a tab where you can set the number of cores and ram per worker node for Zeppelin.
I followed this tutorial to get Zeppelin working:
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-use-zeppelin-notebook/
The Linux version of HDInsight uses Ambari to manage the resources, but I can't seem to find a way to change the settings for Zeppelin.
Zeppelin is not selectable as a separate service in the list of services on the left. It also seems like it isn't available to be added when I choose 'add service' in actions.
I tried editing the general spark configs in Ambari by using override, then adding the worker nodes to my new config group and increasing the number of cores and RAM in custom spark-defaults. (Then clicked save and restarted all affected services.)
I tried editing the spark settings using
vi /etc/spark/conf/spark-defaults.conf
on the headnode, but that wasn't picked up by Ambari.
The performance in Zeppelin seems to stay the same for a query that takes about 1000-1100 seconds every time.
Zeppelin is not a service so it shouldn't show up in Ambari. If you are committed to managing it that way, you may be able to get this to work
https://github.com/tzolov/zeppelin-ambari-plugin
To edit via ssh you'll need edit the zeppelin-env.sh file. First give yourself edit perms.
sudo chmod u+w /usr/hdp/current/incubator-zeppelin/conf/zeppelin-env.sh
and then edit zeppelin configs using
vi /usr/hdp/current/incubator-zeppelin/conf/zeppelin-env.sh
Here you can configure the ZEPPELIN_JAVA_OPTS variable, adding:
-Dspark.executor.memory=1024m -Dspark.executor.cores=16
All that being said... any reason you can't just use a Jupyter notebook instead?
Every few days the Azure HDInsight cluster is being (randomly?) restarted by Microsoft, and in the process any custom changes to hive-site.xml (such as adding a JsonSerde) are lost without any prior warning, and as a result the hive queries from Excel/PowerPivot starts breaking.
How are you supposed to deal with this scenario - are we forced to store our data as CSV files ?
In order to preserve customization during os update or node re-image, you should think of using script action. Here is the link: http://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-customize-cluster/
If you specify the Hive config parameter with a custom configuration object at the time of cluster creation, it should persist. The link here http://hadoopsdk.codeplex.com/wikipage?title=PowerShell%20Cmdlets%20for%20Cluster%20Management has some more details on creating a cluster with custom configuration.
This blog post on MSDN has a table showing what customizations are supported via the different methods, as well as examples for using PowerShell or the SDK to create a cluster with custom Hive configuration parameters (Line 62-64 in the Powershell example): http://blogs.msdn.com/b/bigdatasupport/archive/2014/04/15/customizing-hdinsight-cluster-provisioning-via-powershell-and-net-sdk.aspx
This is the only way to persist these settings because the cluster nodes can be reset for Azure servicing events such as security updates, and the configurations are set back to the initial values when this occurs.