Jvm settings for Azure databricks - apache-spark

As part performance one suggestion in spark documentation is to make pointers 4 bytes instead of 8 as shown in the figure.
I'm working on Azure databricks. Now where do I add this config?
I tried adding in advanced options of a cluster under spark config, the following parameter:
jvm -XX:+UseCompressedOops
I'm I adding this config in right location? If not where should I add?
Edit:
Document link
https://spark.apache.org/docs/latest/tuning.html

Edit Databricks cluster. Go to Advanced options below and then to Environment variables. Add there:
JAVA_OPTS="$JAVA_OPTS -XX:+UseCompressedOops"
Please notice that there is some example, when you start to create a brand new cluster.

Related

Azure Databricks cluster spark configuration is disabled

When creating an Azure Databricks and configuring its cluster, I had chosen the default languages for Spark to be python,sql. But now I want to add Scala, as well. When running the Scala script I was getting the following error. So, my online search took me to this article that describes that you can change Cluster configuration by going to the Advanced options section of the cluster settings page and clicking on the Spark tab there (as shown in image below). But I find the Spark section there greyed out (disabled):
Question: How can I enabled the Spark section of the Advanced section of the cluster settings page (shown in image below) so I can edit the last line of the section? Note: I created the Databricks and its cluster and hence I am the admin (as shown in image 2 below).
Databricks Notebook error: Your administrator has only allowed sql and python commands on this cluster.
You need to click "Edit" button in the cluster controls - after that you should be able to change Spark configuration. But you can't enable Scala for the High concurrency clusters with credentials passthrough as it supports only Python & SQL (doc) - primary reason for that is that with Scala you can bypass users isolation.
If you need credentials passthrough + Scala, then you need to use Standard cluster, but it will work only with a single specific user (doc).

How to set user login credentials to Spark webUI in apache spark open source cluser

We are using open source apache spark cluster in our project. Need some help on following ones.
How to enable login credentials for spark web ui login?
How to disable “kill button” option from spark webui?.
Can someone help me solutions to question 1 or 2 or both?.
Thanks in advance.
Sure. According to this you need to set spark.ui.filters setting to refer to the filter class that implements the authentication method you want to deploy. Spark does not provide any built-in authentication filters.
You can see a filter example here.
You need to modify ACLs to control who has access to modify a running Spark application. It can be done by configuring parameters spark.acls.enable, spark.ui.view.acls and spark.ui.view.acls.groups. You can read more about it here.

Spark memory fraction Parameter

I am using Cloudera distribution of Hadoop and the Spark2 version that is used is 2.2. I was searching for memory management parameters that are present inside "Memory Management" section of the below link:
https://spark.apache.org/docs/2.2.0/configuration.html
However, I don't see these configuration parameters inside spark2> Configuration link of Cloudera Manager.
I think there is some gap in my understanding. Please suggest where to look for if these parameters are to be manually changed.
Hi,
I have provided the parameters of memory tuning in this way in spark configuration of cloudera Manager. What happens is, if you save this then the spark-defaults.conf file gets updated in the server and upon restart of Spark2 service you get the benefit of the configuration change.
Hope it helps.

How to change the resources for zeppelin using linux HDInsight

I've used the Windows version of HDInsight before, and that has a tab where you can set the number of cores and ram per worker node for Zeppelin.
I followed this tutorial to get Zeppelin working:
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-use-zeppelin-notebook/
The Linux version of HDInsight uses Ambari to manage the resources, but I can't seem to find a way to change the settings for Zeppelin.
Zeppelin is not selectable as a separate service in the list of services on the left. It also seems like it isn't available to be added when I choose 'add service' in actions.
I tried editing the general spark configs in Ambari by using override, then adding the worker nodes to my new config group and increasing the number of cores and RAM in custom spark-defaults. (Then clicked save and restarted all affected services.)
I tried editing the spark settings using
vi /etc/spark/conf/spark-defaults.conf
on the headnode, but that wasn't picked up by Ambari.
The performance in Zeppelin seems to stay the same for a query that takes about 1000-1100 seconds every time.
Zeppelin is not a service so it shouldn't show up in Ambari. If you are committed to managing it that way, you may be able to get this to work
https://github.com/tzolov/zeppelin-ambari-plugin
To edit via ssh you'll need edit the zeppelin-env.sh file. First give yourself edit perms.
sudo chmod u+w /usr/hdp/current/incubator-zeppelin/conf/zeppelin-env.sh
and then edit zeppelin configs using
vi /usr/hdp/current/incubator-zeppelin/conf/zeppelin-env.sh
Here you can configure the ZEPPELIN_JAVA_OPTS variable, adding:
-Dspark.executor.memory=1024m -Dspark.executor.cores=16
All that being said... any reason you can't just use a Jupyter notebook instead?

HDInsight persistent Hive settings

Every few days the Azure HDInsight cluster is being (randomly?) restarted by Microsoft, and in the process any custom changes to hive-site.xml (such as adding a JsonSerde) are lost without any prior warning, and as a result the hive queries from Excel/PowerPivot starts breaking.
How are you supposed to deal with this scenario - are we forced to store our data as CSV files ?
In order to preserve customization during os update or node re-image, you should think of using script action. Here is the link: http://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-customize-cluster/
If you specify the Hive config parameter with a custom configuration object at the time of cluster creation, it should persist. The link here http://hadoopsdk.codeplex.com/wikipage?title=PowerShell%20Cmdlets%20for%20Cluster%20Management has some more details on creating a cluster with custom configuration.
This blog post on MSDN has a table showing what customizations are supported via the different methods, as well as examples for using PowerShell or the SDK to create a cluster with custom Hive configuration parameters (Line 62-64 in the Powershell example): http://blogs.msdn.com/b/bigdatasupport/archive/2014/04/15/customizing-hdinsight-cluster-provisioning-via-powershell-and-net-sdk.aspx
This is the only way to persist these settings because the cluster nodes can be reset for Azure servicing events such as security updates, and the configurations are set back to the initial values when this occurs.

Resources