HDInsight persistent Hive settings - azure

Every few days the Azure HDInsight cluster is being (randomly?) restarted by Microsoft, and in the process any custom changes to hive-site.xml (such as adding a JsonSerde) are lost without any prior warning, and as a result the hive queries from Excel/PowerPivot starts breaking.
How are you supposed to deal with this scenario - are we forced to store our data as CSV files ?

In order to preserve customization during os update or node re-image, you should think of using script action. Here is the link: http://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-customize-cluster/

If you specify the Hive config parameter with a custom configuration object at the time of cluster creation, it should persist. The link here http://hadoopsdk.codeplex.com/wikipage?title=PowerShell%20Cmdlets%20for%20Cluster%20Management has some more details on creating a cluster with custom configuration.

This blog post on MSDN has a table showing what customizations are supported via the different methods, as well as examples for using PowerShell or the SDK to create a cluster with custom Hive configuration parameters (Line 62-64 in the Powershell example): http://blogs.msdn.com/b/bigdatasupport/archive/2014/04/15/customizing-hdinsight-cluster-provisioning-via-powershell-and-net-sdk.aspx
This is the only way to persist these settings because the cluster nodes can be reset for Azure servicing events such as security updates, and the configurations are set back to the initial values when this occurs.

Related

Databricks Lakehouse JDBC and Docker

Pretty new to Databricks.
I've got a requirement to access data in the Lakehouse using a JDBC driver. This works fine.
I now want to stub the Lakehouse using a docker image for some tests I want to write. Is it possible to get a Databricks / spark docker image with a database in it? I would also want to bootstrap the database on startup to create a bunch of tables.
No - Databricks is not a database but a hosted service (PaaS). You can theoretically you can use OSS Spark with Thriftserver started on it, but the connections strings and other functionality would be very different, so it makes no sense to spend time on it (imho). Real solution would depend on the type of tests that you want to do.
Regarding bootstrapping database & create a bunch of tables - just issue these commands, like, create database if not exists or create table if not exists when you application starts up (see documentation for an exact syntax)

Azure Databricks cluster spark configuration is disabled

When creating an Azure Databricks and configuring its cluster, I had chosen the default languages for Spark to be python,sql. But now I want to add Scala, as well. When running the Scala script I was getting the following error. So, my online search took me to this article that describes that you can change Cluster configuration by going to the Advanced options section of the cluster settings page and clicking on the Spark tab there (as shown in image below). But I find the Spark section there greyed out (disabled):
Question: How can I enabled the Spark section of the Advanced section of the cluster settings page (shown in image below) so I can edit the last line of the section? Note: I created the Databricks and its cluster and hence I am the admin (as shown in image 2 below).
Databricks Notebook error: Your administrator has only allowed sql and python commands on this cluster.
You need to click "Edit" button in the cluster controls - after that you should be able to change Spark configuration. But you can't enable Scala for the High concurrency clusters with credentials passthrough as it supports only Python & SQL (doc) - primary reason for that is that with Scala you can bypass users isolation.
If you need credentials passthrough + Scala, then you need to use Standard cluster, but it will work only with a single specific user (doc).

Can we configure presto's data base connector information from its GUI

I am using presto version 179 and I need to manually create a database.properties file in /etc/presto/catalog through the CLI.
Can I do the same process from the GUI of presto?
Presto's built-in web interface does not provide any configuration capabilities.
Usually, such things are handled as part of deployment/configuration management on a cluster. Thus, configuration is provided by some external means just as is Presto installation.

Azure: Is it possible to chose the ambari instance while creating an R cluster?

What i'm doing
I'm working on microsoft Azure, and here is the thing. I'm trying to create an R cluster on azure with hadoop 3.6 but i need some default tools like nifi, kafka and storm which are available on an HDF.
Problem
When i create the cluster, i cant't chose the instance of ambari so i tried to create the cluster with a template wich i activate every morging to create the cluster and another one to delete the cluster every night. I was wondering if it's possible to chose the Ambari instrance while using the template.
anyone has an idea ?
AFAIK, you cannot change the version of Ambari. Since it comes by default with HDI versions.
You can find more details on this documentation regarding Hadoop components available with different HDInsight versions.

HDInsight SparkHistory on Azure shows no applications

I have created a Spark HDInsight Cluster on Azure. The cluster was used to run different jobs (either Spark or Hive).
Until a month ago, the history of the jobs could be seen in the Spark History Server dashboard. It seems that following the update that introduced Spark 1.6.0, this dashboard is no longer showing any applications.
I have also tried to bypass this issue by executing the PowerShell cmdlet for get-azurehdinsightjob as sugested here. The output is again an empty list of applications.
I would appreciate any help as this dashboard used to work and now all my experiments are stalled.
I managed to solve the issue by deleting everything inside wasb:///hdp/spark-events. Maybe the issue was related to the size of the folder, as no other log files could be appended.
All the following jobs are now appearing successfully in the Spark History Server dashboard.

Resources