Upgrade/Migrate HDInsight Cluster to Last Version - azure

I'm sure this is posted somewhere or has been communicated but I just can't seem to find anything about upgrading/migrating from a HDInsight cluster from one version to the next.
A little background. We've been using Hive with HDInsight to store all of our IIS logs since 1/24/2014. We love it and it provides good insight to our teams.
I recently was reviewing http://azure.microsoft.com/en-us/documentation/articles/hdinsight-component-versioning/ and noticed that our version of HDInsight (2.1.3.0.432823) is no longer supported and will be deprecated in May. That got me to thinking about how to get onto version 3.2. I just can't seem to find anything about how to go about doing this.
Does anyone have any insight into if this is possible and if so how?

HDInsight uses Azure Storage for persistent data, so you should be able to create a new cluster and point to the old data, as long as you are using wasb://*/* for your storage locations. This article has a great overview of the storage architecture: http://azure.microsoft.com/en-us/documentation/articles/hdinsight-use-blob-storage/
If you are using Hive and have not set up a customized metastore, then you may need to save or recreate some of the tables. Here's a blog post that covers some of those scenarios: http://blogs.msdn.com/b/bigdatasupport/archive/2014/05/01/hdinsight-backup-and-restore-hive-table.aspx
You can configure a new cluster and add the existing cluster's storage container as an "additional" storage account to test this out without first taking down the current cluster. Just be sure not to have both clusters using the same container as their default storage.

Related

Customizing nodes of an Azure Synapse Workspace Spark Cluster

When creating a Spark cluster within an Azure Synapse workspace, is there a means to install arbitrary files and directories onto it's cluster nodes and/or onto the node's underlying distributed filesystem?
By arbitrary files and directories, I literally mean arbitrary files and directories; not just extra Python libraries like demonstrated here.
Databricks smartly provided a means to do this on it's cluster nodes (described in this document). Now I'm trying to see if there's a means to do the same on an Azure Synapse Workspace Spark Cluster.
Thank you.
Unfortunately, Azure Synapse Analytics don't support arbitrary binary installs or writing to Spark local storage.
I would suggest you to provide feedback on the same:
https://feedback.azure.com/forums/307516-azure-synapse-analytics
All of the feedback you share in these forums will be monitored and reviewed by the Microsoft engineering teams responsible for building Azure.

How does HDInsight cluster maps to Azure Storage as HDFS?

I have a fair idea of how Hadoop works as I have studied the on-premise model since that's how everyone learns. In that sense the top level idea is fairly straightforward.We have a set of machines (nodes) and we run certain processes on each one of them and then configure those processes in such a way that the entire thing starts behaving as a single logical entity that we call a Hadoop (YARN) cluster. Here HDFS is a logical layer on top of individual storage of all the machines in the cluster. But when we start of thinking of the same cluster in cloud , this becomes little confusing. Taking the case of HDInsight Hadoop cluster , lets say I already have an Azure Storage account with lots of text data and I want to do some analysis so I go ahead and spin a Hadoop cluster in the same region as the storage account. Now the whole idea behind Hadoop is that of processing closest to where data exists. In this case when we create the Hadoop cluster , a bunch of Azure Virtual Machines start behind the scenes with their own underlying storage (though in the same region). But then, while creating the cluster we do specify a default storage account and a few other storage accounts to be attached where data that is to be processed lies. So ideally the data that is to be processed needs to exist on the disks for the virtual machines. How does this thing work in Azure? I guess the virtual machines create disks that are actually pointers to azure storage accounts (default + attached) ? This part is what is not really explained well and is really cloudy. So lot of people including myself are always in dark when they learn the classic on-premise Hadoop model academically and start using cloud based clusters in the real world. If we could see more information about these virtual machines right from the cluster Overview page from the Azure portal , it would help the understanding. I know it's visible from Ambari but again Ambari is blind to Azure, it's an independent component so that is not very helpful.
There is an underlying driver which works as a bridge in mapping the Azure Storage as HDFS to other services running in HDInsight.
You can read more about this driver's functionality in the below official page.
https://hadoop.apache.org/docs/current/hadoop-azure/index.html
If your Azure Storage Account is of type ADLS Gen 2 (Azure Data Lake Storage Gen2) then the driver used is different and can be found under the following official page. This offers some advance capabilities of ADLS Gen2 to beef up your HDInsight performance.
https://hadoop.apache.org/docs/current/hadoop-azure/abfs.html
Finally, as same as your on-prem Hadoop installation, HDInsight too has a local HDFS that is deployed across your HDInsight cluster VM Hard drives also. You can access this local HDFS using URI as below.
hdfs://mycluster/
For example you can issue the following to view your local HDFS root level content.
hdfs dfs -ls hdfs://mycluster/

HDInsight - Azure blob storage

I have some basic clarifications about azure hdInsight.
The following article gives some basic input on using hdinsight.
https://azure.microsoft.com/en-in/documentation/articles/hdinsight-hadoop-emulator-get-started/.
It says that HDinsight internally uses azure blob storage .
Having this in mind, my question is as follows:
I have a hdinsight hd1 which uses storage account stg1.
If I want to just uploading and download files using azure storage explorer to stg1 , then whats the use of having hd1 , I can do it without even creating hdinsight which costs heavily.
So, is hadoop hdinsight only used for processing some data stored in stg1 to produce some results like wordcount?Is that the only reason why we use HDInsight?
If you want to understand the HDInsight and blob storage better, you need to read https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-use-blob-storage/.
HDInsight is Microsoft's implementation of Hadoop. So far there 4 different base types which include Hadoop, HBase, Storm, Spark. You can always install additional components to the base types.
Your question is really about why using Hadoop. Hadoop shines when you need to process a lot of data - big data.
One of the differences between HDInsight and other Hadoop implementations is the separation of storage (blob storage) from compute (HDInsight clusters). You would still need to copy the data (or store the data directly in Azure blob storage). When you are ready to process, you create an HDInsight cluster, submit a job, and then delete the cluster. You delete the cluster so you don't need to pay for the cluster anymore. Even after the cluster is deleted, your date stored in the Blob storage retains.
HDInsight is a family of products, including Hadoop, Spark, HBase, and Storm. They all do different things, and storage is but only one aspect.

How to create a new HDInsight cluster on an existing storage container

I am using HDInsight on Azure to research the scalability of ranking machine learning methods (learning to rank, for the insiders) on Hadoop. I managed to test run my implementation of a learning to rank algorithm on a HDInsight cluster and clocked its time to complete the operation.
Now I want to run the same code over and over again with different numbers of cores to see how the running time scales as a function of the number of cores. From other questions on this forum I understood that HDInsight does not allow changing the number of cores of a cluster. Would it instead be possible in some way to delete the current cluster, and then create a new cluster that makes use of the exact same container on my Azure Storage? I tried to do this by simply giving the new cluster the same name as the previous one (as the container that is created for a new cluster is automatically named after the cluster at creation time), but that doesn't work as the new container created for this new cluster will have "-1" appended to the cluster name. The datafile that I am trying to process is around 15GB in size, so it would be a real pain in the ass if I would need to upload this file to the cluster container for each cluster that I create.
Any help on how I can run my algorithms on HDInsight with varying numbers of cores without having to re-upload my input data for each point of measurement would be very much appreciated!
Kind Regards,
Niek Tax
You should be able to link your existing storage container to an HDInsight cluster According to http://azure.microsoft.com/en-us/documentation/articles/hdinsight-use-blob-storage/#benefits
Using the custom create, you have one of the following options for the default storage account:
Use existing storage
Create new storage
Use storage from another subscription.
You also have the option to create your own Blob container or use an existing one.
The link shows how you can do that through the Windows Azure Portal.

Accessing Raw Data for Hadoop

I am looking at the data.seattle.gov data sets and I'm wondering in general how all of this large raw data can get sent to hadoop clusters. I am using hadoop on azure.
It looks like data.seattle.gov is a self contained data service, not built on top of the public cloud.
They have own Restful API for the data access.
Thereof I think the simplest way is to download interested Data to your hadoop cluster, or
to S3 and then use EMR or own clusters on Amazon EC2.
If they (data.seattle.gov ) has relevant queries capabilities you can query the data on demand from Your hadoop cluster passing data references as input. It might work only if you doing very serious data reduction in these queries - otherwise network bandwidth will limit the performance.
In Windows Azure you can place your data sets (unstructured data etc..) in Windows Azure Storage and then access it from the Hadoop Cluster
Check out the blog post: Apache Hadoop on Windows Azure: Connecting to Windows Azure Storage from Hadoop Cluster:
http://blogs.msdn.com/b/avkashchauhan/archive/2012/01/05/apache-hadoop-on-windows-azure-connecting-to-windows-azure-storage-your-hadoop-cluster.aspx
You can also get your data from the Azure Marketplace e.g. Gov Data sets etc..
http://social.technet.microsoft.com/wiki/contents/articles/6857.how-to-import-data-to-hadoop-on-windows-azure-from-windows-azure-marketplace.aspx

Resources