Accessing Raw Data for Hadoop - azure

I am looking at the data.seattle.gov data sets and I'm wondering in general how all of this large raw data can get sent to hadoop clusters. I am using hadoop on azure.

It looks like data.seattle.gov is a self contained data service, not built on top of the public cloud.
They have own Restful API for the data access.
Thereof I think the simplest way is to download interested Data to your hadoop cluster, or
to S3 and then use EMR or own clusters on Amazon EC2.
If they (data.seattle.gov ) has relevant queries capabilities you can query the data on demand from Your hadoop cluster passing data references as input. It might work only if you doing very serious data reduction in these queries - otherwise network bandwidth will limit the performance.

In Windows Azure you can place your data sets (unstructured data etc..) in Windows Azure Storage and then access it from the Hadoop Cluster
Check out the blog post: Apache Hadoop on Windows Azure: Connecting to Windows Azure Storage from Hadoop Cluster:
http://blogs.msdn.com/b/avkashchauhan/archive/2012/01/05/apache-hadoop-on-windows-azure-connecting-to-windows-azure-storage-your-hadoop-cluster.aspx
You can also get your data from the Azure Marketplace e.g. Gov Data sets etc..
http://social.technet.microsoft.com/wiki/contents/articles/6857.how-to-import-data-to-hadoop-on-windows-azure-from-windows-azure-marketplace.aspx

Related

How does HDInsight cluster maps to Azure Storage as HDFS?

I have a fair idea of how Hadoop works as I have studied the on-premise model since that's how everyone learns. In that sense the top level idea is fairly straightforward.We have a set of machines (nodes) and we run certain processes on each one of them and then configure those processes in such a way that the entire thing starts behaving as a single logical entity that we call a Hadoop (YARN) cluster. Here HDFS is a logical layer on top of individual storage of all the machines in the cluster. But when we start of thinking of the same cluster in cloud , this becomes little confusing. Taking the case of HDInsight Hadoop cluster , lets say I already have an Azure Storage account with lots of text data and I want to do some analysis so I go ahead and spin a Hadoop cluster in the same region as the storage account. Now the whole idea behind Hadoop is that of processing closest to where data exists. In this case when we create the Hadoop cluster , a bunch of Azure Virtual Machines start behind the scenes with their own underlying storage (though in the same region). But then, while creating the cluster we do specify a default storage account and a few other storage accounts to be attached where data that is to be processed lies. So ideally the data that is to be processed needs to exist on the disks for the virtual machines. How does this thing work in Azure? I guess the virtual machines create disks that are actually pointers to azure storage accounts (default + attached) ? This part is what is not really explained well and is really cloudy. So lot of people including myself are always in dark when they learn the classic on-premise Hadoop model academically and start using cloud based clusters in the real world. If we could see more information about these virtual machines right from the cluster Overview page from the Azure portal , it would help the understanding. I know it's visible from Ambari but again Ambari is blind to Azure, it's an independent component so that is not very helpful.
There is an underlying driver which works as a bridge in mapping the Azure Storage as HDFS to other services running in HDInsight.
You can read more about this driver's functionality in the below official page.
https://hadoop.apache.org/docs/current/hadoop-azure/index.html
If your Azure Storage Account is of type ADLS Gen 2 (Azure Data Lake Storage Gen2) then the driver used is different and can be found under the following official page. This offers some advance capabilities of ADLS Gen2 to beef up your HDInsight performance.
https://hadoop.apache.org/docs/current/hadoop-azure/abfs.html
Finally, as same as your on-prem Hadoop installation, HDInsight too has a local HDFS that is deployed across your HDInsight cluster VM Hard drives also. You can access this local HDFS using URI as below.
hdfs://mycluster/
For example you can issue the following to view your local HDFS root level content.
hdfs dfs -ls hdfs://mycluster/

How to connect locally installed Apache Hive to Azure datalake?

I have installed Apache Hive on my local system and I need to connect to Azure Data Lake to query the data from it. How to configure it?
Details on how you can connect Hadoop to Azure Data Lake are available here - https://hadoop.apache.org/docs/current/hadoop-azure-datalake/index.html.
You will need to have a recent version of Hadoop running in order to have the modules natively available.
There are blogs which talk about enabling this connectivity e.g. - https://medium.com/azure-data-lake/connecting-your-own-hadoop-or-spark-to-azure-data-lake-store-93d426d6a5f4.
But unless you are running Hadoop in an Azure Region where the Azure Data Lake Store (ADLS) account is located, your solution will be non-optimal. You will incur latency in data read/writes, as well as costs since you will be egressing data out of an Azure region during reads. Trust you have factored these into your planning.
Thanks,
Sachin Sheth,
Program Manager, Azure Data Lake.

Configuration of Azure for Storage and Hadoop Clustered Processing

Hope someone can offer any advice. At the moment I have been asked to scope out a possible infrastructure for a new Azure Platform. We are also going to be using HDFS / Hadoop for our ETL and Storage.
Can anyone offer any advice on the following :
Set up a Storage Optimised Server (eg, L4, 4 Core, 32gb Ram, 678 GB Storage) to hold our raw data, reference tables and final cleansed data within HDFS. This server could be running 24/7 to feed our analytics platforms.
Then, to utilise the power of Hadoop, could we spin up a set of Processing servers (eg, once a week) to read from the Storage Server, process and write back to the storage server and then shutdown until the next load & process task.
Would really appreciate anyone's thoughts advice on this or any possible configurations we could think of?
Many thanks
Fiorano
Whether your Hadoop cluster is on-premises or in the cloud, it contains two main resources: compute resources to process jobs, and storage resources to hold data. In an on-premises cluster, the storage and compute resources are combined into the same hardware tying them together. With HDInsight the storage is wholly separated from the compute resource. This is a very important distinction of HDInsight. It means that I can completely turn off the compute portion of the cluster and the data will remain accessible.
Note: To analyze data in HDInsight cluster, you can store the data either in Azure Storage, Azure Data Lake Store, or both.
For more details, refer "Azure HDInsight Documentation".

HDInsight - Azure blob storage

I have some basic clarifications about azure hdInsight.
The following article gives some basic input on using hdinsight.
https://azure.microsoft.com/en-in/documentation/articles/hdinsight-hadoop-emulator-get-started/.
It says that HDinsight internally uses azure blob storage .
Having this in mind, my question is as follows:
I have a hdinsight hd1 which uses storage account stg1.
If I want to just uploading and download files using azure storage explorer to stg1 , then whats the use of having hd1 , I can do it without even creating hdinsight which costs heavily.
So, is hadoop hdinsight only used for processing some data stored in stg1 to produce some results like wordcount?Is that the only reason why we use HDInsight?
If you want to understand the HDInsight and blob storage better, you need to read https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-use-blob-storage/.
HDInsight is Microsoft's implementation of Hadoop. So far there 4 different base types which include Hadoop, HBase, Storm, Spark. You can always install additional components to the base types.
Your question is really about why using Hadoop. Hadoop shines when you need to process a lot of data - big data.
One of the differences between HDInsight and other Hadoop implementations is the separation of storage (blob storage) from compute (HDInsight clusters). You would still need to copy the data (or store the data directly in Azure blob storage). When you are ready to process, you create an HDInsight cluster, submit a job, and then delete the cluster. You delete the cluster so you don't need to pay for the cluster anymore. Even after the cluster is deleted, your date stored in the Blob storage retains.
HDInsight is a family of products, including Hadoop, Spark, HBase, and Storm. They all do different things, and storage is but only one aspect.

How to efficiently move big data from a data center to Azure Blob Storage for later processing via HDInsight?

I need to setup scheduled tasks which purpose is to copy/move large amounts of data from an on-premises data center to Windows Azure Blob Storage.
The options I've explored are WebHDFS and Flume (the latter does not seem to be supported by HDInsight currently).
What is the most efficient way to transfer unstructured files from a data center to Windows Azure Blob Storage?
If you are using HDInsight, you don't need to involve HDFS at all. In fact you don't need your cluster to be running to upload the data. The best way of getting data into HDInsight is to upload it to Azure Blob Storage, using either the standard .NET clients, or something third-party like Azure Management Studio or AzCopy.
If you want to stream the data constantly, then you are probably better setting up something like Flume, Kafka or Storm to work against an HDInsight cluster, but that will require a certain amount of customisation on the cluster itself, which means you'll run into problems with reboots, and require a permanent cluster.
You didn't mention how much data you're talking about (you just said large amounts). But... assuming it's 100's of TB or petabytes, Azure has an Import/Export Service which offers disk-ship.
Outside of that, you'd need to use your own code or use a 3rd-party tool such as Microsoft's AzCopy to transfer your content to blobs. Remember that you'll be able to perform parallel uploads, to compress time (as long as your data center's bandwidth is large enough for you to see the benefits).
You could use CloudBerry drive and Flume to stream data to HDInsight cluster/Azure Blob storage
http://blogs.msdn.com/b/bigdatasupport/archive/2014/03/18/using-apache-flume-with-hdinsight.aspx
No,you cannot use flume to stream data directly to HDInsight. post from Microsoft blog says that
a vast majority of Flume consumers will land their streaming data into HDFS – and HDFS is not the default file system used with HDInsight. Even if it were - we do not expose public facing Name Node or HDFS endpoints so the Flume agent would have a terrible time reaching the cluster! So, for these reasons and a few others , the answer is typically "no. …it won't work or its not supported"
source :http://blogs.msdn.com/b/bigdatasupport/archive/2014/03/18/using-apache-flume-with-hdinsight.aspx?CommentPosted=true#commentmessage
It also is worth mentioning the ExpressRoute option. Microsoft now has a program called ExpressRoute where your datacenter can be connected straight to Azure with a much faster connection, in cooperation with your ISP. See also http://azure.microsoft.com/en-us/services/expressroute/

Resources