Hope someone can offer any advice. At the moment I have been asked to scope out a possible infrastructure for a new Azure Platform. We are also going to be using HDFS / Hadoop for our ETL and Storage.
Can anyone offer any advice on the following :
Set up a Storage Optimised Server (eg, L4, 4 Core, 32gb Ram, 678 GB Storage) to hold our raw data, reference tables and final cleansed data within HDFS. This server could be running 24/7 to feed our analytics platforms.
Then, to utilise the power of Hadoop, could we spin up a set of Processing servers (eg, once a week) to read from the Storage Server, process and write back to the storage server and then shutdown until the next load & process task.
Would really appreciate anyone's thoughts advice on this or any possible configurations we could think of?
Many thanks
Fiorano
Whether your Hadoop cluster is on-premises or in the cloud, it contains two main resources: compute resources to process jobs, and storage resources to hold data. In an on-premises cluster, the storage and compute resources are combined into the same hardware tying them together. With HDInsight the storage is wholly separated from the compute resource. This is a very important distinction of HDInsight. It means that I can completely turn off the compute portion of the cluster and the data will remain accessible.
Note: To analyze data in HDInsight cluster, you can store the data either in Azure Storage, Azure Data Lake Store, or both.
For more details, refer "Azure HDInsight Documentation".
Related
I have a fair idea of how Hadoop works as I have studied the on-premise model since that's how everyone learns. In that sense the top level idea is fairly straightforward.We have a set of machines (nodes) and we run certain processes on each one of them and then configure those processes in such a way that the entire thing starts behaving as a single logical entity that we call a Hadoop (YARN) cluster. Here HDFS is a logical layer on top of individual storage of all the machines in the cluster. But when we start of thinking of the same cluster in cloud , this becomes little confusing. Taking the case of HDInsight Hadoop cluster , lets say I already have an Azure Storage account with lots of text data and I want to do some analysis so I go ahead and spin a Hadoop cluster in the same region as the storage account. Now the whole idea behind Hadoop is that of processing closest to where data exists. In this case when we create the Hadoop cluster , a bunch of Azure Virtual Machines start behind the scenes with their own underlying storage (though in the same region). But then, while creating the cluster we do specify a default storage account and a few other storage accounts to be attached where data that is to be processed lies. So ideally the data that is to be processed needs to exist on the disks for the virtual machines. How does this thing work in Azure? I guess the virtual machines create disks that are actually pointers to azure storage accounts (default + attached) ? This part is what is not really explained well and is really cloudy. So lot of people including myself are always in dark when they learn the classic on-premise Hadoop model academically and start using cloud based clusters in the real world. If we could see more information about these virtual machines right from the cluster Overview page from the Azure portal , it would help the understanding. I know it's visible from Ambari but again Ambari is blind to Azure, it's an independent component so that is not very helpful.
There is an underlying driver which works as a bridge in mapping the Azure Storage as HDFS to other services running in HDInsight.
You can read more about this driver's functionality in the below official page.
https://hadoop.apache.org/docs/current/hadoop-azure/index.html
If your Azure Storage Account is of type ADLS Gen 2 (Azure Data Lake Storage Gen2) then the driver used is different and can be found under the following official page. This offers some advance capabilities of ADLS Gen2 to beef up your HDInsight performance.
https://hadoop.apache.org/docs/current/hadoop-azure/abfs.html
Finally, as same as your on-prem Hadoop installation, HDInsight too has a local HDFS that is deployed across your HDInsight cluster VM Hard drives also. You can access this local HDFS using URI as below.
hdfs://mycluster/
For example you can issue the following to view your local HDFS root level content.
hdfs dfs -ls hdfs://mycluster/
I am learning from this course. It asks to create a new hdinsight cluster (options are hadoop, hbase, storm or spark) and also a storage account. What is difference between a cluster and a storage account? Does cluster include processors to process my jobs and does storage account mean space to store my data? Why cannot i connect the same storage account with different clusters?
Also under Microsoft Azure >> New >> Data + Analytics, I see 2 options : hdinsight, data lake analytics that deal with big data. What is difference between those two? Both of them look similar
HDInsight
Microsoft's cloud-based Big Data service. Apache Hadoop and other popular Big Data solutions.
Data Lake Analytics
Big data analytics made easy
There are a lot of questions in here so let me answer them 1 by 1.
What is Blob Storage vs HDInsight Cluster?
Blob storage is a distributed file store very similar to HDFS and is used to store data/videos/things. A HDInsight cluster is a number of Hadoop virtual machines created to run Map Reduce code over a DFS (HDFS or Blob storage). Having two separate services allow you to scale each independently, saving money in the long term. Data storage is cheap but a 500 node VM cluster can get pricey quickly. Being able to kill the cluster but keep your data is helpful.
Why can't I connect the same storage account with different clusters?
You can have multiple clusters pointed at the same storage account but it's an Anti pattern. Storage accounts have Data and IO limits and if you have multiple clusters pulling against a single storage account, it's more probable you'll hit them. Also, storage accounts only cost $$ if you have data in them so having multiple isn't a cost increase.
What is Azure Data Lake(ADL) and ADL storage?
Azure data lake is another option for both storage and compute. ADL storage can be thought of as blob storage v2. You get an increase of some of the limits on IO and file size from blob storage, while still being able to use Hadoop for compute. ADL is a second option for compute that is completely different then Hadoop. You don't have to worry about the cluster creation or clusters in general. You write a query, specify the amount of parallelization you'd like, and the data is returned.
References:
https://azure.microsoft.com/en-us/documentation/articles/azure-subscription-service-limits/#storage-limits
https://azure.microsoft.com/en-us/services/hdinsight/
https://azure.microsoft.com/en-us/solutions/data-lake/
Currently my team is creating a solution that would use HDInsight. We will be getting 5TB of data daily and will need to do some map/reduce jobs on this data. Would there be any performance/cost difference if our data will be stored in Azure Table Storage instead of Azure HBase?
The main differences will be in both functionality and cost.
Azure Table Storage doesn't have a map reduce engine attached to it in itself, though of course you could use the map reduce approach to write your own.
You can use Azure HDInsight to connect Map Reduce to table storage. There are a couple of connectors around, including one written by me which is hive focused and requires some configuration, and may not suit your partition scheme (http://www.simonellistonball.com/technology/hadoop-hive-inputformat-azure-tables/) and a less performance focused, but more complete version from someone at Microsoft (http://blogs.msdn.com/b/mostlytrue/archive/2014/04/04/analyzing-azure-table-storage-data-with-hdinsight.aspx).
The main advantage of Table Storage is that you aren't constantly taking processing cost.
If you use HBase, you will need to run a full cluster all the time, so there is a cost disadvantage, however, you will get some functionality and performance gains, plus you will have something a bit more portable, should you wish to use other hadoop platforms. You would also have access to a much greater range of analytic functionality with the HBase option.
HDInsight (HBase/Hadoop) uses Azure Blob storage not ATS. For your data-storage you will charged only applicable blob storage cost, based on your subscription.
P.S. Don't forget to delete your cluster once job has completed, to avoid charges. Your data will persist in BLOB storage and can be used by next cluster you build.
When using HDInsight and choosing Azure Storage Blob to store the data that needs to be computed, you still have to choose the number of data nodes when provisioning a new cluster. If your data is being stored on an Azure Storage Blob, what impact does the number of data nodes have? Is the data from the blob actually replicated onto the data nodes?
If you put data on the Azure Blob Store, it stays there, and is read directly from Azure Storage.
The data nodes in the HDInsight cluster have two purposes. Firstly, they run the actual compute jobs, which read from Azure Storage Directly. This is not as crazy as it might sound to an HDFS user because of Azure's consistent underlying fabric, which keeps the storage nice and close to the compute.
Secondly, the data nodes are running an HDFS filesystem on their local disk. This is generally only used for intermediate and tmp files in HDInsight, since it is transitory (only lasts as long as the cluster).
So, choosing the number of data nodes is essentially choosing how many job running nodes (yarn application containers, or job tracker slots depending on version) you want to be able to handle, and to a lesser extent, choosing how much temp space your jobs need.
I need to setup scheduled tasks which purpose is to copy/move large amounts of data from an on-premises data center to Windows Azure Blob Storage.
The options I've explored are WebHDFS and Flume (the latter does not seem to be supported by HDInsight currently).
What is the most efficient way to transfer unstructured files from a data center to Windows Azure Blob Storage?
If you are using HDInsight, you don't need to involve HDFS at all. In fact you don't need your cluster to be running to upload the data. The best way of getting data into HDInsight is to upload it to Azure Blob Storage, using either the standard .NET clients, or something third-party like Azure Management Studio or AzCopy.
If you want to stream the data constantly, then you are probably better setting up something like Flume, Kafka or Storm to work against an HDInsight cluster, but that will require a certain amount of customisation on the cluster itself, which means you'll run into problems with reboots, and require a permanent cluster.
You didn't mention how much data you're talking about (you just said large amounts). But... assuming it's 100's of TB or petabytes, Azure has an Import/Export Service which offers disk-ship.
Outside of that, you'd need to use your own code or use a 3rd-party tool such as Microsoft's AzCopy to transfer your content to blobs. Remember that you'll be able to perform parallel uploads, to compress time (as long as your data center's bandwidth is large enough for you to see the benefits).
You could use CloudBerry drive and Flume to stream data to HDInsight cluster/Azure Blob storage
http://blogs.msdn.com/b/bigdatasupport/archive/2014/03/18/using-apache-flume-with-hdinsight.aspx
No,you cannot use flume to stream data directly to HDInsight. post from Microsoft blog says that
a vast majority of Flume consumers will land their streaming data into HDFS – and HDFS is not the default file system used with HDInsight. Even if it were - we do not expose public facing Name Node or HDFS endpoints so the Flume agent would have a terrible time reaching the cluster! So, for these reasons and a few others , the answer is typically "no. …it won't work or its not supported"
source :http://blogs.msdn.com/b/bigdatasupport/archive/2014/03/18/using-apache-flume-with-hdinsight.aspx?CommentPosted=true#commentmessage
It also is worth mentioning the ExpressRoute option. Microsoft now has a program called ExpressRoute where your datacenter can be connected straight to Azure with a much faster connection, in cooperation with your ISP. See also http://azure.microsoft.com/en-us/services/expressroute/