When using HDInsight and choosing Azure Storage Blob to store the data that needs to be computed, you still have to choose the number of data nodes when provisioning a new cluster. If your data is being stored on an Azure Storage Blob, what impact does the number of data nodes have? Is the data from the blob actually replicated onto the data nodes?
If you put data on the Azure Blob Store, it stays there, and is read directly from Azure Storage.
The data nodes in the HDInsight cluster have two purposes. Firstly, they run the actual compute jobs, which read from Azure Storage Directly. This is not as crazy as it might sound to an HDFS user because of Azure's consistent underlying fabric, which keeps the storage nice and close to the compute.
Secondly, the data nodes are running an HDFS filesystem on their local disk. This is generally only used for intermediate and tmp files in HDInsight, since it is transitory (only lasts as long as the cluster).
So, choosing the number of data nodes is essentially choosing how many job running nodes (yarn application containers, or job tracker slots depending on version) you want to be able to handle, and to a lesser extent, choosing how much temp space your jobs need.
Related
I have a fair idea of how Hadoop works as I have studied the on-premise model since that's how everyone learns. In that sense the top level idea is fairly straightforward.We have a set of machines (nodes) and we run certain processes on each one of them and then configure those processes in such a way that the entire thing starts behaving as a single logical entity that we call a Hadoop (YARN) cluster. Here HDFS is a logical layer on top of individual storage of all the machines in the cluster. But when we start of thinking of the same cluster in cloud , this becomes little confusing. Taking the case of HDInsight Hadoop cluster , lets say I already have an Azure Storage account with lots of text data and I want to do some analysis so I go ahead and spin a Hadoop cluster in the same region as the storage account. Now the whole idea behind Hadoop is that of processing closest to where data exists. In this case when we create the Hadoop cluster , a bunch of Azure Virtual Machines start behind the scenes with their own underlying storage (though in the same region). But then, while creating the cluster we do specify a default storage account and a few other storage accounts to be attached where data that is to be processed lies. So ideally the data that is to be processed needs to exist on the disks for the virtual machines. How does this thing work in Azure? I guess the virtual machines create disks that are actually pointers to azure storage accounts (default + attached) ? This part is what is not really explained well and is really cloudy. So lot of people including myself are always in dark when they learn the classic on-premise Hadoop model academically and start using cloud based clusters in the real world. If we could see more information about these virtual machines right from the cluster Overview page from the Azure portal , it would help the understanding. I know it's visible from Ambari but again Ambari is blind to Azure, it's an independent component so that is not very helpful.
There is an underlying driver which works as a bridge in mapping the Azure Storage as HDFS to other services running in HDInsight.
You can read more about this driver's functionality in the below official page.
https://hadoop.apache.org/docs/current/hadoop-azure/index.html
If your Azure Storage Account is of type ADLS Gen 2 (Azure Data Lake Storage Gen2) then the driver used is different and can be found under the following official page. This offers some advance capabilities of ADLS Gen2 to beef up your HDInsight performance.
https://hadoop.apache.org/docs/current/hadoop-azure/abfs.html
Finally, as same as your on-prem Hadoop installation, HDInsight too has a local HDFS that is deployed across your HDInsight cluster VM Hard drives also. You can access this local HDFS using URI as below.
hdfs://mycluster/
For example you can issue the following to view your local HDFS root level content.
hdfs dfs -ls hdfs://mycluster/
Hope someone can offer any advice. At the moment I have been asked to scope out a possible infrastructure for a new Azure Platform. We are also going to be using HDFS / Hadoop for our ETL and Storage.
Can anyone offer any advice on the following :
Set up a Storage Optimised Server (eg, L4, 4 Core, 32gb Ram, 678 GB Storage) to hold our raw data, reference tables and final cleansed data within HDFS. This server could be running 24/7 to feed our analytics platforms.
Then, to utilise the power of Hadoop, could we spin up a set of Processing servers (eg, once a week) to read from the Storage Server, process and write back to the storage server and then shutdown until the next load & process task.
Would really appreciate anyone's thoughts advice on this or any possible configurations we could think of?
Many thanks
Fiorano
Whether your Hadoop cluster is on-premises or in the cloud, it contains two main resources: compute resources to process jobs, and storage resources to hold data. In an on-premises cluster, the storage and compute resources are combined into the same hardware tying them together. With HDInsight the storage is wholly separated from the compute resource. This is a very important distinction of HDInsight. It means that I can completely turn off the compute portion of the cluster and the data will remain accessible.
Note: To analyze data in HDInsight cluster, you can store the data either in Azure Storage, Azure Data Lake Store, or both.
For more details, refer "Azure HDInsight Documentation".
I am learning from this course. It asks to create a new hdinsight cluster (options are hadoop, hbase, storm or spark) and also a storage account. What is difference between a cluster and a storage account? Does cluster include processors to process my jobs and does storage account mean space to store my data? Why cannot i connect the same storage account with different clusters?
Also under Microsoft Azure >> New >> Data + Analytics, I see 2 options : hdinsight, data lake analytics that deal with big data. What is difference between those two? Both of them look similar
HDInsight
Microsoft's cloud-based Big Data service. Apache Hadoop and other popular Big Data solutions.
Data Lake Analytics
Big data analytics made easy
There are a lot of questions in here so let me answer them 1 by 1.
What is Blob Storage vs HDInsight Cluster?
Blob storage is a distributed file store very similar to HDFS and is used to store data/videos/things. A HDInsight cluster is a number of Hadoop virtual machines created to run Map Reduce code over a DFS (HDFS or Blob storage). Having two separate services allow you to scale each independently, saving money in the long term. Data storage is cheap but a 500 node VM cluster can get pricey quickly. Being able to kill the cluster but keep your data is helpful.
Why can't I connect the same storage account with different clusters?
You can have multiple clusters pointed at the same storage account but it's an Anti pattern. Storage accounts have Data and IO limits and if you have multiple clusters pulling against a single storage account, it's more probable you'll hit them. Also, storage accounts only cost $$ if you have data in them so having multiple isn't a cost increase.
What is Azure Data Lake(ADL) and ADL storage?
Azure data lake is another option for both storage and compute. ADL storage can be thought of as blob storage v2. You get an increase of some of the limits on IO and file size from blob storage, while still being able to use Hadoop for compute. ADL is a second option for compute that is completely different then Hadoop. You don't have to worry about the cluster creation or clusters in general. You write a query, specify the amount of parallelization you'd like, and the data is returned.
References:
https://azure.microsoft.com/en-us/documentation/articles/azure-subscription-service-limits/#storage-limits
https://azure.microsoft.com/en-us/services/hdinsight/
https://azure.microsoft.com/en-us/solutions/data-lake/
Currently my team is creating a solution that would use HDInsight. We will be getting 5TB of data daily and will need to do some map/reduce jobs on this data. Would there be any performance/cost difference if our data will be stored in Azure Table Storage instead of Azure HBase?
The main differences will be in both functionality and cost.
Azure Table Storage doesn't have a map reduce engine attached to it in itself, though of course you could use the map reduce approach to write your own.
You can use Azure HDInsight to connect Map Reduce to table storage. There are a couple of connectors around, including one written by me which is hive focused and requires some configuration, and may not suit your partition scheme (http://www.simonellistonball.com/technology/hadoop-hive-inputformat-azure-tables/) and a less performance focused, but more complete version from someone at Microsoft (http://blogs.msdn.com/b/mostlytrue/archive/2014/04/04/analyzing-azure-table-storage-data-with-hdinsight.aspx).
The main advantage of Table Storage is that you aren't constantly taking processing cost.
If you use HBase, you will need to run a full cluster all the time, so there is a cost disadvantage, however, you will get some functionality and performance gains, plus you will have something a bit more portable, should you wish to use other hadoop platforms. You would also have access to a much greater range of analytic functionality with the HBase option.
HDInsight (HBase/Hadoop) uses Azure Blob storage not ATS. For your data-storage you will charged only applicable blob storage cost, based on your subscription.
P.S. Don't forget to delete your cluster once job has completed, to avoid charges. Your data will persist in BLOB storage and can be used by next cluster you build.
We have a logging system called as Xtrace. We use this system to dump logs, exceptions, traces etc. in SQL Azure database. Ops team then uses this data for debugging, SCOM purpose. Considering the 150 GB limitation that SQL Azure has we are thinking of using HDInsight (Big Data) Service.
If we dump the data in Azure Table Storage, will HDInsight Service work against ATS?
Or it will work only against the blob storage, which means the log records need to be created as files on blob storage?
Last question. Considering the scenario I explained above, is it a good candidate to use HDInsight Service?
HDInsight is going to consume content from HDFS, or from blob storage mapped to HDFS via Azure Storage Vault (ASV), which effectively provides an HDFS layer on top of blob storage. The latter is the recommended approach, since you can have a significant amount of content written to blob storage, and this maps nicely into a file system that can be consumed by your HDInsight job later. This would work great for things like logs/traces. Imagine writing hourly logs to separate blobs within a particular container. You'd then have your HDInsight cluster created, attached to the same storage account. It then becomes very straightforward to specify your input directory, which is mapped to files inside your designated storage container, and off you go.
You can also store data in Windows Azure SQL DB (legacy naming: "SQL Azure"), and use a tool called Sqoop to import data straight from SQL DB into HDFS for processing. However, you'll have the 150GB limit you mentioned in your question.
There's no built-in mapping from Table Storage to HDFS; you'd need to create some type of converter to read from Table Storage and write to text files for processing (but I think writing directly to text files will be more efficient, skipping the need for doing a bulk read/write in preparation for your HDInsight processing). Of course, if you're doing non-HDInsight queries on your logging data, then it may indeed be beneficial to store initially to Table Storage, then extracting the specific data you need whenever launching your HDInsight jobs.
There's some HDInsight documentation up on the Azure Portal that provides more detail around HDFS + Azure Storage Vault.
The answer above is sligthly misleading in regard to the Azure Table Storage part. It is not necessary to first write ATS contents to text files and then process the text files. Instead a standard Hadoop InputFormat or Hive StorageHandler can be written, that reads directly from ATS. There are at least 2 implementations available at this point in time:
ATS InputFormat and Hive StorageHandler written by an MS employee
ATS Hive StorageHandler written by Simon Ball