I am learning from this course. It asks to create a new hdinsight cluster (options are hadoop, hbase, storm or spark) and also a storage account. What is difference between a cluster and a storage account? Does cluster include processors to process my jobs and does storage account mean space to store my data? Why cannot i connect the same storage account with different clusters?
Also under Microsoft Azure >> New >> Data + Analytics, I see 2 options : hdinsight, data lake analytics that deal with big data. What is difference between those two? Both of them look similar
HDInsight
Microsoft's cloud-based Big Data service. Apache Hadoop and other popular Big Data solutions.
Data Lake Analytics
Big data analytics made easy
There are a lot of questions in here so let me answer them 1 by 1.
What is Blob Storage vs HDInsight Cluster?
Blob storage is a distributed file store very similar to HDFS and is used to store data/videos/things. A HDInsight cluster is a number of Hadoop virtual machines created to run Map Reduce code over a DFS (HDFS or Blob storage). Having two separate services allow you to scale each independently, saving money in the long term. Data storage is cheap but a 500 node VM cluster can get pricey quickly. Being able to kill the cluster but keep your data is helpful.
Why can't I connect the same storage account with different clusters?
You can have multiple clusters pointed at the same storage account but it's an Anti pattern. Storage accounts have Data and IO limits and if you have multiple clusters pulling against a single storage account, it's more probable you'll hit them. Also, storage accounts only cost $$ if you have data in them so having multiple isn't a cost increase.
What is Azure Data Lake(ADL) and ADL storage?
Azure data lake is another option for both storage and compute. ADL storage can be thought of as blob storage v2. You get an increase of some of the limits on IO and file size from blob storage, while still being able to use Hadoop for compute. ADL is a second option for compute that is completely different then Hadoop. You don't have to worry about the cluster creation or clusters in general. You write a query, specify the amount of parallelization you'd like, and the data is returned.
References:
https://azure.microsoft.com/en-us/documentation/articles/azure-subscription-service-limits/#storage-limits
https://azure.microsoft.com/en-us/services/hdinsight/
https://azure.microsoft.com/en-us/solutions/data-lake/
Related
I was going through the Microsoft documents:
https://learn.microsoft.com/en-us/azure/data-lake-store/data-lake-store-overview
I'm new to Azure Data lake and HDInsight. There is a statement in the URL which tells that
"Azure Data Lake Store can be accessed from Hadoop (available with HDInsight cluster) using the WebHDFS-compatible REST APIs."
As per my initial understanding, Data lake store is a store in which any kind of data can be stored. I think, HDInsight also kind of does the same thing.
My question is what is the difference between Azure Data lake and Azure HDInsight? If HDInsight can be used for file storage or any kind of storage then Why to use Data Lake?It would be great if some one could clarify this in details. Thanks.
The easiest way to think of Data Lake is to think of this large container that has like a real lake with rivers coming into the river you never know where the rivers are coming from (or what "type" of river). Azure Data Lake was introduced to make big data easy for developers, data scientists, and analysts to store data of any size. It removes the complexities of ingesting and storing all your data while making it faster to get up and running with big data. Data Lake is able to stored the mass different types of data (Structured data, unstructured data, log files, real-time, images, etc. ) and to blend that together, to correlate many different data types. The key thing here is as we are moving from traditional way to the modern tools (like Hadoop, Cassandra, NoSQL DB, etc). Azure Data Lake includes three services:
Azure Data Lake Store, a no limits data lake that powers big data
analytics
Azure Data Lake Analytics, a massively parallel on-demand
job service
Azure HDInsight, a full managed Cloud Hadoop and Spark
offering
Azure Data Lake Store is like a cloud-based file service or file system that is pretty much unlimited in size. We can run services on top of the data that's in that store. So you could use Hadoop or Spark in an HDInsight cluster, or you could use the Azure Data Lake analytic service, which is a complement to the Azure Data Lake Store. And what that service will let you do is to run jobs that effectively query the data you have stored in the Azure Data Lake store and generate output results.
In nutshell,
Hdinsight is a managed hadoop service (to provide compute support)
Azure Data lake(ADL) is a managed storage service (to provide large amount of storage support)
(Instead of ADL, you can alternatively choose to use Blobs in HDinsight, but Blobs have some limitations (like file streaming to storage via hdinsight cluster is not supported)
Here is the definition from Azure documentation (below):
Azure uses "decomposed hardware method"
You can relate or assume HDinsight as a Hadoop Cluster, Azure Data lake (ADL) as HDFS. But they are detached.
If you want to relate with AWS, HDInsight is equivalent to EMR and ADL is equivalent to EMRFS or S3
If you terminate the cluster, ADL storage stays with the files stored in it. You can access the storage directly using another service or tool (like Azure Data bricks) or you can create one another hdinsight cluster on top of the data.
Hdinsight access the ADL using adl:// , and hdinsight never
store the file blocks in the nodes (like Hadoop does), rather it has
mappings to storage service.
Azure Data Lake Store, is just that a data store. HDInsight can also do that in the cluster that you spin up. However, when you stop that cluster, the data also goes away.
It is common that customers use either Azure Data Lake Store, or Azure storage to provide permanent storage separate from the cluster (compute) used to process the data.
Guy
HDInsight is the analytics service whereas the Azure Data Lake Storage is the storage service. You most likely need both to have functional analytics cluster.
HDInsight provides the cluster, fully manages the open-source packages for analytics (Hadoop, Spark ...etc), and you set up your cluster to use Azure Data Lake Storage which support HDFS API ( Hadoop FileSystem ) on top of Cloud Storage.
Azure Data Lake Storage Gen2 is what you are supposed to start looking at which merges the benefits of both Azure Storage and ADLS in one service.
ADLS Gen 2 documentation - https://learn.microsoft.com/en-us/azure/storage/data-lake-storage/introduction
Azure Data Lake Analytics provides server less compute while using Azure Data Lake Store for data storage, whereas in HDInsight,we need to specify and design for Compute Virtual Machine nodes as per processing requirements. It may be advantageous for developers to work with server less compute in Azure Data Lake Analytics, as scaling needs of Analytics Job are taken care out of box.
Hope someone can offer any advice. At the moment I have been asked to scope out a possible infrastructure for a new Azure Platform. We are also going to be using HDFS / Hadoop for our ETL and Storage.
Can anyone offer any advice on the following :
Set up a Storage Optimised Server (eg, L4, 4 Core, 32gb Ram, 678 GB Storage) to hold our raw data, reference tables and final cleansed data within HDFS. This server could be running 24/7 to feed our analytics platforms.
Then, to utilise the power of Hadoop, could we spin up a set of Processing servers (eg, once a week) to read from the Storage Server, process and write back to the storage server and then shutdown until the next load & process task.
Would really appreciate anyone's thoughts advice on this or any possible configurations we could think of?
Many thanks
Fiorano
Whether your Hadoop cluster is on-premises or in the cloud, it contains two main resources: compute resources to process jobs, and storage resources to hold data. In an on-premises cluster, the storage and compute resources are combined into the same hardware tying them together. With HDInsight the storage is wholly separated from the compute resource. This is a very important distinction of HDInsight. It means that I can completely turn off the compute portion of the cluster and the data will remain accessible.
Note: To analyze data in HDInsight cluster, you can store the data either in Azure Storage, Azure Data Lake Store, or both.
For more details, refer "Azure HDInsight Documentation".
When creating a HDInsights Hadoop cluster in Azure there are two storage options. Either Azure Data Lake Store (ADLS) or Azure Blob Storage.
What are the real differences between these two options and how do they affect the performance?
I found this page https://learn.microsoft.com/en-us/azure/data-lake-store/data-lake-store-comparison-with-blob-storage
But it is not very specific, only uses very general terms like "ADLS is optimized for analytics".
Does it mean that its better for storing the HDInsights file system? And if ADLS is indeed faster then why not use it for non-analytics data as well?
As per this document, an Azure Storage account can hold up to 4.75 TB, though individual blobs (or files from an HDInsight perspective) can only go up to 195 GB. Azure Data Lake Store can grow dynamically to hold trillions of files, with individual files greater than a petabyte. For more information, see Understanding blobs and Data Lake Store.
Also, check Benefits of Azure Storage and Use Data Lake Store for more details and comparisons.
Hope this helps.
In addition to Ashok's answer: ADLS is currently only available in a few regions, compared to Azure Storage. So if you need your HDInsight account in a specific region, you should make sure your storage is in the same region.
Another benefit of ADLS over Azure Storage is its POSIX-based security model at the file/folder level that uses AAD security principals instead of Shared Access Keys.
The reason why you may not want to use ADLS for non-analytics data is primarily cost. Because of some of the additional capabilities, it is currently a bit more expensive.
In addition to the other answers its not possible to use the Spark Data Factory activity on HDInsights clusters that use Data Lake as the primary storage. This limitation applies to both ADFv1 and v2 as seen here: https://learn.microsoft.com/en-us/azure/data-factory/v1/data-factory-spark and https://learn.microsoft.com/en-us/azure/data-factory/transform-data-using-spark
I have some basic clarifications about azure hdInsight.
The following article gives some basic input on using hdinsight.
https://azure.microsoft.com/en-in/documentation/articles/hdinsight-hadoop-emulator-get-started/.
It says that HDinsight internally uses azure blob storage .
Having this in mind, my question is as follows:
I have a hdinsight hd1 which uses storage account stg1.
If I want to just uploading and download files using azure storage explorer to stg1 , then whats the use of having hd1 , I can do it without even creating hdinsight which costs heavily.
So, is hadoop hdinsight only used for processing some data stored in stg1 to produce some results like wordcount?Is that the only reason why we use HDInsight?
If you want to understand the HDInsight and blob storage better, you need to read https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-use-blob-storage/.
HDInsight is Microsoft's implementation of Hadoop. So far there 4 different base types which include Hadoop, HBase, Storm, Spark. You can always install additional components to the base types.
Your question is really about why using Hadoop. Hadoop shines when you need to process a lot of data - big data.
One of the differences between HDInsight and other Hadoop implementations is the separation of storage (blob storage) from compute (HDInsight clusters). You would still need to copy the data (or store the data directly in Azure blob storage). When you are ready to process, you create an HDInsight cluster, submit a job, and then delete the cluster. You delete the cluster so you don't need to pay for the cluster anymore. Even after the cluster is deleted, your date stored in the Blob storage retains.
HDInsight is a family of products, including Hadoop, Spark, HBase, and Storm. They all do different things, and storage is but only one aspect.
When using HDInsight and choosing Azure Storage Blob to store the data that needs to be computed, you still have to choose the number of data nodes when provisioning a new cluster. If your data is being stored on an Azure Storage Blob, what impact does the number of data nodes have? Is the data from the blob actually replicated onto the data nodes?
If you put data on the Azure Blob Store, it stays there, and is read directly from Azure Storage.
The data nodes in the HDInsight cluster have two purposes. Firstly, they run the actual compute jobs, which read from Azure Storage Directly. This is not as crazy as it might sound to an HDFS user because of Azure's consistent underlying fabric, which keeps the storage nice and close to the compute.
Secondly, the data nodes are running an HDFS filesystem on their local disk. This is generally only used for intermediate and tmp files in HDInsight, since it is transitory (only lasts as long as the cluster).
So, choosing the number of data nodes is essentially choosing how many job running nodes (yarn application containers, or job tracker slots depending on version) you want to be able to handle, and to a lesser extent, choosing how much temp space your jobs need.