HDInsight: HBase or Azure Table Storage? - azure

Currently my team is creating a solution that would use HDInsight. We will be getting 5TB of data daily and will need to do some map/reduce jobs on this data. Would there be any performance/cost difference if our data will be stored in Azure Table Storage instead of Azure HBase?

The main differences will be in both functionality and cost.
Azure Table Storage doesn't have a map reduce engine attached to it in itself, though of course you could use the map reduce approach to write your own.
You can use Azure HDInsight to connect Map Reduce to table storage. There are a couple of connectors around, including one written by me which is hive focused and requires some configuration, and may not suit your partition scheme (http://www.simonellistonball.com/technology/hadoop-hive-inputformat-azure-tables/) and a less performance focused, but more complete version from someone at Microsoft (http://blogs.msdn.com/b/mostlytrue/archive/2014/04/04/analyzing-azure-table-storage-data-with-hdinsight.aspx).
The main advantage of Table Storage is that you aren't constantly taking processing cost.
If you use HBase, you will need to run a full cluster all the time, so there is a cost disadvantage, however, you will get some functionality and performance gains, plus you will have something a bit more portable, should you wish to use other hadoop platforms. You would also have access to a much greater range of analytic functionality with the HBase option.

HDInsight (HBase/Hadoop) uses Azure Blob storage not ATS. For your data-storage you will charged only applicable blob storage cost, based on your subscription.
P.S. Don't forget to delete your cluster once job has completed, to avoid charges. Your data will persist in BLOB storage and can be used by next cluster you build.

Related

Databricks Delta Tables - Where are they normally stored?

I'm beginning my journey into Delta Tables and one thing that is still confusing me is where is the best place to save your delta tables if you need to query them later.
For example I'm migrating several tables from on-prem to azure databricks into individual delta tables. My question is, should I save the individual delta tables which could be significant in size into the DBFS databricks internal storage, or should I mount a blob storage location and save the delta lake tables there? What do people normally do in these situations?
I usually recommend people to store data in a separate storage account (either mounted, or used directly), and don't use the internal storage of workspace for that tasks. Primary reason - it's easier to share this data with other workspaces, or other systems if it's necessary. Internal storage should be primarily used for temp files, libraries, init scripts, etc.
There is a number of useful guides available that can help:
Azure Databricks Best Practices, and it's specifically says about internal storage
About securing access to Azure Data Lake

Which Azure storage technology for weather forecast data

I would like some advice/tips about the right technology to select in order to store some forecast data on Azure technologies.
My team and I are scraping some weather forecast data everyday from various sources and store it as is on a Azure File Storage. The files format is "grib2" which is a standard format of weather forecast data.
We are able to extract the data from those "grib2" files using python script running on a Azure VM.
We now have several files that represent hundreds gigabytes of data to store and I'm struggling to find which data store from the Azure technologies suits the best our needs in term of praticity and cost.
We started using "Azure Table Storage" first because it's cheap solution,
but I've read on many posts that it is a bit old and not very adapted to our solution as it for example does not allow more than 1,000 entites per query and no aggregation on data.
I considered using Azure SQL db but it seems that it can become very expensive very fast.
I also considered the Azure Data Lake Storage Gen2 (and HDinsight) technologies but am not very at ease with those blob storages and am not really able to say if it can suit my needs in terms of praticity and if it is "easy to query".
By now we just plan to achieve that :
1) Extract data from grib2 files thanks to a python script running on an Azure VM
2) Insert the transformed data into [Azure storage]
3) Query the [Azure storage] from Azure Machine Learning Service or a local R script (for example)
4) Insert the computed data into [Azure storage]
where [Azure Storage] technology is to determine.
Any help or advice would be much appreciated, thanks.
A couple of things I would see here:
To store the downloaded files in raw format (grib2 in your case), either place them on good ol' Azure Blob Storage. Cheap storage exactly for your needs.
Use Azure Databricks to load the data from the storage account and unpack it into memory. (python or scala)
Load the data in memory - still in Databricks - to run you ML inferencing. You could also use SparkR if you really want to.
Store the computed files in a serving layer. This really depends on what you want to do with it later. Often Azure SQL Database is an obvious choice. There is a native Spark connector which efficiently writes data from Databricks to SQL DB.
In addition to using Databricks as your inferencing environment, it's also a good choice for ML training (e.g. utilizing Azure ML Service).

Use Data Lake or Blob on HDInsights cluster on Azure

When creating a HDInsights Hadoop cluster in Azure there are two storage options. Either Azure Data Lake Store (ADLS) or Azure Blob Storage.
What are the real differences between these two options and how do they affect the performance?
I found this page https://learn.microsoft.com/en-us/azure/data-lake-store/data-lake-store-comparison-with-blob-storage
But it is not very specific, only uses very general terms like "ADLS is optimized for analytics".
Does it mean that its better for storing the HDInsights file system? And if ADLS is indeed faster then why not use it for non-analytics data as well?
As per this document, an Azure Storage account can hold up to 4.75 TB, though individual blobs (or files from an HDInsight perspective) can only go up to 195 GB. Azure Data Lake Store can grow dynamically to hold trillions of files, with individual files greater than a petabyte. For more information, see Understanding blobs and Data Lake Store.
Also, check Benefits of Azure Storage and Use Data Lake Store for more details and comparisons.
Hope this helps.
In addition to Ashok's answer: ADLS is currently only available in a few regions, compared to Azure Storage. So if you need your HDInsight account in a specific region, you should make sure your storage is in the same region.
Another benefit of ADLS over Azure Storage is its POSIX-based security model at the file/folder level that uses AAD security principals instead of Shared Access Keys.
The reason why you may not want to use ADLS for non-analytics data is primarily cost. Because of some of the additional capabilities, it is currently a bit more expensive.
In addition to the other answers its not possible to use the Spark Data Factory activity on HDInsights clusters that use Data Lake as the primary storage. This limitation applies to both ADFv1 and v2 as seen here: https://learn.microsoft.com/en-us/azure/data-factory/v1/data-factory-spark and https://learn.microsoft.com/en-us/azure/data-factory/transform-data-using-spark

How to efficiently move big data from a data center to Azure Blob Storage for later processing via HDInsight?

I need to setup scheduled tasks which purpose is to copy/move large amounts of data from an on-premises data center to Windows Azure Blob Storage.
The options I've explored are WebHDFS and Flume (the latter does not seem to be supported by HDInsight currently).
What is the most efficient way to transfer unstructured files from a data center to Windows Azure Blob Storage?
If you are using HDInsight, you don't need to involve HDFS at all. In fact you don't need your cluster to be running to upload the data. The best way of getting data into HDInsight is to upload it to Azure Blob Storage, using either the standard .NET clients, or something third-party like Azure Management Studio or AzCopy.
If you want to stream the data constantly, then you are probably better setting up something like Flume, Kafka or Storm to work against an HDInsight cluster, but that will require a certain amount of customisation on the cluster itself, which means you'll run into problems with reboots, and require a permanent cluster.
You didn't mention how much data you're talking about (you just said large amounts). But... assuming it's 100's of TB or petabytes, Azure has an Import/Export Service which offers disk-ship.
Outside of that, you'd need to use your own code or use a 3rd-party tool such as Microsoft's AzCopy to transfer your content to blobs. Remember that you'll be able to perform parallel uploads, to compress time (as long as your data center's bandwidth is large enough for you to see the benefits).
You could use CloudBerry drive and Flume to stream data to HDInsight cluster/Azure Blob storage
http://blogs.msdn.com/b/bigdatasupport/archive/2014/03/18/using-apache-flume-with-hdinsight.aspx
No,you cannot use flume to stream data directly to HDInsight. post from Microsoft blog says that
a vast majority of Flume consumers will land their streaming data into HDFS – and HDFS is not the default file system used with HDInsight. Even if it were - we do not expose public facing Name Node or HDFS endpoints so the Flume agent would have a terrible time reaching the cluster! So, for these reasons and a few others , the answer is typically "no. …it won't work or its not supported"
source :http://blogs.msdn.com/b/bigdatasupport/archive/2014/03/18/using-apache-flume-with-hdinsight.aspx?CommentPosted=true#commentmessage
It also is worth mentioning the ExpressRoute option. Microsoft now has a program called ExpressRoute where your datacenter can be connected straight to Azure with a much faster connection, in cooperation with your ISP. See also http://azure.microsoft.com/en-us/services/expressroute/

Can we use HDInsight Service for ATS?

We have a logging system called as Xtrace. We use this system to dump logs, exceptions, traces etc. in SQL Azure database. Ops team then uses this data for debugging, SCOM purpose. Considering the 150 GB limitation that SQL Azure has we are thinking of using HDInsight (Big Data) Service.
If we dump the data in Azure Table Storage, will HDInsight Service work against ATS?
Or it will work only against the blob storage, which means the log records need to be created as files on blob storage?
Last question. Considering the scenario I explained above, is it a good candidate to use HDInsight Service?
HDInsight is going to consume content from HDFS, or from blob storage mapped to HDFS via Azure Storage Vault (ASV), which effectively provides an HDFS layer on top of blob storage. The latter is the recommended approach, since you can have a significant amount of content written to blob storage, and this maps nicely into a file system that can be consumed by your HDInsight job later. This would work great for things like logs/traces. Imagine writing hourly logs to separate blobs within a particular container. You'd then have your HDInsight cluster created, attached to the same storage account. It then becomes very straightforward to specify your input directory, which is mapped to files inside your designated storage container, and off you go.
You can also store data in Windows Azure SQL DB (legacy naming: "SQL Azure"), and use a tool called Sqoop to import data straight from SQL DB into HDFS for processing. However, you'll have the 150GB limit you mentioned in your question.
There's no built-in mapping from Table Storage to HDFS; you'd need to create some type of converter to read from Table Storage and write to text files for processing (but I think writing directly to text files will be more efficient, skipping the need for doing a bulk read/write in preparation for your HDInsight processing). Of course, if you're doing non-HDInsight queries on your logging data, then it may indeed be beneficial to store initially to Table Storage, then extracting the specific data you need whenever launching your HDInsight jobs.
There's some HDInsight documentation up on the Azure Portal that provides more detail around HDFS + Azure Storage Vault.
The answer above is sligthly misleading in regard to the Azure Table Storage part. It is not necessary to first write ATS contents to text files and then process the text files. Instead a standard Hadoop InputFormat or Hive StorageHandler can be written, that reads directly from ATS. There are at least 2 implementations available at this point in time:
ATS InputFormat and Hive StorageHandler written by an MS employee
ATS Hive StorageHandler written by Simon Ball

Resources