Hadoop on Azure using IaaS - azure

I am looking at having a Hadoop cluster setup for Big Data analytics using the virtualized environment in Azure. As the data volume is very high, I am looking at having data stored in secondary storage like Azure Data Lake Store and Hadoop cluster storage will act as the primary storage.
I would like to know, how can this be configured so that when i create a Hive table and partition, part of the data can reside in Primary storage and the rest in the secondary storage?
Thanks
Regards,
Madhu

You can't mix file systems with a Hive table by default. The Hive metastore only consists of one filesystem location for a database / table definition.
You might try to use Waggle Dance to setup a federated Hive solution, but it's probably too much work than simply allowing Hive data to exist in Azure

I don't know about Hadoop and Hive but you could combine Azure Data Lake Store (ADLS) and Azure SQL Data Warehouse (ADW), ie use Polybase in ADW to create an external table on the 'cold' data in ADLS and an internal table for your 'warm' data. ADW has the advantage that you can pause it.
Optionally create a view over the top to combine the external and internal table.

Related

How to access DeltaLake Tables without Databrick Cluster running

I have created DeltaLake Tables on DataBricks Cluster. And I am able to access these tables from external system/application. Though I need to keep the cluster up and running all the time to be able to access the table data.
Question:
Is it possible to access the DeltaLake Tables when Cluster is down?
If Yes, Then how can I setup
I tried to lookup on docs. Found that 'Premium access to DetaBrick' has some Table Access Controls. disabled by otherwise. It says:
Enabling Table Access Control will allow users to control who can
select, create, and modify databases, tables, views, and functions
that they create.
I also found this doc
I don't think this is the option for my requirement.. Please suggest
The solution I found is to store all Delta Lake Tables on Storage Gen2. This will have access to external resources irrespective of DataBrick Clusters.
While reading a file or writing into table we will have our Cluster up and running, rest of time it can be shut down.
From Docs: In databricks we can create delta tables of two types: managed and unmanaged. Managed are those for which data is stored in DBFS (Databricks FileSystem). While Unmanaged are those where an external ADLS Gen-2 location can be specified.
dataframe.write.mode("overwrite").option("path","abfss://[ContainerName]#[StorageAccount].dfs.core.windows.net").saveAsTable("table")

Is it possible to access Databricks DBFS from Azure Data Factory?

I am trying to use the Copy Data Activity to copy data from Databricks DBFS to another place on the DBFS, but I am not sure if this is possible.
When I select Azure Delta Storage as a dataset source or sink, I am able to access the tables in the cluster and preview the data, but when validating it says that the tables are not delta tables (which they aren't, but I don't seem to acsess the persistent data on DBFS)
Furthermore, what I want to access is the DBFS, not the cluster tables. Is there an option for this?

Is possible to read an Azure Databricks table from Azure Data Factory?

I have a table into an Azure Databricks Cluster, i would like to replicate this data into an Azure SQL Database, to let another users analyze this data from Metabase.
Is it possible to acess databricks tables through Azure Data factory?
No, unfortunately not. Databricks tables are typically temporary and last as long as your job/session is running. See here.
You would need to persist your databricks table to some storage in order to access it. Change your databricks job to dump the table to Blob storage as it's final action. In the next step of your data factory job, you can then read the dumped data from the storage account and process further.
Another option may be databricks delta although I have not tried this yet...
If you register the table in the Databricks hive metastore then ADF could read from it using the ODBC source in ADF. Though this would require an IR.
Alternatively you could write the table to external storage such as blob or lake. ADF can then read that file and push it to your sql database.

Azure Data lake VS Azure HDInsight

I was going through the Microsoft documents:
https://learn.microsoft.com/en-us/azure/data-lake-store/data-lake-store-overview
I'm new to Azure Data lake and HDInsight. There is a statement in the URL which tells that
"Azure Data Lake Store can be accessed from Hadoop (available with HDInsight cluster) using the WebHDFS-compatible REST APIs."
As per my initial understanding, Data lake store is a store in which any kind of data can be stored. I think, HDInsight also kind of does the same thing.
My question is what is the difference between Azure Data lake and Azure HDInsight? If HDInsight can be used for file storage or any kind of storage then Why to use Data Lake?It would be great if some one could clarify this in details. Thanks.
The easiest way to think of Data Lake is to think of this large container that has like a real lake with rivers coming into the river you never know where the rivers are coming from (or what "type" of river). Azure Data Lake was introduced to make big data easy for developers, data scientists, and analysts to store data of any size. It removes the complexities of ingesting and storing all your data while making it faster to get up and running with big data. Data Lake is able to stored the mass different types of data (Structured data, unstructured data, log files, real-time, images, etc. ) and to blend that together, to correlate many different data types. The key thing here is as we are moving from traditional way to the modern tools (like Hadoop, Cassandra, NoSQL DB, etc). Azure Data Lake includes three services:
Azure Data Lake Store, a no limits data lake that powers big data
analytics
Azure Data Lake Analytics, a massively parallel on-demand
job service
Azure HDInsight, a full managed Cloud Hadoop and Spark
offering
Azure Data Lake Store is like a cloud-based file service or file system that is pretty much unlimited in size. We can run services on top of the data that's in that store. So you could use Hadoop or Spark in an HDInsight cluster, or you could use the Azure Data Lake analytic service, which is a complement to the Azure Data Lake Store. And what that service will let you do is to run jobs that effectively query the data you have stored in the Azure Data Lake store and generate output results.
In nutshell,
Hdinsight is a managed hadoop service (to provide compute support)
Azure Data lake(ADL) is a managed storage service (to provide large amount of storage support)
(Instead of ADL, you can alternatively choose to use Blobs in HDinsight, but Blobs have some limitations (like file streaming to storage via hdinsight cluster is not supported)
Here is the definition from Azure documentation (below):
Azure uses "decomposed hardware method"
You can relate or assume HDinsight as a Hadoop Cluster, Azure Data lake (ADL) as HDFS. But they are detached.
If you want to relate with AWS, HDInsight is equivalent to EMR and ADL is equivalent to EMRFS or S3
If you terminate the cluster, ADL storage stays with the files stored in it. You can access the storage directly using another service or tool (like Azure Data bricks) or you can create one another hdinsight cluster on top of the data.
Hdinsight access the ADL using adl:// , and hdinsight never
store the file blocks in the nodes (like Hadoop does), rather it has
mappings to storage service.
Azure Data Lake Store, is just that a data store. HDInsight can also do that in the cluster that you spin up. However, when you stop that cluster, the data also goes away.
It is common that customers use either Azure Data Lake Store, or Azure storage to provide permanent storage separate from the cluster (compute) used to process the data.
Guy
HDInsight is the analytics service whereas the Azure Data Lake Storage is the storage service. You most likely need both to have functional analytics cluster.
HDInsight provides the cluster, fully manages the open-source packages for analytics (Hadoop, Spark ...etc), and you set up your cluster to use Azure Data Lake Storage which support HDFS API ( Hadoop FileSystem ) on top of Cloud Storage.
Azure Data Lake Storage Gen2 is what you are supposed to start looking at which merges the benefits of both Azure Storage and ADLS in one service.
ADLS Gen 2 documentation - https://learn.microsoft.com/en-us/azure/storage/data-lake-storage/introduction
Azure Data Lake Analytics provides server less compute while using Azure Data Lake Store for data storage, whereas in HDInsight,we need to specify and design for Compute Virtual Machine nodes as per processing requirements. It may be advantageous for developers to work with server less compute in Azure Data Lake Analytics, as scaling needs of Analytics Job are taken care out of box.

How to read/write data from/to Azure Table using Hadoop?

I'd like to have a hadoop job which read data from Azure table storage and write data back into it. How can I do that?
I'm mostly interested in writing data into Azure tables from HDInsight.

Resources