Azure Databricks with Storage Account as data layer - azure

I have just started working on a data analysis that requires analyzing high volume data using Azure Databricks. While planning to use Databricks notebook to analyze, I have come across different storage options to load the data a) DBFS - default file system from Databricks b) Azure Data Lake (ADLS) and c) Azure Blob Storage. Looks like the items (b) and (c) can be mounted into the workspace to retrieve the data for our analysis.
With the above understanding, may I get the following questions clarified please?
What's the difference between these storage options while using them in the context of Databricks? Do DBFS and ADLS incorporate HDFS' file management principles under the hood like breaking files into chunks, name node, data node etc?
If I mount Azure Blob Storage container to analyze the data, would I still get the same performance as other storage options? Given the fact that blob storage is an object based store, does it still break the files into blocks and load those chunks as RDD partitions into Spark executor nodes?

DBFS is just an an abstraction on top of scalable object storage like S3 on AWS, ADLS on Azure, Google Storage on GCP.
By default when you create a workspace, you get an instance of DBFS - so-called DBFS Root. Plus you can mount additional storage accounts under the /mnt folder. Data written to mount point paths (/mnt) is stored outside of the DBFS root. Even though the DBFS root is writeable, It's recommended that you store data in mounted object storage rather than in the DBFS root. The DBFS root is not intended for production customer data, as there are limitations, like lack of access control, you can't access storage account mounted as DBFS Root outside of workspace, etc.
The actual implementation of the storage service like namenodes, etc. are really abstacted away - you work with HDFS-compatible API, but under the hood implementation will differ depending on the cloud and flavor of storage. For Azure, you can find some details about their implementation in this blog post.
Regarding the second question - yes, you still should get the splitting of files into chunks, etc. There are differences between Blob Storage & Data Lake Storage, especially for ADLS Gen 2 that have better security model and may better optimized for big data workloads. This blog post describes differences between them.

Related

DBFS AZURE Databricks -difference in filestore and DBFS

I am using Azure Databricks with ADLS storage layer.I have a doubt that what is the difference between DBFS and Filestore ? Any idea,what is the max size of a file that can be stored in Filestore?
Can we store output files in Filestore and then overwrite them?
Thank you.
DBFS is an abstraction over the cloud storage implementations that allow you to work with files in cloud storage using simple paths instead of full URLs. From documentation:
Databricks File System (DBFS) is a distributed file system mounted into a Databricks workspace and available on Databricks clusters. DBFS is an abstraction on top of scalable object storage and offers the following benefits:
Allows you to mount storage objects so that you can seamlessly access data without requiring credentials.
Allows you to interact with object storage using directory and file semantics instead of storage URLs.
Persists files to object storage, so you won’t lose data after you terminate a cluster.
Under the hood, on Azure it uses the same ADLS, so it the same limits should apply (current limit is 200Tb per file).
P.S. Please note that there is so-called DBFS Root - created from the storage account that is created automatically during workspace creation, and DBFS mounts to "external" storage accounts. It's generally recommended to use DBFS Root only for temporary files, because if you delete workspace, that storage account will be removed as well.

Azure Data Lake Gen2 vs Storage account

I have a requirement to process some big data and planning to deploy Databricks cluster & a storage technology. Currently evaluating Data Lake Gen2 which supports both object and file storage. The storage account (blob, file, table, queue) also has similar capabilities which can handle both file based and object based storage requirements. I am bit puzzled to go for an option because of these similarities. Can someone clarify the following questions please?
Except HDFS support, what else is a significant feature that I should use Data Lake Gen2 against Storage Account?
Storage Account v2 with Hierarchical namespace enabled == Data Lake Gen2. If so, can I use File System to create file shares and mount them in my VM as like Storage acc's File system?
For accessing data from Databricks, which one of these two will be better for big data workloads. I can see Storage account can also be mounted as DBFS which can still leverage the distributed processing.
Except HDFS support, what else is a significant feature that I should
use Data Lake Gen2 against Storage Account?
Answer: There're also other benefits. In short, the benefits are Performance / Management / Security as well it's cost. For more details, you can refer to this official article.
Storage Account v2 with Hierarchical namespace enabled == Data Lake
Gen2. If so, can I use File System to create file shares and mount
them in my VM as like Storage acc's File system?
Answer: Of course, the ADLS Gen2 supports file shares mount as the blob storage does.
For accessing data from Databricks, which one of these two will be
better for big data workloads. I can see Storage account can also be
mounted as DBFS which can still leverage the distributed processing.
Answer: ADLS Gen2 can also be mounted as DBFS. And as per Answer 1, the better one should be ADLS Gen2.

What is the Data size limit of DBFS in Azure Databricks

I read here that storage limit on AWS Databricks is 5TB for individual file and we can store as many files as we want
So does the same limit apply to Azure Databricks? or, is there some other limit applied on Azure Databricks?
Update:
#CHEEKATLAPRADEEP Thanks for the explanation but, can someone please share the reason behind: "we recommend that you store data in mounted object storage rather than in the DBFS root"
I need to use DirectQuery (because of huge data size) in Power BI and ADLS doesnt support that as of now.
From Azure Databricks Best Practices: Do not Store any Production Data in Default DBFS Folders
Important Note: Even though the DBFS root is writeable, we recommend that you store data in mounted object storage rather than in the DBFS root.
Reason for recommending to store data in mounted storage account than storing in storage account is located in ADB workspace.
Reason1: You don't have write permission, when you use the same storage account externally via Storage Explorer.
Reason 2: You cannot use the same storage accounts for another ADB workspace or use the same storage account linked service for Azure Data Factory or Azure synapse workspace.
Reason 3: In future, you decided to use Azure Synapse workspaces than ADB.
Reason 4: What if you want to delete the existing workspace.
Databricks File System (DBFS) is a distributed file system mounted into an Azure Databricks workspace and available on Azure Databricks clusters. DBFS is an abstraction on top of scalable object storage i.e. ADLS gen2.
There is no restriction on amount of data you can store in Azure Data Lake Storage Gen2.
Note: Azure Data Lake Storage Gen2 able to store and serve many exabytes of data.
For Azure Databricks Filesystem (DBFS) - Support only files less than 2GB in size.
Note: If you use local file I/O APIs to read or write files larger than 2GB you might see corrupted files. Instead, access files larger than 2GB using the DBFS CLI, dbutils.fs, or Spark APIs or use the /dbfs/ml folder.
For Azure Storage – Maximum storage account capacity is 5 PiB Petabytes.
The following table describes default limits for Azure general-purpose v1, v2, Blob storage, and block blob storage accounts. The ingress limit refers to all data that is sent to a storage account. The egress limit refers to all data that is received from a storage account.
Note: Limitation on single block blob is 4.75 TB.
Databricks documentation states:
Support only files less than 2GB in size. If you use local file I/O
APIs to read or write files larger than 2GB you might see corrupted
files. Instead, access files larger than 2GB using the DBFS CLI,
dbutils
You can read more here: https://learn.microsoft.com/en-us/azure/databricks/data/databricks-file-system

The best practice to be followed when reading data from azure datalake gen1 through azure databricks

I am new to azure databricks. I was trying to read data from datalake into databricks. I found that there are mainly two methods
Mounting the file present in datalake into dbfs (Advantage being Authentication required just once)
Using Service Principal and OAuth (Authentication required for each request)
I am interested to know if there is some significant memory consumption when we choose to mount folders in dbfs. I learnt that the data mounted is persisted . So I guessing that might lead to some memory consumption. I'll like if somebody can explain me what's going on the backend when we mount a file in dbfs
The question of persistent data:
As far as I have understood based on the documentation of dbfs, the data read in from the mount point through dbfs is not persisted:
"Data written to mount point paths (/mnt) is stored outside of the DBFS root. Even though the DBFS root is writeable, we recommend that you store data in mounted object storage rather than in the DBFS root."
Instead, you can write data directly to the DBFS (which is, under the hood, just a Storage account), and that data will persist between the restarts of your cluster. For example, you could store some example dataset directly in DBFS.
Best practice with Data Lake Gen 1
As there shouldn't be any performance implications, I don't know there is a "best practice" overall. Based on my experience it is good to keep in mind that both solutions might seem confusing to new users who don't know how authentication was or is done.

Can we use HDInsight Service for ATS?

We have a logging system called as Xtrace. We use this system to dump logs, exceptions, traces etc. in SQL Azure database. Ops team then uses this data for debugging, SCOM purpose. Considering the 150 GB limitation that SQL Azure has we are thinking of using HDInsight (Big Data) Service.
If we dump the data in Azure Table Storage, will HDInsight Service work against ATS?
Or it will work only against the blob storage, which means the log records need to be created as files on blob storage?
Last question. Considering the scenario I explained above, is it a good candidate to use HDInsight Service?
HDInsight is going to consume content from HDFS, or from blob storage mapped to HDFS via Azure Storage Vault (ASV), which effectively provides an HDFS layer on top of blob storage. The latter is the recommended approach, since you can have a significant amount of content written to blob storage, and this maps nicely into a file system that can be consumed by your HDInsight job later. This would work great for things like logs/traces. Imagine writing hourly logs to separate blobs within a particular container. You'd then have your HDInsight cluster created, attached to the same storage account. It then becomes very straightforward to specify your input directory, which is mapped to files inside your designated storage container, and off you go.
You can also store data in Windows Azure SQL DB (legacy naming: "SQL Azure"), and use a tool called Sqoop to import data straight from SQL DB into HDFS for processing. However, you'll have the 150GB limit you mentioned in your question.
There's no built-in mapping from Table Storage to HDFS; you'd need to create some type of converter to read from Table Storage and write to text files for processing (but I think writing directly to text files will be more efficient, skipping the need for doing a bulk read/write in preparation for your HDInsight processing). Of course, if you're doing non-HDInsight queries on your logging data, then it may indeed be beneficial to store initially to Table Storage, then extracting the specific data you need whenever launching your HDInsight jobs.
There's some HDInsight documentation up on the Azure Portal that provides more detail around HDFS + Azure Storage Vault.
The answer above is sligthly misleading in regard to the Azure Table Storage part. It is not necessary to first write ATS contents to text files and then process the text files. Instead a standard Hadoop InputFormat or Hive StorageHandler can be written, that reads directly from ATS. There are at least 2 implementations available at this point in time:
ATS InputFormat and Hive StorageHandler written by an MS employee
ATS Hive StorageHandler written by Simon Ball

Resources