I have a directory on Azure Data Lake mountedd to an Azure Data Bricks cluster. Browsing through the file system using the CLI tools or just running dbfs utils through a notebook, I can see that there are files and data in that directory. Further - executing queries against those files is successful, data is succesfully read in and written out.
I can also successfully browse to the root of my mount ('/mnt', just because that's what the documentation used here: https://docs.databricks.com/spark/latest/data-sources/azure/azure-datalake.html) in the 'Create New Table' UI (via Data -> Add Table -> DBFS).
However, there are no subdirs listed under that root directory.
Is this a quirk of DBFS? A quirk of the UI? Or do I need to reconfigure something to allow me to add tables via that UI?
The Data UI currently is not supporting mounts, it only works for the internal DBFS. So currently there is no configuration option. If you want to use this UI for data upload (and not e. g. Storage Explorer) only solution would be to move the data afterwards from the internal DBFS to the mount dir via dbutils.fs.mv.
Related
I am able to see my containers in an Azure Data Lake Storage Gen2 endpoint with signing in the AD.
I am able to see the files and select 1 singular file while browsing but the question is - is there an ability to select the folder of the container and bring in every single file from that container to build my dataset if they are all the same.
Or do I require something like an external table in Azure Data Explorer of some some sort?
Just drag the first file from your collection into the data pane
then right-click and use the edit union option
When ingesting data and transforming the various layers of our data lake built on top of Azure ADLS gen2 storage account (hierarchical), I can organize files in Containers or File Shares. We currently ingest raw files into a RAW container in their native format ".csv". We then take those files and merge them into a QUERY container in compressed parquet format so that we can virtualize all the data using Polybase in SQL server.
It is my understanding that only files stored within File Shares can be accessed using the typical SMB/UNC paths. When building out a data lake such as this, should Containers within ADLS be avoided in order to gain the additional benefit of being able to access those same files via File Shares?
I did notice that files located under File shares do not appear to support metadata key/values (unless it's just not exposed through the UI). Other than that, I wonder if there are any other real differences between the two types.
Thanks to #Gaurav for sharing the knowledge in comment section.
(Posting the answer using the details provided in comment section to help other community members.)
Earlier, only the files which were stored in Azure storage File Share can be accessed using the typical SMB/UNC paths. But recently, now it is possible to mount Blob Container as well using the NFS 3.0 protocol. This Microsoft official document provides step-by-step guidance.
Limitation: You can mount a container in Blob storage only from a Linux-based Azure Virtual Machine (VM) or a Linux system that runs on-premises. There is no support for Windows and Mac OS.
Disclaimer: Not a code query, but directly related to it.
I find it difficult in Databricks to handle such scenarios where there's no shell prompt; just the notebooks. I have two clusters on Azure dev & prod. The database & tables can be accessed via Databricks Notebooks of separate environments.
The problem arises when I want to:
Query data in dev, but from prod environment & vice-versa. On a sql prompt, it just seems impossible to achieve this.
If I want to populate dev table from prod table; there's no way to establish a connection from within the dev notebook to query the table of prod environment.
The workaround I've established for now to copy the prod data into dev is:
Download full dump from production in csv in my local machine.
Upload to DBFS in dev environment.
Create temp table/directly insert the csv in the dev table.
Any comments on how I remove this download-upload process & query prod directly from dev notebook?
DBFS root is not really a production-grade solution, it's recommended that you always mount an external storage (e.g. Azure Storage - blob or ADLS Gen2)and use it to store your tables.
If you use external storage the problem becomes quite simple - all you have to do is mount the production storage on the dev cluster and you can access it as tables can be defined both over root dbfs and mounted data sources. So you can have a notebook that copies data from one to the other (and hopefully does all of the data anonymization / sampling that you need). You can also setup a more explicit process for that using Azure Data Factory, in most cases using only simple copy activity.
In Azure Data Lake Storage Gen1 I can see the folder structure, See folders and files etc.
I can preform actions on the files like renaming them/Deleting them and more
One operation that is missing in the Azure portal and in other means is the option to create a copy of a folder or a file
I have tried to do it using PowerShell and using the portal itself
and it seems that this option is not available
Is there a reason for that?
Are there any other options to copy a folder in Data-lake?
The data-lake storage is used as part of an HDInsight cluster
You can use Azure Storage Explorer to copy files and folders.
Open Storage Explorer.
In the left pane, expand Local and Attached.
Right-click Data Lake Store, and - from the context menu - select Connect to Data Lake Store....
Enter the Uri, then the tool navigates to the location of the URL you just entered.
Select the file/folder you want to copy.
Navigate to your desired destination.
Click Paste.
Other options for copying files and folders in a data lake include:
Azure Data Factory
AdlCopy (command line tool)
My suggestion is to use Azure Data Factory (ADF). It is the fastest way, if you want to copy large files or folders.
Based on my experience 10GB files will be copied for approximately in 1 min 20 sec.
You just need to create simple pipeline with one data store, which will be used as source and destination data store.
Using Azure Storage Explorer (ASE) for copy large files is to slow, 1GB more than 10 min.
Copying files with ASE is the most similar operation as in most file explorer (Copy/Paste) unlike ADF copying which requires create pipeline.
I think create simple pipeline is worth effort, especially because pipeline can be reused for copying another files or folders, with minimal editing.
I agree with the above comment, you can use ADF to copy the file. Just you need to look that it doesn't add up your costs. Microsoft Azure Storage Explorer (MASE) is also a good option to copy blob.
If you have very big files, then below option is more faster:
AzCopy:
Download a single file from blob to local directory:
AzCopy /Source:https://<StorageAccountName>.blob.core.windows.net/<BlobFolderName(if any)> /Dest:C:\ABC /SourceKey:<BlobAccessKey> /Pattern:"<fileName>"
If you are using the Azure Data Lake Store with HDInsight another very performant option is using the native hadoop file system commands like hdfs dfs -cp or if you want to copy a large number of files distcp. So for example:
hadoop distcp adl://<data_lake_storage_gen1_account>.azuredatalakestore.net:443/sourcefolder adl://<data_lake_storage_gen1_account>.azuredatalakestore.net:443/targetfolder
This is also a good option, if you are using multiple storage accounts. See also the documentation.
I have few csv file present on my local system. I want to upload them into Azure blobs in a particular directory structure. I need to create the directory structure as well on azure.
Please suggest the possible options to achieve that.
1 - Create Your storage in Azure
2 - get Azure storage explorer
https://azure.microsoft.com/en-us/features/storage-explorer/
3- start the app, login and Navigate to your Blob
4- Drag and drop folder and files :)
basically this MS provided Software allows you to use your storage account in a classic folders and files structure