Copy File/Folders in Azure Data Lake Gen1 - azure

In Azure Data Lake Storage Gen1 I can see the folder structure, See folders and files etc.
I can preform actions on the files like renaming them/Deleting them and more
One operation that is missing in the Azure portal and in other means is the option to create a copy of a folder or a file
I have tried to do it using PowerShell and using the portal itself
and it seems that this option is not available
Is there a reason for that?
Are there any other options to copy a folder in Data-lake?
The data-lake storage is used as part of an HDInsight cluster

You can use Azure Storage Explorer to copy files and folders.
Open Storage Explorer.
In the left pane, expand Local and Attached.
Right-click Data Lake Store, and - from the context menu - select Connect to Data Lake Store....
Enter the Uri, then the tool navigates to the location of the URL you just entered.
Select the file/folder you want to copy.
Navigate to your desired destination.
Click Paste.
Other options for copying files and folders in a data lake include:
Azure Data Factory
AdlCopy (command line tool)

My suggestion is to use Azure Data Factory (ADF). It is the fastest way, if you want to copy large files or folders.
Based on my experience 10GB files will be copied for approximately in 1 min 20 sec.
You just need to create simple pipeline with one data store, which will be used as source and destination data store.
Using Azure Storage Explorer (ASE) for copy large files is to slow, 1GB more than 10 min.
Copying files with ASE is the most similar operation as in most file explorer (Copy/Paste) unlike ADF copying which requires create pipeline.
I think create simple pipeline is worth effort, especially because pipeline can be reused for copying another files or folders, with minimal editing.

I agree with the above comment, you can use ADF to copy the file. Just you need to look that it doesn't add up your costs. Microsoft Azure Storage Explorer (MASE) is also a good option to copy blob.
If you have very big files, then below option is more faster:
AzCopy:
Download a single file from blob to local directory:
AzCopy /Source:https://<StorageAccountName>.blob.core.windows.net/<BlobFolderName(if any)> /Dest:C:\ABC /SourceKey:<BlobAccessKey> /Pattern:"<fileName>"

If you are using the Azure Data Lake Store with HDInsight another very performant option is using the native hadoop file system commands like hdfs dfs -cp or if you want to copy a large number of files distcp. So for example:
hadoop distcp adl://<data_lake_storage_gen1_account>.azuredatalakestore.net:443/sourcefolder adl://<data_lake_storage_gen1_account>.azuredatalakestore.net:443/targetfolder
This is also a good option, if you are using multiple storage accounts. See also the documentation.

Related

Copying new Azure blobs to different container

We have 5 vendors that are SFTPing files to Blob Storage. When the files come in, I need to copy them to another container and create a folder in that container named with the date to put the files in. From the second container, I need to copy the files to a file share on an Azure server. What is the best way to go about this?
I'm very new to Azure and unsure what the best way is to accomplish what I am being asked to do. Any help would be greatly appreciated.
I'd recommend using Azure Synapse for this task. It will let you move data to and from different storage securely and with little-to-no code.
Specifically, I'd put a blob storage trigger on the SFTP blob container so that the Synapse Pipeline to move data automatically runs when your vendors drop their files.
Note that when you look for documentation on how to do things in Synapse, most of the time the Azure Data Factory documentation will also be applicable, since most of Data Factory's functionality is now in Synapse.
The ADF and Synapse YouTube channels are excellent resources, as well as the Microsoft Learn courses on Data Engineering.
I need to copy them to another container and create a folder in that container named with the date to put the files in.
You can use Azcopy to copy a files to another container by using SAS token.
command:
azcopy copy 'https://<storage account>.blob.core.windows.net/test/files?SAS' 'https://<storage account >.blob.core.windows.net/mycontainer/12-01-2023?SAS' --recursive
Console:
Portal:
I need to copy the files to a file share on an Azure server
You can also copy the files from container to file share by using Azcopy.
Command:
azcopy copy 'https://<storage account>.blob.core.windows.net/test?SAS' 'https://<storage account >.file.core.windows.net/fileshare/12-01-2023?SAS' --recursive
Console:
Portal:
You can get the SAS token through portal:
Go to portal -> your storage account -> shared access signature -> check the resource types -> click generate SAS and Connection-string.
Portal:
Probably azcopy is a good way to move all or part of the blobs from one container to another one. But I would suggest to automate it with Azure Functions. I think it can be atomated triggering an Azure Function every time a blob or set of blobs (Azure could process a batch of blobs) are updoladed to the source container.
Note on Azure Functions, depends on the quantity of blobs to be moved and the time that it could take, durable functions should be better solution to skip timeout exception. Durable function returns inmediate response but are running in "background".
Consider this article to have a better approach to this solution:
https://build5nines.com/azure-functions-copy-blob-between-azure-storage-accounts-in-c/

Tableau + Azure Data Lake Gen2 multiple files or folder

I am able to see my containers in an Azure Data Lake Storage Gen2 endpoint with signing in the AD.
I am able to see the files and select 1 singular file while browsing but the question is - is there an ability to select the folder of the container and bring in every single file from that container to build my dataset if they are all the same.
Or do I require something like an external table in Azure Data Explorer of some some sort?
Just drag the first file from your collection into the data pane
then right-click and use the edit union option

Moving files among azure blob without downloading

Currently, I have a blob container with about 5TB archive files. I need to move some of those files to another container. Is that a way to avoid download and upload files related? I do not need to access the data of those file. I do not want to get any bill about reading archive files either.
Thanks.
I suggest that you can use Data Factory. It usually used to transfer big data.
Copy performance and scalability achievable using ADF:
You can learn from below tutorial:
Copy and transform data in Azure Blob storage by using Azure Data Factory
Hope this helps.
You can use azcopy for that. It is a command line util that you can use to initiate server to server transfers:
AzCopy uses server-to-server APIs, so data is copied directly between storage servers. These copy operations don't use the network bandwidth of your computer.

Store csv files on Azure blobs

I have few csv file present on my local system. I want to upload them into Azure blobs in a particular directory structure. I need to create the directory structure as well on azure.
Please suggest the possible options to achieve that.
1 - Create Your storage in Azure
2 - get Azure storage explorer
https://azure.microsoft.com/en-us/features/storage-explorer/
3- start the app, login and Navigate to your Blob
4- Drag and drop folder and files :)
basically this MS provided Software allows you to use your storage account in a classic folders and files structure

Create New Table from DBFS mount to Azure Data Lake

I have a directory on Azure Data Lake mountedd to an Azure Data Bricks cluster. Browsing through the file system using the CLI tools or just running dbfs utils through a notebook, I can see that there are files and data in that directory. Further - executing queries against those files is successful, data is succesfully read in and written out.
I can also successfully browse to the root of my mount ('/mnt', just because that's what the documentation used here: https://docs.databricks.com/spark/latest/data-sources/azure/azure-datalake.html) in the 'Create New Table' UI (via Data -> Add Table -> DBFS).
However, there are no subdirs listed under that root directory.
Is this a quirk of DBFS? A quirk of the UI? Or do I need to reconfigure something to allow me to add tables via that UI?
The Data UI currently is not supporting mounts, it only works for the internal DBFS. So currently there is no configuration option. If you want to use this UI for data upload (and not e. g. Storage Explorer) only solution would be to move the data afterwards from the internal DBFS to the mount dir via dbutils.fs.mv.

Resources