Databricks parallel mv files using dbutils.fs.mv - azure

I have azure storage account and have some files (more than millions of files) in a single folder .
I want to use dbutils.fs.mv to another folder . what's the fastest way to do that ?

You can try the following methods.
azcopy as suggested by #Kombajn zbożowy.
Use the below sample code for that.
azcopy copy "https://rakeshgen2.blob.core.windows.net/mysource2/<SAS Key>" "https://rakeshgen2.blob.core.windows.net/targetdata/<SAS Key>" --recursive=true
Target files:
You can go through this link to know more about azcopy performance.
You can use dbutils.fs.mv or dbutils.fs.cp after mounting.
Example:
try:
dbutils.fs.mv("/mnt/mysource2/","/mnt/targetadb",recurse=True)
except:
pass
Copy activity may also work for you if there are no recursive folders in your structure.
Go through the Documentation to understand about copy activity performance and speed.

Related

Azure blob soft delete and versioning- how to restore files easily?

I am trying to understand how soft delete and versioning work within azure blog storage.
It seems that if you have both soft delete and versioning turned on... you can’t just ‘undelete ’ deleted files, as versioning actually saves a new version as a deleted file.
So instead you have to promote the last version of each deleted file.
But what if you have a structure of nested folders and thousands of blobs... you can’t just promote the top version of the top level folder... you need to use Powershell to list files with no current version, and promote them? How would you do this?
This seems awfully complicated, when without versioning - a simple ‘undelete’ command is available from the GUI.
Am I missing something? What is the easiest way to ‘undelete’ a nested folder structure of thousand of blobs in folders, when versioning is turned on?
Thanks
As Rob Minson pointed out, the approach involves copying a blob version to the same container. For PowerShell, use the Copy-AzStorageBlob cmdlet; for Azure CLI, use the az storage blob copy start command. You can pass an account key or SAS token, or use Azure AD.
We've updated the documentation to shed some light on an approach to restoring blobs when soft-delete and/or versioning is enabled. Code samples are available for both PowerShell and Azure CLI.
Simply put, no.
The first point that needs to be emphasized is that blobs in blob storage are not nested as you might think. It seems that blob storage is the same as the local file system: some nested folders, and many files inside. But in fact these are fake, the storage structure of blob storage is flat. Blob storage is not about putting a small box in a box and then putting items in the small box. In fact, all blobs are items of blob storage, and there is no such thing as a "small box".
Then, the second point, for blob storage, the soft-delete operation only supports two objects, one is a blob and the other is a container.
Check out this document:
https://learn.microsoft.com/en-us/azure/storage/blobs/soft-delete-container-overview?tabs=azure-cli#how-container-soft-delete-works
However, you can only use container soft delete to restore blobs if
the container itself was deleted. To a restore a deleted blob when its
parent container has not been deleted, you must use blob soft delete
or blob versioning.
So unfortunately, there is no so-called easy way. You need to operate on each blob, the nested structure does not actually exist.
If you are interested, you can read this blog:
https://medium.com/#loopjockey/structuring-azure-blobs-for-functions-8305ba427356
I completely agree that this seems really un-documented at the moment. I've raised a github issue against this docs page to see if they can get the situation improved.
The best path through that I've found is something like the following:
Using Azure Storage Explorer, open up the container with the soft-deleted, versioned blobs, then change the drop down to "All blobs and blobs without current version". Now you can select a blob and hit 'Promote Version'. The deleted blob will be restored and in the Activities pane you can expand the operation and hit 'Copy AzCopy Command to Clipboard'.
The result will show you something like the following:
./azcopy.exe copy
"https://accountname.blob.core.windows.net/containername/blobname?<sastoken>&versionid=2021-04-22T11%3A35%3A36.9385599Z"
"https://accountname.blob.core.windows.net/containername/blobname?<sastoken>"
--overwrite=true
--recursive
--trusted-microsoft-suffixes=;
Now, based on this you can see you have a building block for automating the process you're talking about. Your problem at this point is finding this thing:
versionid=2021-04-22T11%3A35%3A36.9385599Z
Unfortunately that's a timestamp to nanosecond precision which you're not going to be able to infer. There's no functionality I can find in powershell, in the REST APIs or in AzCopy to get this data, the only way I have found is this sample for the .Net SDK.
All this probably means you can either:
Implement your own C# console app using the Azure.Storage.Blobs library to list the versions for each blob, then perform the relevant copy command now you know the magic version string
Wait for the REST API or Powershell library to get the ability to list versions

Use Azure Data Factory to copy files and place a csv of files copied

I am trying to implement the following flow in an Azure Data Factory pipeline:
Copy files from an SFTP to a local folder.
Create a comma separated file in the local folder with the list of files and their
sizes.
The first step was easy enough, using a 'Copy Data' step with 'SFTP' as source and 'File System' as sink.
The files are being copied, but in the output of this step, I don't see any file information.
I also don't see an option to create a file using data from a previous step.
Maybe I'm using the wrong technology?
One of the reasons I'm using Azure Data Factory, is because of the integration runtime, which allows us to have a single fixed IP to connect to the external SFTP. (easier firewall configuration)
Is there a way to implement step 2?
Thanks for any insight!
There is no built-in feature to achieve this.
You need to use ADF with other service, I suppose you to first use azure function to check the files and then do copy.
The structure should be like this:
You can get the size of the files and save them to the csv file:
Get size of files(python):
How to fetch sizes of all SFTP files in a directory through Paramiko
And use pandas to save the messages as csv(python):
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html
Writing a pandas DataFrame to CSV file
Simple http trigger of azure function(python):
https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-http-webhook-trigger?tabs=python
(Put the processing logic in the body of the azure function. Basically, you can do anything you want in the body of the azure function except for the graphical interface and some unsupported things. You can choose the language you are familiar with, but in short, there is not a feature in ADF that satisfies your idea.)

azcopy list function gives a different count (almost double) of objects than Storage Explorer

I am uploading files with AZCOPY (one by one as and when they are provided) to Azure Datalakes gen 2 and keep a track with Storage explorer and individual log of each file.
There have been 6253 file uploads and Storage explorer shows the same along with number of logs for each file upload
But when i use AZCOPY LIST it gives me 11254.
Making it difficult to script and automate.
Is there a logical explanation for this?
There is no access issue, in fact the same AZCOPY is working on copying the files
I have tried to redownload if that makes sense
This is a known bug, scheduled for fixing in our next release: https://github.com/Azure/azure-storage-azcopy/issues/692

How can we save or upload .py file on dbfs/filestore

We have few .py files on my local needs to stored/saved on fileStore path on dbfs. How can I achieve this?
Tried with dbUtils.fs module copy actions.
I tried the below code but did not work, I know something is not right with my source path. Or is there any better way of doing this? please advise
'''
dbUtils.fs.cp ("c:\\file.py", "dbfs/filestore/file.py")
'''
It sounds like you want to copy a file on local to the dbfs path of servers of Azure Databricks. However, due to the interactive interface of Notebook of Azure Databricks based on browser, it could not directly operate the files on local by programming on cloud.
So the solutions as below that you can try.
As #Jon said in the comment, you can follow the offical document Databricks CLI to install the databricks CLI via Python tool command pip install databricks-cli on local and then copy a file to dbfs.
Follow the offical document Accessing Data to import data via Drop files into or browse to files in the Import & Explore Data box on the landing page, but also recommended to use CLI, as the figure below.
Upload your specified files to Azure Blob Storage, then follow the offical document Data sources / Azure Blob Storage to do the operations include dbutils.fs.cp.
Hope it helps.

Azure convert blob to file

Some large disks containing hundreds of 30GB tar files have been prepared and ready to ship.
The disks have been prepared as BLOB using the WAImportExport tool.
The Azure share is expecting files.
Ideally we don't want to redo the disks as FILE instead of BLOB. Are we able to upload as BLOBs to one storage area and extract the millions of files from the tarballs to a FILE storage area without writing code?
Thanks
Kevin
azcopy will definitely do it and has been tested. We were able to move files from blobs to files using the CLI in Azure with the azcopy command.
The information provided below was proven not to be true.
Microsoft Partner told me yesterday there is no realistic way to convert Blobs to Files in the above-mentioned scenario.
Essentially, it is important to select either WAImportExport.exe Version 1 for BLOBS or WAImportExport.exe Version 2 for files. Information on this can be found at this location.
The mistake was easily made and done so by a number of people here: the link to the tool sent was to the binary version 1. Search results tended to direct users to version 1 but version 2 only appears only after deeper dig. Version 2 - seems to be an afterthought my Microsoft when they added the Files option to Azure. It's a pity they didn't use different binary names or build a switch into version 2 to do both and delete the version 1 offering.

Resources