Azure Blob Storage mass delete items with specific names - azure

Let's say that I have a blob container that have the following files with names
2cfe4d2c-2703-4b0f-bed0-6b239f238206
1a18c31f-bf28-4f64-a796-79237fabc66a
20a300dd-c032-405f-b67d-9c077623c26c
6f4041dd-92da-484a-966d-d5a168a9a2ec
(Let's say there are around ~15000 files)
I want to delete around 1200 of them. I have a file with all the names I want to delete. In my case, I have it in JSON but it does not really matter in what kinda format it is; I know the files I wanna delete.
What is the most efficient/safe way to delete these items?
I can think of a few ways. For example, using az storage blob deletebatch or az storage blob delete . I am sure that the former is more efficient but I would not know how to do this because there is not really a pattern, just a big list of guids (names) that I want to delete.
I guess I would have to write some code to iterate over my list of files to delete and then use the CLI or some azure storage library to delete them.
I'd prefer to just use some built-in tooling but I can't find a way to do this without having to write code to talk with the API.
What would be the best way to do this?

The tool azcopy would be perfect for that. Using the command azcopy remove you can specify a path to a text file using the parameter --list-of-files={path} to filter on specific files. The file should be plain text and line delimited.

Related

How to Export Multiple files from BLOB to Data lake Parquet format in Azure Synapse Analytics using a parameter file?

I'm trying to export multiples .csv files from a blob storage to Azure Data Lake Storage in Parquet format based on a parameter file using ADF -for each to iterate each file in blob and copy activity to copy from src to sink (have tried using metadata and for each activity)
as I'm new on Azure could someone help me please to implement a parameter file that will be used in copy activity.
Thanks a lot
If so. I created simple test:
I have a paramfile contains the file names that will be copied later.
In ADF, we can use Lookup activity to the paramfile.
The dataset is as follows:
The output of Lookup activity is as follows:
In ForEach activity, we should add dynamic content #activity('Lookup1').output.value. It will foreach the ouput array of Lookup activity.
Inside ForEach activity, at source tab we need to select Wildcard file path and add dynamic content #item().Prop_0 in the Wildcard paths.
That's all.
I think you are asking for an idea of ow to loop through multiple files and merge all similar files into one data frame, so you can push it into SQL Server Synapse. Is that right? You can loop through files in a Lake by putting wildcard characters in the path to files that are similar.
Copy Activity pick up only files that have the defined naming pattern—for example, "*2020-02-19.csv" or "???20210219.json".
See the link below for more details.
https://azure.microsoft.com/en-us/updates/data-factory-supports-wildcard-file-filter-for-copy-activity/

Copying file from SFTP to Azure Data Lake Gen2

So my problem is quite stupid but I cannot find a way to resolve it. I have one 15 GB file on external SFTP server that I need to copy to my data lake. The thing is that column delimiter is a comma and I have some nested lists as well. So when I am trying to use ADF copy activity, the result looks like that:
And most of my data is gone(as nested structures get cut on the first occurence of comma). So maybe I could ignore delimiter. I have tried to set pipe as a delimiter just to get this whole dataset as one column but this doesnt work either.
Powershell? I have tried different scripts that used to work with smaller files and I am getting an error every time.
I have even tried to upload it manually via Azure Storage Explorer but it fails as well after some time. I am not really sure how to make it work at this point.
Thank you for any advice!

Azure Data Factory - Recording file name when reading all files in folder from Azure Blob Storage

I have a set of CSV files stored in Azure Blob Storage. I am reading the files into a database table using the Copy Data task. The Source is set as the folder where the files reside, so it's grabbing it's file and loading it into the database. The issue is that I can't seem to map the file name in order to read it into a column. I'm sure there are more complicated ways to do it, for instance first reading the metadata and then read the files using a loop, but surely the file metadata should be available to use while traversing through the files?
Thanks
This is not possible in a regular copy activity. Mapping Data Flows has this possibility, it's still in preview, but maybe it can help you out. If you check the documentation, you find an option to specify a column to store file name.
It looks like this:

Export Sharepoint list to .csv and upload to Azure Data Lake Using Flow

I am trying to using Microsoft Flow to export a Sharepoint List to Azure Data Lake.
I want it so that anytime a particular online list is changed, its entire contents are loaded into a file in Data Lake. If the file already exists, I want to overwrite it. Can someone please explain how I can go about doing this, I have tried multiple ways, but they are not getting the job done.
Thanks
I was able to get the items in the SharePoint list to near perfection. I will post the Flow here in case anyone in the future needs it.
So what I did is that every 5 minutes I "create" a file in Azure Data Lake which overwrites the file if it exists. The content of the files cannot be blank, so I added a newline to the content. Then I use Get Items to retrieve all the items in the SharePoint List. From there, using an Apply to each loop, I append the content of the current row of the Sharepoint list to the Data Lake file (separated by | and ending with a new line after all the content is added). This works to near perfection, with the only caveat being the newline at the beginning of the file, which I eliminate using PowerQuery.
This is exactly what I needed. If anybody sees a way to make this better, please post so that we can get this to perfection.

Find a Blob in Azure Container

I have thousands & thousands of Blobs in a container, something like
A/Temp/A001-1.log
A/Temp/A001-2.log
A/Temp/A001-3.log
B/Tmp/B001-1.log
B/Tmp/B001-2.log
B/Tmp/B002-1.log
Now my problem is that I want to find Blob having A001 in its name. I understand that ListBlobsWithPrefix looks for Blob starting with some text which is not the case for me. ListBlobs would bring all the blobs to my code and then I would have to search for the one. Is there any way where I can just get the blobs I am looking for.
There's really no easy way to search a container for a specific blob (or set of blobs with a name pattern) aside from brute-force. And name prefixes, as you've guessed, won't help you either in this case.
What I typically advise folks to do is keep their searchable metadata somewhere else (maybe SQL DB, maybe MongoDB, doesn't really matter as long as it provides the search capability they need), with that data store containing a reference link to the exact blob. The blob name itself can also be stored in the metadata as one of the searchable properties.
Also: Once you get into the "thousands & thousands of blobs in a container," you'll find that pulling the blob names is going to take a while (which, again, I think you're seeing). Containers can certainly hold as many blobs as you want, but in that case, you really want to be accessing them directly, based on some other metadata, and not enumerating through the name list.
Instead of searching, construct the blobname if its prefix is known and then try downloading the blob.If the blob doesnt found you will be getting 404 not found exception.
As of today there's a preview feature available in few regions to allow for Blob Storage indexing:
https://learn.microsoft.com/en-us/azure/storage/blobs/storage-manage-find-blobs?tabs=azure-portal
Hope they make it available soonn,
regards

Resources