Get blob content from file using wildcard - azure

Example:
Blob: container/folder/myfile123.txt [where 123 is dynamic]
I am trying to get the content of an Azure blob file by using a wildcard for the extension since it can be different, but always the same leading (like myfile) and ending extension (like .txt). I've tested using things like myfile*.txt or myfile?.txt but no success when specifying the path.
For getting a wildcard file using the Get blob content tool in logic apps, how can I get a file by a leading name and ending extension but any possible combination between?

You must use the exact name of the file.
What you can do is to get a list of all blobs in the container. Then loop over that list getting each individual file.

You can use the "List blobs" connector and then the "Filter array" collections connector to get the wildcard functionality via the "contains"-operator. Then just use the "Get blob using path" and type in the expression: body('Filter_array')[0]['name']
Or in code view:
"path": "/my_catalogue/#{body('Filter_array')[0]['name']}"
to get the first filename that matches your wildcard.

Related

Graph API DriveItem: How can I only get query results from the root directory (PREVENT recursive searching)?

https://graph.microsoft.com/v1.0/sites/MyDomain.sharepoint.com,00000000-1111-2222-3333-444444444444/drive/search(q='Matrix')
The above correctly returns all drive files with the word "Matrix" in them within the Shared%20Documents directory for the site's provided site ID (00000000-1111-2222-3333-444444444444).
However, it's recursive: it returns files with the word "Matrix" in them within subfolders too. I only want to query files in the root directory.
How do I search for file names, only within the root directory? I tried changing /drive to /drive/root like below, but it did not make a difference:
https://graph.microsoft.com/v1.0/sites/MyDomain.sharepoint.com,00000000-1111-2222-3333-444444444444/drive/root/search(q='Matrix')
ChatGPT recommended adding the filter $filter=parentReference/path eq '/drive/root':
https://graph.microsoft.com/v1.0/sites/MySite.sharepoint.com,00000000-1111-2222-3333-444444444444/drive/search(q='Matrix')?$filter=parentReference/path eq '/drive/root'
...but I got the error "Only createdDateTime,remoteItem.shared.sharedBy.group.id,remoteItem.shared.sharedBy.user.id is supported for filtering" which ChatGPT didn't know how to get past
I solved this by obtaining the folder id of the root folder and using the /drive/items/{folderId}/children$filter URI instead of /drive/search. I obtained the folder ID of the root folder by copying the id within the parentReference of an item that lies within my root directory from the output of my first command.
Then I queried the files in the root directory with the following format:
https://graph.microsoft.com/v1.0/sites/MyDomain.sharepoint.com,{siteId}/drive/items/{folderId}/children?$filter=startswith(name,'MyWord')
So in my case, the URI ended up looking like below:
https://graph.microsoft.com/v1.0/sites/MyDomain.sharepoint.com,00000000-1111-2222-3333-444444444444/drive/items/01NCSFADN6Y2GOVW7725BZO354PWSELRRZ/children?$filter=startswith(name,'Install')
Unfortunately, I couldn't use the contains function (which functions similarly to /search) and had to use startswith because contains isn't supported on $filter for text fields.
Finally, you can optionally tack on the end whichever field(s) you're interested in retrieving with the select parameter:
&select=name,#microsoft.graph.downloadUrl

How to filter blobs in Azure SDK for Python

I want to search for blobs in my Azure blob storage according to a specific tag (like: .name, .creation_date, .size...)
My current way is returning all blobs from the container with MyContainerClient.list_blobs and searching for the corresponding tag afterwards. Since my container stores around 800000 blobs, this takes me around 20 min, which is not usable for a live view of the content.
But I also found another ContainerClient function: .find_blobs_by_tags(filter_expression: str) where I can search for a specific blob whose tags matches the specified condition.
In the Azure API they specified this filter_expression as: ""yourtagname"='firsttag'" , therefore I specified: ""name"='example.jpg'" or ""creation_date"='2021-07-04 09:35:19+00:00'"
Azure SDK Python - ContainerClient.find_blobs_by_tag
Unfortunately I always get an error:
azure.core.exceptions.HttpResponseError: Error parsing query at or near character position 1: unexpected 'creation_time'
RequestId:63bd850b-401e-005f-745e-400d5a000000
Time:2022-03-25T15:40:22.4156367Z
ErrorCode:InvalidQueryParameterValue
queryparametername:where
queryparametervalue:'creation_time'='0529121f-7676-46c7-8a52-424664774240/0529121f-7676-46c7-8a52-424664774240.json'
reason:This query parameter value is invalid.
Content: <?xml version="1.0" encoding="utf-8"?>
<Error><Code>InvalidQueryParameterValue</Code><Message>Error parsing query at or near character position 1: unexpected &apos;creation_time&apos;
RequestId:63bd850b-401e-005f-745e-400d5a000000
Time:2022-03-25T15:40:22.4156367Z</Message><QueryParameterName>where</QueryParameterName><QueryParameterValue>&apos;creation_time&apos;=&apos;0529121f-7676-46c7-8a52-424664774240/0529121f-7676-46c7-8a52-424664774240.json&apos;</QueryParameterValue><Reason>This query parameter value is invalid.</Reason></Error>
Has someone experience with this Azure function calls?
Looking at the github code(in the find_blobs_by_tags function) , it says :
:param str filter_expression:
The expression to find blobs whose tags matches the specified condition.
eg. "\"yourtagname\"='firsttag' and \"yourtagname2\"='secondtag'"
Looks like you are missing the escape characters? Can you try including them in?

Azure Data Factory removing spaces from column names of csv file

I'm a bit new to azure data factory so apologies if I'm missing anything obvious. I've done several searches and I can't find anything that quite fits.
So the situation is that we have an existing pipeline that will take the path to a csv file and pass this in as a delimited data set. As a sink it is using a parquet data set. This is a generic process that we can pass any delimited file into and it will output it as parquet.
This has been working well but now we have started receiving files with spaces and special characters in the header which causes the output to parquet to fail. Unfortunately we don't have control over the format of the files we receive so I can't handle this at source.
What I would like to do is on ingestion of the file replace any spaces and other special characters in the header with an underscore. If I were doing this on premise I could quickly create a powershell script to do it. I had thought about creating a custom task in AFD to call a powershell script to do this in the blob storage but that seems more complicated than it should be. Is there something else I can do to get this process working while keeping it generic?
As #Joel Cochran mentioned, you can use the below expression in Select transformation to replace space and special characters in the header.
regexReplace($$,'[^a-zA-Z]','_')
Source:
In Select transformation, remove the auto mappings and add new rule base mapping to use this expression.
preview:
You can change the output filename not directly in the Copy activity, assuming you are using this activity.
The workaround is to use a parameter for the filename output that you can cleanup.
You can use the Get Metadata activity to get all filenames from the source csv files.
Then loop over these files with a foreach activity.
Within the foreach activity you can set the output filename with the new name with the cleaned value.
The function could look like this:
#replace(item().name, ' ', '_')
More information on the replace function

Get blob contents from last modified folder in Azure container via Azure logic apps

I have an Azure logic app that's getting blob contents from my Azure storage account on a regular basis. However, my blobs are getting stored in sub-directories.
Eg. MyContainer > Invoice > 20200101 > Invoice1.csv
Every month my 3rd sub-directory that is '20200101' will change to '20200201', '20200301' so & so forth.
I need my Logic app to return the blob contents of the latest folder that gets created in my container.
Any advice regarding this?
Thanks!!
For this requirement, please refer to my logic app below:
1. List all of the folders under /mycontainer/Invoice/.
2. Initialize two variables in type of Integer, one named maxNum and the other named numberFormatOfName.
3. Use "For each" to loop the value from "List blobs" above. In "For each" loop, first set numberFormatOfName with expression int(replace(items('For_each')?['Name'], '/', '')). Then add a "If" condition to judge if numberFormatOfName greater than maxNum. If true, set the value of maxNum with numberFormatOfName.
4. After the "For each" loop, use another "List blobs" to list all of the blobs in latest(max number) folder. The expression in below screenshot is string(variables('maxNum')).
If you do not want list blobs, but you want get the blob content. You can do it like below:
==============================Update==============================
Running the logic app, I get the result shown as below screenshot:
I created three folders 20200101, 20200202, 20200303 under /mycontainer/Invoice in my blob storage. The content of three csv file are 111,111, 222,222, 333,333. The logic app response the third csv file content 333,333.
=============================Update 2=============================

Unable to copy file from SFTP in Azure Data Factory when using wildcard(*) in the filename

I am unable to copy csv files from an SFTP connection to blob storage when using the wildcard(*) in the filename.
More specifically, I receive csv files in the SFTP on a daily basis, and they are of the format: "ddMMyyyyxxxxxx.csv", where "xxxxxx" is the timestamp. More concretely, my csv file for the 13th of March is: "13032019083647.csv", while for the 14th of March: "14032019083556.csv". Obviously, the timestamp is different for every day, thus I want to copy the file independently of whatever strings exists between the date and the the file extenstion.
In the "File" subfield of the "File path" of the "Connection" tab of my subset, I give as input: "13032019*.csv", as instructed by the help icon next to the field:
When I do so, my Debug run fails with:
{"errorCode": "2200", "message":
"ErrorCode=UserErrorInvalidCopyBehaviorBlobNameNotAllowedWithPreserveOrFlattenHierarchy,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Cannot
adopt copy behavior PreserveHierarchy when copying from folder to a
single file.,Source=Microsoft.DataTransfer.ClientLibrary}
I receive a similar error no matter which type of copy behaviour I choose. I have also tried experimenting with the fileFilter parameter (even though ADF warns that the same behaviour can be achieved with the fileName option), but I still end up getting the same error.
For further clarification, I am attaching the Code segment that ADF produces for this configuration:
I should also mention, that when using the full fileName in the corresponding field, namely the value: "13032019083647.csv", copying works normally.
Any help would be greatly appreciated!
My guess it might get two files with wildcard operation.
In such cases we need to use metadata activity, filter activity and for-each activity to copy these files.
1.Metadata activity : Use data-set in these activity to point the particular location of the files and pass the child Items as the parameter.
2.Filter activity : Use filter to filter the files based on your needs.
3.For-each activity : In the For-each activity get Items from the previous activity and add copy activity inside the for-each.
In copy activity the source data set should be #item().name.
I hope this will solve your issue.
What worked for me was the following: I kept the same regex for the input file, but I defined as "Copy behaviour: Merge Files". Since as mentioned, there is only 1 file that satisfies the regex condition, only 1 file was created as output. I am aware that this is a sort of "dirty" solution, but it did the trick for me.

Resources