Get custom metadata for blob files in Azure data factory - azure

I have few 100 files in a folder in Blob Storage. Each of the files have custom metadata (Dictionary type). So when traversing through all files I need to get those metadata of each files.
So how to read that details. I tried using GetMetadata feature which has some hardcoded features like, exists, filename, lastedit etc. But I need to get the custom metadata of those files.
Please share some ideas.

#Sandeep
I am assuming you're looking to get the user-defined Metadata through Get Meta data Feature.
Unfortunately, this is currently not possible - the activity only returns the predefined set of Metada.
The list of Metadata supported has been documented here.
Workaround
One of the workarounds I can think is make use of the Web Activity in the ADF and hit the REST API Get Blob Properties
The Get blob properties API returns all user-defined metadata, standard HTTP properties, and system properties for the blob.

Related

How to retrieve latest file from SharePoint to blob using logic app?

I am getting the everyday one new file in my SharePoint. My file is stored in "Shared Document/Data"
Below is my logic app flow. I used Get Files and I am selecting Document and included Nested items as I am getting new data under "data folder" which is stored in the Shared document.
I used Filter Array to get the last modified file with less than or equal to 5m so I can get the latest file
I am facing two issues,
It takes all the files from SharePoint under "Get Files" and the filter array is not working.
I have used create blob wrongly
Can anyone advise me on how to do this?
Follow the workaround
You can use the below Trigger & Connector
Sharepoint (Trigger) - When a file is created or modified in a folder
You can use this connector to select and get the exact Directory to fetch the recent Modified/Created File.
Azure Blob Storage (Connector)- Create Blob (V2)
Use this connector to create a blob.
Result
Modified file fetched and added in a Blob
Refer here for more information
Updated Answer
Here is the list of available directories in SharePoint that will be shown in a SharePoint Trigger. You can select according to your requirement.

Azure Data Factory - Add dynamic metadata in Copy Task

I am currently using ADF to copy a bunch of files from FTP to Azure Storage account. I have to add
metadata for each file. I have been able to do this by adding metadata under the sink tab.
The problem is that this metadata is dynamic for each file and is derived from the name of the file. Can I do something like this in ADF or do I need a separate Azure Function / API to update metadata for each file?
Regards Tarun
I think you can use ADF expression here?

CreatedBy/LastModifiedBy information for a Blob in Azure Storage Container

I am trying to process some blobs in Azure Storage container. Our business users upload csv files to a blob container. The task is to process these files and persist the data in staging tables in Azure SQL DB for them to analyse later. This involves creating tables dynamically matching the file structure of the csv files. I have got this part working correctly. I am using python to accomplish this part of the task.
The next part of the task is to notify the user (who uploaded the blob) via an email once the blob has been processed in the DB by providing them with the table name corresponding to the blob. Ideally, I should also be able to set the permissions in the DB by giving read permissions to the user only on the table corresponding to the blob he uploaded.
To accomplish this, I thought I'll read the blob owner or last modified by attributes from the blob property and use that information for notification/db permissions. But I am not able to find any such property in blob properties. I tried using Diagnostic Logging at Storage account level but the logs also don't show any information about created by or modified by.
Can someone please guide me how can I go about getting this working?
As the information about who created/last modified a blob is not available as a system property, you will need to come up with your own implementation. I can think of a few solutions for that (without using an external database to store this information):
Store this information as blob's metadata: Each blob can have custom metadata. You can store this information in blob's metadata by creating two keys: CreatedBy and LastModifiedBy and store appropriate information. Please note that blob's metadata is not queryable and also it is very easy to overwrite the metadata. This is by far the easiest approach I could think of.
Make use of x-ms-client-request-id: With each request to Azure Storage, you can pass a custom value in x-ms-client-request-id request header. If storage analytics is enabled, this information gets logged. You could then query analytics data to find this information. However, it is extremely cumbersome to find this information in analytics logs as the information is saved as a line item in a blob in $logs container. To find this information, you would first need to find appropriate blob containing this information. Then you would need to download the blob, find the appropriate log entry and extract this information.
Considering none of the solution is perfect, I would recommend that you go with saving this information in an external database. It would be much simpler to accomplish your goal if you go with an external database.
Blobs in azure support custom metadata as a dictionary of key/value pairs you can save foreach file, but in my experience it's not handy in all the cases, specially because you can not query over those without read the blob (azure will charge you that cost) without having in mind the network transfer.
from:
https://learn.microsoft.com/en-us/azure/storage/blobs/storage-properties-metadata
Objects in Azure Storage support system properties and user-defined
metadata, in addition to the data they contain.
System properties: System properties exist on each storage resource.
Some of them can be read or set, while others are read-only. Under the
covers, some system properties correspond to certain standard HTTP
headers. The Azure storage client library maintains these for you.
User-defined metadata: User-defined metadata is metadata that you
specify on a given resource in the form of a name-value pair. You can
use metadata to store additional values with a storage resource. These
additional metadata values are for your own purposes only, and do not
affect how the resource behaves.
I had something very similar to do one time and to avoid creating external databases and connect that I've just created a table in the storage to save each file url from the blob storage without all the properties you need (user permissions) in a unstructured way.
You might find extremely straight forward to query information from the table with python (I did with .net) but I found it's pretty much the same.
https://learn.microsoft.com/en-us/azure/cosmos-db/table-storage-how-to-use-python
Azure Table storage and Azure Cosmos DB are services that store
structured NoSQL data in the cloud, providing a key/attribute store
with a schemaless design. Because Table storage and Azure Cosmos DB
are schemaless, it's easy to adapt your data as the needs of your
application evolve. Access to Table storage and Table API data is fast
and cost-effective for many types of applications, and is typically
lower in cost than traditional SQL for similar volumes of data.
Example code for filtering:
from azure.cosmosdb.table.tableservice import TableService
from azure.cosmosdb.table.models import Entity
table_service = TableService(connection_string='DefaultEndpointsProtocol=https;AccountName=myaccount;AccountKey=mykey;TableEndpoint=myendpoint;)
tasks = table_service.query_entities('tasktable', filter="PartitionKey eq 'tasksSeattle'")
for task in tasks:
print(task.description)
print(task.priority)
So you need only create the table and use the keys from azure to connect it.
Hope it helps you.

Storing Documents on Azure with custom metadata

i am trying to find the best way to implement a small site allowing the user to upload a file and then search on it.
i used azure search with blob storage.
the file is stored on the blob storage and is then gets indexed by azure search indexer - so far so good.
the problem is that i would like to add to each document some custom data like file id and other business data, this data is not part of the document. is there a way to achieve this?
some one, suggested i use cosmos db, though i am not sure its the best way to go when it comes to documents.
Thanks
If you would like to keep using blob storage, you can store metadata with the blobs - just add custom metadata to your blobs, add corresponding fields to the search index, and the blob indexer will pick up the metadata.

Delete remote files using Azure Data Factory

How do I delete all files in a source folder (located on On-premise file system). I would need help with a .NET custom activity or any Out-of-the-box solutions in Azure Data Factory.
PS: I did find a delete custom activity but it's more towards Blob storage.
Please help.
Currently, there is NO support for a custom activity on Data Management Gateway. Data Management Gateway only supports copy activity and Stored Procedure activity as of today (02/22/2017).
Work Around: As I do not have a delete facility for on-premise files, I am planning to have source files in folder structures of yyyy-mm-dd. So, every date folder (Ex: 2017-02-22 folder) will contain all the related files. I will now configure my Azure Data Factory job to pull data based on date.
Example: The ADF job on Feb 22nd will search for 2017-02-22 folder. In the next run my ADF job will search for 2017-02-23 folder. This way I don't need to delete the processed files.
Actually, there is a normal way to do it. You will have to create Azure Functions App that will accept POST with your FTP/SFTP settings (in case you use one) and file name to remove. Therefore you parse request content to JSON, extract settings and use SSH.NET library to remove desired file. In case you have just a file share, you do not even need to bother with SSH.
Later on in Data Factory you add a Web Activity with dynamic content in Body section constructing the JSON request in the form I've mentioned above. For URL you specify published Azure Function Url + ?code=<your function key>
We actually ended up creating the whole bunch of Azure Functions that serve as custom activities for our DF pipelines.

Resources