I have a data factory which triggers based on storage blob event. In the triggered event, I see two properties TriggerTime and EventPayload. As I have need to read the Storage Blob related information I am trying to process the EventPayload in the Data Factory. I would like access a property like 'url' from the data tag.
A sample payload looks like this:
{
"topic":"/subscriptions/7xxxxe5bbccccc85/resourceGroups/das00/providers/Microsoft.Storage/storageAccounts/datxxxxxx61",
"subject":"/blobServices/default/containers/raw/blobs/sample.parquet",
"eventType":"Microsoft.Storage.BlobCreated",
"id":"a1c320d7-501f-0047-362c-xxxxxxxxxxxx",
"data":{
"api":"FlushWithClose",
"requestId":"5010",
"eTag":"0x8D82743B5D86E72",
"contentType":"application/octet-stream",
"contentLength":203665463,
"contentOffset":0,
"blobType":"BlockBlob",
"url":"https://mystorage.dfs.core.windows.net/raw/sample.parquet",
"sequencer":"000000000000000000000000000066f10000000000000232",
"storageDiagnostics":{
"batchId":"89308627-6e28-xxxxx-96e2-xxxxxx"
}
},
"dataVersion":"3",
"metadataVersion":"1",
"eventTime":"2020-07-13T15:45:04.0076557Z"
}
Is there any short hand for processing the EventPayload in the Data Factory? For example, the filename and folderpath of an event can be accessed using #triggerBody() in the Data Factory. Does this require custom code like Azure function?
Related
I have an Azure Data Factory pipeline with two triggers:
schedule trigger
blob event trigger
I would like for blob event trigger to wait for a marker file in storage account under dynamic path e.g.:
landing/some_data_source/some_dataset/#{formatDateTime(#trigger().scheduledTime, 'yyyyMMdd')}/_SUCCESS
Refering to #trigger().scheduledTime doesn't work.
How to pass scheduleTime parameter value from schedule trigger to blob event trigger ?
If I understand correctly, you are trying to edit blob event trigger fields Blob path begins with or Blob path ends with - using the scheduleTime from the scheduleTrigger!
Unfortunately, as we can confirm from the official MS doc Create a trigger that runs a pipeline in response to a storage event
Blob path begins with and ends with are the only pattern matching
allowed in Storage Event Trigger. Other types of wildcard matching
aren't supported for the trigger type.
It takes the literal values.
This does not work:
Only if you have a file names as same ! Unlikely
Also, this dosen't
But, this would
Workaround:
As discussed with #marknorkin earlier, since this is not available out-of-the-box in BlobEventTrigger, we can try use Until activity composed from GetMetadata+Wait activities, where in GetMetadata will check for dynamic path existence.
I have copied data to azure blob container using the copy activity.i was able to use that to trigger my azure function using Blob trigger.However my req is to call the azure function activity that can be configured in azure datafactory pipeline.to that i need to pass the blob container path so that the azure function based on the HTTP trigger can read from this path.Blob trigger works but isnt allowed.Any idea as to how to get the path of the container and pass it to the azure function activity?
Edit:-
i added this
And the output of the path in the request sent to the HTTPTrigger of azure func looks like this
This is where i need the fully formed path post the copy say folder/myfolder/2010/10/01
However i dont.
-----------------------UPDATE----------------------------------------------------
this is the sink dataset
with the connection of the dataset(sink)like this
and my copypipeline looks like this
ran the debug and the copy instead of folder/myfolder/2020/10/01 gives folder/myfolder/#variables('data')
According to the description of your question, it seems you do not know the target blob path of the "Copy" activity. I guess you use pipeline parameter to input the blob path in your data factory. Something like below:
So in the HTTP trigger function request body, you just need to choose the testPath.
If your function request body need to be like {"path":"xxx"}, you can use "concat()" function in data factory to join the string together.
==================================Update=================================
I want to create a blob storage trigger that takes any files put into blob storage (a fast process) and transfers them to Data Lake storage (NOT to another Blob Storage location).
Can this be done?
Can it be done using JavaScript, or does it require C#?
Does sample code exist showing how to do this? If so, would you be so kind as to point me to it?
Note: we've created a pipeline that will go from Blob Storage to Data lake storage. That's not what I'm asking about here.
You could potentially use an Azure Function or Azure Logic App to detect new files on Blob Storage and either call your webhook to trigger the pipeline or do the move itself.
Can this be done?
As jamesbascle mentioned that we could use Azure function to do that.
Can it be done using JavaScript, or does it require C#?
It can be done with javascript or C#.
Does sample code exist showing how to do this? If so, would you be so kind as to point me to it?
How to create a Blob storage triggered function, please refer to this document. We also could get the C#/javascript demo code from this document.
JavaScript code
module.exports = function(context) {
context.log('Node.js Blob trigger function processed', context.bindings.myBlob);
context.done();
};
C# code
[FunctionName("BlobTriggerCSharp")]
public static void Run([BlobTrigger("samples-workitems/{name}")] Stream myBlob, string name, TraceWriter log)
{
log.Info($"C# Blob trigger function Processed blob\n Name:{name} \n Size: {myBlob.Length} Bytes");
}
I have some data being collected that is in an xml format. Something that looks like
<OLDI_MODULE xmlns="">
<StStoHMI_IBE>
<PRack>0</PRack>
<PRackSlotNo>0</PRackSlotNo>
<RChNo>0</RChNo>
<RChSlotNo>0</RChSlotNo>
This data is sent to Azure Eventhub. I wanted to send this data to a SQL database. I created a stream in Azure Stream Analytics that takes this input and puts it in a SQL database. But when the input format is asked for the input stream, there are only JSON,CVS and Avro. Which of these formats can I use? Or which of the azure services should I use to move data from Eventhub to sql database?
By far the easiest option is to use Azure Stream Analytics as you intended to do. But yes, you will have to convert the xml to json or another supported format before you can use the data.
The other options is more complex, requires some code and a way to host the code (using a worker role or web job for instance) but gives the most flexibility. That option is to use an EventProcessor to read the data from the Event Hub and put it in a database.
See https://azure.microsoft.com/en-us/documentation/articles/event-hubs-csharp-ephcs-getstarted/ for how to set this up.
The main work is done in the Task IEventProcessor.ProcessEventsAsync(PartitionContext context, IEnumerable messages) method. Based on the example it will be something like:
async Task IEventProcessor.ProcessEventsAsync(PartitionContext context, IEnumerable<EventData> messages)
{
foreach (EventData eventData in messages)
{
string xmlData = Encoding.UTF8.GetString(eventData.GetBytes());
// Parse the xml and store the data in db using Ado.Net or whatever you're comfortable with
}
//Call checkpoint every 5 minutes, so that worker can resume processing from 5 minutes back if it restarts.
if (this.checkpointStopWatch.Elapsed > TimeSpan.FromMinutes(5))
{
await context.CheckpointAsync();
this.checkpointStopWatch.Restart();
}
}
JSON would be a good data format to be used in Azure Event Hub. Once you receive the data in Azure Event Hub. You can use Azure Stream Analytics to move the data SQL DB.
Azure Stream Analytics consists of 3 parts : input, query and output. Where input is the event hub , output is the SQL DB. The query should be written by you to select the desired fields and output it.
Check out the below article:
https://azure.microsoft.com/en-us/documentation/articles/stream-analytics-define-outputs/
Stream Analytics would be Azure resource you should look into for moving the data from Event Hub
How can you fetch data from an http rest endpoint as an input for a data factory?
My use case is to fetch new data hourly from a rest HTTP GET and update/insert it into a document db in azure.
Can you just create an endpoint like this and put in the rest endpoint?
{
"name": "OnPremisesFileServerLinkedService",
"properties": {
"type": "OnPremisesFileServer",
"description": "",
"typeProperties": {
"host": "<host name which can be either UNC name e.g. \\\\server or localhost for the same machine hosting the gateway>",
"gatewayName": "<name of the gateway that will be used to connect to the shared folder or localhost>",
"userId": "<domain user name e.g. domain\\user>",
"password": "<domain password>"
}
}
}
And what kind of component do I add to create the data transformation job - I see that there is a a bunch of things like hdinsight, data lake and batch but not sure what the differences or appropriate service would be to simply upsert the new set into the azure documentDb.
I think the simplest way will be to use the Azure Logic Apps.
You can make a call to any Restfull service using the Http Connector in Azure Logic App connectors.
So you can do GET and POST/PUT etc in a flow based on schedule or based on some other GET listener:
Here is the documentation for it:
https://azure.microsoft.com/en-us/documentation/articles/app-service-logic-connector-http/
To do this with Azure Data Factory you will need to utilize Custom Activities.
Similar question here:
Using Azure Data Factory to get data from a REST API
If Azure Data Factory is not an absolute requirement Aram's suggestion might serve you better utilizing Logic Apps.
Hope that helps.
This can be achieved with Data Factory. This is especially good if you want to run batches on a schedule and have a single place for monitoring and management. There is sample code in our GitHub repo for an HTTP loader to blob here https://github.com/Azure/Azure-DataFactory. Then, the act of moving data from the blob to docdb will do the insert for you using our DocDB connector. There is a sample on how to use this connector here https://azure.microsoft.com/en-us/documentation/articles/data-factory-azure-documentdb-connector/ Here are the brief steps you will take to fulfill your scenario
Create a custom .NET activity to get your data to blob.
Create a linked service of type DocumentDb.
Create linked service of type AzureStorage.
Use input dataset of type AzureBlob.
Use output dataset of type DocumentDbCollection.
Create and schedule a pipeline that includes your custom activity, and a Copy Activity that uses BlobSource and DocumentDbCollectionSink schedule the activities to the required frequency and availability of the datasets.
Aside from that, choosing where to run your transforms (HDI, Data Lake, Batch) will depend on your I/o and perf reqs. You can choose to run your custom activity on Azure Batch or HDI in this case.