I have a copy Activity that copies data from Blob to Azure Data Lake. The Blob is populated by an Azure function with an event hub trigger. Blob files are appended with UNIX timestamp which is the event enqueued time in the event hub. Azure data factory is triggered every hour to merge the files and move them over to Data lake.
Inside the source dataset I have filters by Last Modified date in UTC time out of the box. I can use this but it limits me to use Last modified date in the blob. I want to use my own date filters and decide where I want to apply these filters. Is this possible in Data factory? If yes, can you please point me in the right direction.
For ADF in any case,the only idea that came to my mind is using combination of Look Up Activity ,ForEach Activity and Filter Activity.Maybe it is kind of complex.
1.Use Look up to retrieve the data from the blob file.
2.Use ForEach Activity to loop the result and set your data time filters.
3.Inside the ForEach Activity, do the copy task.
Please refer to this blog to get some clues.
Reviewing your descriptions of all the tasks you did now, I suggest you getting an idea of Azure Stream Analytics Service. No matter the data source is Event Hub or Azure Blob Storage, ASA supports them as input. And it supports ADL as output.
You could create a job to configure input and output,then use popular SQL language to filter your data however you want.Such as Where operator or DataTime Functions.
Related
I am using Azure Data Factory to copy the data from one Blob Storage Account to Data Lake Storage Gen2 account.
I have created a pipeline and created a copy activity inside that. I trigger this pipeline from a Timer Trigger Azure Function using C# SDK.
I am copying only incremental data by making use of Filter by last modified feature. I am passing UTC StartTime and EndTime.
Now, the question is - I don't want to trigger the second activity if no files are found within this range. How can I do that?
You can use If Condition activity, and check whether has files through this expression: #greater(activity('Copy data1').output.filesWritten,0). Then put the second activity into case True activity like this:
I currently have pipelines developed that leverage Azure Data Factory for orchestration and Azure DataBricks for it's compute to perform the following actions... I receive tens of thousands of single record json files into Azure Blob in a real-time basis and on a 15 minute basis i check the folders for any new files and once found I load them into a dataframe using Databricks and load these into a single file in SQL DB before having other ADF jobs trigger stored procedures which then transform my data into final SQL tables.... We are looking to move away from Databricks as we are not using it for it's true capabilities but are of course paying the Databricks costs. Looking for ideas on other solutions to load tens of thousands of jsons into SQL DB (with minimal to no transformations) on a periodic (i.e. 15 minute) basis. We are a microsoft shop so not looking to necessarily move away from Azure tools.
Here a few ideas:
use Azure Functions + Blob Trigger / Event Grid to process the JSON files in real time (every time a new JSON file arrives, it will trigger your function). Then you could either insert into the final table or on a temporary table.
another idea would be combine Azure Functions + Blob Trigger / Event Grid to sink the data to a data lake. You can use ADF to sink it to SQL final tables.
Azure SQL DB is actually pretty capable as far as JSON goes so you could just use OPENROWSET to import the data directly from blob store and OPENJSON to shred it. You could then use a Logic App running on a schedule to call the proc say every 15 minutes, you wouldn't even need ADF as part of the solution.
I've worked up a couple of similar answers previously, eg here and here, but let me know if you want to progress more down this route and we can work up something more detailed.
Is there a method that gives me the list of files copied in azure data lake storage after a copy activity in azure data factory? I have to copy data from a datasource and after i have to skip files based on a particular condition. Condition must check also file path and name with other data from sql database. any idea?
As of now, there's no function to get the files list after a copy activity. You can however use a get Metadata activity or a Lookup Activity and chain a Filter activity to it to get the list of files based on your condition.
There's a workaround that you can check out here.
"The solution was actually quite simple in this case. I just created another pipeline in Azure Data Factory, which was triggered by a Blob Created event, and the folder and filename passed as parameters to my notebook. Seems to work well, and a minimal amount of configuration or code required. Basic filtering can be done with the event, and the rest is up to the notebook.
For anyone else stumbling across this scenario, details below:
https://learn.microsoft.com/en-us/azure/data-factory/how-to-create-event-trigger"
I want to copy data from blob storage(parquet format) to cosmos db. Scheduled the trigger for every one hour. But all the files/data getting copied in every run. how to skip the files that are already copied?
There is no unique key with the data. We should not copy the same file content again.
Based on your requirements, you could get an idea of modifiedDatetimeStart and modifiedDatetimeEnd properties in Blob Storage DataSet properties.
But you need to modify the configuration of dataset every period of time via sdk to push the value of the properties move on.
Another two solutions you could consider :
1.Using Blob Trigger Azure Function.It could be triggered if any modifications on the blob files then you could transfer data from blob to cosmos db by sdk code.
2.Using Azure Stream Analytics.You could configure the input as Blob Storage and output as Cosmos DB.
I am building Azure IoT solution for my BI project. For now I have an application that once per set time window sends a .csv blob to Azure Blob Storage with incremental number in name. So after some time I will have in my storage files such as 'data1.csv', 'data2.csv', 'data3.csv', etc.
Now I will need to load these data into a database which will be my warehouse with the use of Azure Stream Analytics job. The issue might be that .CSV files will have overlapping data. They will be send every 4h and contain data for past 24h. I need to always read only last file (with highest number) and prepare lookup so it properly updates data in the warehouse. What will be the best approach to make Stream Analytics read only latest file and for updating records in DB?
EDIT:
TO clarify - I am fully aware that ASA is not capable of being an ETL job. My question is what would be best approach for my case with using IoT tools
I would suggest one of these 2 ways:
use ASA to write in a temporary SQL table, and the use a SQL trigger
to update the main table of the DW with the diff.
Or remove duplicates by adding a unique constraint as described here:
https://blogs.msdn.microsoft.com/streamanalytics/2017/01/13/how-to-achieve-exactly-once-delivery-for-sql-output/
Thanks,
JS - Azure Stream Analytics