Azure DataFactory Incremental BLOB copy

Azure DataFactory Incremental BLOB copy - azure

I've made a pipeline to copy data from one blob storage to another. I want to have incremental copy if it's possible, but haven't found a way to specify it. The reason is I want to run this on a schedule and only copy any new data since last run.

If your blob name is well named with timestamp, you could follow this doc to copy partitioned data. You could use copy data tool to setup the pipeline. You could select tumbling window and then in file path filed input {year}/{month}/{day}/fileName and choose the right pattern. It will help you construct the parameters.
If you blob name is not well named with timestamp, you could use get metadata activity to check the last modified time. Please reference this post.
Event trigger is just one way to control when the pipeline should run. You could also use tumbling window trigger or schedule trigger in your scenarios.

I'm going to presume that by 'incremental' you mean new blobs added to a container. There is no easy way to copy changes to a specific blob.
So, this is not possible automatically when running on a schedule since 'new' is not something the scheduler can know.
Instead, you can use a Blob created Event Trigger, then cache the result (Blob name) somewhere else. Then, when your schedule runs, it can read those names and copy only those blobs.
You have many options to cache. A SQL Table, another blob.
Note: The complication here is trying to do this on a schedule. If you can adjust the parameters to merely copy every new file, it's very, very easy because you can just copy the blob that created the trigger.
Another option is to copy the blob on create using the trigger to a temporary/staging container, then use a schedule to move those files to the ultimate destination.

Related

Copying data using Data Copy into individual files for blob storage

I am entirely new to Azure, so if this is easy please just tell me to RTFM, but I'm not used to the terminology yet so I'm struggling.
I've created a data factory and pipeline to copy data, using a simple query, from my source data. The target data is a .txt file in my blob storage container. This part is all working quite well.
Now, what I'm attempting to do is to store each row that's returned from my query into an individual file in blob storage. This is where I'm getting stuck, and I'm not sure where to look. This seems like something that'll be pretty easy, but as I said I'm new to Azure and so far am not sure where to look.

You can type 1 in the Max rows per file of the Sink setting and don't set the file name in the dataset of sink. If you need, you can specify the file name prefix in the File name prefix setting.
Screenshots:
The dataset of sink
Sink setting in the copy data activity
Result:

How to Export Multiple files from BLOB to Data lake Parquet format in Azure Synapse Analytics using a parameter file?

I'm trying to export multiples .csv files from a blob storage to Azure Data Lake Storage in Parquet format based on a parameter file using ADF -for each to iterate each file in blob and copy activity to copy from src to sink (have tried using metadata and for each activity)
as I'm new on Azure could someone help me please to implement a parameter file that will be used in copy activity.
Thanks a lot

If so. I created simple test:
I have a paramfile contains the file names that will be copied later.
In ADF, we can use Lookup activity to the paramfile.
The dataset is as follows:
The output of Lookup activity is as follows:
In ForEach activity, we should add dynamic content #activity('Lookup1').output.value. It will foreach the ouput array of Lookup activity.
Inside ForEach activity, at source tab we need to select Wildcard file path and add dynamic content #item().Prop_0 in the Wildcard paths.
That's all.

I think you are asking for an idea of ow to loop through multiple files and merge all similar files into one data frame, so you can push it into SQL Server Synapse. Is that right? You can loop through files in a Lake by putting wildcard characters in the path to files that are similar.
Copy Activity pick up only files that have the defined naming pattern—for example, "*2020-02-19.csv" or "???20210219.json".
See the link below for more details.
https://azure.microsoft.com/en-us/updates/data-factory-supports-wildcard-file-filter-for-copy-activity/

Azure Blob Storage mass delete items with specific names

Let's say that I have a blob container that have the following files with names
2cfe4d2c-2703-4b0f-bed0-6b239f238206
1a18c31f-bf28-4f64-a796-79237fabc66a
20a300dd-c032-405f-b67d-9c077623c26c
6f4041dd-92da-484a-966d-d5a168a9a2ec
(Let's say there are around ~15000 files)
I want to delete around 1200 of them. I have a file with all the names I want to delete. In my case, I have it in JSON but it does not really matter in what kinda format it is; I know the files I wanna delete.
What is the most efficient/safe way to delete these items?
I can think of a few ways. For example, using az storage blob deletebatch or az storage blob delete . I am sure that the former is more efficient but I would not know how to do this because there is not really a pattern, just a big list of guids (names) that I want to delete.
I guess I would have to write some code to iterate over my list of files to delete and then use the CLI or some azure storage library to delete them.
I'd prefer to just use some built-in tooling but I can't find a way to do this without having to write code to talk with the API.
What would be the best way to do this?

The tool azcopy would be perfect for that. Using the command azcopy remove you can specify a path to a text file using the parameter --list-of-files={path} to filter on specific files. The file should be plain text and line delimited.

Azure Data Factory - different copy data mappings views in the same pipeline?

I am trying to set up a pipeline with copy data activity in Azure Data Factory and I am confused by the different view of mapping in the copy activity. I have created the pipeline from the template "Copy data from on premise SQL Server to SQL Azure" and I am cloning the activity so there shouldn't be any differences. The source is the same in both activities and I use query against the source database.
Here's how I see it:
Original copy activity:
Cloned copy activity:
I would like to understand why I see different views of mapping.
Thanks in advance!

I don't that's an problem. When we clone a copy active, before we debug or run the pipeline, we need to check all the settings manually.
From your screenshots, original copy activity miss the source schema during mapping. Just import schema will be solved it.
And the cloned copy active seams import schema automatically. Not sure if all the columns are mapped(I think not). Some suggestions:
Please import the schema in source dataset firstly, and fully set
one copy active.
Then clone the copy active which may avoid the problem.
Data Factory may not very smartly and even we clone active, we still should check all the settings in each actives.
Update:
Like #JeffRamos said, sinks are different then the mapping will be different.
We are glad to hear that you have figured it out:
"I have figured it out - I was using the query that contained the
"count(*)" aggregate. Removing it and the "group by" clause made the
mapping view the same as for the original Copy activity."
Thanks for #JeffRamos's useful comment again.
HTH.

Azure Data Factory - Recording file name when reading all files in folder from Azure Blob Storage

I have a set of CSV files stored in Azure Blob Storage. I am reading the files into a database table using the Copy Data task. The Source is set as the folder where the files reside, so it's grabbing it's file and loading it into the database. The issue is that I can't seem to map the file name in order to read it into a column. I'm sure there are more complicated ways to do it, for instance first reading the metadata and then read the files using a loop, but surely the file metadata should be available to use while traversing through the files?
Thanks

This is not possible in a regular copy activity. Mapping Data Flows has this possibility, it's still in preview, but maybe it can help you out. If you check the documentation, you find an option to specify a column to store file name.
It looks like this:

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Azure DataFactory Incremental BLOB copy - azure

I've made a pipeline to copy data from one blob storage to another. I want to have incremental copy if it's possible, but haven't found a way to specify it. The reason is I want to run this on a schedule and only copy any new data since last run.

Related

Copying data using Data Copy into individual files for blob storage

How to Export Multiple files from BLOB to Data lake Parquet format in Azure Synapse Analytics using a parameter file?

Azure Blob Storage mass delete items with specific names

Azure Data Factory - different copy data mappings views in the same pipeline?

Azure Data Factory - Recording file name when reading all files in folder from Azure Blob Storage

Categories

Resources