Azure Data Factory Wildcard Characters - azure

I got a sftp location where generally .csv files are put, and we process or pull the file present there using ADF Copy activity. When no file is present and we give the exact filename and run the pipeline, it fails which is as expected. But when we give a wildcard character such as abc*.csv and run the pipeline with no file present in the sftp location, the copy activity passes though rows written is 0. Can anyone tell me why this happens. We are using Adf v2.

The answer to this is nuanced. Here is the difference:
When you give an exact filename, but the file doesn't exist, Data Factory tries to get it, but the request is returned a 'file not found' error. This is passed up to the activity, and is recognized as a failure.
When you give a wildcard, this is really asking "Get me a list of files that fit this pattern, and then Copy each of them". When there are no files matching the pattern, the result is an empty list. Since the list length is 0, no requests to fetch any file is made, meaning there is no opportunity to be served a 'file not found' error.
This is my reasoning from my experience with Data Factory. I am not a member of the development team.

Related

Azure Data Factory Counting number of Files in Folder

I am attempting to determine if a folder is empty.
My current method involves using a GetMeta shape and running the following to set a Boolean.
#greater(length(activity('Is Staging Folder Empty').output.childItems), 0)
This works great when files are present.
When the folder is empty (a state I want to test for) I get
"The required Blob is missing".
Can I trap this condition?
What alternatives are there to determine if a folder is empty?
I have reproduced the above and got same error.
This error occurs when the folder is empty, and the source is a Blob storage. You can see it is working fine for me when the Source is ADLS.
for sample I have used set variable.
inside false of if.
if folder is empty:
Can I trap this condition?
What alternatives are there to determine if a folder is empty?
One alternative can be to use ADLS instead of Blob storage as source.
(or)
You can do like below, if you want to avoid this error with Blob storage as source. Give an if activity for the failure of Get Meta and check the error in the expression.
#startswith(string(activity('Get Metadata1').error.message), 'The required Blob is missing')
In True activities (required error and folder is empty) I have used set variable for demo.
In False activities (If any other error apart from the above occurs) use Fail activity to fail the pipeline.
Fail Message: #string(activity('Get Metadata1').error.message)
For success of Get Meta activity, there is no need to check the count of child Items because Get Meta data fails if the folder is empty. So, on success Go with your activities flow.
An alternative would be
Blob:
Dataset :
where test is the container and test is the folder inside the container which I am trying to scan (Which ideally doesnt exist as seen above)
Use get meta data activity to check if the folder exists :
If false, exit else count for the files

Copy files from AWS s3 sub folder to Azure Blob

I am trying to copy files out of a s3 bucket using azure data factory. Firstly I want a list of the directories.
Using the CLI I would use. {aws s3 ls }
From there I can determine from the list in a foreach an push that into a variable.
In adf, I have tried to use 'get metadata', although this works in theory. In practice there are 76 files in each directory and the loop is over 1.5m. This just isn't worth it, it takes far too long, especially as the directories only takes about 20 seconds for 20000 directories.
Is there a method to do this list. When creating the dataset we have a no permissions, however when we use specific location it does.
Many thanks
I have found another way of completing this task.
So to begin with I am using get metadata with the child option. It produces an array.
I push this into a string variable. With this variable you can then create a stored procedure to pick this apart, using openjson to get just the value. This can then be pulled apart further to get the directory names.
I then merge these into a table.
Using lookup I can then run another stored procedure to return the value I require from the table. This whole process runs in a couple of minutes.
Anyone who wants a further explanation, please ask, I will try and create a walk through to assist

Parametrization using Azure Data Factory

I have a Pipeline job in Azure Data Factory which I want to use to run the pipeline job but pass all files for a specific month through for example.
I have a folder called 2020/01 inside this folder is numerous files with different names.
The question is: Can one pass a parameter through to only extract and load the files for 2020/01/01 and 2020/01/02 if that makes sense?
Excellent, Thanks Jay it worked and i can now run my pipeline jobs passing through the month or even day level.
Really appreciate your response, have a fantastic day.
Regards
Rayno
The question is: Can one pass a parameter through to only extract and
load the files for 2020/01/01 and 2020/01/02 if that makes sense?
You did't mention which connector you are using in pipeline job,but you mentioned folder in your question.As i know,the majority folder path could be parametrization in ADF copy activity configuration.
You could create a param :
Then apply it in the wildcard folder path:
Even if your files' names have same prefix,you could apply 01*.json on the wildcard file name property.

Delete CSV File from FTP server in Azure Data Factory

I can't seem to figure out how to do this.
I'm trying to use the delete activity to remove the CSV file I just processed in a pipeline.
After setting up the delete activity, I see nothing that indicates that it'll delete the file from my FTP server. After I debug/run the pipeline, I get an error. Everything I've seen related to using this activity is in regard to some other storage type.
Here's the actual error:
{
"errorCode": "3703",
"message": "Invalid delete activity payload with 'folderPath' that is required and cannot be empty.",
"failureType": "UserError",
"target": "DeleteCSVFromFTPServer"
}
But there's nowhere on the activity to specify the folder path.
Can anyone point me to and FTP specific example of how to use the delete activity?
I figured I'd answer my own question since I figured this out about 5 minutes after I posted the question.
Hopefully this will help someone else out down the road.
The issue was that in the Dataset I had not supplied a value for the folder path. Leaving it empty worked on import, but would not work for the delete.
I supplied a . in the dataset's file path field as shown below.
Now the pipeline will run completely as I expect.

Using Logic Apps to get specific files from all sub(sub)folders, load them to SQL-Azure

I'm quite new to Data Factory and Logic Apps (but I am experienced with SSIS since many years),
I succeeded in loading a folder with 100 text-files into SQL-Azure with DATA FACTORY
But the files themselves are untouched
Now, another requirement is that I loop through the folders to get all files with a certain file extension,
In the end I should move (=copy & delete) all the files from the 'To_be_processed' folder to the 'Processed' folder
I can not find where to put 'wildcards' and such:
For example, get all files with file extensions .001, 002, 003, 004, 005, ...until... , 996, 997, 998, 999 (thousand files)
--> also searching in the subfolders.
Is it possible to call a Data Factory from within a Logic App ? (although this seems unnecessary)
Please find some more detailed information in this screenshot:
(click to enlarge)
Thanks in advance helping me out exploring this new technology!
Interesting situation.
I agree that using Logic Apps just for this additional layer of file handling seems unnecessary, but Azure Data Factory may currently be unable to deal with exactly what you need...
In terms of adding wild cards to your Azure Data Factory datasets you have 3 attributes available within the JSON type properties block, as follows.
Folder Path - to specify the directory. Which can work with a partition by clause for a time slice start and end. Required.
File Name - to specify the file. Which again can work with a partition by clause for a time slice start and end. Not required.
File Filter - this is where wildcards can be used for single and multiple characters. (*) for multi and (?) for single. Not required.
More info here: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-onprem-file-system-connector
I have to say that separately none of the above are ideal for what you require and I've already fed back to Microsoft that we need a more flexible attribute that combines the 3 above values into 1, allowing wildcards in various places and a partition by condition that works with more than just date time values.
That said. Try something like the below.
"typeProperties": {
"folderPath": "TO_BE_PROCESSED",
"fileFilter": "17-SKO-??-MD1.*" //looks like 2 middle values in image above
}
On a side note; there is already a Microsoft feedback item thats been raised for a file move activity which is currently under review.
See here: https://feedback.azure.com/forums/270578-data-factory/suggestions/13427742-move-activity
Hope this helps
We have used a C# application which we call through 'app services' -> webjobs.
Much easier to iterate through folders. To call SQL we used sql bulkinsert

Resources