ADF -GetMetadata show file exists even if file is not present - azure

I have a folder in which there are many files (txt files) ,I am using GetMetadata to check if any file exists or not.
so in data set I mention
but even if no file exists it give exists=true
How to dynamically provide filename ?as I tried giving *.txt but this also fail

It is working fine for me by passing *.txt in DelimitedText dataset settings.
Steps I did:
Create a Get Metadata activity with DelimitedText dataset.
In dataset settings, pass *.txt to test all the files in the given blob.
In Get Metadata activity, select Field list Exists and click on check box. See image below.
Now just run the pipeline and you well get the desired output as shown below.

Related

ADF Copy Activity problem with wildcard path

I have a seemingly simple task to integrate multiple json files that are residing in a data lake gen2
The problem is files that need to be integrated are located in multiple folders, for example this is a typical structure that I am dealing with:
Folder1\Folder2\Folder3\Folder4\Folder5\2022\Month\Day\Hour\Minute\ <---1 file in Minute Folder
Than same structure for 20223 year, so in order for me to collect all the files I have to go to bottom of the structure which is Minute folder, if I use wildcard path it looks like this:
Wildcard paths 'source from dataset"/ *.json, it copies everything including all folders, and I just want files, I tried to narrow it down and copies only first for 2022 but whatever I do is not working in terms of wildcard paths, help is much appreciated
trying different wildcard combinations did not help, obviously I am doing something wrong
There is no option to copy files from multiple sub- folders to single destination folder. Flatten hierarchy as a copy behavior also will have autogenerated file names in target.
image reference MS document on copy behaviour
Instead, you can follow the below approach.
In order to list the file path in the container, take the Lookup activity and connect to xml dataset with HTTP linked service.
Give the Base URL in HTTP connector as,
https://<storage_account_name>.blob.core.windows.net/<container>?restype=directory&comp=list.
[Replace <storage account name> and <container> with the appropriate name in the above URL]
Lookup activity gives the list of folders and files as separate line items as in following image.
Take the Filter activity and filter the URLs that end with .json from the lookup activity output.
Settings of filter activity:
items:
#activity('Lookup1').output.value[0].EnumerationResults.Blobs.Blob
condition:
#endswith(item().URL,'.json')
Output of filter activity
Take the for-each activity next to filter activity and give the item of for-each as #activity('Filter1').output.value
Inside for-each activity, take the copy activity.
Take http connector and json dataset as source, give the base url as
https://<account-name>.blob.core.windows.net/<container-name>/
Create the parameter for relative URL and value for that parameter as #item().name
In sink, give the container name and folder name.
Give the file name as dynamic content.
#split(item().name,'/')[sub(length(split(item().name,'/')),1)]
This expression will take the filename from relative URL value.
When the pipeline is run, all files from multiple folders got copied to single folder.

I want to copy files at the bottom of the folder hierarchy and put them in one folder

In Azure Synapse Analytics, I want to copy files at the bottom of the folder hierarchy and put them in one folder.
The files you want to copy are located in their respective folders.
(There are 21 files in total.)
enter image description here
I tried it using ability to flatten the hierarchy of "Copy" activity.
However, as you can see in the attached image, the file name is created on the Synapse side.
enter image description here
I tried to get the name of the bottom-level file with the "Get Metadata" activity, but I could not use wildcards in the file path.
I considered creating and running 21 pipelines that would copy each file, but since the files are updated daily in Blob, it would be impractical to run the pipeline manually every day using 21 folder paths.
Does anyone know of any smart way to do this?
Any help would be appreciated.
Using flatten hierarchy does not preserve existing file name, new file name will be generated. Wildcard paths are not accepted by Get metadata activity. Hence one option is to use Get Metadata with ForEach to achieve the requirement.
The following are the images of folder structure that I used for this demonstration.
I created a Get Metadata activity first. I am retrieving the folder names (21 folders like '20220701122731.zip') inside Intage Sample folder using field list as child items.
Now I used ForEach activity to loop through these folders names by giving items value as #activity('Get folders level1').output.childItems.
Inside ForEach I have 3 activities. First is another Get Metadata activity to get the subfolder names (to get one folder inside '20220701122731.zip', that is '20220701122731')
In this, while creating dataset, we passed the name of parent folder (folder_1 = '20220701122731.zip') to the dataset to use it in the path as
#{concat('unzipped/Intage Sample.zip/Intage Sample/',dataset().folder_1)}
This returns the names of subfolders (like '20220701122731') which are inside parent folder (like '20220701122731.zip' which have 1 subfolder each). I used set variable activity to assign the child items output to this variable using #activity('Get folder inner').output.childItems .
The final step is copy activity to move the required files to one single destination folder. Since there is only one sub-folder inside each of the 21 folders (only one sub-folder like '20220701122731' inside folder like '20220701122731.zip'), we can use the values achieved from above steps directly to complete the copy.
Along with the help of wildcard paths in this copy data activity, we can complete the copy. The wildcard directory path will be
#{concat('unzipped/Intage Sample.zip/Intage Sample/',item().name, '/', variables('test')[0].name)}
#item().name give parent folder name, in your case- '20220701122731.zip'
#variables('test')[0].name gives sub-folder name, in your case like '20220701122731'
For sink, I have created a dataset pointing to a folder inside my container called output_files. When triggered, the pipeline runs successfully.
The following are the contents of my output_files folder.

How to read *.txt files in Azure Data Factory?

I'm trying to load data from a file *.txt type to a SQL Data Base by using a Data Flow or Copy Data activity in Azure Data Factory, but I'm not being capable to do it, down below is my try:
File configuration (as you see guys, I'm using the csv option cause' is the unique way that Azure allows me to read it):
Here is the Preview Data shows:
Everything looks fine, but once I use the Data Set in a Data Flow, I get as follow:
It is possible to read a *.txt file with Azure? What I'm doing wrong?
I tried with a sample text file and was able to get the original data in the Source transformation data preview.
Please check if you have selected the correct source dataset in your source transformation. Sometimes, when the source file is changed, it still shows old projections or incorrect projections and data previews. To reset you can change the output stream name or reconnect the source file.
Below is my source dataset connection and source settings.
Source dataset: text file
Dataflow source:

Newline in sink output data

Why does azure data factory data flow automatically add new line to the output file? Can this be deleted or is there a settings to configure? See the screenshot of the first image.
output file
I have only 1 row/record when I preview the data.
sink data preview
Sorry, I have to removed/blurred the data.
I tried to repro this scenario and you are right. This happens in some file types. Such as I see in .CSV and binary files.
I know that when using Binary dataset, ADF does not parse file content but treat it as-is, and you can only copy from Binary dataset to Binary dataset.
And Data Preview is a snapshot of your transformed data using row limits and data sampling from data frames in Spark memory. Therefore, the sink drivers are not utilized or tested in this scenario. It shows limited number of rows when previewed and the number of columns shown in preview is adopted from the first row in the file.
I can see it as below:
Output file from sink in ADF preview editor in Storage container:
You can also confirm by looking at the inspect tab
I also tried downloading the output file to local and opening using different editors to confirm the behavior (New line '16' got appended automatically)
Workaround: You can try use DelimitedText as source dataset or Json as sink dataset instead.
Please share your feedback with product group so that they can look into this.
Similar Feedback: https://feedback.azure.com/forums/217298-storage/suggestions/40268644--preview-file-in-blob-container-vs-edit

How to use Wildcard Filenames in Azure Data Factory SFTP?

I am using Data Factory V2 and have a dataset created that is located in a third-party SFTP. The SFTP uses a SSH key and password. I was successful with creating the connection to the SFTP with the key and password. I can now browse the SFTP within Data Factory, see the only folder on the service and see all the TSV files in that folder.
Naturally, Azure Data Factory asked for the location of the file(s) to import. I use the "Browse" option to select the folder I need, but not the files. I want to use a wildcard for the files.
When I opt to do a *.tsv option after the folder, I get errors on previewing the data. When I go back and specify the file name, I can preview the data. So, I know Azure can connect, read, and preview the data if I don't use a wildcard.
Looking over the documentation from Azure, I see they recommend not specifying the folder or the wildcard in the dataset properties. I skip over that and move right to a new pipeline. Using Copy, I set the copy activity to use the SFTP dataset, specify the wildcard folder name "MyFolder*" and wildcard file name like in the documentation as "*.tsv".
I get errors saying I need to specify the folder and wild card in the dataset when I publish. Thus, I go back to the dataset, specify the folder and *.tsv as the wildcard.
In all cases: this is the error I receive when previewing the data in the pipeline or in the dataset.
Can't find SFTP path '/MyFolder/*.tsv'. Please check if the path exists. If the path you configured does not start with '/', note it is a relative path under the given user's default folder ''. No such file .
Why is this that complicated? What am I missing here? The dataset can connect and see individual files as:
/MyFolder/MyFile_20200104.tsv
But fails when you set it up as
/MyFolder/*.tsv
I use Copy frequently to pull data from SFTP sources. You mentioned in your question that the documentation says to NOT specify the wildcards in the DataSet, but your example does just that. Instead, you should specify them in the Copy Activity Source settings.
In my implementations, the DataSet has no parameters and no values specified in the Directory and File boxes:
In the Copy activity's Source tab, I specify the wildcard values. Those can be text, parameters, variables, or expressions. I've highlighted the options I use most frequently below.
You can specify till the base folder here and then on the Source Tab select Wildcard Path specify the subfolder in first block (if there as in some activity like delete its not present) and *.tsv in the second block.
enter image description here

Resources