Azure Data Factory - Recording file name when reading all files in folder from Azure Blob Storage - azure

I have a set of CSV files stored in Azure Blob Storage. I am reading the files into a database table using the Copy Data task. The Source is set as the folder where the files reside, so it's grabbing it's file and loading it into the database. The issue is that I can't seem to map the file name in order to read it into a column. I'm sure there are more complicated ways to do it, for instance first reading the metadata and then read the files using a loop, but surely the file metadata should be available to use while traversing through the files?
Thanks

This is not possible in a regular copy activity. Mapping Data Flows has this possibility, it's still in preview, but maybe it can help you out. If you check the documentation, you find an option to specify a column to store file name.
It looks like this:

Related

How to read *.txt files in Azure Data Factory?

I'm trying to load data from a file *.txt type to a SQL Data Base by using a Data Flow or Copy Data activity in Azure Data Factory, but I'm not being capable to do it, down below is my try:
File configuration (as you see guys, I'm using the csv option cause' is the unique way that Azure allows me to read it):
Here is the Preview Data shows:
Everything looks fine, but once I use the Data Set in a Data Flow, I get as follow:
It is possible to read a *.txt file with Azure? What I'm doing wrong?
I tried with a sample text file and was able to get the original data in the Source transformation data preview.
Please check if you have selected the correct source dataset in your source transformation. Sometimes, when the source file is changed, it still shows old projections or incorrect projections and data previews. To reset you can change the output stream name or reconnect the source file.
Below is my source dataset connection and source settings.
Source dataset: text file
Dataflow source:

Copying data using Data Copy into individual files for blob storage

I am entirely new to Azure, so if this is easy please just tell me to RTFM, but I'm not used to the terminology yet so I'm struggling.
I've created a data factory and pipeline to copy data, using a simple query, from my source data. The target data is a .txt file in my blob storage container. This part is all working quite well.
Now, what I'm attempting to do is to store each row that's returned from my query into an individual file in blob storage. This is where I'm getting stuck, and I'm not sure where to look. This seems like something that'll be pretty easy, but as I said I'm new to Azure and so far am not sure where to look.
You can type 1 in the Max rows per file of the Sink setting and don't set the file name in the dataset of sink. If you need, you can specify the file name prefix in the File name prefix setting.
Screenshots:
The dataset of sink
Sink setting in the copy data activity
Result:

How to Export Multiple files from BLOB to Data lake Parquet format in Azure Synapse Analytics using a parameter file?

I'm trying to export multiples .csv files from a blob storage to Azure Data Lake Storage in Parquet format based on a parameter file using ADF -for each to iterate each file in blob and copy activity to copy from src to sink (have tried using metadata and for each activity)
as I'm new on Azure could someone help me please to implement a parameter file that will be used in copy activity.
Thanks a lot
If so. I created simple test:
I have a paramfile contains the file names that will be copied later.
In ADF, we can use Lookup activity to the paramfile.
The dataset is as follows:
The output of Lookup activity is as follows:
In ForEach activity, we should add dynamic content #activity('Lookup1').output.value. It will foreach the ouput array of Lookup activity.
Inside ForEach activity, at source tab we need to select Wildcard file path and add dynamic content #item().Prop_0 in the Wildcard paths.
That's all.
I think you are asking for an idea of ow to loop through multiple files and merge all similar files into one data frame, so you can push it into SQL Server Synapse. Is that right? You can loop through files in a Lake by putting wildcard characters in the path to files that are similar.
Copy Activity pick up only files that have the defined naming pattern—for example, "*2020-02-19.csv" or "???20210219.json".
See the link below for more details.
https://azure.microsoft.com/en-us/updates/data-factory-supports-wildcard-file-filter-for-copy-activity/

azure data factory: iterate over millions of files

Previously I had a problem on how to merge several JSON files into one single file,
which I was able to resolve it with the answer of this question.
At first, I tried with just some files by using wild cards in the file name in the connection section of the input dataset. But when I remove the file name, theory tells me that all of the files in all folders would be loaded recursively as I checked the copy recursively option, in the source section of the copy activity.
The problem is that when I manually trigger the pipeline after removing the file name from the input of the data set, only some of the files get loaded and the task ends successfully but only loading around 400+ files, each folder has 1M+ files, I want to create BIG csv files by merging all the small JSON files of the source (I already was able to create csv file by mapping the schemas in the copy activity).
It is probably stopping due to a timeout or out of memory exception.
One solution is to loop over the contents of the directory using
Directory.EnumerateFiles(searchDir)
This way you can process all the files without having the list / contents of all files in memory at the same time.

Specify the filename of the CSV inside the zip file on Azure Data Factory Copy

I have created a pipeline in Azure Data Factory using the Copy Data functionality.
It is copying a view from Azure SQL to a CSV file on a Blob Storage. I have chosen to Zip the file and name it Output_{year}{month}{day}.zip.
Everything is working perfectly, however the content of the zip file contains the csv which has a GUID for a filename. How can I make it so that the filename inside the zip is: Output_{year}{month}{day}.csv?
For your case, currently there is no option in ADF to customize the file name inside the generated zip file.
One possible trick is you can use two copy activities, the 1st one copy to Output_{year}{month}{day}.csv, the 2nd one copy from that file to Output_{year}{month}{day}.zip with "copyBehavior" set to "PreserveHierarchy" (default).

Resources