How to read *.txt files in Azure Data Factory? - azure

I'm trying to load data from a file *.txt type to a SQL Data Base by using a Data Flow or Copy Data activity in Azure Data Factory, but I'm not being capable to do it, down below is my try:
File configuration (as you see guys, I'm using the csv option cause' is the unique way that Azure allows me to read it):
Here is the Preview Data shows:
Everything looks fine, but once I use the Data Set in a Data Flow, I get as follow:
It is possible to read a *.txt file with Azure? What I'm doing wrong?

I tried with a sample text file and was able to get the original data in the Source transformation data preview.
Please check if you have selected the correct source dataset in your source transformation. Sometimes, when the source file is changed, it still shows old projections or incorrect projections and data previews. To reset you can change the output stream name or reconnect the source file.
Below is my source dataset connection and source settings.
Source dataset: text file
Dataflow source:

Related

How to use a Tab-Delimited UTF-16le file as source in a Microsoft Azure data Factory dataflow

I am working for a customer in the medical business (so excuse the many redactions in the screenshots). I am pretty new here so excuse any mistakes I might make please.
We are trying to fill a SQL database table with data coming from 2 different sources (CSV files). Both are delivered on a BLOB storage where we have read access.
The first flow I build to do this with azure data factory works perfectly so I just thought to clone that flow and point it to the second source. However the CSV files from the second source are TAB delimited and UTF-16le encoded. Luckily you can set these parameters when you create a dataset:
Dataset Settings
When I verify the dataset by using the "Preview Data" option, I see a nice list with data coming from the CSV file:Output from preview data So it appears to work fine !
Now I create a new dataflow and in the source I use the newly created Data source. All settings I left at default. data flow settings
Now when I open Data Preview and click refresh I get garbage and NULL outputs instead of the nice data I received when testing the data source. output from source block in dataflow In my first dataflow i created this does produce the expected data from the csv file but somehow the data is now scrambled ?
Could someone please help me with what I am missing or doing wrong here ?
Tried to repro and here you could see if you have the Dataset settings,
Encoding as UTF-8 instead of UTF-16 then you will ne able to preview the data.
Data Preview inside the Dataflow:
And if even I try to have the UTF-16LE enabled for the encoding having such issues:
Hence, for now you could change the Encoding and use the pipeline.

Newline in sink output data

Why does azure data factory data flow automatically add new line to the output file? Can this be deleted or is there a settings to configure? See the screenshot of the first image.
output file
I have only 1 row/record when I preview the data.
sink data preview
Sorry, I have to removed/blurred the data.
I tried to repro this scenario and you are right. This happens in some file types. Such as I see in .CSV and binary files.
I know that when using Binary dataset, ADF does not parse file content but treat it as-is, and you can only copy from Binary dataset to Binary dataset.
And Data Preview is a snapshot of your transformed data using row limits and data sampling from data frames in Spark memory. Therefore, the sink drivers are not utilized or tested in this scenario. It shows limited number of rows when previewed and the number of columns shown in preview is adopted from the first row in the file.
I can see it as below:
Output file from sink in ADF preview editor in Storage container:
You can also confirm by looking at the inspect tab
I also tried downloading the output file to local and opening using different editors to confirm the behavior (New line '16' got appended automatically)
Workaround: You can try use DelimitedText as source dataset or Json as sink dataset instead.
Please share your feedback with product group so that they can look into this.
Similar Feedback: https://feedback.azure.com/forums/217298-storage/suggestions/40268644--preview-file-in-blob-container-vs-edit

Copying data using Data Copy into individual files for blob storage

I am entirely new to Azure, so if this is easy please just tell me to RTFM, but I'm not used to the terminology yet so I'm struggling.
I've created a data factory and pipeline to copy data, using a simple query, from my source data. The target data is a .txt file in my blob storage container. This part is all working quite well.
Now, what I'm attempting to do is to store each row that's returned from my query into an individual file in blob storage. This is where I'm getting stuck, and I'm not sure where to look. This seems like something that'll be pretty easy, but as I said I'm new to Azure and so far am not sure where to look.
You can type 1 in the Max rows per file of the Sink setting and don't set the file name in the dataset of sink. If you need, you can specify the file name prefix in the File name prefix setting.
Screenshots:
The dataset of sink
Sink setting in the copy data activity
Result:

How to Export Multiple files from BLOB to Data lake Parquet format in Azure Synapse Analytics using a parameter file?

I'm trying to export multiples .csv files from a blob storage to Azure Data Lake Storage in Parquet format based on a parameter file using ADF -for each to iterate each file in blob and copy activity to copy from src to sink (have tried using metadata and for each activity)
as I'm new on Azure could someone help me please to implement a parameter file that will be used in copy activity.
Thanks a lot
If so. I created simple test:
I have a paramfile contains the file names that will be copied later.
In ADF, we can use Lookup activity to the paramfile.
The dataset is as follows:
The output of Lookup activity is as follows:
In ForEach activity, we should add dynamic content #activity('Lookup1').output.value. It will foreach the ouput array of Lookup activity.
Inside ForEach activity, at source tab we need to select Wildcard file path and add dynamic content #item().Prop_0 in the Wildcard paths.
That's all.
I think you are asking for an idea of ow to loop through multiple files and merge all similar files into one data frame, so you can push it into SQL Server Synapse. Is that right? You can loop through files in a Lake by putting wildcard characters in the path to files that are similar.
Copy Activity pick up only files that have the defined naming pattern—for example, "*2020-02-19.csv" or "???20210219.json".
See the link below for more details.
https://azure.microsoft.com/en-us/updates/data-factory-supports-wildcard-file-filter-for-copy-activity/

Azure Data Factory - Recording file name when reading all files in folder from Azure Blob Storage

I have a set of CSV files stored in Azure Blob Storage. I am reading the files into a database table using the Copy Data task. The Source is set as the folder where the files reside, so it's grabbing it's file and loading it into the database. The issue is that I can't seem to map the file name in order to read it into a column. I'm sure there are more complicated ways to do it, for instance first reading the metadata and then read the files using a loop, but surely the file metadata should be available to use while traversing through the files?
Thanks
This is not possible in a regular copy activity. Mapping Data Flows has this possibility, it's still in preview, but maybe it can help you out. If you check the documentation, you find an option to specify a column to store file name.
It looks like this:

Resources