Azure data factory - Copy Data activity Sink - Max rows per file property - azure

I would like to spilt my big size file into smaller chunks inside blob storage via ADF copy data activity. I am trying to do so using Max Rows per file property in Copy activity sink but my file is not getting spilt into smaller files rather I get the same big size file in result, can anyone share any valuable info here?

I tested it, and it can work fine.
Source dataset
2.Source setting
3.Sink dataset
4.Sink setting
Result:

Related

Azure Data Factory - Copy files using a CSV with filepaths

I am trying to create an ADF pipeline that does the following:
Takes in a csv with 2 columns, eg:
Source, Destination
test_container/test.txt, test_container/test_subfolder/test.txt
Essentially I want to copy/move the filepath from the source directory into the Destination directory (Both these directories are in Azure blob storage).
I think there is a way to do this using lookups, but lookups are limited to 5000 rows and my CSV will be larger than that. Any suggestions on how this can be accomplished?
Thanks in advance,
This is a complex scenario for Azure Data Factory. Also as you mentioned there are more than 5000 file paths records in your CSV files, it also means same number of Source and Destination paths. So now if you create this architecture in ADF, it will goes like this:
You will use the Lookup activity to read the Source and Destination paths. In that also you can't read all the paths due to Lookup activity limitation.
Later you will iterate over the records using ForEach activity.
Now you also need to split the path so that you will get container, directory and file names separately to pass the details to Datasets created for Source and Destination location. Once you split the paths, you need to use the Set variable activity to store the Source and Destination container, directory and file names. These variables will be then passed to Datasets dynamically. This is a tricky part as even if a single record is unable to split properly then your pipeline would fail.
If above step completed successfully, then you not need to worry about copy activity. If all the parameters got the expected values under Source and Destination tabs in copy activity it will work properly.
My suggestion is to use programmatical approach for this. Use python, for example, to read the CSV file using pandas module and iterate over each path and copy the files. This will work fine even if you have 5000+ records.
You can refer this SO thread which will help you to implement the same programmatically.
First, if you want to maintain a hierarchical pattern in your data, i recommend using ADLS (Azure Data Lake Storage) this will guarantee a certain structure for your data.
second, if you have a Folder in Blob Storage and you would like to copy files to it, use Copy Activity, you should define 2 datasets, one for the source and one for the sink.
check this link : https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-overview

How to insert file manipulation during copy activity in Data factory

I am using Data factory to copy collection from Mongo Atlas to ADLS Gen 2.
By default data factory will create one json file per collection. But that leaves me with one huge json file.
I checked data flows and transformation but they work on file that is already present in ADLS. Is there a way I can split the data as it comes in to ADLS rather than first getting a huge file and then post processing and splitting it into smaller files?
If the collection size is 5GB, is it possible for data factory to split them in chunks of 100MB as the copy runs?
I would suggest you to use Partitioning as Partition option in Sink. As shown in below screenshot.
Refer - https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-performance#optimize-tab

Integration Runtime out of memory ADF

I am using data flow activity to convert MongoDB data to SQL.
As of now MongoDB/Atlas is not supported as a source in dataflow. I am converting MongoDB data to JSON file in AzureBlob Storage and then using that json file as a source in dataflow.
for a json source file whose size is around/more than 4Gb, whenever I try to import projection, the Azure Integration Runtime is throwing following error.
I have changed the core size to 16+16 and cluster type to memory optimized.
Is there any other way to import projection ?
Since your source data is one large file that contains lots of rows with maybe complex schemas, you can create a temporary file with a few rows that contain all the columns you want to read, and then do the following:
1. From the data flow source Debug Settings -> Import projection
with sample file to get the complete schema.
Now, Select Import projection.
2. Next, rollback the Debug Settings to use the source dataset for the remaining data movement/transformation.
If you want to map data types as well, you can follow this official MS recommendation doc, as map data type cannot be directly supported in JSON source.
Workaround for this was:
Instead of pulling all the data from mongo in a single blob, I pulled small chunks (500MB-1GB each) by using limit and skip option in "Copy Data" Activity.
and stored them in different JSON blobs

Azure Data Factory - How to read only the latest dataset in a Delta format Parquet built from Databricks?

To be clear about the format, this is how the DataFrame is saved in Databricks:
folderpath = "abfss://container#storage.dfs.core.windows.net/folder/path"
df.write.format("delta").mode("overwrite").save(folderPath)
This produces a set of Parquet files (often in 2-4 chunks) that are in the main folder, with a _delta_log folder that contains the files describing the data upload.
The delta log folder dictates which set of Parquet files in the folder should be read.
In Databricks, i would read the latest dataset for exmaple, by doing the following:
df = spark.read.format("delta").load(folderpath)
How would i do this in Azure Data Factory?
I have chosen Azure Data Lake Gen 2, then the Parquet format, however this doesn't seem to work, as i get the entire set of parquets read (i.e. all data sets) and not just the latest.
How can i set this up properly?
With Data Factory pipeline, it seems to be hard to achieve that. But I have some ideas for you:
Use lookup active to get the content of delta_log file. If there many files, use get metadata to get the all the files schema(last modified date).
Use an if condition active or swich active to filter the latest data.
After the data filtered, pass the lookup output to set the copy active source(set as parameter).
The hardest thing is that you need figure out how to filter the latest dataset with delta_log. You could try this way, the whole work flow should like this but I can't tell you if it really works. I couldn't test that for you without same environment.
HTP.

Is to possible to load the selective data from file to ADW through Azure pipeline

I wanted to know whether we will be able to load the selective data from file to ADW through Azure pipeline.
I will be getting full file daily and I have to load the data into ADW. But I just have to load only the last 2 days data from the file. The file will have the date to represent the data.
I have gone through pipeline documentation and couldn't find any way to filter out the data directly from file.
Could anyone please suggest whether it is possible?
Thanks
If the data file is a csv file or similar in blob storage or data lake store, you can create an external table in ADW which maps the data file. Then you can use a select query to retrieve data for the last 2 days when loading data from the external table.
/Magnus

Resources