How to insert file manipulation during copy activity in Data factory - azure

I am using Data factory to copy collection from Mongo Atlas to ADLS Gen 2.
By default data factory will create one json file per collection. But that leaves me with one huge json file.
I checked data flows and transformation but they work on file that is already present in ADLS. Is there a way I can split the data as it comes in to ADLS rather than first getting a huge file and then post processing and splitting it into smaller files?
If the collection size is 5GB, is it possible for data factory to split them in chunks of 100MB as the copy runs?

I would suggest you to use Partitioning as Partition option in Sink. As shown in below screenshot.
Refer - https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-performance#optimize-tab

Related

How use output of data flow in the copy data activty in azure data factory

I have a excel file which I transform in a azure data flow in adf. I have added some column and transformed some values. As next step I want to copy the new data into a cosmos db. How can I achieve this? It's not clear how do I get the result of the data flow into the copyData activity. I have a sink in the data flow which will store the transformed data into a csv. As I understand the adf will create multiple files for performance reason. Or is there a way to make the changes "on the fly" and work with the transformed file further
Thanks
If your sink is Cosmos DB for No SQL, then there is a direct sink connecter available in azure dataflows. After applying your transformations, you can create a dataset and directly move the data.
If your sink is not for No SQL, then as you have done, write your data as csv files to your storage account. And if you choose to write the data to a single file, you can choose the Output to single file option in sink settings and give a filename.
You can directly select this file to copy to your sink. But if you already have data written as multiple files in a folder, you can use wild card path option as shown below:

Increment data load from Azure Synapse to ADLS using delta lake

We have some views created in Azure Synapse Db. We need to query this data incrementally based on a water mark column and it has to be loaded into the Azure data lake container into the Raw layer and then to the curated layer. In Raw Layer the file should contain the entire Data(Full Load data).So basically we need to append this data and export as a full load . Should we use Databricks Delta lake tables to handle this requirement. How we can upsert data to the Delta lake table. Also we need to delete the record if it has been deleted from source.What should be partition column to be used for this
Please look at the syntax for delta tables - UPSERT. Before the delta file format, one would have to read the old file, read the new file and do a set operation on the dataframes to get the results.
The nice thing about delta is the ACID properties. I like using data frames since the syntax might be smaller. Here is an article for you to read.
https://www.databricks.com/blog/2019/03/19/efficient-upserts-into-data-lakes-databricks-delta.html

Integration Runtime out of memory ADF

I am using data flow activity to convert MongoDB data to SQL.
As of now MongoDB/Atlas is not supported as a source in dataflow. I am converting MongoDB data to JSON file in AzureBlob Storage and then using that json file as a source in dataflow.
for a json source file whose size is around/more than 4Gb, whenever I try to import projection, the Azure Integration Runtime is throwing following error.
I have changed the core size to 16+16 and cluster type to memory optimized.
Is there any other way to import projection ?
Since your source data is one large file that contains lots of rows with maybe complex schemas, you can create a temporary file with a few rows that contain all the columns you want to read, and then do the following:
1. From the data flow source Debug Settings -> Import projection
with sample file to get the complete schema.
Now, Select Import projection.
2. Next, rollback the Debug Settings to use the source dataset for the remaining data movement/transformation.
If you want to map data types as well, you can follow this official MS recommendation doc, as map data type cannot be directly supported in JSON source.
Workaround for this was:
Instead of pulling all the data from mongo in a single blob, I pulled small chunks (500MB-1GB each) by using limit and skip option in "Copy Data" Activity.
and stored them in different JSON blobs

Zero Bytes Files are getting created by ADF data flow

There is a Conditional Split in my ADF data flow. Success puts the rows to a SQL database and failure conditions collect all the incorrect records and puts them into a sink which is of type CSV (Delimited text).
In case of success condition, there is an empty CSV file of 0 bytes is getting created in the sink.
How can I stop this?
If you don't wish to write output to an external source, you can use cache sink. It writes data into the Spark cache instead of a data store. In mapping data flows, you can reference this data within the same flow many times using a cache lookup. If you want to store this later to a data store, just reference data as part of an expression.
To write to a cache sink, add a sink transformation and select Cache as the sink type. Unlike other sink types, you don't need to select a dataset or linked service because you aren't writing to an external store.
Note: A cache sink must be in a completely independent data stream from any transformation referencing it via a cache lookup. A cache
sink also must the first sink written.
When utilizing cached lookups, make sure that your sink ordering has
the cached sinks set to 1, the lowest (or first) in ordering.
Reference this data within the same flow using a cache lookup, as part of an expression to store this to a data store.
Alternately use cache lookup against source to select and write it to CSV or log in a different stream or sink.
Refer: CacheSink, CachedLookup
If you still want to delete empty Zero byte files, you can use ADF or programmatic way to delete at the end of execution (Delete Activity in Azure Data Factory)
Examples of using the Delete activity

Azure data factory - Copy Data activity Sink - Max rows per file property

I would like to spilt my big size file into smaller chunks inside blob storage via ADF copy data activity. I am trying to do so using Max Rows per file property in Copy activity sink but my file is not getting spilt into smaller files rather I get the same big size file in result, can anyone share any valuable info here?
I tested it, and it can work fine.
Source dataset
2.Source setting
3.Sink dataset
4.Sink setting
Result:

Resources