Integration Runtime out of memory ADF - azure

I am using data flow activity to convert MongoDB data to SQL.
As of now MongoDB/Atlas is not supported as a source in dataflow. I am converting MongoDB data to JSON file in AzureBlob Storage and then using that json file as a source in dataflow.
for a json source file whose size is around/more than 4Gb, whenever I try to import projection, the Azure Integration Runtime is throwing following error.
I have changed the core size to 16+16 and cluster type to memory optimized.
Is there any other way to import projection ?

Since your source data is one large file that contains lots of rows with maybe complex schemas, you can create a temporary file with a few rows that contain all the columns you want to read, and then do the following:
1. From the data flow source Debug Settings -> Import projection
with sample file to get the complete schema.
Now, Select Import projection.
2. Next, rollback the Debug Settings to use the source dataset for the remaining data movement/transformation.
If you want to map data types as well, you can follow this official MS recommendation doc, as map data type cannot be directly supported in JSON source.

Workaround for this was:
Instead of pulling all the data from mongo in a single blob, I pulled small chunks (500MB-1GB each) by using limit and skip option in "Copy Data" Activity.
and stored them in different JSON blobs

Related

How use output of data flow in the copy data activty in azure data factory

I have a excel file which I transform in a azure data flow in adf. I have added some column and transformed some values. As next step I want to copy the new data into a cosmos db. How can I achieve this? It's not clear how do I get the result of the data flow into the copyData activity. I have a sink in the data flow which will store the transformed data into a csv. As I understand the adf will create multiple files for performance reason. Or is there a way to make the changes "on the fly" and work with the transformed file further
Thanks
If your sink is Cosmos DB for No SQL, then there is a direct sink connecter available in azure dataflows. After applying your transformations, you can create a dataset and directly move the data.
If your sink is not for No SQL, then as you have done, write your data as csv files to your storage account. And if you choose to write the data to a single file, you can choose the Output to single file option in sink settings and give a filename.
You can directly select this file to copy to your sink. But if you already have data written as multiple files in a folder, you can use wild card path option as shown below:

How to insert file manipulation during copy activity in Data factory

I am using Data factory to copy collection from Mongo Atlas to ADLS Gen 2.
By default data factory will create one json file per collection. But that leaves me with one huge json file.
I checked data flows and transformation but they work on file that is already present in ADLS. Is there a way I can split the data as it comes in to ADLS rather than first getting a huge file and then post processing and splitting it into smaller files?
If the collection size is 5GB, is it possible for data factory to split them in chunks of 100MB as the copy runs?
I would suggest you to use Partitioning as Partition option in Sink. As shown in below screenshot.
Refer - https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-performance#optimize-tab

How to Read json file which is more than 3GB and has duplicate columns in nested elements

I am working on Azure technologies and want to Read json file which is more than 3GB and has duplicate columns in nested elements.
I tried pyspark, data flow , pipeines. But no luck.
Could you please suggest which technique i could use?
Using Azure Data Factory Mapping data flow, you can split big JSON file to small partitions.
In the Sink, define the partitioning
Later, using data flow transform, you can drop the duplicate values.
Follow https://mssqldude.wordpress.com/2019/03/23/partition-large-files-with-adf-using-mapping-data-flows/ for partitioning and https://tech-tutes.com/2020/10/19/remove-duplicate-data-using-data-flow-in-azure-data-factory/ to drop duplicate values.

Is there way to copy files from local machine to Dataflow harness instance in python + apache beam

I want to validate data of each element in a ParDo function against the json schema file.
To make this work, I need to copy json schema file from my local machine to harness Dataflow instance created by Python Beam Dataflow SDK.
Each individual element represents data for separate table (variety of such different element is 26 meaning element can be dumped into any of these 26 tables based on the key field in the element representing the table name).
I want this json schema file to be copied only once at the start of Dataflow job on harness instance and then do validation of a element with already stored json schema.
I came across a post saying use DoFn.setup() method but not sure how to use it to copy file from local to harness machine.
Python 3.6, apache-beam 2.26.0
Any help and/or pointers?
Thanks.
You can build your own SDK worker harness container image which contains your schema file, you can read more here: https://beam.apache.org/documentation/runtime/environments/
For your use case, have you considered storing your schema file with Google Cloud Storage? You should be able to read the file with file io in your pipeline and feed the schema into your DoFn as side input.

Azure data factory - Copy Data activity Sink - Max rows per file property

I would like to spilt my big size file into smaller chunks inside blob storage via ADF copy data activity. I am trying to do so using Max Rows per file property in Copy activity sink but my file is not getting spilt into smaller files rather I get the same big size file in result, can anyone share any valuable info here?
I tested it, and it can work fine.
Source dataset
2.Source setting
3.Sink dataset
4.Sink setting
Result:

Resources