Zero Bytes Files are getting created by ADF data flow - azure

There is a Conditional Split in my ADF data flow. Success puts the rows to a SQL database and failure conditions collect all the incorrect records and puts them into a sink which is of type CSV (Delimited text).
In case of success condition, there is an empty CSV file of 0 bytes is getting created in the sink.
How can I stop this?

If you don't wish to write output to an external source, you can use cache sink. It writes data into the Spark cache instead of a data store. In mapping data flows, you can reference this data within the same flow many times using a cache lookup. If you want to store this later to a data store, just reference data as part of an expression.
To write to a cache sink, add a sink transformation and select Cache as the sink type. Unlike other sink types, you don't need to select a dataset or linked service because you aren't writing to an external store.
Note: A cache sink must be in a completely independent data stream from any transformation referencing it via a cache lookup. A cache
sink also must the first sink written.
When utilizing cached lookups, make sure that your sink ordering has
the cached sinks set to 1, the lowest (or first) in ordering.
Reference this data within the same flow using a cache lookup, as part of an expression to store this to a data store.
Alternately use cache lookup against source to select and write it to CSV or log in a different stream or sink.
Refer: CacheSink, CachedLookup
If you still want to delete empty Zero byte files, you can use ADF or programmatic way to delete at the end of execution (Delete Activity in Azure Data Factory)
Examples of using the Delete activity

Related

How use output of data flow in the copy data activty in azure data factory

I have a excel file which I transform in a azure data flow in adf. I have added some column and transformed some values. As next step I want to copy the new data into a cosmos db. How can I achieve this? It's not clear how do I get the result of the data flow into the copyData activity. I have a sink in the data flow which will store the transformed data into a csv. As I understand the adf will create multiple files for performance reason. Or is there a way to make the changes "on the fly" and work with the transformed file further
Thanks
If your sink is Cosmos DB for No SQL, then there is a direct sink connecter available in azure dataflows. After applying your transformations, you can create a dataset and directly move the data.
If your sink is not for No SQL, then as you have done, write your data as csv files to your storage account. And if you choose to write the data to a single file, you can choose the Output to single file option in sink settings and give a filename.
You can directly select this file to copy to your sink. But if you already have data written as multiple files in a folder, you can use wild card path option as shown below:

How to insert file manipulation during copy activity in Data factory

I am using Data factory to copy collection from Mongo Atlas to ADLS Gen 2.
By default data factory will create one json file per collection. But that leaves me with one huge json file.
I checked data flows and transformation but they work on file that is already present in ADLS. Is there a way I can split the data as it comes in to ADLS rather than first getting a huge file and then post processing and splitting it into smaller files?
If the collection size is 5GB, is it possible for data factory to split them in chunks of 100MB as the copy runs?
I would suggest you to use Partitioning as Partition option in Sink. As shown in below screenshot.
Refer - https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-performance#optimize-tab

Integration Runtime out of memory ADF

I am using data flow activity to convert MongoDB data to SQL.
As of now MongoDB/Atlas is not supported as a source in dataflow. I am converting MongoDB data to JSON file in AzureBlob Storage and then using that json file as a source in dataflow.
for a json source file whose size is around/more than 4Gb, whenever I try to import projection, the Azure Integration Runtime is throwing following error.
I have changed the core size to 16+16 and cluster type to memory optimized.
Is there any other way to import projection ?
Since your source data is one large file that contains lots of rows with maybe complex schemas, you can create a temporary file with a few rows that contain all the columns you want to read, and then do the following:
1. From the data flow source Debug Settings -> Import projection
with sample file to get the complete schema.
Now, Select Import projection.
2. Next, rollback the Debug Settings to use the source dataset for the remaining data movement/transformation.
If you want to map data types as well, you can follow this official MS recommendation doc, as map data type cannot be directly supported in JSON source.
Workaround for this was:
Instead of pulling all the data from mongo in a single blob, I pulled small chunks (500MB-1GB each) by using limit and skip option in "Copy Data" Activity.
and stored them in different JSON blobs

Prevent empty file generation using Azure Data Factory Copy Activity

I'm using Azure Data Factory to copy data from Azure Cosmos DB to Azure Data Lake. My pipeline consists of a copy activity which copies data to the Data lake sink.
This is my query on the source dataset:
select * from c
where c.data.timestamp >= '#{formatDateTime(addminutes(pipeline().TriggerTime, -15), 'yyyy-MM-ddTHH:mm:ssZ' )}'
AND c.data.timestamp < '#{formatDateTime(pipeline().TriggerTime, 'yyyy-MM-ddTHH:mm:ssZ' )}'
I'm getting the data for the last 15 minutes before the trigger time.
Now, if there is no data retrieved by the query then the copy activity generates an empty file and stores it in the data lake. I want to prevent that. Is there any way I can achieve this?
You could use lookup activity and then use an if activity to decide whether you need to run the copy activity.
In the lookup activity, you could set firstRowOnly as true since you only want to check whether there are data.
This is an older thread but someone might have a more elegant way to handle the issue above that ADF produces a file even there are 0 records. Here are my concerns with the Lookup approach or having a post-process clean up the empty file.
It's inefficient to query database twice just to check if there are rows the first time.
Using the [IF Condition] componenet is not possible if you are already inside an [if component] or [case] component of ADF. (This is an ADF constraint/shortcoming also).
Cleaning up the empty file is also inefficient, and not an option if you are triggering off the event of the file being created since it causes a false-positive as it is written before you can clean it up.
I tried the following and it is working: I'm checking if the lookup entry returns more than 0 rows.

Can i upload data from multiple datasources to azure DW at same time

Can i retrieve data from multiple data sources to Azure SQL DataWarehouse at the same time using single pipeline?
SQL DW can certainly load multiple tables concurrently using external (aka PolyBase) tables, bcp, or insert statements. As hirokibutterfield asks, are you referring to a specific loading tool like Azure Data Factory?
Yes you can, but there you have to mention a copy activity for each of the data source being copied to the azure data warehous.
Yes you can, and depending on the extent of transformation required, there would be 2 ways to do this. Regardless of the method, the data source does not matter to ADF since your data movement happens via the copy activity which looks at the dataset and takes care of firing the query on the related datasource.
Method 1:
If all your transformation for a table can be done in a SELECT query on the source systems, you can have a set of copy activities specifying SELECT statements. This is the simple approach
Method 2:
If your transformation requires complex integration logic, first use copy activities to copy over the raw data from the source systems to staging tables in the SQLDW instance (Step 1). Then use a set of stored procedures to do the transformations (Step 2).
The ADF datasets which are the output from Step1 will be input datasets to Step 2 in order to maintain consistency.

Resources