Azure Data Factory to create an empty csv file - azure

I have a requirement where I want to check if file exist in a folder. If file exists then I want to skip else I want to create an empty csv file.
One way I have tried using Metadata activity and then using copy activity I am able to move empty file to the destination.
I want to check if there is any better way of creating empty csv file with on the fly without using copy activity?

Yes, there is another approach but it includes coding.
You can create a Azure Function in your preferred coding language and trigger it using Azure Data Factory Azure Function activity.
The Azure Function activity allows you to run Azure Functions in an Azure Data Factory or Synapse pipeline. To run an Azure Function, you must create a linked service connection. Then you can use the linked service with an activity that specifies the Azure Function that you plan to execute.
Learn more here.
In Azure Function, you can access the directory where you want to check the files availability and can also create/delete/update the csv files with schema based.

You can use a Copy Activity. Source: Write a SQL Query in-line that includes all the columns as NULL and a WHERE clause that will produce no data. Note that no tables are involved, and you can issue the query against any SQL source, including Serverless. Target: the csv file, include headers. Use an IF activity to create the file, if the metadata activity shows the file is absent.
Example query--
SELECT
CAST( NULL AS DATE) AS CreatedDate
, CAST( NULL AS INT) AS EmployeeId
, CAST( NULL AS VARCHAR(20)) AS FirstName
, CAST( NULL AS VARCHAR(50)) AS LastName
, CAST( NULL AS CHAR(20)) AS PostalCode
WHERE
1=2;

Related

Copy CSV File with Multiline Attribute with Azure Synapse Pipeline

I have a CSV File in the Following format which want to copy from an external share to my datalake:
Test; Text
"1"; "This is a text
which goes on on a second line
and on on a third line"
"2"; "Another Test"
I do now want to load it with a Copy Data Task in an Azure Synapse Pipeline. The result is the following:
Test; Text
"1";" \"This is a text"
"which goes on on a second line";
"and on on a third line\"";
"2";" \"Another Test\""
So, yo see, it is not handling the Multi-Line Text correct. I also do not see an option to handle multiline text within a Copy Data Task. Unfortunately i'm not able to use a DataFlow Task, because it is not allowing to run with an external Azure Runtime, which i'm forced to use, due to security reasons.
In fact, i'm of course not speaking about this single test file, instead i do have x thousands of files.
My settings for the CSV File look like follows:
CSV Connector Settings
Can someone tell me how to handle this kind of multiline data correctly?
Do I have any other options within Synapse (apart from the Dataflows)?
Thanks a lot for your help
Well turns out this is not possible with a CSV File.
The pragmatic solution is to use "binary" files instead, to transfer the CSV Files and only load and transform them later on with a Python Notebook in Synapse.
You can achieve this in azure data factory by iterating through all lines and check for delimiter in each line. And then, use string manipulation functions with set variable activities to convert multi-line data to a single line.
Look at the following example. I have a set variable activity with empty value (taken from parameter) for req variable.
In lookup, create a dataset with following configuration to the multiline csv:
In foreach, where I iterate each row by giving items value as #range(0,sub(activity('Lookup1').output.count,1)). Inside for each, I have an if activity with following condition:
#contains(activity('Lookup1').output.value[item()]['Prop_0'],';')
If this is true, then I concat the current result to req variable using 2 set variable activities.
temp: #if(contains(activity('Lookup1').output.value[add(item(),1)]['Prop_0'],';'),concat(variables('req'),activity('Lookup1').output.value[item()]['Prop_0'],decodeUriComponent('%0D%0A')),concat(variables('req'),activity('Lookup1').output.value[item()]['Prop_0'],' '))
actual (req variable): #variables('val')
For false, I have handled the concatenation in the following way:
temp1: #concat(variables('req'),activity('Lookup1').output.value[item()]['Prop_0'],' ')
actual1 (req variable): #variables('val2')
Now, I have used a final variable to handle last line of the file. I have used the following dynamic content for that:
#if(contains(last(activity('Lookup1').output.value)['Prop_0'],';'),concat(variables('req'),decodeUriComponent('%0D%0A'),last(activity('Lookup1').output.value)['Prop_0']),concat(variables('req'),last(activity('Lookup1').output.value)['Prop_0']))
Finally, I have taken copy data activity with a sample source file with 1 column and 1 row (using this to copy our actual data).
Now, take source file configuration as shown below:
Create an additional column with value as final variable value:
Create a sink with following configuration and select mapping for only above created column:
When I run the pipeline, I get the data as required. The following is an output image for reference.

Azure Data Factory - Azure SQL Managed Services incorrect Output column type

I have decided to try and use Azure Data Factory to replicate data from one SQL Managed Instance Database to another with some trimming of the data in the process.
I have set up two Datasets to each Database / Table imported the schema ok (these are duplicated so identical) created a dataflow with one as the source and updated the schema in the projection, added a simple AlterRow (column != 201) gave it the PK then I add the second dataset as the sink and for some reason in the mapping all the output columns are showing as 'string' but the input columns show correctly.
because of this the mapping fails as it thinks the input and output are not matching? I cant understand why both Schema's in the dataset show correctly and the projection in the dataflow for the source shows correctly but it thinks i am outputting to all string columns?
TIA
Here is an easy way to map a set of unknown incoming fields to a defined database table schema ... Add a Select transformation before your Sink. Paste this into the Script behind for the Select:
select(mapColumn(
each(match(true()))
),
skipDuplicateMapInputs: true,
skipDuplicateMapOutputs: true) ~> automap
Now, in your Sink, just leave schema drift and automapping on.

DELETE using PreCopy Script in CopData activity

I have simple copy data activity with source and destination as a table in Azure Data Factory, Before inserting I'm having delete script in the pre-copy script option. The Delete should be done on the basis of parameters passed to the pipeline.
I tried this way but getting error.
DELETE FROM [dbo].[StgMetricLoad] where TransactionKey in(pipeline().parameters.TransactionKey)
Per my experience,you can't merge pipeline string parameter into sql string like that directly. This should be configured as dynamic content with #cancat built-in function.
I tested it in the Set Variable Activity:
#concat('DELETE FROM [dbo].[StgMetricLoad] where TransactionKey in(',
pipeline().parameters.keystring,
')')
Test Output:

How to troubleshoot - Azure DataFactory - Copy Data Destination tables have not been properly configured

I'm setting up a SQL Azure Copy Data job using Data Factory. For my source I'm selecting the exact data that I want. For my destination I'm selecting use stored procedure. I cannot move forward from the table mapping page as it reports 'one or more destination tables have been been properly configured'. From what I can tell. Everything looks good as I can manually run the stored procedure from SQL without an issue.
I'm looking for troubleshooting advice on how to solve this problem as the portal doesn't appear to provide any more data then the error itself.
Additional but unrelated question: What is the benefit from me doing a copy job in data factory vs just having data factory call a stored procedure?
I've tried executing the stored procedure on via SQL. I discovered one problem with that as I had LastUpdatedDate in the TypeTable but it isnt actually an input value. After fixing that I'm able to execute the SP without issue.
Select Data from Source
SELECT
p.EmployeeNumber,
p.EmailName,
FROM PersonFeed AS p
Create table Type
CREATE TYPE [person].[PersonSummaryType] AS TABLE(
[EmployeeNumber] [int] NOT NULL,
[EmailName] [nvarchar](30) NULL
)
Create UserDefined Stored procedure
CREATE PROCEDURE spOverwritePersonSummary #PersonSummary [person].[PersonSummaryType] READONLY
AS
BEGIN
MERGE [person].[PersonSummary] [target]
USING #PersonSummary [source]
ON [target].EmployeeNumber = [source].EmployeeNumber
WHEN MATCHED THEN UPDATE SET
[target].EmployeeNumber = [source].EmployeeNumber,
[target].EmailName = [source].EmailName,
[target].LastUpdatedDate = GETUTCDATE()
WHEN NOT MATCHED THEN INSERT (
EmployeeNumber,
EmailName,
LastUpdatedDate)
VALUES(
[source].EmployeeNumber,
[source].EmailName,
GETUTCDATE());
END
Datafactory UI when setting destination on the stored procedure reports "one or more destination tables have been been properly configured"
I believe the UI is broken when using the Copy Data. I was able to map directly to a table to get the copy job created then manually edit the JSON and everything worked fine. Perhaps the UI is new and that explains why all the support docs only refer only to the json? After playing with this more it looks like the UI sees the table type as schema.type, but it drops the schema for some reason. A simple edit in the JSON file corrects it.

Azure Data Factory V2 Copy Activity file filters

I am using Data Factory v2 and I currently have a simple copy activity which copies files from an FTP server to blob storage. The file names on this server are of the following form :
File_{Year}{Month}{Day}.zip
In order to download the most recent file I add this filter to my input dataset json file :
"fileName": {
"value": "#concat('File_',formatDateTime(utcnow(), 'yyyyMMdd'), '.zip')",
"type": "Expression"
}
I now want to be able to download yesterday's file which is possible using adddays().
However I would like to be able to do this in the same copy activity and it seems that Data Factory v2 does not allow me to use the following kind of regular expression logic :
#concat('File_',formatDateTime(utcnow(), 'yyyyMMdd'), '.zip') || #concat('File_', formatDateTime(adddays(utcnow(), -1), 'yyyyMMdd'), '.zip')
Is this possible or do I need a separate activity ?
It would seem strange to need a second activity since a Copy Activity can only take a single input but if the regex is simple enough, then multiple files are treated as a single input and if not then multiple files are treated as multiple inputs.
The '||' won't work since it will be evaluate as a single string.
But I can provided two solutions for this.
using a tumbling window daily trigger and set the start time as yesterday. So it will trigger two pipeline run.
using Foreach activity + copy activity. The foreach activity iterate an array to pass the yesterday and today to the copy activity.
Btw, you could just use string interpolation expression instead of concat. They are the same.
File_#{formatDateTime(utcnow(), 'yyyyMMdd')}.zip
I would suggest you to read about get metadata activity. This can be helpful in your scenario I think.
https://learn.microsoft.com/en-us/azure/data-factory/control-flow-get-metadata-activity
You have itemName property, lastModified property, check it out.

Resources