Data Factory copy csv to SQL cannot convert empty data - azure

Encountered below various errors caused by empty data when building a very basic Copy Data task from File Sharing to Azure SQL:
ErrorCode=TypeConversionFailure,Exception occurred when converting
value '' for column name 'EndDate' from type 'String' (precision:,
scale:) to type 'DateTime' (precision:255, scale:255). Additional
info: String was not recognized as a valid DateTime.
And here is another one I believe caused by the same reason:
ErrorCode=TypeConversionFailure,Exception occurred when converting
value '' for column name 'ContractID' from type 'String' (precision:,
scale:) to type 'Guid' (precision:255, scale:255). Additional info:
Unrecognized Guid format.
All I need is to treat empty data as NULL when copying to SQL Tables. The only option I have found is "Null value" in my CSV dataset; and it is set to nothing as default.
Below is the code of CSV dataset:
{
"name": "CSV",
"properties": {
"linkedServiceName": {
"referenceName": "CSV",
"type": "LinkedServiceReference"
},
"parameters": {
"FileName": {
"type": "string"
}
},
"annotations": [],
"type": "DelimitedText",
"typeProperties": {
"location": {
"type": "AzureFileStorageLocation",
"fileName": {
"value": "#dataset().FileName",
"type": "Expression"
},
"folderPath": "output"
},
"columnDelimiter": ",",
"escapeChar": "\\",
"firstRowAsHeader": true,
"quoteChar": "\""
},
"schema": []
}
}
The csv file does use double quotation marks as the qualifier. And those empty data in source files look like this:
"b139fe4d-f48a-4158-8196-a43500b3bf02","19601","Bar","2015/02/02","","","","","","","","","","",""

Due to the Copy Activity cann't process the empty value, so we need to use Data Flow to Convert the field to NULL value.
Here is my test using your example:
I created a table in Azure SQL
Create table TestNull(
Column1 UNIQUEIDENTIFIER null,
Column2 varchar(50) null,
Column3 varchar(60) null,
Column4 DateTime null,
Column5 varchar(50) null,
Column6 varchar(50) null
)
In ADF, We can use DerivedColumn to convert the empty value to NULL value. So we can use the expression iifNull(Column_1,toString(null())) to judge the if the field is empty, if so it will be replaced with a NULL value.
In the sink, we should set the mapping.
It will insert NULL value into the table.

Related

I have to write all filenames from an ADLS folder into an csv file but after successfully pipeline run data is not reflected into destination csv file

Let's suppose there are 12 folders on my Container so i have to copy the folder names to a csv file.
In first step i used a getmetadata activity to get the folder names from the container
In second step i used a Foreach activity and pass #activity('Get Metadata1').output.childItems as items
a) Inside foreach i used append varriable activity and append the item().name into a varriable Filename as shown in screenshot.So filename varriable is of array type and it is used to store an array of folder names in Container.
In Third step i used a copy activity it will copy folder names from filename varriable in append activity and will store data into a sink(csv file).
a) The source dataset is a dummy csv file
b) then I check the Mapping
Error
After this when i debug pipeline i am not able to see any foldername on my storage location
You have to deselect the first row as header option in your source dataset. Also change the quote character and escape character to none. The data will be written successfully to your sink file as shown below.
However, if you want to write all the file names to a single column, you can use the following procedure instead:
I have the following folders in my source:
In the dummy source file, I have the data as following:
The following is the source dataset JSON:
{
"name": "source1",
"properties": {
"linkedServiceName": {
"referenceName": "adls",
"type": "LinkedServiceReference"
},
"annotations": [],
"type": "DelimitedText",
"typeProperties": {
"location": {
"type": "AzureBlobFSLocation",
"fileSystem": "data"
},
"columnDelimiter": ",",
"escapeChar": "\\",
"firstRowAsHeader": true,
"quoteChar": "\""
},
"schema": [
{
"type": "String"
}
]
},
"type": "Microsoft.DataFactory/factories/datasets"
}
Now, after using append variable activity to get all the folder names in a single array, use the following dynamic content in new column folder_names.
#join(variables('req'),decodeUriComponent('%0D%0A'))
The following is the sink dataset JSON:
{
"name": "output",
"properties": {
"linkedServiceName": {
"referenceName": "adls",
"type": "LinkedServiceReference"
},
"annotations": [],
"type": "DelimitedText",
"typeProperties": {
"location": {
"type": "AzureBlobFSLocation",
"fileName": "op.csv",
"fileSystem": "output"
},
"columnDelimiter": ",",
"escapeChar": "",
"firstRowAsHeader": true,
"quoteChar": ""
},
"schema": []
}
}
When I run the pipeline, I would get the data as following:

Dynamic Data Flow to separate unique records in Azure Data Factory

I have a requirement to read Parquet Files dynamically and extract unique records. Each file can have 1 or more key Columns.
Assuming the files are going to have 1 key column I have designed the below data flow with ID parameter.
In the aggregate transformation, I am grouping by the ID Column
Also allowing all other columns to pass-through
Note: please observe the Column is being read as ID and not AddressID
In the next step in select, I am trying to rename this ID as AddressID(using the parameter value).
the output show like this
I have tried giving values in names as hardcoded value (Address ID) and it
works.
Can some help me on how to rename this ID with AddressId (parameter value which key column name) dynamically?
Also, the above scenario is possible when there is 1 key column.
Is it possible to use the Azure Data Factory to handle a scenario when there are more than 1 key columns and process it dynamically?
Depending on this we will use adf or use ADB.
Data Flow Code:
{
"name": "RemoveDuplicateRows",
"properties": {
"type": "MappingDataFlow",
"typeProperties": {
"sources": [
{
"dataset": {
"referenceName": "DS_Parquet_DF",
"type": "DatasetReference"
},
"name": "source1"
}
],
"sinks": [
{
"dataset": {
"referenceName": "DS_Parquet_Cleaned",
"type": "DatasetReference"
},
"name": "sink1"
}
],
"transformations": [
{
"name": "Aggregate1"
},
{
"name": "Select1"
}
],
"script": "parameters{\n\tID as string ('AddressID')\n}\nsource(allowSchemaDrift: true,\n\tvalidateSchema: false,\n\tformat: 'parquet',\n\tpartitionBy('roundRobin', 2)) ~> source1\nsource1 aggregate(groupBy(ID = byName($ID)),\n\teach(match(name!=$ID), $$ = first($$))) ~> Aggregate1\nAggregate1 select(mapColumn(\n\t\teach(match(name=='ID'),\n\t\t\t'AddressID' = $$),\n\t\teach(match(name!='ID'))\n\t),\n\tskipDuplicateMapInputs: true,\n\tskipDuplicateMapOutputs: true) ~> Select1\nSelect1 sink(allowSchemaDrift: true,\n\tvalidateSchema: false,\n\tformat: 'parquet',\n\ttruncate: true,\n\tpartitionBy('roundRobin', 2),\n\tskipDuplicateMapInputs: true,\n\tskipDuplicateMapOutputs: true) ~> sink1"
}
}
}
DataFlow Script
parameters{
ID as string ('AddressID')
}
source(allowSchemaDrift: true,
validateSchema: false,
format: 'parquet',
partitionBy('roundRobin', 2)) ~> source1
source1 aggregate(groupBy(ID = byName($ID)),
each(match(name!=$ID), $$ = first($$))) ~> Aggregate1
Aggregate1 select(mapColumn(
each(match(name=='ID'),
'AddressID' = $$),
each(match(name!='ID'))
),
skipDuplicateMapInputs: true,
skipDuplicateMapOutputs: true) ~> Select1
Select1 sink(allowSchemaDrift: true,
validateSchema: false,
format: 'parquet',
truncate: true,
partitionBy('roundRobin', 2),
skipDuplicateMapInputs: true,
skipDuplicateMapOutputs: true) ~> sink1

Azure Data Factory CSV with double quotes

I have a pipeline that retrieves an FTP hosted CSV file. It is comma delimited with double quote identifiers. The issue exists where a string is encapsulated in double quotes, but the string itself contains double quotes.
string example: "Spring Sale" this year.
How it looks in the csv (followed and lead by two null columns):
"","""Spring Sale"" this year",""
SSIS handles this fine, but Data Factory wants to transform it into an extra column that isn't separated by a comma. I have removed the extra quotes on this line and it works fine.
Is there a way around this besides altering the source?
I got this to work using Escape character set as quote (") with the Azure Data Factory Copy Task. Screen shot:
This was based on a file as per your spec:
"","""Spring Sale"" this year",""
and also worked as in insert into an Azure SQL Database table. The sample JSON:
{
"name": "DelimitedText1",
"properties": {
"linkedServiceName": {
"referenceName": "linkedService2",
"type": "LinkedServiceReference"
},
"type": "DelimitedText",
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"fileName": "quotes.txt",
"container": "someContainer"
},
"columnDelimiter": ",",
"escapeChar": "\"",
"quoteChar": "\""
},
"schema": [
{
"name": "Prop_0",
"type": "String"
},
{
"name": "Prop_1",
"type": "String"
},
{
"name": "Prop_2",
"type": "String"
}
]
}
}
Maybe the example file is too simple but it did work for me in this configuration.
Alternately, just use SSIS and host it in Data Factory.

NULLS in File output are \N and I want them to be empty

I have a datafactory that reads from a table and stores the output as a CSV to Blob Storage.
I have noticed that instead of leaving a NULL field blank it inserts the NULL character \N.. Now the external system that is ingesting this can't handle \N.
Is there anyway in my dataset where I can say leave nulls blank.
Below is my dataset properties:
"typeProperties": {
"fileName": "MasterFile-{fileDateNameVariable}.csv",
"folderPath": "master-file-landing",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"firstRowAsHeader": true
},
"partitionedBy": [
{
"name": "fileDateNameVariable",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyyMMdd"
}
}
]
},
Thanks in advance.
You could set the Null value to "" when you set your dataset. Please refer to my test.
Table data:
Output Dataset:
Generate csv file:
Hope it helps you.

Azure ADF sliceIdentifierColumnName is not populating correctly

I've set up a ADF pipeline using a sliceIdentifierColumnName which has worked well as it populated the field with a GUID as expected. Recently however this field stopped being populated, the refresh would work but the sliceIdentifierColumnName field would have a value of null, or occasionally the load would fail as it attempted to populate this field with a value of 1 which causes the slice load to fail.
This change occurred at a point in time, before it worked perfectly, after it repeatedly failed to populate the field correctly. I'm sure no changes were made to the Pipeline which caused this to suddenly fail. Any pointers where I should be looking?
Here an extract of the pipeline source, I'm reading from a table in Amazon Redshift and writing to an Azure SQL table.
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "$$Text.Format('select * from mytable where eventtime >= \\'{0:yyyy-MM-ddTHH:mm:ssZ}\\' and eventtime < \\'{1:yyyy-MM-ddTHH:mm:ssZ}\\' ' , SliceStart, SliceEnd)"
},
"sink": {
"type": "SqlSink",
"sliceIdentifierColumnName": "ColumnForADFuseOnly",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "AmazonRedshiftSomeName"
}
],
"outputs": [
{
"name": "AzureSQLDatasetSomeName"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 10,
"style": "StartOfInterval",
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Hour",
"interval": 2
},
"name": "Activity-somename2Hour"
}
],
Also, here is the error output text
Copy activity encountered a user error at Sink:.database.windows.net side: ErrorCode=UserErrorInvalidDataValue,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Column 'ColumnForADFuseOnly' contains an invalid value '1'.,Source=Microsoft.DataTransfer.Common,''Type=System.ArgumentException,Message=Type of value has a mismatch with column typeCouldn't store <1> in ColumnForADFuseOnly Column.
Expected type is Byte[].,Source=System.Data,''Type=System.ArgumentException,Message=Type of value has a mismatch with column type,Source=System.Data,'.
Here is part of the source dataset, it's a table with all datatypes as Strings.
{
"name": "AmazonRedshiftsomename_2hourly",
"properties": {
"structure": [
{
"name": "eventid",
"type": "String"
},
{
"name": "visitorid",
"type": "String"
},
{
"name": "eventtime",
"type": "Datetime"
}
}
Finally, the target table is identical to the source table, mapping each column name to its counterpart in Azure, with the exception of the additional column in Azure named
[ColumnForADFuseOnly] binary NULL,
It is this column which is now either being populated with NULLs or 1.
thanks,
You need to define [ColumnForADFuseOnly] as binary(32), binary with no length modifier is defaulting to a length of 1 and thus truncating your sliceIdentifier...
When n is not specified in a data definition or variable declaration statement, the default length is 1. When n is not specified with the CAST function, the default length is 30. See here

Resources