Azure Data Factory CSV with double quotes - azure

I have a pipeline that retrieves an FTP hosted CSV file. It is comma delimited with double quote identifiers. The issue exists where a string is encapsulated in double quotes, but the string itself contains double quotes.
string example: "Spring Sale" this year.
How it looks in the csv (followed and lead by two null columns):
"","""Spring Sale"" this year",""
SSIS handles this fine, but Data Factory wants to transform it into an extra column that isn't separated by a comma. I have removed the extra quotes on this line and it works fine.
Is there a way around this besides altering the source?

I got this to work using Escape character set as quote (") with the Azure Data Factory Copy Task. Screen shot:
This was based on a file as per your spec:
"","""Spring Sale"" this year",""
and also worked as in insert into an Azure SQL Database table. The sample JSON:
{
"name": "DelimitedText1",
"properties": {
"linkedServiceName": {
"referenceName": "linkedService2",
"type": "LinkedServiceReference"
},
"type": "DelimitedText",
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"fileName": "quotes.txt",
"container": "someContainer"
},
"columnDelimiter": ",",
"escapeChar": "\"",
"quoteChar": "\""
},
"schema": [
{
"name": "Prop_0",
"type": "String"
},
{
"name": "Prop_1",
"type": "String"
},
{
"name": "Prop_2",
"type": "String"
}
]
}
}
Maybe the example file is too simple but it did work for me in this configuration.
Alternately, just use SSIS and host it in Data Factory.

Related

I have to write all filenames from an ADLS folder into an csv file but after successfully pipeline run data is not reflected into destination csv file

Let's suppose there are 12 folders on my Container so i have to copy the folder names to a csv file.
In first step i used a getmetadata activity to get the folder names from the container
In second step i used a Foreach activity and pass #activity('Get Metadata1').output.childItems as items
a) Inside foreach i used append varriable activity and append the item().name into a varriable Filename as shown in screenshot.So filename varriable is of array type and it is used to store an array of folder names in Container.
In Third step i used a copy activity it will copy folder names from filename varriable in append activity and will store data into a sink(csv file).
a) The source dataset is a dummy csv file
b) then I check the Mapping
Error
After this when i debug pipeline i am not able to see any foldername on my storage location
You have to deselect the first row as header option in your source dataset. Also change the quote character and escape character to none. The data will be written successfully to your sink file as shown below.
However, if you want to write all the file names to a single column, you can use the following procedure instead:
I have the following folders in my source:
In the dummy source file, I have the data as following:
The following is the source dataset JSON:
{
"name": "source1",
"properties": {
"linkedServiceName": {
"referenceName": "adls",
"type": "LinkedServiceReference"
},
"annotations": [],
"type": "DelimitedText",
"typeProperties": {
"location": {
"type": "AzureBlobFSLocation",
"fileSystem": "data"
},
"columnDelimiter": ",",
"escapeChar": "\\",
"firstRowAsHeader": true,
"quoteChar": "\""
},
"schema": [
{
"type": "String"
}
]
},
"type": "Microsoft.DataFactory/factories/datasets"
}
Now, after using append variable activity to get all the folder names in a single array, use the following dynamic content in new column folder_names.
#join(variables('req'),decodeUriComponent('%0D%0A'))
The following is the sink dataset JSON:
{
"name": "output",
"properties": {
"linkedServiceName": {
"referenceName": "adls",
"type": "LinkedServiceReference"
},
"annotations": [],
"type": "DelimitedText",
"typeProperties": {
"location": {
"type": "AzureBlobFSLocation",
"fileName": "op.csv",
"fileSystem": "output"
},
"columnDelimiter": ",",
"escapeChar": "",
"firstRowAsHeader": true,
"quoteChar": ""
},
"schema": []
}
}
When I run the pipeline, I would get the data as following:

Index Out of Range Error When Creating SnowFlake Linked Service in Azure Data Factory

I am passing the credentials and parameters required but I get the error
The value of the property 'index' is invalid: 'Index was out of range.
Must be non-negative and less than the size of the collection.
Parameter name: index'. Index was out of range. Must be non-negative
and less than the size of the collection. Parameter name: index
Activity ID: 36a4265d-3607-4472-8641-332f5656661d.
I had the same issue, the password contained a ' and that's causing the trouble. Changed the password with no symbols and it works like a charm
Seems the UI doesn't generate the linked service correctly. Using Microsoft Docs Example JSON I received the same index error when attempting to create the linked service. If I remove the password from the connection string and add it as a separate property I am able to successfully generate the linked service.
Microsoft Docs Example (Doesn't Work)
{
"name": "SnowflakeLinkedService",
"properties": {
"type": "Snowflake",
"typeProperties": {
"connectionString": "jdbc:snowflake://<accountname>.snowflakecomputing.com/?user=<username>&password=<password>&db=<database>&warehouse=<warehouse>&role=<myRole>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Working Example
{
"name": "SnowflakeLinkedService",
"properties": {
"type": "Snowflake",
"typeProperties": {
"connectionString": "jdbc:snowflake://<accountname>.snowflakecomputing.com/?user=<username>&db=<database>&warehouse=<warehouse>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
We hit this same issue today, it was because our password had an ampersand (&) at the end. This seemed to mess up the connection string as it contained this:
&password=abc123&&role=MyRole
Changing the password to not include an ampersand fixed it

Azure Data Factory - Dynamic Account information - Parameterization of Connection

The documentation demonstrates how to create a parameter for a connected service but not how to actual pass in that parameter from a dataset or activity. Basically the connection string is coming from a lookup foreach loop and I want to connect to a storage table.
The connection looks like this. The test works when passing in a correct parameter:
{
"name": "StatsStorage",
"properties": {
"type": "AzureTableStorage",
"parameters": {
"connectionString": {
"type": "String"
}
},
"annotations": [],
"typeProperties": {
"connectionString": "#{linkedService().connectionString}"
}
}
}
The dataset is the following, I'm struggling to determine how to set the connectionString parameter for the connection. The dataset has two parameters, the connectionstring from the db and the tablename that it needs to connect to:
{
"name": "TestTable",
"properties": {
"linkedServiceName": {
"referenceName": "StatsStorage",
"type": "LinkedServiceReference"
},
"parameters": {
"ConnectionString": {
"type": "string"
},
"TableName": {
"type": "string"
}
},
"annotations": [],
"type": "AzureTable",
"schema": [],
"typeProperties": {
"tableName": {
"value": "#dataset().TableName",
"type": "Expression"
}
}
}
}
How do I set the connection string on the connection?
First, you can't make the whole connection string as an expression. You need provide accountName and accountKey sperately. Refer this post about how to do it. How to provide connection string dynamically for azure table storage/blob storage in Azure data factory Linked service
Then, if you are using ADF UI, it will guide you how to provide value for linked service. For example, if you have two dataset parameters, you could specify it as following.
If you want to see json code, you could click the code icon on the top left corner.
I am using azure blob as an example, but the azure table is almost the same.
Hope it could help.

How to insert data to bigquery table with custom fields with NodeJS?

I'm using npm BigQuery module for inserting data into bigquery. I have a custom field say params which is of type RECORD and accept any int,float or string value as a key value pair. How can I insert to such fields?
Looked into this, but could not find anything useful
[https://cloud.google.com/nodejs/docs/reference/bigquery/1.3.x/Table#insert]
If I understand correctly, you are asking for a map with ANY TYPE value, which is not support in BigQuery.
You may have a map with value type info with a record like below schema.
Your insert code needs to pick correct type_value to set.
{
"name": "map_field",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "key",
"type": "STRING",
},
{
"name": "int_value",
"type": "INTEGER"
},
{
"name": "string_value",
"type": "STRING"
},
{
"name": "float_value",
"type": "FLOAT"
}
]
}

Error while running U-SQL Activity in Pipeline in Azure Data Factory

I am getting following error while running a USQL Activity in the pipeline in ADF:
Error in Activity:
{"errorId":"E_CSC_USER_SYNTAXERROR","severity":"Error","component":"CSC",
"source":"USER","message":"syntax error.
Final statement did not end with a semicolon","details":"at token 'txt', line 3\r\nnear the ###:\r\n**************\r\nDECLARE #in string = \"/demo/SearchLog.txt\";\nDECLARE #out string = \"/scripts/Result.txt\";\nSearchLogProcessing.txt ### \n",
"description":"Invalid syntax found in the script.",
"resolution":"Correct the script syntax, using expected token(s) as a guide.","helpLink":"","filePath":"","lineNumber":3,
"startOffset":109,"endOffset":112}].
Here is the code of output dataset, pipeline and USQL script which i am trying to execute in pipeline.
OutputDataset:
{
"name": "OutputDataLakeTable",
"properties": {
"published": false,
"type": "AzureDataLakeStore",
"linkedServiceName": "LinkedServiceDestination",
"typeProperties": {
"folderPath": "scripts/"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
Pipeline:
{
"name": "ComputeEventsByRegionPipeline",
"properties": {
"description": "This is a pipeline to compute events for en-gb locale and date less than 2012/02/19.",
"activities": [
{
"type": "DataLakeAnalyticsU-SQL",
"typeProperties": {
"script": "SearchLogProcessing.txt",
"scriptPath": "scripts\\",
"degreeOfParallelism": 3,
"priority": 100,
"parameters": {
"in": "/demo/SearchLog.txt",
"out": "/scripts/Result.txt"
}
},
"inputs": [
{
"name": "InputDataLakeTable"
}
],
"outputs": [
{
"name": "OutputDataLakeTable"
}
],
"policy": {
"timeout": "06:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 1
},
"scheduler": {
"frequency": "Minute",
"interval": 15
},
"name": "CopybyU-SQL",
"linkedServiceName": "AzureDataLakeAnalyticsLinkedService"
}
],
"start": "2017-01-03T12:01:05.53Z",
"end": "2017-01-03T13:01:05.53Z",
"isPaused": false,
"hubName": "denojaidbfactory_hub",
"pipelineMode": "Scheduled"
}
}
Here is my USQL Script which i am trying to execute using "DataLakeAnalyticsU-SQL" Activity Type.
#searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM #in
USING Extractors.Text(delimiter:'|');
#rs1 =
SELECT Start, Region, Duration
FROM #searchlog
WHERE Region == "kota";
OUTPUT #rs1
TO #out
USING Outputters.Text(delimiter:'|');
Please suggest me how to resolve this issue.
Your script is missing the scriptLinkedService attribute. You also (currently) need to place the U-SQL script in Azure Blob Storage to run it successfully. Therefore you also need an AzureStorage Linked Service, for example:
{
"name": "StorageLinkedService",
"properties": {
"description": "",
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=myAzureBlobStorageAccount;AccountKey=**********"
}
}
}
Create this linked service, replacing the Blob storage name myAzureBlobStorageAccount with your relevant Blob Storage account, then place the U-SQL script (SearchLogProcessing.txt) in a container there and try again. In my example pipeline below, I have a container called adlascripts in my Blob store and the script is in there:
Make sure the scriptPath is complete, as Alexandre mentioned. Start of the pipeline:
{
"name": "ComputeEventsByRegionPipeline",
"properties": {
"description": "This is a pipeline to compute events for en-gb locale and date less than 2012/02/19.",
"activities": [
{
"type": "DataLakeAnalyticsU-SQL",
"typeProperties": {
"scriptPath": "adlascripts\\SearchLogProcessing.txt",
"scriptLinkedService": "StorageLinkedService",
"degreeOfParallelism": 3,
"priority": 100,
"parameters": {
"in": "/input/SearchLog.tsv",
"out": "/output/Result.tsv"
}
},
...
The input and output .tsv files can be in the data lake and use the the AzureDataLakeStoreLinkedService linked service.
I can see you are trying to follow the demo from: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-usql-activity#script-definition. It is not the most intuitive demo and there seem to be some issues like where is the definition for StorageLinkedService?, where is SearchLogProcessing.txt? OK I found it by googling but there should be a link in the webpage. I got it to work but felt a bit like Harry Potter in the Half-Blood Prince.
Remove the script attribute in your U-SQL activity definition and provide the complete path to your script (including filename) in the scriptPath attribute.
Reference: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-usql-activity
I had a similary issue, where Azure Data Factory would not recognize my script files. A way to avoid the whole issue, while not having to paste a lot of code, is to register a stored procedure. You can do it like this:
DROP PROCEDURE IF EXISTS master.dbo.sp_test;
CREATE PROCEDURE master.dbo.sp_test()
AS
BEGIN
#searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM #in
USING Extractors.Text(delimiter:'|');
#rs1 =
SELECT Start, Region, Duration
FROM #searchlog
WHERE Region == "kota";
OUTPUT #rs1
TO #out
USING Outputters.Text(delimiter:'|');
END;
After running this, you can use
"script": "master.dbo.sp_test()"
in your JSON pipeline definition. Whenever you update the U-SQL script, simply re-run the definition of the procedure. Then there will be no need to copy script files to Blob Storage.

Resources