I am trying to do following...
With Azure Data Fatctory, pipeline copys string from JSON file in Blob Storage to Azure SQL.
I am facing problem as below...
Copied String to Azure SQL is displayed as "???" while original string is "圃場1"(ASC-II format)
How do I properly copy original string to Azure SQL?(Maybe, I need to setup encoding format within LinkedService file.
You have to set the correct encoding in the input dataset of your pipeline. You can do this in the format property, with type TextFormat and encodingName. Read more about these properties here: https://learn.microsoft.com/en-us/azure/data-factory/connector-azure-blob-storage#dataset-properties
Your linked service is working fine, as you can get data from your blob storage so no need to change that.
Your format json would look something like this:
"format": {
"type": "TextFormat",
"encodingName": "gb2312"
}
In this example I used gb2312 because I think those characters are chinese, but I'm not really sure. You can check other encodings here: https://msdn.microsoft.com/library/system.text.encoding.aspx
Also reading this might be useful, to get to know a bit more about other text format properties: https://learn.microsoft.com/en-us/azure/data-factory/supported-file-formats-and-compression-codecs#text-format
Hope this helped! :)
Related
We need to read a file and post an XML payload to an HTTP endpoint via Azure Data Factory (ADF). We have the XML file in our blob storage. we are using Lookup activity to read it. And we plan to put web activity after that to post it to the HTTP endpoint. But, the lookup activity does not support an XML output. Is there a way to read a file and send it in XML format to the next activity in Azure Data Factory?
You can use the xml() function supported among Conversion functions in ADF.
Checkout MS docs: xml function
Return the XML version for a string that contains a JSON object.
xml('<value>')
Parameter:
The string with the JSON object to convert. The JSON object must
have only one root property, which can't be an array. Use the
backslash character () as an escape character for the double
quotation mark (").
Example:
Sample Source xml file stored in blob storage.
Solution:
#string(xml(json(string(activity('Lookup1').output.value[0]))))
Now, you can either use this as string to store in a variable or directly use it dynamically in a web activity payload.
#xml(json(string(activity('Lookup1').output.value[0])))
Checkout MS docs for more : xml function examples
I have created a pipeline with Copy Activity, say, activity1in Azure Synapse Analytics workspace that loads the following JSON to Azure Data Lake Storage Gen2 (ADLSGen2) using source as a REST Api and Sink (destination) as ADLSGen2. Ref.
MyJsonFile.json (stored in ADLSGen2)
{"file_url":"https://files.testwebsite.com/Downloads/TimeStampFileName.zip"}
In the same pipeline, I need to add an activity2 that reads the URL from the above JSON, and activity3 that loads the zip file (mentioned in that URL) to the same Gen2 storage.
Question: How can we add an activity2 to the existing pipeline that will get the URL from the above JSON and then pass it to activity3? Or, are there any better suggestions/solutions to achieve this task.
Remarks: I have tried Set Variable Activity (shown below) by first declaring a variable in the pipeline and the using that variable, say, myURLVar in this activity, but I am not sure how to dynamically set the value of myURLVar to the value of the URL from the above JSON. Please NOTE the Json file name (MyJsonFile.json) is a constant, but zip file name in the URL is dynamic (based on timestamp), hence we cannot just hard code the above url.
As #Steve Zhao mentioned in the comments, use lookup activity to get the data from the JSON file and extract the required URL from the lookup output value using set variable activity.
Connect the lookup activity to the sink dataset of previous copy data activity.
Output of lookup activity:
I have used the substring function in set activity to extract the URL from the lookup output.
#replace(substring(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''),indexof(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''),'http'),sub(length(string(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''))),indexof(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''),'http'))),']','')
Check the output of set variable:
Set variable output value:
There is a way to do this without needing complex string manipulation to parse the JSON. The caveat is that the JSON file needs to be formatted such that there are no line breaks (or that each line break represents a new record).
First setup a Lookup activity that loads the JSON file in the same way as #NiharikaMoola-MT's answer shows.
Then for the Set Variable activity's Value setting, use the following dynamic expression: #activity('<YourLookupActivityNameHere>').output.firstRow.file_url
I've been going round in circles trying to get what I thought would be a relatively trivial pipeline working in Azure Data Factory. I have a CSV file with a schema like this:
Id, Name, Color
1, Apple, Green
2, Lemon, Yellow
I need to transform the CSV into a JSON file that looks like this:
{"fruits":[{"Id":"1","Name":"Apple","Color":"Green"},{"Id":"2","Name":"Lemon","Color":"Yellow"}]
I can't find a simple example that helps me understand how to do this in ADF. I've tried a Copy activity, and a data flow, but the furthest I've got is a json object like this:
{"fruits":{"Id":"1","Name":"Apple","Color":"Green"}}
{"fruits":{"Id":"2","Name":"Lemon","Color":"Yellow"}}
Surely this is simple to achieve. I'd be very grateful if anyone has any suggestions. Thanks!
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-schema-and-type-mapping#tabularhierarchical-source-to-hierarchical-sink
"When copying data from tabular source to hierarchical sink, writing to array inside object is not supported"
But, if we put file pattern under Sink properties as 'Array of Objects', you can achieve somewhere till here:
[{"Id":"1","Name":" Apple","Color":" Green"}
,{"Id":"2","Name":" Lemon","Color":" Yellow"}
]
Trying to load csv files in the data lake(gen2) to Azure Synapse by using Azure Data Factory. The source file has "(double quote) as an escape character. This falls outside the data limitations of directly connecting polybase to Data Lake. I setup the staged copy by the following the documentation
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": {
"referenceName": "LS_StagedCopy",
"type": "LinkedServiceReference"
},
"path": "myContainer/myPath",
"enableCompression": false
}
After I debug the pipeline, I am still getting
{Class=16,Number=107090,State=1,Message=HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: HadoopExecutionException: Too many columns in the line.,},],
I do see ADF creating a temporary folder in the path I supplied in the staged copy, but it looks like it not performing the required transformation to load data. Am I missing anything?
Link to doc Copy and transform data in Azure SQL Data Warehouse by using Azure Data Factory
Most likely the problem is your data. Check your delimiter. Hope its not "," or something obvious like this. Its a common problem when one column has a text with many "," ADF will interpret it as a new column.
Test it with a smaller clean csv and go from there.
First time poster, long time reader.
A third party provider is uploading CSV-files once a day to a shared Azure Blob Storage. The files have a certain prefix with a timestamp in the filename and reside in the same directory. F.i. "dw_palkkatekijat_20170320T021" Every file will have all the data the previous had, plus the newly added data from the previous day. I would like to import all the rows from all the files to a SQL table in an Azure SQL DB. This I can do.
The problem I have is that I don't know how to add the filename into a separate column in the table, so I can separate which file the rows came from, and only use the newest rows. I need to import all the files' contents and store all "versions" of the files. Is there a way I can send the filename as a parameter for a SQL stored procedure? Or any alternate way to handle this problem?
Thank you for your help.
In the current situation you've described you won't be able to get the exact file name. ADF isn't a data transformation service so doesn't give you this level functionality... I wish it did!
However, there are a couple of options to get the file name or something similar to use. None of which I accept are perfect!
Option 1 (Best option, I think!)
As you asked. Pass a parameter to the SQL DB stored procedure. This is certainly possible using the ADF activity parameter attribute.
What to pass as a param?...
Well, if your source files in blob storage have a nicely defined date and time in the file name. Which is what you already use in the input dataset definition then pass that to the proc. Store it in SQL DB table. Then you can work out when the file was loaded and when for and the period of overlap. Maybe?
You can access the time slice start for the dataset in the activity. Example JSON...
"activities": [
{
"name": "StoredProcedureActivityTemplate",
"type": "SqlServerStoredProcedure",
"inputs": [
{
"name": "BlobFile"
}
],
"outputs": [
{
"name": "RelationalTable"
}
],
"typeProperties": {
"storedProcedureName": "[dbo].[usp_LoadMyBlobs]",
"storedProcedureParameters": {
//like this:
"ExactParamName": "$$Text.Format('{0:yyyyMMdd}', Time.AddMinutes(SliceStart, 0))" //tweak the date format
}
}, //etc ....
Option 2 (Loads of effort)
Create yourself a middle man ADF custom activity that reads the file, plus the file name and adds the value as a column.
Custom activities in ADF basically give you the extensibility to do anything as you have to craft the data transformation behaviour in C#.
I would recommend learning what's involved in using custom activities if you want to go down this route. Lots more effort and an Azure Batch Service will be required.
Option 3 (Total overkill)
Use an Azure Data Lake Analytics service! Taking the same approach as option 2. Use USQL in data lake to parse the file and include the file name in the output dataset. In USQL you can pass a wildcard for the file name as part of the extractor and use it within the output dataset.
I brand this option as overkill because bolting on a complete data lake service just to read a filename is excessive. In reality data lake could probably replace your SQL DB layer and give you the file name transformation for free.
By the way, you won't need to use Azure Data Lake storage to store you source files. You could give the analytics service access to the existing shared blob storage account. But you would need it to support the analytics service, only.
Option 4
Have a rethink and use Azure Data Lake instead of Azure SQL DB?????
Hope this helps