Reading and writing an XML file using ADF lookup activity - azure

We need to read a file and post an XML payload to an HTTP endpoint via Azure Data Factory (ADF). We have the XML file in our blob storage. we are using Lookup activity to read it. And we plan to put web activity after that to post it to the HTTP endpoint. But, the lookup activity does not support an XML output. Is there a way to read a file and send it in XML format to the next activity in Azure Data Factory?

You can use the xml() function supported among Conversion functions in ADF.
Checkout MS docs: xml function
Return the XML version for a string that contains a JSON object.
xml('<value>')
Parameter:
The string with the JSON object to convert. The JSON object must
have only one root property, which can't be an array. Use the
backslash character () as an escape character for the double
quotation mark (").
Example:
Sample Source xml file stored in blob storage.
Solution:
#string(xml(json(string(activity('Lookup1').output.value[0]))))
Now, you can either use this as string to store in a variable or directly use it dynamically in a web activity payload.
#xml(json(string(activity('Lookup1').output.value[0])))
Checkout MS docs for more : xml function examples

Related

How to set and get variable value in Azure Synapse or Data Factory pipeline

I have created a pipeline with Copy Activity, say, activity1in Azure Synapse Analytics workspace that loads the following JSON to Azure Data Lake Storage Gen2 (ADLSGen2) using source as a REST Api and Sink (destination) as ADLSGen2. Ref.
MyJsonFile.json (stored in ADLSGen2)
{"file_url":"https://files.testwebsite.com/Downloads/TimeStampFileName.zip"}
In the same pipeline, I need to add an activity2 that reads the URL from the above JSON, and activity3 that loads the zip file (mentioned in that URL) to the same Gen2 storage.
Question: How can we add an activity2 to the existing pipeline that will get the URL from the above JSON and then pass it to activity3? Or, are there any better suggestions/solutions to achieve this task.
Remarks: I have tried Set Variable Activity (shown below) by first declaring a variable in the pipeline and the using that variable, say, myURLVar in this activity, but I am not sure how to dynamically set the value of myURLVar to the value of the URL from the above JSON. Please NOTE the Json file name (MyJsonFile.json) is a constant, but zip file name in the URL is dynamic (based on timestamp), hence we cannot just hard code the above url.
As #Steve Zhao mentioned in the comments, use lookup activity to get the data from the JSON file and extract the required URL from the lookup output value using set variable activity.
Connect the lookup activity to the sink dataset of previous copy data activity.
Output of lookup activity:
I have used the substring function in set activity to extract the URL from the lookup output.
#replace(substring(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''),indexof(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''),'http'),sub(length(string(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''))),indexof(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''),'http'))),']','')
Check the output of set variable:
Set variable output value:
There is a way to do this without needing complex string manipulation to parse the JSON. The caveat is that the JSON file needs to be formatted such that there are no line breaks (or that each line break represents a new record).
First setup a Lookup activity that loads the JSON file in the same way as #NiharikaMoola-MT's answer shows.
Then for the Set Variable activity's Value setting, use the following dynamic expression: #activity('<YourLookupActivityNameHere>').output.firstRow.file_url

Azure Data Factory - Google BigQuery Copy Data activity not returning nested column names

I have a copy activity in Azure Data Factory with a Google BigQuery source.
I need to import the whole table (which contains nested fields - Records in BigQuery).
Nested fields get imported as follows (a string containing only data values):
"{\"v\":{\"f\":[{\"v\":\"1\"},{\"v\":\"1\"},{\"v\":\"1\"},{\"v\":null},{\"v\":\"1\"},{\"v\":null},{\"v\":null},{\"v\":\"1\"},{\"v\":null},{\"v\":null},{\"v\":null},{\"v\":null},{\"v\":\"0\"}]}}"
Expected output would be something like:
{"nestedColName" : [{"subNestedColName": 1}, {"subNestedColName": 1}, {"subNestedColName": 1}, {"subNestedColName": null}, ...] }
I think this is a connector issue from Data Factory's side but am not sure how to proceed.
Have considered using Databricks to import data from GBQ directly and then saving the DataFrame to sink.
Have also considered querying for a subset of columns and using UNNEST where required but would rather not do this as Parquet handles both Array and Map types.
Anyone encountered this before / what did you do?
Solution used:
Databricks (Spark) connector for Google BigQuery:
https://docs.databricks.com/data/data-sources/google/bigquery.html
This preserves schemas and nested field names.
Preferring the simpler setup of ADF BigQuery connector to Databricks's BigQuery support, I opted for a solution where I extract the data in JSON and 'massage' it into Parquet using Databricks:
Use a Copy activity to get data from BigQuery with all the data packed into a single JSON string field. Output format can be Parquet or JSON (I'm using Parquet). Use a BigQuery query like this:
select TO_JSON_STRING(t) as value from `<your BigQuery table>` as t
NOTE: The name of the field must be value. The df.write.text() text file writer writes the contents of value column into each row of the text file, which is a JSON string in this case.
Run a Databrick notebook activity with code like this:
# Read data and write it out as text file to get the JSON. (Compression is optional).
dfInput=spark.read.parquet(inputpath)
dfInput.write.mode("overwrite").option("compression","gzip").text(tmppath)
# Read back as JSON to extract the correct schema.
dfTemp=spark.read.json(tmppath)
dfTemp.write.mode("overwrite").parquet(outputpath)
Use the output as is, or use a Copy activity to copy it to where you like.

Data Factory Data Flow sink file name

I have a data flow that merges multiple pipe delimited files into one file and stores it in Azure Blob Container. I'm using a file pattern for the output file name concat('myFile' + toString(currentDate('PST')), '.txt').
How can I grab the file name that's generated after the dataflow is completed? I have other activities to log the file name into a database, but not able to figure out how to get the file name.
I tried #{activity('Data flow1').output.filePattern} but it didn't help.
Thank you
You can use GetMeta data activity to get the file name that is generated after the data flow.

Copy String by Azure Data Factory Failed

I am trying to do following...
With Azure Data Fatctory, pipeline copys string from JSON file in Blob Storage to Azure SQL.
I am facing problem as below...
Copied String to Azure SQL is displayed as "???" while original string is "圃場1"(ASC-II format)
How do I properly copy original string to Azure SQL?(Maybe, I need to setup encoding format within LinkedService file.
You have to set the correct encoding in the input dataset of your pipeline. You can do this in the format property, with type TextFormat and encodingName. Read more about these properties here: https://learn.microsoft.com/en-us/azure/data-factory/connector-azure-blob-storage#dataset-properties
Your linked service is working fine, as you can get data from your blob storage so no need to change that.
Your format json would look something like this:
"format": {
"type": "TextFormat",
"encodingName": "gb2312"
}
In this example I used gb2312 because I think those characters are chinese, but I'm not really sure. You can check other encodings here: https://msdn.microsoft.com/library/system.text.encoding.aspx
Also reading this might be useful, to get to know a bit more about other text format properties: https://learn.microsoft.com/en-us/azure/data-factory/supported-file-formats-and-compression-codecs#text-format
Hope this helped! :)

How to convert an array of JSON to BLOB type in node.js

I have an array containing JSON which I have to insert into cassandra database table having column of data type blob.
http://www.datastax.com/documentation/cql/3.0/cql/cql_reference/blob_r.html
Above link says -- For example, bigintAsBlob(3) is 0x0000000000000003 and blobAsBigint(0x0000000000000003) is 3.
But I cant make it to work for my scenario.
I am using helenus driver.
Convert JSON to a string, and save it as a BLOB using textAsBlob(content) function in Cassandra.
To load, just read that blob to a string using blobAsText(content), and then decode it as JSON.
To clarify things out, BLOB is Binary Large Object. But you can also alter the Schema to use varchar or text datatype instead of BLOB.

Resources