Is it possible to use U-SQL managed tables as output datasets in Azure Data Factory? - azure

I have a small ADF pipeline that copies a series of files from an Azure Storage Account to an Azure Data Lake account. As a final activity in the pipeline I want to run a U-SQL script that uses the copied files as inputs and outputs the result to a U-SQL managed table.
The U-SQL script basically extracts the data from the copied files, applies some transformation and then INSERT´s it into an existing U-SQL managed table.
How (if possible) can I add the U-SQL table as a output dataset in Azure Data Factory?

You cannot currently add a U-SQL internal table as an output dataset in Azure Data Factory (ADF). A similar question came up recently here and the answer from Michael Rys (the "father" of U-SQL) was "I know that the ADF team has a work item to do this for you."
You could use howerver Azure Data Factory to run a parameterised U-SQL script, where the input parameter is the filepath. This would have a similar result.
Example pipeline from a recent question:
{
"name": "ComputeEventsByRegionPipeline",
"properties": {
"description": "This is a pipeline to compute events for en-gb locale and date less than 2012/02/19.",
"activities": [
{
"type": "DataLakeAnalyticsU-SQL",
"typeProperties": {
"scriptPath": "adlascripts\\SearchLogProcessing.txt",
"scriptLinkedService": "StorageLinkedService",
"degreeOfParallelism": 3,
"priority": 100,
"parameters": {
"in": "/input/SearchLog.tsv",
"out": "/output/Result.tsv"
}
},
...
Basically the U-SQL script goes from:
#searchlog =
EXTRACT ...
FROM #in
USING Extractors.Tsv();
to:
#searchlog =
EXTRACT ...
FROM "/input/SearchLog.tsv"
USING Extractors.Tsv();
which I think achieves the same thing you want.

Related

Azure Data Factory Pipeline - Store single-value source query output as a variable to then use in Copy Data activity

I am looking to implement an incremental table loading pipeline in ADF. I want to execute a query to get the latest timestamp from the table in an Azure SQL database. Then, store this value as a variable in ADF so I can then reference it in the "Source" query of a Copy Data activity.
The goal is to only request data from an API with a timestamp greater than the latest timestamp in the SQL table.
Is this functionality possible within ADF pipelines? or do I need to look to Azure functions or Data Flows?
This is definitely possible with Data Factory. You could use the Lookup Activity or a Stored Procedure, but the team just released the new Script Activity:
This will return results like so:
{
"resultSetCount": 1,
"recordsAffected": 0,
"resultSets": [
{
"rowCount": 1,
"rows": [
{
"MaxDate": "2018-03-20"
}
]
...
}
Here is the expression to read this into a variable:
#activity('Script1').output.resultSets[0].rows[0].MaxDate

Can I pull data from an existingAzure Storage Account table using ARM Templates?

I have an existing Azure Storage Account which has a table. This table has a few details that I would be needing to use in my mainTemplate.json ARM file. Can I pull these values directly in the ARM Template.
[concat(reference(resourceId('Microsoft.Storage/storageAccounts',parameters('storageAccountName'))).primaryEndpoints.table, parameters('tableName'))]
I have been using the above statement in the outputs section and it returns me the table uri. Can I get the values inside that table by any way?
As suggested by Silent By referring this link
Try with using DeploymentScriptOutputs
The script takes one parameter, and output the parameter value. DeploymentScriptOutputs is used for storing outputs.
example
"outputs": {
"result": {
"value": "[reference('runPowerShellInlineWithOutput').outputs.text]",
"type": "string"
}
}
In the outputs section, the value line shows how to access the stored values. Write-Output is used for debugging purpose. To learn how to access the output file, see Monitor and troubleshoot deployment scripts
Thank you #silent for your suggestion

Query from Azure Comos DB and save to Azure Table Storage using Data Factory

I want to save C._ts+C.ttl as one entity in my Azure Table Storage. I do the following query in my Copy Activity:
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"query": {
"value": "#concat('SELECT (C.ts+C.ttl) FROM C WHERE (C.ttl+C._ts)<= ', string(pipeline().parameters.bufferdays))",
"type": "Expression"
},
"nestingSeparator": "."
},
I dont want to copy all the fields from my source i.e. CosmosDB to my sink i.e. Table Storage. I just want to store the result of this query as one value. How can I do that?
According to my test, I presume the null value you queried because of that the collection level ttl affects each document , but will not generate ttl property within document.
So when you execute SELECT c.ttl,c._ts FROM c , just get below result.
Document level ttl is not defined, just follow collection level ttl.
You need to bulk add ttl property into per document so that you could transfer _ts+ttl caculator results.
Your Copy Activity settings looks good , just add an alias in SQL, or set the name of the field via column mapping.
Hope it helps you.

Azure Data Factory specify custom output filename when copying to Blob Storage

I'm currently using ADF to copy files from an SFTP server to Blob Storage on a scheduled basis.
The filename structure is AAAAAA_BBBBBB_CCCCCC.txt.
Is it possible to rename the file before copying to Blob Storage so that I end up with a folder-like structure like below?
AAAAAA/BBBBBB/CCCCCC.txt
Here is what worked for me
I created 3 parameters in my Blob storage dataset, see the image bellow:
I specified the name of my file, added the file extension, you can add anything in the Timestamp just so you could bypass the ADF requirement since a parameter can't be empty.
Next, click on the Connection tab and add the following code in the FileName box: #concat(dataset().FileName,dataset().Timestamp,dataset().FileExtension). This code basically concatenate all parameters do you could have something like "FileName_Timestamp_FileExtension. See the image bellow:
Next, click on your pipeline then select your copy data activity. Click on the Sink tab. Find the parameter Timestamp under Dataset properties and add this code: #pipeline().TriggerTime. See the image bellow:
Finally, publish your pipeline and run/debug it. If it worked for me then I am sure it will work for you as well :)
With ADF V2, you could do that. First, use a lookup activity to get all the filenames of your source.
Then chain a foreach activity to iterate the source file names. The foreach activity contains a copy activity. Both your source dataset and sink dataset of the cop activity have parameters for filename and folder path.
You could use split and replace functions to generate the sink folder path and filename based on your source file names.
First you have to get the filenames in a GetMetadata-Activity. You can use this as a parameter in a copy-Activity and rename the filenames.
As mentioned in previous answer you can use a replace function to do this:
{
"name": "TgtBooksBlob",
"properties": {
"linkedServiceName": {
"referenceName": "Destination-BlobStorage-data",
"type": "LinkedServiceReference"
},
"folder": {
"name": "Target"
},
"type": "AzureBlob",
"typeProperties": {
"fileName": {
"value": "#replace(item().name, '_', '\\')",
"type": "Expression"
},
"folderPath": "data"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}

Populate Azure Data Factory dataset from query

Cannot find an answer via google, msdn (and other microsoft) documentation, or SO.
In Azure Data Factory you can get data from a dataset by using copy activity in a pipeline. The pipeline definition includes a query. All the queries I have seen in documentation are simple, single table queries with no joins. In this case, a dataset is defined as a table in the database with "TableName"= "mytable". Additionally, one could retrieve data from a stored procedure, presumably allowing more complex sql.
Is there a way to define a more complex query in a pipeline that includes joins and/or transformation logic that alters the data from or pipeline from a query rather than stored procedure. I know that you can specify fields in a dataset, but don't know how to get around the "tablename" property.
If there is a way, what would that method be?
input is on-premises sql server. output is azure sql database.
UPDATED for clarity.
Yes, the sqlReaderQuery can be much more complex than what is provided in the examples, and it doesn't have to only use the Table Name in the Dataset.
In one of my pipelines, I have a Dataset with the TableName "dbo.tbl_Build", but my sqlReaderQuery looks at several tables in that database. Here's a heavily truncated example:
with BuildErrorNodes as (select infoNode.BuildId, ...) as MessageValue from dbo.tbl_BuildInformation2 as infoNode inner join dbo.tbl_BuildInformationType as infoType on (infoNode.PartitionId = infoType), BuildInfo as ...
It's a bit confusing to list a single table name in the Dataset, then use multiple tables in the query, but it works just fine.
There's a way to move data from on-premise SQL to Azure SQL using Data Factory.
You can use Copy Activity, check this code sample for your case specifically GitHub link to the ADF Activity source.
Basically you need create Copy Activity which will have TypeProperties with SqlSource and SqlSink sets look like this:
<!-- language: lang-json -->
"typeProperties": {
"source": {
"type": "SqlSource",
"SqlReaderQuery": "select * from [Source]"
},
"sink": {
"type": "SqlSink",
"WriteBatchSize": 1000000,
"WriteBatchTimeout": "00:05:00"
}
},
Also do mention - you can use not only selects from tables or views, but also [Table-Valued-Functions] will work as well.

Resources