My goal is to pass data from one SQl azure database (User DB) to another SQl azure database (Datawarehouse) through a stored procedure.
I have created two linked Services, one for each DB. And two DataSets of which I have doubts.
The stored procedure in question collects data from a table and several joins with other tables and returns a result that should be stored in a table in the Datawarehouse
The SP is like this:
ALTER PROCEDURE [DataWarehouse].[Item_init]
AS
BEGIN
SET NOCOUNT ON
SELECT Id, a.Name, Code, f.Name, s.Name, g.Name
FROM Item.Item a
join Item.Groupg on g.idGroup= a.idGroup
join Item.Subfam s on s.idSubfam = g.idSubfam
join Item.Fam f on f.idFam= s.idFam
END
The dataset that collects data from the UserDB (I think it is not correct) is like this:
{
"name": "ds_SProcItem_init",
"properties": {
"published": false,
"type": "AzureSqlTable",
"linkedServiceName": "UserTable",
"typeProperties": {
"tableName": "Item.Item"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
The other dataset:
{
"name": "ds_DWItemOutput",
"properties": {
"published": false,
"type": "AzureSqlTable",
"linkedServiceName": "DataWareHouse",
"typeProperties": {
"tableName": "Item"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
The pipeline that communicates the datasets is as follows:
{
"name": "SprocItem_InitPipeline",
"properties": {
"activities": [
{
"type": "SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "DataWarehouse.Item_init"
},
"inputs": [
{
"name": "ds_SProcItem_init"
}
],
"outputs": [
{
"name": "ds_DWItemOutput"
}
],
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "SprocItem_Init"
}
],
"start": "2016-08-02T00:00:00Z",
"end": "2016-08-02T05:00:00Z",
"isPaused": false,
"hubName": "pruebasaas_hub",
"pipelineMode": "Scheduled"
}
}
Please, someone who knows the subject, could you help me?
Thanks!
Given the limits of Azure SQL DB I suggest you need to use a copy activity here as well as the stored procedure. You need to handle this within the confines of how ADF wants to work. Remember this isn't SSIS :-)
If I was building the data factory these are the steps I'd take...
For completeness define datasets of each of the tables used by the stored procedure.
First pipeline. Have an activity that calls the stored procedure that does the joins of the input datasets and outputs to a new staging table (do a SQL INSERT INTO ... SELECT... here) on the first Azure SQL DB instance.
Have the output dataset in ADF for the staging table defined (the proc result).
Second pipeline. Have a copy activity from the output staging table in point 3 as the input. Then output to the table on the second Azure SQL DB instance.
Again for completeness an ADF dataset for the final destination table.
The copy activity bridges the gap where cross database queries aren't possible and SQL Server Linked Servers don't exist.
Picture to help...
(Please forgive the poor paint skills)
Make sense? :-)
Good, crack on.
Related
With Data Factory V2 I'm trying to implement a stream of data copy from one Azure SQL database to another.
I have mapped all the columns of the source table with the sink table but in the sink table I have an empty column where I would like to enter the pipeline run time.
Does anyone know how to fill this column in the sink table without it being present in the source table?
Below there is the code of my copy pipeline
{
"name": "FLD_Item_base",
"properties": {
"activities": [
{
"name": "Copy_Team",
"description": "copytable",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "SqlSource"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 10000,
"preCopyScript": "TRUNCATE TABLE Team_new"
},
"enableStaging": false,
"dataIntegrationUnits": 0,
"translator": {
"type": "TabularTranslator",
"columnMappings": {
"Code": "Code",
"Name": "Name"
}
}
},
"inputs": [
{
"referenceName": "Team",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "Team_new",
"type": "DatasetReference"
}
]
}
]
}
}
In my sink table I already have the column data_loadwhere I would like to insert the pipeline execution date, but I did not currently map it.
Based on your situation, please configure SQL Server stored procedure in your SQL Server sink as a workaround.
Please follow the steps from this doc:
Step 1: Configure your Sink dataset:
Step 2: Configure Sink section in copy activity as follows:
Step 3: In your database, define the table type with the same name as sqlWriterTableType. Notice that the schema of the table type should be same as the schema returned by your input data.
CREATE TYPE [dbo].[testType] AS TABLE(
[ID] [varchar](256) NOT NULL,
[EXECUTE_TIME] [datetime] NOT NULL
)
GO
Step 4: In your database, define the stored procedure with the same name as SqlWriterStoredProcedureName. It handles input data from your specified source, and merge into the output table. Notice that the parameter name of the stored procedure should be the same as the "tableName" defined in dataset.
Create PROCEDURE convertCsv #ctest [dbo].[testType] READONLY
AS
BEGIN
MERGE [dbo].[adf] AS target
USING #ctest AS source
ON (1=1)
WHEN NOT MATCHED THEN
INSERT (id,executeTime)
VALUES (source.ID,GETDATE());
END
you can consider using stored procedure at the sink side to apply the source data into the sink table by designating "sqlWriterStoredProcedureName" of the SqlSink. Pass the pipeline run time to the stored procedure as the parameter and insert into sink table.
I'm using Azure Data Factory to periodically import data from MySQL to Azure SQL Data Warehouse.
The data goes through a staging blob storage on an Azure storage account, but when I run the pipeline it fails because it can't separate the blob text back to columns. Each row that the pipeline tries to insert into the destination becomes a long string which contains all the column values delimited by a "⯑" character.
I used Data Factory before, without trying the incremental mechanism, and it worked fine. I don't see a reason it would cause such a behavior, but I'm probably missing something.
I'm attaching the JSON that describes the pipeline with some minor naming changes, please let me know if you see anything that can explain this.
Thanks!
EDIT: Adding exception message:
Failed execution Database operation failed. Error message from
database execution :
ErrorCode=FailedDbOperation,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Error
happened when loading data into SQL Data
Warehouse.,Source=Microsoft.DataTransfer.ClientLibrary,''Type=System.Data.SqlClient.SqlException,Message=Query
aborted-- the maximum reject threshold (0 rows) was reached while
reading from an external source: 1 rows rejected out of total 1 rows
processed.
(/f4ae80d1-4560-4af9-9e74-05de941725ac/Data.8665812f-fba1-407a-9e04-2ee5f3ca5a7e.txt)
Column ordinal: 27, Expected data type: VARCHAR(45) collate SQL_Latin1_General_CP1_CI_AS, Offending value:* ROW OF VALUES
* (Tokenization failed), Error: Not enough columns in this
line.,},],'.
{
"name": "CopyPipeline-move_incremental_test",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "$$Text.Format('select * from [table] where InsertTime >= \\'{0:yyyy-MM-dd HH:mm}\\' AND InsertTime < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)"
},
"sink": {
"type": "SqlDWSink",
"sqlWriterCleanupScript": "$$Text.Format('delete [schema].[table] where [InsertTime] >= \\'{0:yyyy-MM-dd HH:mm}\\' AND [InsertTime] <\\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)",
"allowPolyBase": true,
"polyBaseSettings": {
"rejectType": "Value",
"rejectValue": 0,
"useTypeDefault": true
},
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "column1:column1,column2:column2,column3:column3"
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": "StagingStorage-somename",
"path": "somepath"
}
},
"inputs": [
{
"name": "InputDataset-input"
}
],
"outputs": [
{
"name": "OutputDataset-output"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 10,
"style": "StartOfInterval",
"retry": 3,
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "Activity-0-_Custom query_->[schema]_[table]"
}
],
"start": "2017-06-01T05:29:12.567Z",
"end": "2099-12-30T22:00:00Z",
"isPaused": false,
"hubName": "datafactory_hub",
"pipelineMode": "Scheduled"
}
}
It sounds like what your doing is right, but the data is poorly formed (common problem, none UTF-8 encoding) so ADF can't parse the structure as you require. When I encounter this I often have to add a custom activity to the pipeline that cleans and prepares the data so it can then be used in a structured way by downstream activities. Unfortunately this is a big be overhead in the development of the solution and will require you to write a C# class to deal with the data transformation.
Also remember ADF has none of its own compute, it only invokes other services, so you'll also need an Azure Batch Service to execute to compiled code.
Sadly there is no magic fix here. Azure is great to Extract and Load your perfectly structured data, but in the real world we need other services to do the Transform or Cleaning meaning we need a pipeline that can ETL or I prefer ECTL.
Here's a link on create ADF custom activities to get you started: https://www.purplefrogsystems.com/paul/2016/11/creating-azure-data-factory-custom-activities/
Hope this helps.
I've been struggeling with the same message, sort of, when importing from Azure sql db to Azure DWH using Data Factory v.2 using staging (which implies Polybase). I've learned that Polybase will fail with error messages related to incorrect data types etc. The message I've received is much similar to the one mentioned here, even though I'm not using Polybase directly from SQL, but via Data Factory.
Anyways, the solution for me was to avoid NULL values for columns of decimal or numeric type, e.g. ISNULL(mynumericCol, 0) as mynumericCol.
I have a USQL script stored on my ADL store and I am trying to execute it. the script file is quite big - about 250Mb.
So far i have a Data Factory, I have created a Linked Service and am trying to create a Data lake Analytics U-SQL Activity.
The code for my U-SQL Activity looks like this:
{
"name": "RunUSQLScript1",
"properties": {
"description": "Runs the USQL Script",
"activities": [
{
"name": "DataLakeAnalyticsUSqlActivityTemplate",
"type": "DataLakeAnalyticsU-SQL",
"linkedServiceName": "AzureDataLakeStoreLinkedService",
"typeProperties": {
"scriptPath": "/Output/dynamic.usql",
"scriptLinkedService": "AzureDataLakeStoreLinkedService",
"degreeOfParallelism": 3,
"priority": 1000
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 3,
"timeout": "01:00:00"
},
"scheduler": {
"frequency": "Day",
"interval": 1
}
}
],
"start": "2017-05-02T00:00:00Z",
"end": "2017-05-02T00:00:00Z"
}
}
However, I get the following error:
Error
Activity 'DataLakeAnalyticsUSqlActivityTemplate' from >pipeline 'RunUSQLScript1' has no output(s) and no schedule. Please add an >output dataset or define activity schedule.
What i would like is to have this Activity run on-demand, i.e. I do not want it scheduled at all, and also I do not understand what Inputs and Outputs are in my case. The U-SQL Script I am trying to run is operating on millions of files on my ADL storage and is saving them after some modifiction of the content.
Currently ADF does not support running USQL script stored in ADLS for a USQL activity, i.e. the "scriptLinkedService" under "typeProperties" has to be an Azure Blob Storage Linked Service. We will update the documentation for USQL activity to make this more clear.
Supporting running USQL script stored in ADLS is on our product backlog, but we don't have a committed date for this yet.
Shirley Wang
Currently ADF does not support executing the activity on-demand and it needs to be configured with a schedule. You will need at least one output to drive the schedule execution of the activity. The output can be a dummy Azure Storage one without actually write the data out but ADF leverages the availability properties to drive the schedule execution. For example:
{
"name": "OutputDataset",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"fileName": "dummyoutput.txt",
"folderPath": "adf/output",
"format": {
"type": "TextFormat",
"columnDelimiter": "\t"
}
},
"availability": {
"frequency": "Day",
"interval": 1
}
}
}
I am trying to make a periodic copy of all the data returning from an OData query into a documentDB collection, on a daily basis.
The copy works fine using the copy wizard, which is A REALLY GREAT option for simple tasks. Thanks for that.
What isn't working for me though: The copy just adds data each time, and I have NO WAY that I can SEE with a documentDB sink to "pre-delete" the data in the collection (compare to the SQL sink which has sqlWriterCleanupScript, which I could set to something like Delete * from 'table').
I know I can create an Azure Batch and do what I need, but at this point, I'm not sure that it isn't better to do a function and forego the Azure Data Factory (ADF) for this move. I'm using ADF for replicating on-prem SQL stuff just fine, because it has the writer cleanup script.
At this point, I'd like to just use DocumentDB but I don't see a way to do it given the way my data works.
Here's a look at my pipeline:
{
"name": "R-------ProjectToDocDB",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": " "
},
"sink": {
"type": "DocumentDbCollectionSink",
"nestingSeparator": ".",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
/// this is where a cleanup script would be great.
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "ProjectId:ProjectId,.....:CostClassification"
}
},
"inputs": [
{
"name": "InputDataset-shc"
}
],
"outputs": [
{
"name": "OutputDataset-shc"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 3,
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "Activity-0-_Custom query_->---Project"
}
],
"start": "2017-04-26T20:13:27.683Z",
"end": "2099-12-31T05:00:00Z",
"isPaused": false,
"hubName": "r-----datafactory01_hub",
"pipelineMode": "Scheduled"
}
}
Perhaps there's an update in the pipeline that creates parity between SQL output and DocumentDB
Azure Data Factory did not support clean up script for DocDB today. It's something in our backlog. If you can describe a little bit more for the E2E scenario, could help us priorities. For example, why append to the same collection not work? Is that because there's no way to identify the incremental records after each run? For the clean up requirement, will that always be delete * or it might be based on time stamp, etc. Thanks. Before the support for clean up script was there, custom activity was the only way to workaround now, sorry.
You could use a Logic App that runs on a Timer Trigger.
I have two dataset, one "FileShare" DS1 and another "BlobSource" DS2. I define a pipeline with one copy activity, which needs to copy the files from DS1 to DS3 (BlobSource), with dependency specified as DS2. The activity is specified below:
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "FileShare"
},
"sink": {
"type": "BlobSource"
}
},
"inputs": [
{
"name": "FoodGroupDescriptionsFileSystem"
},
{
"name": "FoodGroupDescriptionsInputBlob"
}
],
"outputs": [
{
"name": "FoodGroupDescriptionsAzureBlob"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst"
},
"scheduler": {
"frequency": "Minute",
"interval": 15
},
"name": "FoodGroupDescriptions",
"description": "#1 Bulk Import FoodGroupDescriptions"
}
Here, how can i specify multiple source type (both FileShare and BlobSource)? It throws error when i try to pass as list.
The copy activity doesn't like multiple inputs or outputs. It can only perform a 1 to 1 copy... It won't even change the filename for you in the output dataset, never mind merging files!
This is probably intentional so Microsoft can charge you more for additional activities. But let's not digress into that one.
I suggest having 1 pipeline copying both files into some sort of Azure storage using separate activities (1 per file). Then have a second down stream pipeline that has a custom activity to read and merge/concatenate the files to produce a single output.
Remember that ADF isn't an ETL tool like SSIS. Its just there to invoke other Azure services. Copying is about a complex as it gets.