I'm trying to create a dynamic dataflow that gets data using a dynamic query that is pass as a parameter and then I want to upsert and delete in a dynamically defined sink table. I'm having troubles with setting the sink dynamic. is it possible? how can I do it? I see no option to have a dynamic target table. what I was thinking was to add another parameter with the sink tablename and use it.
High level view of the dataflow
[Source gets data given sql query ] (OK)
[transformations in the middle] (OK)
[Sink to a dynamic table, that needs to be parameterized but I cannot find a way] (NOT OK)
You can create a parameter in Sink Dataset and then pass the table name as parameter from dataflow activity to Sink Dataset.
Step:
Related
I have my Data Flow designed with Dataflow activity in it. The Dataflow gives my sink data as something like this
{"BaseObject":"ABCD","OHY":"AAS"}
{"BaseObject":"DEFG","OHY":"LOI"}
{"BaseObject":"POIU","OHY":"JJI"}
Now I need each value of BaseObject and have to pass it as a parameter to a web activity one by one so it will be like for each loop where one value of BaseObject is taken at a time and then passed to the web activity as a parameter, which is in turns gives me my final JSON.
How can I do this?
Once the dataflow activty is executed, data will be loaded to the sink dataset. To get the sink results of dataflow activity, use another activity (lookup) and connect it to the sink dataset.
In the pipeline, connect the lookup activity after the dataflow activity and read the sink dataset to get the data loaded.
Dataflow:
Sink dataset:
sink settings:
pipeline:
Output of lookup activity:
Connect the lookup output to the Foreach activity, to loop the value BaseObject.
#activity('Lookup1').output.value
You can use this current item (#item().BaseObject) of activities inside the Foreach activity.
Ex:
There is a Conditional Split in my ADF data flow. Success puts the rows to a SQL database and failure conditions collect all the incorrect records and puts them into a sink which is of type CSV (Delimited text).
In case of success condition, there is an empty CSV file of 0 bytes is getting created in the sink.
How can I stop this?
If you don't wish to write output to an external source, you can use cache sink. It writes data into the Spark cache instead of a data store. In mapping data flows, you can reference this data within the same flow many times using a cache lookup. If you want to store this later to a data store, just reference data as part of an expression.
To write to a cache sink, add a sink transformation and select Cache as the sink type. Unlike other sink types, you don't need to select a dataset or linked service because you aren't writing to an external store.
Note: A cache sink must be in a completely independent data stream from any transformation referencing it via a cache lookup. A cache
sink also must the first sink written.
When utilizing cached lookups, make sure that your sink ordering has
the cached sinks set to 1, the lowest (or first) in ordering.
Reference this data within the same flow using a cache lookup, as part of an expression to store this to a data store.
Alternately use cache lookup against source to select and write it to CSV or log in a different stream or sink.
Refer: CacheSink, CachedLookup
If you still want to delete empty Zero byte files, you can use ADF or programmatic way to delete at the end of execution (Delete Activity in Azure Data Factory)
Examples of using the Delete activity
Recently Microsoft launched the snowflake connection for data flow in ADF. Is there any way to turn on the push down optimization in ADF so that if my source and target is Snowflake only then instead of pulling data out of snowflake environment it should trigger a query in snowflake to do the task. Like a normal ELT process instead of ETL.
Let me know if you need some more clarification.
As I understand the intent here is to fire a query from ADF on snowflake data so that possibley the data can be scrubbed ( or something similar ) . I see that Lookup activity also supports snowflake and probably that should help you . My knowledge on SF is limited , but i know that you can call a proc/query from lookup activity and that should help .
https://learn.microsoft.com/en-us/azure/data-factory/control-flow-lookup-activity
"Lookup activity reads and returns the content of a configuration file or table. It also returns the result of executing a query or stored procedure. The output from Lookup activity can be used in a subsequent copy or transformation activity if it's a singleton value. The output can be used in a ForEach activity if it's an array of attributes."
We are creating Azure Data Factory pipeline using .net API. Here we are providing input data source using sqlReaderQuery. By this mean, this query can use multiple table.
So problem is we can't extract any single table from this query and give tableName as typeProperty in Dataset as shown below:
"typeProperties": {
"tableName": "?"
}
While creating dataset it throws exception as tableName is mandatory. We don't want to provide tableName in this case? Is there any alternative of doing the same?
We are also providing structure in dataset.
Unfortunately you cant do that natively. You need to deploy a Dataset for each table. Azure Data Factory produce slices for every activity ahead of execution time. Without knowing the table name, Data Factory would fail when producing these input slices.
If you want to read from multiple tables, then use a stored procedure as the input to the data set. Do your joins and input shaping in the stored procedure.
You could also get around this by building a dynamic custom activity that operates, say, at the database level. When doing this you would use a dummy input dataset and a generic output data set and control most of the process yourself.
It is a bit of a nuisance this property being mandatory, particularly if you have provided a ...ReaderQuery. For Oracle copies I have used sys.dual as the table name, this is a sort of built-in dummy table in Oracle. In SQL Server you could use one of the system views or set up a dummy table.
Can i retrieve data from multiple data sources to Azure SQL DataWarehouse at the same time using single pipeline?
SQL DW can certainly load multiple tables concurrently using external (aka PolyBase) tables, bcp, or insert statements. As hirokibutterfield asks, are you referring to a specific loading tool like Azure Data Factory?
Yes you can, but there you have to mention a copy activity for each of the data source being copied to the azure data warehous.
Yes you can, and depending on the extent of transformation required, there would be 2 ways to do this. Regardless of the method, the data source does not matter to ADF since your data movement happens via the copy activity which looks at the dataset and takes care of firing the query on the related datasource.
Method 1:
If all your transformation for a table can be done in a SELECT query on the source systems, you can have a set of copy activities specifying SELECT statements. This is the simple approach
Method 2:
If your transformation requires complex integration logic, first use copy activities to copy over the raw data from the source systems to staging tables in the SQLDW instance (Step 1). Then use a set of stored procedures to do the transformations (Step 2).
The ADF datasets which are the output from Step1 will be input datasets to Step 2 in order to maintain consistency.