Azure Data Factory Flow : Data Flow Sink Mapping - azure

In : Azure Data Factory Flow : Data Flow Sink Mapping
on the mapping tab of the Sink:
"At least one incoming column is mapped to a column in the sink dataset schema with a conflicting type, which can cause NULL values or runtime errors."
Is there some way of actually finding out what the offending column is?
Visual inspection confirms both outgoing transformation and snowflake table types are identical.

If you are loading from ADF into Snowflake, I would recommend you load all columns of data into a single "variant" column in Snowflake. Then, once the data is safe and sound, you parse your variant column into multiple columns (complete with Data Quality checks and transformations)

Related

Exception Handling for Copy Activity in Azure Data Factory

I was using the copy activity for updating rows to azure table storage. Currently the pipeline fails if there are any errors in updating any of the rows/batches.
Is there a way to gracefully handle the failed rows and continue with the copy activity for the rest of the data?
I already tried the Fault Tolerance option which the copy activity provides but that does not solve this case.
FaultTolerance Page
I have repro'd the same and got the same error when mapping the column containing special character data to RowKey column in Table Storage.
Source dataset
Fault tolerance settings
Error Message:
In copy activity, it is not possible to skip the incompatible rows other than using fault tolerance. Workaround is to use dataflow activity and separate the compatible rows and incompatible rows and then copy the compatible data using copy activity. Below is the approach.
Source is taken as in below image.
Since col4 needs to be checked before loading to Table storage, Condition is given on col4 data using condition activity. Conditional split Transformation is added after source transformation. Condition is given as,
FalseStream :
like(col4,'%#%')||like(col4,'%$%')||like(col4,'%/%')||like(col4,'%\\%')
**Sample characters are given in the above condition. **
True Stream will be the rows which do not match the above condition.
Both False and true Streams are added to Sink1 and sink2 respectively to copy the data to blob storage.
Output:
False Stream:
True Stream Data:
Once compatible data are copied to Blob, they can be copied to Table Storage using copy activity.

Azure Data Factory V2 - Process One Array Field on Row as a string

I created an Azure Data Factory pipeline that uses a Rest data source to pull data from a Rest API and copy it to an Azure SQL database. Each row in the Rest data source contains approx. 8 fields but one of those fields contains an array of values. I'm using a Copy Data task. How do I get all values from that field to map into 1 of my database fields, possibly as a string? I've tried clicking on "Collection Reference" for that field but if the array field has 5 values, it creates 5 different records in my SQL table for the one source row. If I don't select "Collection Reference", it only grabs the first value in the array.
I looked into using the Data Flow mapping task instead, but that one doesn't seem to support a Rest API dataset as a data source.
Please help.
You can store the output of REST API as a JSON file in Azure blob storage by Copy Data activity. Then you can use that file as Source and do transformation in Data Flow. Also you can use Lookup activity to get the JSON data and invoke the SP to store the data in Azure SQL Database(This way will be cheaper and it's performance will be better).

Cache Lookup Properties in Azure Data Factory

I have a requirement where in I have a source file containing the Table Name(s) in Mapping Data Flow. Based on the Table Name in the file - there needs to be a dynamic query where column metadata, along with some other properties is retrieved from the data dictionary tables and inserted into a different sink table. The table name from the file would be used as a where condition filter.
Since there can be multiple tables listed in the input file (lets assume its a csv with only one column containing the table names), if we decide to use a cache sink for the file :
Is it possible to use the results of that cached sink in the Source transformation query in the same mapping data flow - as a lookup (from where the column metadata is being retrieved) and if Yes, how
What would be the best way to restrict data from the metadata table query based on this table name
Though of alternatively achieving this with a pipeline using For Each passing the table name as parameter to data flow, but in this case if there are 100 tables in the file, there would be 100 iterations and 100 times the cluster would need to be spun up. Please advise if this is wronf or there are better ways to achieve this
You would need to use option 3. Loop through the table names and pass each in as a parameter to the data flow to set the table name in the dataset.
ADF handles the cluster creation and teardown. All you have to worry about is whether you want to execute each sequentially or in parallel and how many. There are concurrency limits in ADF, so you should consider a batch count of 20 if you run in parallel.

how to convert currency to decimal in Azure Data Factory's copy data activity mapping

In Azure Data Factory i have a pipe line and pipeline has one copy data activity that has a source a REST api and a destination a SQL DB Table.
In the mapping of this copy activity i am telling that which columns from REST dataset (on left) will be mapped to which columns on SQL dataset (onright)
there is a json property in Rest "totalBalance" that is supposed to be mapped to "Balance" field in DB Tables.
Json has "totalBalance" as string for example "$36,970,267.07" so how to convert this into decimal so that i can map it to DataBase table?
do i need to some how use mapping activity instead of copy activity ? or just the copy activity can do that ?
finally what worked for me was having a copy activity and a mapping activity.
Copy activity copies data from REST to SQLtable where all the columns are VARCHAR type and from that table a mapping activity sinks data from SQL(allString) tables to actual destination SQLTable.
But between mapping and sink i added "Derived Column" for each source property i want to convert and in expression of that derived column i am using expression like this
toDecimal(replace(replace(totalAccountReceivable, '$', ''),',',''))
The copy activity can not do that directly.
There are two way I think can do this:
First:change decimal to varchar in DB Tables.
Second:add a lookup activity before copy activity and remove the '$' in 'totalBalance' column,then add an additional column like this:
Finally,use this additional column map to 'Balance' column.
Hope this can help you.

Ingesting a CSV file thru Polybase without knowing the sequence of columns

I am trying to ingest a few CSV files from Azure Data Lake into Azure Synapse using Polybase.
There is a fixed set of columns in each CSV file and the column names are given on the first line. However, the columns can come in different ordering sequence.
In Polybase, I need to declare external table which I need to know the exact sequence of columns during design time and hence I cannot create the external table. Are there other ways to ingest the CSV file?
I don't believe you can do this directly with Polybase because as you noted the CREATE EXTERNAL TABLE statement requires the column declarations. At runtime, the CSV data is then mapped to those column names.
You could accomplish this easily with Azure Data Factory and Data Flow (which uses Polybase under the covers to move the data to Synapse) by allowing the Data Flow to generate the table. This works because the table is generated after the data has been read rather than before as with EXTERNAL.
For the sink Data Set, create it with parameterized table name [and optionally schema]:
In the Sink activity, specify "Recreate table":
Pass the desired table name to the sink Data Set from the Pipeline:
Be aware that all string-based columns will be defined as VARCHAR(MAX).

Resources