How to get the Source Data When Ingestion Failure in KUSTO ADX - azure

I have a base table in ADX Kusto DB.
.create table base (info:dynamic)
I have written a function which parses(dynamic column) the base table and greps a few columns and stores it in another table whenever the base table gets data(from EventHub). Below function and its update policy
.create function extractBase()
{
base
| evaluate bag_unpack(info)
| project tostring(column1), toreal(column2), toint(column3), todynamic(column4)
}
.alter table target_table policy update
#'[{"IsEnabled": true, "Source": "base", "Query": "extractBase()", "IsTransactional": false, "PropagateIngestionProperties": true}]'
suppose if the base table does not contain the expected column, ingestion error happens. how do I get the source(row) for the failure?
When using .show ingestion failures, it displays the failure message. there is a column called IngestionSourcePath. when I browse the URL, getting an exception as Resource Not Found.
If ingestion failure happens, I need to store the particular row of base table into IngestionFailure Table. for further investigation

In this case, your source data cannot "not have" a column defined by its schema.
If no value was ingested for some column in some row, a null value will be present there and the update policy will not fail.
Here the update policy will break if the original table row does not contain enough columns. Currently the source data for such errors is not emitted as part of the failure message.
In general, the source URI is only useful when you are ingesting data from blobs. In other cases the URI shown in the failed ingestion info is a URI on an internal blob that was created on the fly and no one has access to.
However, there is a command that is missing from documentation (we will make sure to update it) that allows you to duplicate (dump to storage container you provide) the source data for the next failed ingestion into a specific table.
The syntax is:
.dup-next-failed-ingest into TableName to h#'Path to Azure blob container'
Here the path to Azure Blob container must include a writeable SAS.
The required permission to run this command is DB admin.

Related

Azure Stream Analytic always got same 'OutputDataConversionError.TypeConversionError', even I remove the datetime column in the Synapse DW sql pool,

Always got same 'OutputDataConversionError.TypeConversionError' , even I remove the datetime column in output in Synapse DW sql pool, and got same error after delete and recreated stream analystic.
Stream Input is event hub, get dignostic log from azure sql database. Tested pass.
Stream output is a table in azure synapse analystic DW sql pool. Tested ok.
Query is like:
SELECT
Records.ArrayValue.count as [count],
Records.ArrayValue.total as [total],
Records.ArrayValue.minimum as [minimum],
Records.ArrayValue.minimum as [maximum],
Records.ArrayValue.resourceId as [resourceId],
CAST(Records.ArrayValue.time AS datetime) as [time],
Records.ArrayValue.metricName as [metricName],
Records.ArrayValue.timeGrain as [timeGrain],
Records.ArrayValue.average as [average]
INTO
OrderSynapse
FROM
dbhub d
CROSS APPLY GetArrayElements(d.records) AS Records
the query passed the test run. but stream job got into degraded state. and got error:
Source 'dblog' had 1 occurrences of kind 'OutputDataConversionError.TypeConversionError' between processing times '2021-11-12T05:28:08.7922407Z' and '2021-11-12T05:28:08.7922407Z'.
But even I deleted the stream job, drop the [time] column in output table, remove the "CAST(Records.ArrayValue.time AS datetime) as [time], " in the query statement, and recreated a new stream job, still got same error?
Part of the Activity log:
"ErrorCategory": "Diagnostic",
"ErrorCode": "DiagnosticMessage",
"Message": "First Occurred: 11/12/2021 7:39:12 AM | Resource Name: dblog | Message: Source 'dblog' had 1 occurrences of kind 'OutputDataConversionError.TypeConversionError' between processing times '2021-11-12T07:39:12.8681135Z' and '2021-11-12T07:39:12.8681135Z'. ",
"Type": "DiagnosticMessage",
Why? is there a hidden cache I can not clean?
It looks like a bug in the output adapter is provoking that issue. While the fix is rolling out, you can re-order the field list to match the column order in the destination table.

Unable to get scalar value of a query on cosmos db in azure data factory

I am trying to get the count of all records present in cosmos db in a lookup activity of azure data factory. I need this value to do a comparison with other value activity outputs.
The query I used is SELECT VALUE count(1) from c
When I try to preview the data after inserting this query I get an error saying
One or more errors occurred. Unable to cast object of type
'Newtonsoft.Json.Linq.JValue' to type 'Newtonsoft.Json.Linq.JObject'
as shown in the below image:
snapshot of my azure lookup activity settings
Could someone help me in resolving this error and if this is the limitation of azure data factory how can I get the count of all the rows of the cosmos db document using some other ways inside azure data factory?
I reproduce your issue on my side exactly.
I think the count result can't be mapped as normal JsonObject. As workaround,i think you could just use Azure Function Activity(Inside Azure Function method ,you could use SDK to execute any sql as you want) to output your desired result: {"number":10}.Then bind the Azure Function Activity with other activities in ADF.
Here is contradiction right now:
The query sql outputs a scalar array,not other things like jsonObject,or even jsonstring.
However, ADF Look Up Activity only accepts JObject,not JValue. I can't use any convert built-in function here because the query sql need to be produced with correct syntax anyway. I already submitted a ticket to MS support team,but get no luck with this limitation.
I also tried select count(1) as num from c which works in the cosmos db portal. But it still has limitation because the sql crosses partitions.
So,all i can do here is trying to explain the root cause of issue,but can't change the product behaviours.
2 rough ideas:
1.Try no-partitioned collection to execute above sql to produce json output.
2.If the count is not large,try to query columns from db and loop the result with ForEach Activity.
You can use:
select top 1 column from c order by column desc

How to drop duplicates in source data set (JSON) and load data into azure SQL DB in azure data factory

I have a table in SQL DB with primary key fields. Now i am using a copy activity in azure data factory with source dataset(JSON).
We are writing this data into sink dataset(SQL DB) but the pipeline is failing with the below error
"message": "'Type=System.Data.SqlClient.SqlException,Message=Violation of
PRIMARY KEY constraint 'PK__field__399771B9251AD6D4'. Cannot
insert duplicate key in object 'dbo.crop_original_new'. The
duplicate key value is (9161, en).\r\nThe statement has been
terminated.,Source=.Net SqlClient Data Provider,SqlErrorNumber=2627,Class=14,ErrorCode=-2146232060,State=1,Errors=
[{Class=14,Number=2627,State=1,Message=Violation of PRIMARY KEY
constraint 'PK__field__399771B9251AD6D4'. Cannot insert
duplicate key in object 'Table'. The duplicate key value is
(9161, en).,},{Class=0,Number=3621,State=0,Message=The statement has
been terminated.,},],'",
You can use Fault tolerance setting provided in copy activity to skip incompatible rows.
Setting image
Well, the finest solution would be:
Create a staging table in your SQL environment stg_table (this table should have a different key policy)
Load data from JSON source to stg_table
Write a stored procedure to clean data from duplicates and to load into your destination table
Or if you are familiar with Mapping Data Flows in ADF you can check this article by Mark Kromer

How to set up output path while copying data from Azure Cosmos DB to ADLS Gen 2 via Azure Data Factory

I have a cosmos DB collection in the following format:
{
"deviceid": "xxx",
"partitionKey": "key1",
.....
"_ts": 1544583745
}
I'm using Azure Data Factory to copy data from Cosmos DB to ADLS Gen 2. If I copy using a copy activity, it is quite straightforward. However, my main concern is the output path in ADLS Gen 2. Our requirements state that we need to have the output path in a specific format. Here is a sample of the requirement:
outerfolder/version/code/deviceid/year/month/day
Now since deviceid, year, month, day are all in the payload itself I can't find a way to use them except create a lookup activity and use the output of the lookup activity in the copy activity.
And this is how I set the ouput folder using the dataset property:
I'm using SQL API on Cosmos DB to query the data.
Is there a better way I can achieve this?
I think that your way works, but its not the cleanest. What I'd do is create a different variable inside the pipeline for each one: version, code, deviceid, etc. Then, after the lookup you can assign the variables, and finally do the copy activity referencing the pipeline variables.
It may look kind of redundant, but think of someone (or you 2 years from now) having to modify the pipeline and if you are not around (or have forgotten), this way makes it clear how it works, and what you should modify.
Hope this helped!!

Azure Data factory copy activity failed mapping strings (from csv) to Azure SQL table sink uniqueidentifier field

I have an Azure data factory (DF) pipeline that consists a Copy activity. The Copy activity uses HTTP connector as source to invoke a REST end-point and returns csv stream that sinks with Azure SQL Database table.
The Copy fails when CSV contains strings (such as 40f52caf-e616-4321-8ea3-12ea3cbc54e9) which are mapped to an uniqueIdentifier field in target table with error message The given value of type String from the data source cannot be converted to type uniqueidentifier of the specified target column.
I have tried to wrapped the source string with {} such as {40f52caf-e616-4321-8ea3-12ea3cbc54e9} with no success.
The Copy activity will work if I modified the target table field from uniqueIdentifier to nvarchar(100).
I reproduce your issue on my side.
The reason is data types of source and sink are dismatch.You could check the Data type mapping for SQL server.
Your source data type is string which is mapped to nvarchar or varchar, and uniqueidentifier in sql database needs GUID type in azure data factory.
So,please configure sql server stored procedure in your sql server sink as a workaround.
Please follow the steps from this doc:
Step 1: Configure your Sink dataset:
Step 2: Configure Sink section in copy activity as follows:
Step 3: In your database, define the table type with the same name as sqlWriterTableType. Notice that the schema of the table type should be same as the schema returned by your input data.
CREATE TYPE [dbo].[CsvType] AS TABLE(
[ID] [varchar](256) NOT NULL
)
Step 4: In your database, define the stored procedure with the same name as SqlWriterStoredProcedureName. It handles input data from your specified source, and merge into the output table. Notice that the parameter name of the stored procedure should be the same as the "tableName" defined in dataset.
Create PROCEDURE convertCsv #ctest [dbo].[CsvType] READONLY
AS
BEGIN
MERGE [dbo].[adf] AS target
USING #ctest AS source
ON (1=1)
WHEN NOT MATCHED THEN
INSERT (id)
VALUES (convert(uniqueidentifier,source.ID));
END
Output:
Hope it helps you.Any concern,please free feel to let me know.
There is a way to fix guid conversion into uniqueidentifier SQL column type properly via JSON configuration.
Edit the Copy Activity via Code {} button in top right toolbar.
Put:
"translator": {
"type": "TabularTranslator",
"typeConversion": true
}
into typeProperties block of the Copy activity. This will also work if Mapping schema is unspecified / dynamic.

Resources