How can I ingest data from Apache Avro into the Azure Data Explorer? - azure

for several days I'm trying to ingest Apache Avro formatted data from a blob storage into the Azure Data Explorer.
I'm able to reference the toplevel JSON-keys like $.Body (see red underlined example in the screenshot below), but when it goes to the nested JSON-keys, Azure fails to parse them properly and displays nothing (as seen in the green column: I would expect $.Body.entityId to reference the key "entityId" inside the Body-JSON).
Many thanks in advance for any help!
Here is a screenshot of the azure data explorer web interface
Edit 1
I already tried to increase the "Nested Levels" Option to 2, but all I got is this error message with no further details. The error message won't even disappear when I decrease the Level back to 1. I have to cancel and start the process all over agein.
I just recognize that the auto-generated columns have some strange types. Seems like they add up to the type string... This seems a little odd to me either.
Edit 2
Here is some kql-Code.
This is the schema of my input .avro file, what I get from my Eventhub-Capture:
{
SequenceNumber: ...,
Offset: ...,
EnqueuedTimeUTC: ...,
SystemProperties: ...,
Properties: ...,
Body: {
entityId: ...,
eventTime: ...,
messageId: ...,
data: ...
}
}, ...
And with these ingestion commands I can't reference the inner JSON-keys. The toplevel keys work perfectly fine.
// Create table command
////////////////////////////////////////////////////////////
.create table ['test_table'] (['Body']:dynamic, ['entityId']:string)
// Create mapping command
////////////////////////////////////////////////////////////
.create table ['test_table'] ingestion apacheavro mapping 'test_table_mapping' '[{"column":"Body", "Properties":{"Path":"$.Body"}},{"column":"entityId", "Properties":{"Path":"$.Body.entityId"}}]'
// Ingest data into table command
///////////////////////////////////////////////////////////
.ingest async into table ['test_table'] (h'[SAS URL]') with (format='apacheavro',ingestionMappingReference='test_table_mapping',ingestionMappingType='apacheavro',tags="['503a2cfb-5b81-4c07-8658-639009870862']")
I would love to ingest the inner data fields on separate columns, instead of building any workaround with update policies.

For those having the same issue, here is the workaround we currently use:
First, assume that we want to ingest the contents of the Body field from the avro file to the table avro_destination.
Step 1: Create an ingestion table
.create table avro_ingest(
Body: dynamic
// optional other columns, if you want...
)
Step 2: Create an update policy
.create-or-alter function
with (docstring = 'Convert avro_ingest to avro_destination', folder='ingest')
convert_avro_ingest() {
avro_ingest
| extend entityId = tostring(Body.entityId)
| extend messageId = tostring(Body.messageId)
| extend eventTime = todatetime(Body.eventTime)
| extend data = Body.data
| project entityId, messageId, eventTime, data
}
.alter table avro_destination policy update
#'[{ "IsEnabled": true, "Source": "avro_ingest", "Query": "convert_avro_ingest()", "IsTransactional": false, "PropagateIngestionProperties": true}]'
Step 3: Ingest the .avro files into the avro_ingest table
...as seen in the Question, with one Column containing the whole Body-JSON per entry.

Following the OP updates
Here is the Avro schema of an Event Hubs capture.
As you can see, Body as of type bytes, so there is practically nothing you can do with it at this form, other than ingesting it As Is (as Dynamic).
{
"type":"record",
"name":"EventData",
"namespace":"Microsoft.ServiceBus.Messaging",
"fields":[
{"name":"SequenceNumber","type":"long"},
{"name":"Offset","type":"string"},
{"name":"EnqueuedTimeUtc","type":"string"},
{"name":"SystemProperties","type":{"type":"map","values":["long","double","string","bytes"]}},
{"name":"Properties","type":{"type":"map","values":["long","double","string","bytes"]}},
{"name":"Body","type":["null","bytes"]}
]
}
If you'll take a look on the ingested data, you'll see that the content of Body is arrays of integers.
Those integers are the decimal values of the characters that construct Body.
capture
| project Body
| take 3
Body
[123,34,105,100,34,58,32,34,56,49,55,98,50,99,100,57,45,97,98,48,49,45,52,100,51,53,45,57,48,51,54,45,100,57,55,50,51,55,55,98,54,56,50,57,34,44,32,34,100,116,34,58,32,34,50,48,50,49,45,48,56,45,49,50,84,49,54,58,52,56,58,51,50,46,53,57,54,50,53,52,34,44,32,34,105,34,58,32,48,44,32,34,109,121,105,110,116,34,58,32,50,48,44,32,34,109,121,102,108,111,97,116,34,58,32,48,46,51,57,56,53,52,52,56,55,52,53,57,56,57,48,55,57,55,125]
[123,34,105,100,34,58,32,34,57,53,100,52,100,55,56,48,45,97,99,100,55,45,52,52,57,50,45,98,97,54,100,45,52,56,49,54,97,51,56,100,52,56,56,51,34,44,32,34,100,116,34,58,32,34,50,48,50,49,45,48,56,45,49,50,84,49,54,58,52,56,58,51,50,46,53,57,54,50,53,52,34,44,32,34,105,34,58,32,49,44,32,34,109,121,105,110,116,34,58,32,56,56,44,32,34,109,121,102,108,111,97,116,34,58,32,48,46,54,53,53,51,55,51,51,56,49,57,54,53,50,52,52,49,125]
[123,34,105,100,34,58,32,34,53,50,100,49,102,54,54,53,45,102,57,102,54,45,52,49,50,49,45,97,50,57,99,45,55,55,56,48,102,101,57,53,53,55,48,56,34,44,32,34,100,116,34,58,32,34,50,48,50,49,45,48,56,45,49,50,84,49,54,58,52,56,58,51,50,46,53,57,54,50,53,52,34,44,32,34,105,34,58,32,50,44,32,34,109,121,105,110,116,34,58,32,49,57,44,32,34,109,121,102,108,111,97,116,34,58,32,48,46,52,53,57,54,49,56,54,51,49,51,49,50,50,52,50,50,51,125]
Body can be converted to text using make_string() and then parsed to JSON using todynamic()
capture
| project BodyJSON = todynamic(make_string(Body))
| take 3
BodyJSON
{"id":"817b2cd9-ab01-4d35-9036-d972377b6829","dt":"2021-08-12T16:48:32.5962540Z","i":0,"myint":20,"myfloat":"0.398544874598908"}
{"id":"95d4d780-acd7-4492-ba6d-4816a38d4883","dt":"2021-08-12T16:48:32.5962540Z","i":1,"myint":88,"myfloat":"0.65537338196524408"}
{"id":"52d1f665-f9f6-4121-a29c-7780fe955708","dt":"2021-08-12T16:48:32.5962540Z","i":2,"myint":19,"myfloat":"0.45961863131224223"}

Simply increase "Nested levels" to 2.

Related

Is it possible for CQL to parse a JSON object to insert data?

From what I looked so far, it seems impossible with Cassandra. But I thought I'd give it a shot:
How can I select a value of a json property, parsed from a json object string, and use it as part of an update / insert statement in Cassandra?
For example, I'm given the json object:
{
id:123,
some_string:"hello there",
mytimestamp: "2019-09-02T22:02:24.355Z"
}
And this is the table definition:
CREATE TABLE IF NOT EXISTS myspace.mytable (
id text,
data blob,
PRIMARY KEY (id)
);
Now the thing to know at this point is that for a given reason the data field will be set to the json string. In other words, there is no 1:1 mapping between the given json and the table columns, but the data field contains the json object as kind of a blob value.
... Is it possible to parse the timestamp value of the given json object as part of an insert statement?
Pseudo code example of what I mean, which obviously doesn't work ($myJson is a placeholder for the json object string above):
INSERT INTO myspace.mytable (id, data)
VALUES (123, $myJson)
USING timestamp toTimeStamp($myJson.mytimestamp)
The quick answer is no, it's not possible to do that with CQL.
The norm is to parse the elements of the JSON object within your application to extract the corresponding values to construct the CQL statement.
As a side note, I would discourage using the CQL blob type due to possible performance issues should the blob size exceeed 1MB. If it's JSON, consider storing it as CQL text type instead. Cheers!
Worth mentioning, but CQL can do a limited amount of JSON parsing on its own. Albeit, not as detailed as you're asking here (ex: USING timestamp).
But something like this works:
> CREATE TABLE myjsontable (
... id TEXT,
... some_string TEXT,
... PRIMARY KEY (id));
> INSERT INTO myjsontable JSON '{"id":"123","some_string":"hello there"}';
> SELECT * FROM myjsontable WHERE id='123';
id | some_string
-----+-------------
123 | hello there
(1 rows)
In your case you'd either have to redesign the table or the JSON payload so that they match. But as Erick and Cédrick have mentioned, the USING timestamp part would have to happen client-side.
What you detailed is doable with Cassandra.
Timestamp
To insert timestamp in a query it should be formatted as an ISO 8601 String. Sample examples could be found here. In your code, you might have to convert your incoming value to expected type and format.
Blob:
Blob expects to store binary data, as such it cannot be put Ad hoc as a String in a CQL query. (you can use TEXT type to do it if you want to encode base64)
When you need to insert binary data you need to provide proper type as well. For instance if you are working with Javascript to need to provide a Buffer as describe in the documentation Then when you execute your query you externalized your parameters
const sampleId = 123;
const sampleData = Buffer.from('hello world', 'utf8');
const sampleTimeStamp = new Date();
client.execute('INSERT INTO myspace.mytable (id, data) VALUES (?, ?) USING timestamp toTimeStamp(?)', [ sampleId, sampleData, sampleTimeStamp ]);

Kusto/Azure Data Explorer - How can I partition an external table using a timespan field?

Hoping someone can help..
I am new to Kusto and have to get an external table reading data from an Azure Blob storage account working, but the one table I have is unique in that the data for the timestamp column is split into 2 separate columns , i.e. LogDate and LogTime (see script below).
My data is stored in the following structure in the Azure Storage account container (container is named "employeedata", for example):
{employeename}/{year}/{month}/{day}/{hour}/{minute}.csv, in a simple CSV format.
I know the CSV is good because if I import it into a normal Kusto table, it works perfectly.
My KQL script for the external table creation looks as follows:
.create-or-alter external table EmpLogs (Employee: string, LogDate: datetime, LogTime:timestamp)
kind=blob
partition by (EmployeeName:string = Employee, yyyy:datetime = startofday(LogDate), MM:datetime = startofday(LogDate), dd:datetime = startofday(LogDate), HH:datetime = todatetime(LogTime), mm:datetime = todatetime(LogTime))
pathformat = (EmployeeName "/" datetime_pattern("yyyy", yyyy) "/" datetime_pattern("MM", MM) "/" datetime_pattern("dd", dd) "/" substring(HH, 0, 2) "/" substring(mm, 3, 2) ".csv")
dataformat=csv
(
h#'************************'
)
with (folder="EmployeeInfo", includeHeaders="All")
I am getting the error below constantly, which is not very helpful (redacted from full error, basically comes down to the fact there is a syntax error somewhere):
Syntax error: Query could not be parsed: {
"error": {
"code": "BadRequest_SyntaxError",
"message": "Request is invalid and cannot be executed.",
"#type": "Kusto.Data.Exceptions.SyntaxException",
"#message": "Syntax error: Query could not be parsed: . Query: '.create-or-alter external table ........
I know the todatetime() function works on timespan's, I tested it with another table and it created a date similar to the following: 0001-01-01 20:18:00.0000000.
I have tried using the bin() function on the timestamp/LogTime columns, but the same error as above, and even tried importing the time value as a string and doing some string manipulation on it, no luck. Getting the same syntax error.
Any help/guidance would be greatly appreciated.
Thank you!!
Currently, there's no way to define an external table partition based on more than one column. If your dataset timestamp is splitted between two columns: LogDate:datetime and LogTime:timestamp, then the best you can do is use virtual column for the partition by time:
.create-or-alter external table EmpLogs(Employee: string, LogDate:datetime, LogTime:timespan)
kind=blob
partition by (EmployeeName:string = Employee, PartitionDate:datetime)
pathformat = (EmployeeName "/" datetime_pattern("yyyy/MM/dd/HH/mm", PartitionDate))
dataformat=csv
(
//h#'************************'
)
with (folder="EmployeeInfo", includeHeaders="All")
Now, you can filter by the virtual column and fine tune using LogTime:
external_table("EmpLogs")
| where Employee in ("John Doe", ...)
| where PartitionDate between(datetime(2020-01-01 10:00:00) .. datetime(2020-01-01 11:00:00))
| where LogTime ...

To use the output of a lookup activity to query the db and write to a csv file in storage account usin ADF

My requirement is to use ADF to read data (columnA) from an xlx/csv file which is in the storage account and use that (columnA) to query my db and the output of my query which includes (columnA) should be written to a file in storage account.
I was able to read the data from the storage account but getting it as table. I Need to use it as a individual entry like select * from table where id=columnA.
Then the next task if I'm able to read each data, how to write it to a file
I used lookup activity to read data from excel, the below is the sample output, I need to use only the sku number for my query next, not able to proceed with this. Kindly suggest a solution
I set a variable as the output of the lookup as suggested here https://www.mssqltips.com/sqlservertip/6185/azure-data-factory-lookup-activity-example/ and tried to use that variable in my query, but I'm getting exception when I trigger it, bad template error.
Please try this:
I create a sample like yours and there is no need to use set variable.
Details:
Below is lookup output:
{
"count": 3,
"value": [
{
"SKU": "aaaa"
},
{
"SKU": "bbbb"
},
{
"SKU": "ccc"
}
]
}
Setting of copy data activity:
Query sql:
select * from data_source_table where Name = '#{activity('Lookup1').output.value[0].SKU}'
You can also use this sql,if you need:
select * from data_source_table where Name in('#{activity('Lookup1').output.value[0].SKU}','#{activity('Lookup1').output.value[1].SKU}','#{activity('Lookup1').output.value[2].SKU}')
This is my test data in my SQL DataBase:
Here is the result:
1,"aaaa",0,2017-09-01 00:56:00.0000000
2,"bbbb",0,2017-09-02 05:23:00.0000000
Hope this can help you.
Update:
You can try to use DataFlow.
source1 is your csv file,source2 is SQL DataBase.
This is setting of lookup
Filter condition:!isNull(PersonID)(One column in your SQL DataBase.)
Then,use select delete the SKU column.
Finally,Output to single file.

How to store Cassandra maps in an array?

I want to store data in following structure :-
"id" : 100, -- primary key
"data" : [
{
"imei" : 862304021502870,
"details" : [
{
"start" : "2018-07-24 12:34:50",
"end" : "2018-07-24 12:44:34"
},
{
"start" : "2018-07-24 12:54:50",
"end" : "2018-07-24 12:56:34"
}
]
}
]
So how do I create table schema in Cassandra for the same ?
Thanks in advance.
There are several approaches to this, depending on the requirements regarding data access/modification - for example, do you need to modify individual fields, or you update at once:
Declare the map of imei/details as user-defined type (UDT), and then declare table like this:
create table tbl (
id int primary key,
data set<frozen<details_udt>>);
But this is relatively hard to support in the long term, especially if you add more nested objects with different types. Plus, you can't really update fields of the frozen records that you must to use in case of nested collections/UDTs - for this table structure you need to replace complete record inside set.
Another approach - just do explicit serialization/deserialization of data into/from JSON or other format, and have table structure like this:
create table tbl(
id int primary key,
data text);
the type of data field depends on what format you'll use - you can use blob as well to store binary data. But in this case you'll need to update/fetch complete field. You can simplify things if you use Java driver's custom codecs that will take care for conversion between your data structure in Java & desired format. See example in the documentation for conversion to/from JSON.

Azure stream analytics - Joining on a csv file returns 0 rows

I have the following query:
SELECT
[VanList].deviceId
,[VanList].[VanName]
events.[timestamp]
,events.externaltemp
,events.internaltemp
,events.humidity
,events.latitude
,events.longitude
INTO
[iot-powerBI]
FROM
[iot-EventHub] as events timestamp by [timestamp]
join [VanList] on events.DeviceId = [VanList].deviceId
where iot-eventHub is my event hub and VanList is a reference list (csv file) that has been uploaded to azure storage.
I have tried uploading sample data to test the query, but it always returns 0 rows.
Below is a sample of the JSON captured by my Event Hub Input
[
{
"DeviceId":1,
"Timestamp":"2015-06-29T12:15:18.0000000",
"ExternalTemp":9,
"InternalTemp":8,
"Humidity":43,
"Latitude":51.3854942,
"Longitude":-1.12774682,
"EventProcessedUtcTime":"2015-06-29T12:25:46.0932317Z",
"PartitionId":1,
"EventEnqueuedUtcTime":"2015-06-29T12:15:18.5990000Z"
} ]
Below is a sample of my CSV reference data.
deviceId,VanName
1,VAN 1
2,VAN 2
3,Standby Van
Both lists contain a device id of 1, so I am expecting my query to be able to join the two together.
I have tried using both "inner join" and "join" in my query syntax, but neither result in a successful join.
What is wrong with my Stream Analytics query?
Try adding a CAST function in the join. I'm not sure why that works and adding a CREATE TABLE clause for the VanList reference data input doesn't accomplish the same thing. But I think this works.
SELECT
[VanList].deviceId
,[VanList].[VanName]
,events.[timestamp]
,events.externaltemp
,events.internaltemp
,events.humidity
,events.latitude
,events.longitude
INTO
[iot-powerBI]
FROM
[iot-EventHub] as events timestamp by [Timestamp]
join [VanList] on events.DeviceId = cast([VanList].deviceId as bigint)
The only thing I can see is you are missing a comma in your original query, otherwise it looks correct. I would try recreating the Stream Analytics job. Here is another example that worked for me.
SELECT
countryref.CountryName as Geography,
input.GeographyId as GeographyId
into [country-out]
FROM input timestamp by [TransactionDateTime]
Join countryref
on countryref.GeographyID = input.GeographyId here
Input data example
{"pageid":801,"firstname":"Gertrude","geographyid":2,"itemid":2,"itemprice":79.0,"transactiondatetime":"2015-06-30T14:25:51.0000000","creditcardnumber":"2ggnC"}
{"pageid":801,"firstname":"Venice","geographyid":1,"itemid":10,"itemprice":169.0,"transactiondatetime":"2015-06-30T14:25:51.0000000","creditcardnumber":"xLyOp"}
{"pageid":801,"firstname":"Christinia","geographyid":2,"itemid":2,"itemprice":79.0,"transactiondatetime":"2015-06-30T14:25:51.0000000","creditcardnumber":"VuycQ"}
{"pageid":801,"firstname":"Dorethea","geographyid":4,"itemid":2,"itemprice":79.0,"transactiondatetime":"2015-06-30T14:25:51.0000000","creditcardnumber":"tgvQP"}
{"pageid":801,"firstname":"Dwain","geographyid":4,"itemid":4,"itemprice":129.0,"transactiondatetime":"2015-06-30T14:25:51.0000000","creditcardnumber":"O5TwV"}
Country ref data
[
{
"GeographyID":1,
"CountryName":"USA"
},
{
"GeographyID":2,
"CountryName":"China"
},
{
"GeographyID":3,
"CountryName":"Brazil"
},
{
"GeographyID":4,
"CountryName":"Andrews country"
},
{
"GeographyID":5,
"CountryName":"Chile"
}
]

Resources