How to fix format error defining Trino/Kafka Joda Date Format - presto

I have build a Trino schema for a Kafka stream that needs to parse a date in the following format: 2022-09-19T00:00:00.000+02:00. (The date is Midnight UTC+2h 19th Sept 2022.)
The Schema is here, with the section that it is rejecting highlighted: https://github.com/os-climate/markets-pricing-service/blob/main/config/ecb-trino-schema.yaml#L50-L61
{
"name": "starttime-period",
"mapping": "starttime-period",
"type": "TIMESTAMP",
"dataFormat" :"custom-date-time",
"formatHint":"yyyy-mm-dTHH:mm:ss.SSS+HH:mm"
},
{
"name": "end-time-period",
"mapping": "end-time-period",
"type": "TIMESTAMP",
"dataFormat" :"custom-date-time",
"formatHint":"yyyy-mm-dTHH:mm:ss.SSS+HH:mm"
},
The error is:
SQL Error: Query failed (#20220930_190035_00011_kpnkf): invalid Joda Time pattern 'yyyy-mm-dTHH:mm:ss.SSS+HH:mm' passed as format hint for column 'starttime-period'
I have been through the documentation here: https://trino.io/docs/current/connector/kafka.html and cannot see what I could be doing wrong.

HH:mm is used to represent hour and minute of the day, +02:00 at the end is a time zone offset, so try to use Z in your format string: "formatHint":"yyyy-mm-dTHH:mm:ss.SSSZ" (joda docs).

Related

Cloudwatch query does not work if #timestamp is to previous date

I am trying to push some old logs (-10 days) using boto3 put_log_events to cloud watch. if I set the current time in milliseconds for timestamp field, I am able to run cloudwatch queries in log streams. if it is set actual log DateTime, cloud watch responds with
"No data found for this time range"
Sample Log:
{"Namespace": "AWS/ECS", "Metric": "CPUUtilization", "Dimensions": {"Average": 0.08905141301220283, "Sum": 0.08905141301220283, "Unit": "Percent"}, "Timestamp": "2021-01-16T22:19:00+00:00"}
s = datetime.datetime(2021, 1, 15, 3, 17, tzinfo=tzutc()
time_data=int(round(s.timestamp() * 1000)) ## cloudwatch search does not work if this is set.
time_data = int(round(time.time() * 1000)) ## cloudwatch search works if this timestamp is set.
msg=[ { "timestamp": time_data,
"message": message}
]
Cludwatch query:
filter #logStream = 'daily-2021-01-27'
| fields #timestamp, #message
| filter Namespace = 'AWS/ECS'
Form AWS DOC:
-- timestamp -> (long)
The time the event occurred, expressed as the number of milliseconds after Jan 1, 1970 00:00:00 UTC.
-- None of the log events in the batch can be older than 14 days or older than the retention period of the log group.
Not sure what I am missing.
(Data that is timestamped 24 hours or more in the past may take in
excess of 48 hours to become available from submission time using
GetMetricStatistics.
)
is this related to this behaviour?

Change datetime format generated with make-series operation in Kusto

Introduction:
In Azure Data Explorer there is a make-series-Operator which allow us to create series of specified aggregated values along specified axis.
Where is the problem:
The operator works good except the changes in timestamp format.
For example
let resolution = 1d;
let timeframe = 3d;
let start_ts = datetime_add('second', offset, ago(timeframe));
let end_ts = datetime_add('second', offset, now());
Table
| make-series max(value) default=0 on timestamp from start_ts to end_ts step resolution by col_1, col_2
Current results:
I got the result contains the timestamp in UTC like the following
"max_value": [
-2.69,
-2.79,
-2.69
],
"timestamp": [
"2020-03-29T18:01:08.0552135Z",
"2020-03-30T18:01:08.0552135Z",
"2020-03-31T18:01:08.0552135Z"
],
Expected result:
result should be like the following
"max_value": [
-2.69,
-2.79,
-2.69
],
"timestamp": [
"2020-03-29 18:01:08",
"2020-03-30 18:01:08",
"2020-03-31 18:01:08"
],
Question:
is there any way to change the datetime format which generated in make-series operation in kusto to be NOT in UTC format.
is there any way to change the datetime format which generated in make-series operation in kusto to be NOT in UTC format.
it's not clear what you define as "UTC Format". Kusto/ADX uses the ISO 8601 standard, and timestamps are always UTC. You can see that is used in your original message, e.g. 2020-03-29T18:01:08.0552135Z.
if, for whatever reason, you want to present datetime values in a different format, inside of a dynamic column (array or property bag), you could achieve that using mv-apply and format_datetime():
print arr = dynamic(
[
"2020-03-29T18:01:08.0552135Z",
"2020-03-30T18:01:08.0552135Z",
"2020-03-31T18:01:08.0552135Z"
])
| mv-apply arr on (
summarize make_list(format_datetime(todatetime(arr), "yyyy-MM-dd HH:mm:ss"))
)

How to extract and flatten a JSON array as well as specify an Array Value for 'TIMESTAMP BY' in Stream Analytics Query

I got the following input stream data to Stream Analytics.
[
{
"timestamp": 1559529369274,
"values": [
{
"id": "SimCh01.Device01.Ramp1",
"v": 39,
"q": 1,
"t": 1559529359833
},
{
"id": "SimCh01.Device01.Ramp2",
"v": 183.5,
"q": 1,
"t": 1559529359833
}
],
"EventProcessedUtcTime": "2019-06-03T02:37:29.5824231Z",
"PartitionId": 3,
"EventEnqueuedUtcTime": "2019-06-03T02:37:29.4390000Z",
"IoTHub": {
"MessageId": null,
"CorrelationId": null,
"ConnectionDeviceId": "ew-IoT-01-KepServer",
"ConnectionDeviceGenerationId": "636948080712635859",
"EnqueuedTime": "2019-06-03T02:37:29.4260000Z",
"StreamId": null
}
}
]
I try to extract the "values" array and specify the "t" within the array element for TIMESTAMP BY
I was able to query with simple SAQL statement within Stream Analytics to read the input and route to the output. However, I only interested in the "values" array above.
This is my first attempt. It does not like my 'TIMESTAMP BY' statement when I try to re-start Stream Analytics Job
SELECT
KepValues.ArrayValue.id,
KepValues.ArrayValue.v,
KepValues.ArrayValue.q,
KepValues.ArrayValue.t
INTO
[PowerBI-DS]
FROM
[IoTHub-Input] as event
CROSS APPLY GetArrayElements(event.[values]) as KepValues
TIMESTAMP BY KepValues.ArrayValue.t
==============================================================================
This is my 2nd attempt. It still does not like my 'TIMESTAMP BY' statement.
With [PowerBI-Modified-DS] As (
SELECT
KepValues.ArrayValue.id as ID,
KepValues.ArrayValue.v as V,
KepValues.ArrayValue.q as Q,
KepValues.ArrayValue.t as T
FROM
[IoTHub-Input] as event
CROSS APPLY GetArrayElements(event.[values]) as KepValues
)
SELECT
ID, V, Q, T
INTO
[PowerBI-DS]
FROM
[PowerBI-Modified-DS] TIMESTAMP BY T
After extraction, this is what I expected, a table with columns "id", "v", "q", "t" and each row will have a single ArrayElement. e.g.,
"SimCh01.Device01.Ramp1", 39, 1, 1559529359833
"SimCh01.Device01.Ramp2", 183.5, 1, 1559529359833
Added
Since then, I have modified the query as below to create a DateTime by converting the Unix time t into DateTime time
With [PowerBI-Modified-DS] As (
SELECT
arrayElement.ArrayValue.id as ID,
arrayElement.ArrayValue.v as V,
arrayElement.ArrayValue.q as Q,
arrayElement.ArrayValue.t as TT
FROM
[IoTHub-Input] as iothubAlias
CROSS APPLY GetArrayElements(iothubAlias.data) as arrayElement
)
SELECT
ID, V, Q, DATEADD(millisecond, TT, '1970-01-01T00:00:00Z') as T
INTO
[SAJ-01-PowerBI]
FROM
[PowerBI-Modified-DS]
I manage to add DATEADD() to convert Unix Time into DateTime and call it as T. Now how can I add 'TIMESTAMP BY'. I did try to add behind [PowerBI-Modified-DS] . But the editor complains the insert is invalid. What else can I do. Or this is the best I can do. I understand I need to set 'TIMESTAMP BY' so Power BI understand this is the streaming data.
The TIMESTAMP BY clause in Stream Analytics requires a value to be of type DATETIME. String values conforming to ISO 8601 formats are supported. In your example, the value of 't' does not conform to this standard.
To use TIMESTAMP BY clause in your case, you will have to pre-process the data before sending it to Stream Analytics or change the source to create the event (specifically the field 't') using this format.
Stream Analytics assigns TIMESTAMP before the query is executed. So TIMESTAMP BY expression can only refer to fields in the input payload. You have 2 options.
You can have 2 ASA jobs. First does the CROSS APPLY and the 2nd job does TIMESTAMP BY.
You can implement a deserializer in C# (sign up for preview access). This way you can have one job that uses your implementation to read incoming events. Your deserializer will convert the unix time to DateTime and this field can then be used in your TIMESTAMP BY clause.

Syntax error on U-SQL Script to get data from JSON complex type

This is my input JSON file.
[
{
"Tag": "STACK007",
"data": [
{
"item": "UNIFY109",
"timestamp": "2018-08-27T17:28:51.8490000Z",
"jsonVersion": 1,
"messageType": 1,
"velocity": 709
}
],
"EventProcessedUtcTime": "2018-08-27T17:36:17.5519639Z",
"EventEnqueuedUtcTime": "2018-08-27T17:28:52.0010000Z"
}
]
I'm trying to convert this input JSON file to CSV. Here's my U-SQL Script
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
USING Microsoft.Analytics.Samples.Formats.Json;
DECLARE #input string = #"/demodata/logs/2018/08/input.json";
#json =
EXTRACT
Tag string,
EventProcessedUtcTime DateTime,
EventEnqueuedUtcTime DateTime,
JsonFunctions.JsonTuple(data) AS data
FROM #input
USING new JsonExtractor();
#result =
SELECT Tag,
address["velocity"]AS Velocity,
address["messageType"]AS MessageType,
address["timestamp"]AS Timestamp
FROM #json;
OUTPUT #result
TO "/output/demooutput.csv"
USING Outputters.Csv();
This script is giving me a syntax error with the message "syntax error. Expected one of: '.' "
How do I fix this?
I found that this had been answered previously:
#resultset =
EXTRACT
item string,
jsonversion int,
messageType int,
velocity float
FROM #input
USING new Microsoft.Analytics.Samples.Formats.Json.JsonExtractor("data[*]");
This is from an answer by Michael Rys to this stackoverflow question:
U- SQL Unable to extract data from JSON file
"Actually the JSONExtractor supports the rowpath parameter expressed in JSONPath that gives you the ability to identify the JSON object or JSON array items that you want to map into rows. So you can extract your data with a single statement from your JSON document"

Azure Data factory v2 copy activity source value is null sink value is not allowed null

My question is about how/where to perform the type conversion during copy activity.
I have an Azure data factory pipeline which defines data imports from a TSV file in Data Lake Gen1 to SQL server database table.
The schema of the TSV file is: {QueryDate,Count1,Count2}
count1 and count2 may have no value.
example data in TSV file:
20180717 10
20180717 5 5
20180717 7 1
20180717 7
The Schema of the SQL server table is
{QueryDate(datetime2(7)), UserNumber(int), ActiveNumber(int)}
Both UserNumber and ActiveNumber have Not null constraint.
When I use copy activity in my pipeline to copy TSV data to the table I get an error like this:
error Code": "2200", "message":
"'Type=System.InvalidOperationException, does not allow
DBNull.Value...
when count1 or count2 have no value I want to use 0 to replace the null value.
And I think that will not cause the error anymore.
But I don't know where and how to perform this conversion.
The conversion should in the source dataset, the sink dataset or the copy activity?
And the correct syntax of the conversion I don't figure out either.
I was trying to set the nullValue of the source datatset format setting
with value like null, 0, NULL, but none of them works, I still get the error.
"typeProperties": {
"format": {
"type": "TextFormat",
"columnDelimiter": "\t",
"nullValue": "NULL",
"treatEmptyAsNull": true,
"skipLineCount": 0,
"firstRowAsHeader": false},
...
}
I also see the question: Azure Data Factory - can't convert from "null" to datetime field
but it still not solve my problem.

Resources