Kusto/Azure Data Explorer - How can I partition an external table using a timespan field? - azure

Hoping someone can help..
I am new to Kusto and have to get an external table reading data from an Azure Blob storage account working, but the one table I have is unique in that the data for the timestamp column is split into 2 separate columns , i.e. LogDate and LogTime (see script below).
My data is stored in the following structure in the Azure Storage account container (container is named "employeedata", for example):
{employeename}/{year}/{month}/{day}/{hour}/{minute}.csv, in a simple CSV format.
I know the CSV is good because if I import it into a normal Kusto table, it works perfectly.
My KQL script for the external table creation looks as follows:
.create-or-alter external table EmpLogs (Employee: string, LogDate: datetime, LogTime:timestamp)
kind=blob
partition by (EmployeeName:string = Employee, yyyy:datetime = startofday(LogDate), MM:datetime = startofday(LogDate), dd:datetime = startofday(LogDate), HH:datetime = todatetime(LogTime), mm:datetime = todatetime(LogTime))
pathformat = (EmployeeName "/" datetime_pattern("yyyy", yyyy) "/" datetime_pattern("MM", MM) "/" datetime_pattern("dd", dd) "/" substring(HH, 0, 2) "/" substring(mm, 3, 2) ".csv")
dataformat=csv
(
h#'************************'
)
with (folder="EmployeeInfo", includeHeaders="All")
I am getting the error below constantly, which is not very helpful (redacted from full error, basically comes down to the fact there is a syntax error somewhere):
Syntax error: Query could not be parsed: {
"error": {
"code": "BadRequest_SyntaxError",
"message": "Request is invalid and cannot be executed.",
"#type": "Kusto.Data.Exceptions.SyntaxException",
"#message": "Syntax error: Query could not be parsed: . Query: '.create-or-alter external table ........
I know the todatetime() function works on timespan's, I tested it with another table and it created a date similar to the following: 0001-01-01 20:18:00.0000000.
I have tried using the bin() function on the timestamp/LogTime columns, but the same error as above, and even tried importing the time value as a string and doing some string manipulation on it, no luck. Getting the same syntax error.
Any help/guidance would be greatly appreciated.
Thank you!!

Currently, there's no way to define an external table partition based on more than one column. If your dataset timestamp is splitted between two columns: LogDate:datetime and LogTime:timestamp, then the best you can do is use virtual column for the partition by time:
.create-or-alter external table EmpLogs(Employee: string, LogDate:datetime, LogTime:timespan)
kind=blob
partition by (EmployeeName:string = Employee, PartitionDate:datetime)
pathformat = (EmployeeName "/" datetime_pattern("yyyy/MM/dd/HH/mm", PartitionDate))
dataformat=csv
(
//h#'************************'
)
with (folder="EmployeeInfo", includeHeaders="All")
Now, you can filter by the virtual column and fine tune using LogTime:
external_table("EmpLogs")
| where Employee in ("John Doe", ...)
| where PartitionDate between(datetime(2020-01-01 10:00:00) .. datetime(2020-01-01 11:00:00))
| where LogTime ...

Related

Inserting Timestamp Into Snowflake Using Python 3.8

I have an empty table defined in snowflake as;
CREATE OR REPLACE TABLE db1.schema1.table(
ACCOUNT_ID NUMBER NOT NULL PRIMARY KEY,
PREDICTED_PROBABILITY FLOAT,
TIME_PREDICTED TIMESTAMP
);
And it creates the correct table, which has been checked using desc command in sql. Then using a snowflake python connector we are trying to execute following query;
insert_query = f'INSERT INTO DATA_LAKE.CUSTOMER.ACT_PREDICTED_PROBABILITIES(ACCOUNT_ID, PREDICTED_PROBABILITY, TIME_PREDICTED) VALUES ({accountId}, {risk_score},{ct});'
ctx.cursor().execute(insert_query)
Just before this query the variables are defined, The main challenge is getting the current time stamp written into snowflake. Here the value of ct is defined as;
import datetime
ct = datetime.datetime.now()
print(ct)
2021-04-30 21:54:41.676406
But when we try to execute this INSERT query we get the following errr message;
ProgrammingError: 001003 (42000): SQL compilation error:
syntax error line 1 at position 157 unexpected '21'.
Can I kindly get some help on ow to format the date time value here? Help is appreciated.
In addition to the answer #Lukasz provided you could also think about defining the current_timestamp() as default for the TIME_PREDICTED column:
CREATE OR REPLACE TABLE db1.schema1.table(
ACCOUNT_ID NUMBER NOT NULL PRIMARY KEY,
PREDICTED_PROBABILITY FLOAT,
TIME_PREDICTED TIMESTAMP DEFAULT current_timestamp
);
And then just insert ACCOUNT_ID and PREDICTED_PROBABILITY:
insert_query = f'INSERT INTO DATA_LAKE.CUSTOMER.ACT_PREDICTED_PROBABILITIES(ACCOUNT_ID, PREDICTED_PROBABILITY) VALUES ({accountId}, {risk_score});'
ctx.cursor().execute(insert_query)
It will automatically assign the insert time to TIME_PREDICTED
Educated guess. When performing insert with:
insert_query = f'INSERT INTO ...(ACCOUNT_ID, PREDICTED_PROBABILITY, TIME_PREDICTED)
VALUES ({accountId}, {risk_score},{ct});'
It is a string interpolation. The ct is provided as string representation of datetime, which does not match a timestamp data type, thus error.
I would suggest using proper variable binding instead:
ctx.cursor().execute("INSERT INTO DATA_LAKE.CUSTOMER.ACT_PREDICTED_PROBABILITIES "
"(ACCOUNT_ID, PREDICTED_PROBABILITY, TIME_PREDICTED) "
"VALUES(:1, :2, :3)",
(accountId,
risk_score,
("TIMESTAMP_LTZ", ct)
)
);
Avoid SQL Injection Attacks
Avoid binding data using Python’s formatting function because you risk SQL injection. For example:
# Binding data (UNSAFE EXAMPLE)
con.cursor().execute(
"INSERT INTO testtable(col1, col2) "
"VALUES({col1}, '{col2}')".format(
col1=789,
col2='test string3')
)
Instead, store the values in variables, check those values (for example, by looking for suspicious semicolons inside strings), and then bind the parameters using qmark or numeric binding style.
You forgot to place the quotes before and after the {ct}. The code should be :
insert_query = "INSERT INTO DATA_LAKE.CUSTOMER.ACT_PREDICTED_PROBABILITIES(ACCOUNT_ID, PREDICTED_PROBABILITY, TIME_PREDICTED) VALUES ({accountId}, {risk_score},'{ct}');".format(accountId=accountId,risk_score=risk_score,ct=ct)
ctx.cursor().execute(insert_query)

I want to use SQL_VARIANT datatype in external table Azure SQL and I get the "Index was out of range error."

I have two SQL Azure databases - DatabaseA and DatabaseB on a server hosted in Azure.
I need to access a view on DatabaseA from DatabaseB - namely I need the sys.identity_columns in DatabaseA to be available to me on DatabaseB. So I am creating an external table on DatabaseB that links to this information like this (I didn't include all the columns but I included the one causing the problem)
CREATE EXTERNAL TABLE [SOURCE_SYS].[identity_columns](
[object_id] int not null
,[name] nvarchar(128) null
,[column_id] int not null
,[system_type_id] tinyint not null
,[seed_value] sql_variant null
)
WITH
(
DATA_SOURCE = MyElasticDBQueryDataSrc,
SCHEMA_NAME = 'sys',
OBJECT_NAME = 'identity_columns'
);
When I run this - it works. But when I try to use the result - select * from [SOURCE_SYS].[identity_columns] - I get this error:
Msg 46823, Level 16, State 1, Line 50
Error retrieving data from MyServer.database.windows.net.DatabaseA. The underlying error message received was: 'Index was out of range. Must be non-negative and less than the size of the collection.
Parameter name: index'.
If I comment out the fields in this table that have the sql_variant datatypes - it works fine but I do need the information in that field and the other two sql_variant fields that exist in the same table. MyElasticDBQueryDataSrc works fine on other similar tables without the sql_variant type.
Can anyone suggest what I might be doing wrong? Or suggest a workaround? I tried using bigints as it is mostly seed values that are either integers or null but that didn't work because it told me it wasn't the same datatype.
Any help much appreciated.
Well - after a weekend of sleep I figured out the answer!
If you use nvarchar(30) in he external table definition - you can then convert it to a bigint in any query you use it in
CREATE EXTERNAL TABLE [SOURCE_SYS].[identity_columns](
[object_id] int not null
,[name] nvarchar(128) null
,[column_id] int not null
,[system_type_id] tinyint not null
,[seed_value] nvarchar(30) null
)
WITH
(
DATA_SOURCE = MyElasticDBQueryDataSrc,
SCHEMA_NAME = 'sys',
OBJECT_NAME = 'identity_columns'
);
Now I can access the value like this:
select cast(isnull([seed_value], 0) as bigint) from SOURCE_SYS.identity_columns
Beware that if you do a select * from - you will need to do the variants separately from the rest of the query - you'll get this error:
Msg 46825, Level 16, State 1, Line 58
The data type of the column 'seed_value' in the external table is different than the column's data type in the underlying standalone or sharded table present on the external source.
Hope this is helpful to someone!

Data Factory V2 Query Azure Table Storage but use a lookup Value

I have a SQL watermark table which contains the last date in my destination table
My source data is coming from an Azure Storage Table and the date time is a string
I set up the date time in the watermark table to match the format in the Azure table storage
I create a lookup and a copy task
If I hard code the date into the Query for source and run this works fine CreatedAt ge '2019-03-06T14:03:11.000Z'
But obviously I dont want to hard code this value. I want to use the date from the lookup
But when I replace the hardcoded date with the lookup value
CreatedAt ge 'activity('LookupWatermarkOld').output'
I get an error
{
"errorCode": "2200",
"message":"ErrorCode=FailedStorageOperation,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=A
storage operation failed with the following error 'The remote server returned an error: (400) Bad Request.'.,Source=,
''Type=Microsoft.WindowsAzure.Storage.StorageException,Message=The remote server returned an error: (400) Bad Request.,
Source=Microsoft.WindowsAzure.Storage,StorageExtendedMessage=Syntax
error at position 42 in 'CreatedAt ge 'activity('LookupWatermarkOld').output''.\nRequestId:8c65ced9-b002-0051-79d9-d41d49000000\nTime:2019-03-07T11:35:39.0640233Z,,''Type=System.Net.WebException,Message=The remote server returned an error: (400) Bad Request.,Source=Microsoft.WindowsAzure.Storage,'",
"failureType": "UserError",
"target": "CopyMentions"
}
Can anyone help me with this? How do you use the Lookup value in a Azure Table query?
check this out:
1) Lookup activity. Query field:
SELECT MAX(WatermarkColumnName) as LastId FROM TableName;
Also, make sure that you checked "First row only" option.
2) In Copy Data activity use query. Query field:
#concat('SELECT * FROM TableName as s WHERE s.WatermarkColumnName > ''', activity('LookupActivity').output.firstRow.LastID, '''')
Finally I got some help on this and it works with
CreatedAt gt '#{activity('LookupWatermarkOld').output.firstRow.WaterMarkValue}'
the WaterarkValue is the column name from the SQL Lookup table
The Lookup creates an array so you have to specify the FirstRow from this array
And wrap in '' so its used as a string value
--For recent ADFv2
Use the watermark/lookup/output value in parameter.
Example: ParamUserCount = #{activity('LookupActivity').output.count}
or for output function
and you can use it in query as
Example: "select * from userDetails where usercount = {$ParamUserCount}"
make sure you enclose the query in " " to set as string and parameter in query should be enclosed in { }

Polybase - maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 1 rows processed

[Question from customer]
I have following data in a text file. Delimited by |
A | null , ZZ
C | D
When I run this query using HDInsight:
CREATE EXTERNAL TABLE myfiledata(
col1 string,
col2 string
)
row format delimited fields terminated by '|' STORED AS TEXTFILE LOCATION 'wasb://.....';
I get the following result as expected:
A null , ZZ
C D
But when I run the same query using SQL DW Polybase, it throws error:
Query aborted-- the maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 1 rows processed.
How do I fix this?
Here's my script in SQL DW:
-- Creating external data source (Azure Blob Storage)
CREATE EXTERNAL DATA SOURCE azure_storage1
WITH
(
TYPE = HADOOP
, LOCATION ='wasbs://....blob.core.windows.net'
, CREDENTIAL = ASBSecret
)
;
-- Creating external file format (delimited text file)
CREATE EXTERNAL FILE FORMAT text_file_format
WITH
(
FORMAT_TYPE = DELIMITEDTEXT
, FORMAT_OPTIONS (
FIELD_TERMINATOR ='|'
, USE_TYPE_DEFAULT = TRUE
)
)
;
-- Creating external table pointing to file stored in Azure Storage
CREATE EXTERNAL TABLE [Myfile]
(
Col1 varchar(5),
Col2 varchar(5)
)
WITH
(
LOCATION = '/myfile.txt'
, DATA_SOURCE = azure_storage1
, FILE_FORMAT = text_file_format
)
;
We’re currently working on a way to bubble up the reason for reject to the user.
In the meantime, here's what's happening:
The default # of rows allowed to fail schema matching is 0. This means that if at least one of the rows you’re loading in from /myfile.txt doesn’t match the schema. In Hive, strings can accommodate an arbitrary amount of chars, but varchars cannot. In this case it’s failing on the varchar(5) for “null , ZZ” because that is more than 5 characters.
If you’d like to change the REJECT_VALUE in the CREATE EXTERNAL TABLE call, that will let through the other row – more info can be found here: https://msdn.microsoft.com/library/dn935021(v=sql.130).aspx
It's due to dirty record for the respective file format, for example in the case of parquet if the column contains '' (empty string) then it won't work, and will throw Query aborted-- the maximum reject threshold
[AZURE.NOTE] A query on an external table can fail with the error "Query aborted-- the maximum reject threshold was reached while reading from an external source". This indicates that your external data contains dirty records. A data record is considered 'dirty' if the actual data types/number of columns do not match the column definitions of the external table or if the data doesn't conform to the specified external file format. To fix this, ensure that your external table and external file format definitions are correct and your external data conform to these definitions. In case a subset of external data records is dirty, you can choose to reject these records for your queries by using the reject options in CREATE EXTERNAL TABLE DDL.

Azure stream analytics - Joining on a csv file returns 0 rows

I have the following query:
SELECT
[VanList].deviceId
,[VanList].[VanName]
events.[timestamp]
,events.externaltemp
,events.internaltemp
,events.humidity
,events.latitude
,events.longitude
INTO
[iot-powerBI]
FROM
[iot-EventHub] as events timestamp by [timestamp]
join [VanList] on events.DeviceId = [VanList].deviceId
where iot-eventHub is my event hub and VanList is a reference list (csv file) that has been uploaded to azure storage.
I have tried uploading sample data to test the query, but it always returns 0 rows.
Below is a sample of the JSON captured by my Event Hub Input
[
{
"DeviceId":1,
"Timestamp":"2015-06-29T12:15:18.0000000",
"ExternalTemp":9,
"InternalTemp":8,
"Humidity":43,
"Latitude":51.3854942,
"Longitude":-1.12774682,
"EventProcessedUtcTime":"2015-06-29T12:25:46.0932317Z",
"PartitionId":1,
"EventEnqueuedUtcTime":"2015-06-29T12:15:18.5990000Z"
} ]
Below is a sample of my CSV reference data.
deviceId,VanName
1,VAN 1
2,VAN 2
3,Standby Van
Both lists contain a device id of 1, so I am expecting my query to be able to join the two together.
I have tried using both "inner join" and "join" in my query syntax, but neither result in a successful join.
What is wrong with my Stream Analytics query?
Try adding a CAST function in the join. I'm not sure why that works and adding a CREATE TABLE clause for the VanList reference data input doesn't accomplish the same thing. But I think this works.
SELECT
[VanList].deviceId
,[VanList].[VanName]
,events.[timestamp]
,events.externaltemp
,events.internaltemp
,events.humidity
,events.latitude
,events.longitude
INTO
[iot-powerBI]
FROM
[iot-EventHub] as events timestamp by [Timestamp]
join [VanList] on events.DeviceId = cast([VanList].deviceId as bigint)
The only thing I can see is you are missing a comma in your original query, otherwise it looks correct. I would try recreating the Stream Analytics job. Here is another example that worked for me.
SELECT
countryref.CountryName as Geography,
input.GeographyId as GeographyId
into [country-out]
FROM input timestamp by [TransactionDateTime]
Join countryref
on countryref.GeographyID = input.GeographyId here
Input data example
{"pageid":801,"firstname":"Gertrude","geographyid":2,"itemid":2,"itemprice":79.0,"transactiondatetime":"2015-06-30T14:25:51.0000000","creditcardnumber":"2ggnC"}
{"pageid":801,"firstname":"Venice","geographyid":1,"itemid":10,"itemprice":169.0,"transactiondatetime":"2015-06-30T14:25:51.0000000","creditcardnumber":"xLyOp"}
{"pageid":801,"firstname":"Christinia","geographyid":2,"itemid":2,"itemprice":79.0,"transactiondatetime":"2015-06-30T14:25:51.0000000","creditcardnumber":"VuycQ"}
{"pageid":801,"firstname":"Dorethea","geographyid":4,"itemid":2,"itemprice":79.0,"transactiondatetime":"2015-06-30T14:25:51.0000000","creditcardnumber":"tgvQP"}
{"pageid":801,"firstname":"Dwain","geographyid":4,"itemid":4,"itemprice":129.0,"transactiondatetime":"2015-06-30T14:25:51.0000000","creditcardnumber":"O5TwV"}
Country ref data
[
{
"GeographyID":1,
"CountryName":"USA"
},
{
"GeographyID":2,
"CountryName":"China"
},
{
"GeographyID":3,
"CountryName":"Brazil"
},
{
"GeographyID":4,
"CountryName":"Andrews country"
},
{
"GeographyID":5,
"CountryName":"Chile"
}
]

Resources