Kusto number of overlapping intervals in a time range - azure

I'm trying to write a Kusto query that needs to count how many intervals overlap for a certain date range. This is how my table looks like:
userID | interval1 | interval2
24 | 21.1.2012 10:40 | 21.1.2012 11:00
25 | 21.1.2012 9:55 | 21.1.2012 10:50
I would like to to consider the time range given by [min(interval1), max(interval2)] with 1s step and for each instance of this interval I would like to know how many intervals from the previous table overlap. For example, for 21.1.2012 10:00 there is only one interval but for 10:45 there are two intervals overlapping.
Thank you

Every interval1 indicates additional user's session start (+1).
Every interval2 indicates additional user's session end (-1).
The accumulated sum indicates the number of active sessions.
Solution 1 (Rendering level)
with (accumulate=True)
let t = (datatable (userID:int,interval1:datetime,interval2:datetime)
[
24 ,datetime(2012-01-21 10:40) ,datetime(2012-01-21 11:00)
,25 ,datetime(2012-01-21 09:55) ,datetime(2012-01-21 10:50)
]);
let from_dttm = datetime(2012-01-21 09:30);
let to_dttm = datetime(2012-01-21 11:30);
let sessions_starts = (t | project delta = 1, dttm = interval1);
let sessions_ends = (t | project delta = -1, dttm = interval2);
union sessions_starts, sessions_ends
| make-series delta = sum(delta) on dttm from from_dttm to to_dttm step 1s
| render timechart with (accumulate=True)
Fiddle
Solution 2 (Data level)
mv-apply + row_cumsum
let t = (datatable (userID:int,interval1:datetime,interval2:datetime)
[
24 ,datetime(2012-01-21 10:40) ,datetime(2012-01-21 11:00)
,25 ,datetime(2012-01-21 09:55) ,datetime(2012-01-21 10:50)
]);
let from_dttm = datetime(2012-01-21 09:30);
let to_dttm = datetime(2012-01-21 11:30);
let sessions_starts = (t | project delta = 1, dttm = interval1);
let sessions_ends = (t | project delta = -1, dttm = interval2);
union sessions_starts, sessions_ends
| make-series delta = sum(delta) on dttm from from_dttm to to_dttm step 1s
| mv-apply delta to typeof(long), dttm to typeof(datetime) on (project active_users = row_cumsum(delta), dttm)
| render timechart with (xcolumn=dttm, ycolumns=active_users)
Fiddle

Take a look at this sample from the Kusto docs:
https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/samples?pivots=azuredataexplorer#chart-concurrent-sessions-over-time
X
| mv-expand samples = range(bin(interval1, 1m), interval2, 1m)
| summarize count_userID = count() by bin(todatetime(samples), 1m)

Related

Sphinx Results Take Huge Time To Show (Slow Index)

I'm new to Sphinx, i have simple table tbl_urls with two columns (domain_id,url)
i created my index as below to get domain id and number of urls for any giving keyword
source src2
{
type = mysql
sql_host = 0.0.0.0
sql_user = spnx
sql_pass = 123
sql_db = db_spnx
sql_port = 3306 # optional, default is 3306
sql_query = select id,domain_id,url from tbl_domain_urls
sql_attr_uint = domain_id
sql_field_string = url
}
index url_tbl
{
source = src2
path =/var/lib/sphinx/data/url_tbl
}
indexer
{
mem_limit = 2047M
}
searchd
{
listen = 0.0.0.0:9312
listen = 0.0.0.0:9306:mysql41
listen = /home/charlie/sphinx-3.4.1/bin/searchd.sock:sphinx
log = /var/log/sphinx/sphinx.log
query_log = /var/log/sphinx/query.log
read_timeout = 5
max_children = 30
pid_file = /var/run/sphinx/sphinx.pid
max_filter_values = 20000
seamless_rotate = 1
preopen_indexes = 0
unlink_old = 1
workers = threads # for RT indexes to work
binlog_path = /var/lib/sphinx/data
max_batch_queries = 128
}
problem is the time taken to show results is over one min
SELECT domain_id,count(*) as url_counter
FROM ul_tbl WHERE MATCH('games')
group by domain_id limit 1000000 OPTION max_matches=1000000;show meta;
+-----------+-------+
| domain_id | url |
+-----------+-------+
| 9900 | 444 |
| 41309 | 48 |
| 62308 | 491 |
| 85798 | 401 |
| 595 | 4851 |
13545 rows in set (3 min 22.56 sec)
+---------------+--------+
| Variable_name | Value |
+---------------+--------+
| total | 13545 |
| total_found | 13545 |
| time | 1.406 |
| keyword[0] | games |
| docs[0] | 456667 |
| hits[0] | 514718 |
+---------------+--------+
table tbl_domain_urls 100,821,614 rows
dedicated server HP Proliant 2xL5420 16GB RAM 2x1TB HDD
I need your support to optimize my QUERY or config settings, i need the results in the lowest time possible, i really appreciate any new idea to test
Note:
I tried distributed index to use multiple core for processing without any noticable results

Best way to show today Vs yesterday Vs week in KQL azure monitror

I am trying to show the count for today 9rolling 24 hours) vs yesterday (again rolling) Vs the weekly average
And though I've got the code to work but I am getting an error as well
The error is (Query succeeded with warnings: There were some errors when processing your query.: "Partial query failure: Unspecified error (message: 'shard: 5eeb9282-0854-4569-a674-10f8daef9f7d, source: (Error { details: Rest(404, "HEAD qh1kustorageoiprdweu16.blob.core.windows.net/jgtb64673c4e98a07fa116b4e49211-0d2a81b5bf3540e087ff2cc0e4e57c98/13da174e-3951-4b54-9a45-1f9cbe5759b4/426a5a10-4e91-4...")
The code
let Yes_End = ago(24h);
let Yes_Start = ago(48h);
let N = ago(1m);
let LW_end = ago(14d);
let Lw_start = ago(7d);
let Curr = customMetrics
|extend Dec_Reasion = tostring(customDimensions["DeclineReason"])
|extend Type = tostring(customDimensions["AcquiringInstitutionId"])
|extend dw = dayofweek(timestamp)
|where name =='TransactionsDeclined'
|where timestamp between (Yes_End..N)
|summarize CurrentVal=sum(valueCount) by tostring(Dec_Reasion);
let Trend = customMetrics
|extend Dec_Reasion = tostring(customDimensions["DeclineReason"])
|extend Type = tostring(customDimensions["AcquiringInstitutionId"])
|where timestamp between (Yes_Start .. Yes_End)
|where name =='TransactionsDeclined'
|summarize Yesterday_total=sum(valueCount) by tostring(Dec_Reasion);
let weekTrend =customMetrics
|extend Dec_Reasion = tostring(customDimensions["DeclineReason"])
|extend Type = tostring(customDimensions["AcquiringInstitutionId"])
|extend dw = dayofweek(timestamp)
|where toint(dw) <6
|where timestamp between (LW_end .. Lw_start)
|where name =='TransactionsDeclined'
|summarize Week_Avg=sum(valueCount)/5 by tostring(Dec_Reasion) ;
Curr
|join kind=leftouter Trend on Dec_Reasion
|join kind=leftouter weekTrend on Dec_Reasion
|project Dec_Reasion,CurrentVal,Yesterday_total,Week_Avg
This query can be written in a way that does not require joins.
You might want to give it a try.
let Yes_End = ago(24h);
let Yes_Start = ago(48h);
let N = ago(1m);
let LW_end = ago(14d);
let Lw_start = ago(7d);
customMetrics
| where timestamp between (LW_end .. Lw_start)
or timestamp between (Yes_Start .. N)
| where name == 'TransactionsDeclined'
| extend Dec_Reasion = tostring(customDimensions["DeclineReason"])
,Type = tostring(customDimensions["AcquiringInstitutionId"])
| summarize CurrentVal = sumif(valueCount, timestamp between (Yes_End .. N))
,Yesterday_total = sumif(valueCount, timestamp between (Yes_Start .. Yes_End))
,Week_Avg = sumif(valueCount, timestamp between (LW_end .. Lw_start) and where toint(dayofweek(timestamp)) < 6) / 5
by Dec_Reasion

How to use series_divide() in Kusto?

I am not able correctly divide time-series data with another time-series.
I get data from my TestTablewhich results in the following view:
TagId, sdata
8862, [0,0,0,0,2,2,2,3,4]
6304, [0,0,0,0,2,2,2,3,2]
I want to divide the sdata series for tagId 8862 with the series from 6304
I expect the following result:
[NaN,NaN,NaN,NaN,1,1,1,1,2]
When I try the below code, I only get two empty ddata rows in my S2 results
TestTable
| where TagId in (8862,6304)
| make-series sdata = avg(todouble(Value)) default=0 on TimeStamp in range (datetime(2019-06-27), datetime(2019-06-29), 1m) by TagId
| as S1;
S1 | project ddata = series_divide(sdata[0].['sdata'], sdata[1].['sdata'])
| as S2
What am I doing wrong?
both arguments to series_divide() can't come from two separate rows in the dataset.
here's an example for how you could achieve that (based on the limited-and-perhaps-not-fully-representative-of-your-real use case, as shown in your question)
let T =
datatable(tag_id:long, sdata:dynamic)
[
8862, dynamic([0,0,0,0,2,2,2,3,4]),
6304, dynamic([0,0,0,0,2,2,2,3,2]),
]
;
let get_value_from_T = (_tag_id:long)
{
toscalar(
T
| where tag_id == _tag_id
| take 1
| project sdata
)
};
print sdata_1 = get_value_from_T(8862), sdata_2 = get_value_from_T(6304)
| extend result = series_divide(sdata_1, sdata_2)
which returns:
|sdata_1 | sdata_2 | result |
|--------------------|---------------------|---------------------------------------------|
|[0,0,0,0,2,2,2,3,4] | [0,0,0,0,2,2,2,3,2] |["NaN","NaN","NaN","NaN",1.0,1.0,1.0,1.0,2.0]|

Spark out of memory with a large number of window functions (lag, lead)

I need to calculate additional features from a dataset using multiple lead's and lag's. The high number of lead's and lag's causes a out-of-memory error.
Data frame:
|----------+----------------+---------+---------+-----+---------|
| DeviceID | Timestamp | Sensor1 | Sensor2 | ... | Sensor9 |
|----------+----------------+---------+---------+-----+---------|
| | | | | | |
| Long | Unix timestamp | Double | Double | | Double |
| | | | | | |
|----------+----------------+---------+---------+-----+---------|
Window definition:
// Each window contains about 600 rows
val w = Window.partitionBy("DeviceID").orderBy("Timestamp")
Compute extra features:
var res = df
val sensors = (1 to 9).map(i => s"Sensor$i")
for (i <- 1 to 5) {
for (s <- sensors) {
res = res.withColumn(lag(s, i).over(w))
.withColumn(lead(s, i)).over(w)
}
// Compute features from all the lag's and lead's
[...]
}
System info:
RAM: 16G
JVM heap: 11G
The code gives correct results with small datasets, but gives an out-of-memory error with 10GB of input data.
I think the culprit is the high number of window functions because the DAG shows a very long sequence of
Window -> WholeStageCodeGen -> Window -> WholeStageCodeGen ...
Is there anyway to calculate the same features in a more efficient way?
For example, is it possible to get lag(Sensor1, 1), lag(Sensor2, 1), ..., lag(Sensor9, 1) without calling lag(..., 1) nine times?
If the answer to the previous question is no, then how can I avoid out-of-memory? I have already tried increasing the number of partitions.
You could try something like
res = res.select('*', lag(s"Sensor$1", 1).over(w), lag(s"Sensor$1", 2).over(w), ...)
That is, to write everything in a select instead of many withColumn
Then there will be only 1 Window in the plan. Maybe it helps with the performance.

How to calculate hours active between two timestamps

If I have a Dataframe with two Timestamps, called 'start' and 'end', how can I calculate a list of all the hour's between 'start' and 'end'?
Another say to say this might be "which hours was the record active"?
For example:
// Input
| start| end|
|2017-06-01 09:30:00|2017-06-01 11:30:00|
|2017-06-01 14:00:00|2017-06-01 14:30:00|
// Result
| start| end|hours_active|
|2017-06-01 09:30:00|2017-06-01 11:30:00| (9,10,11)|
|2017-06-01 14:00:00|2017-06-01 14:30:00| (14)|
Thanks
If the difference between the start and end is always less than 24 hours, you can use the following UDF. Assuming the type of the columns is Timestamp:
val getActiveHours = udf((s: Long, e: Long) => {
if (e >= s) {
val diff = e - s
(s to (s+diff)).toSeq
} else {
// the end is in the next day
(s to 24).toSeq ++ (1L to e).toSeq
}
})
df.withColumn("hours_active", getActiveHours(hour($"start"), hour($"end")))
Using the example data in the question gives:
+---------------------+---------------------+------------+
|start |end |hours_active|
+---------------------+---------------------+------------+
|2017-06-01 09:30:00.0|2017-06-01 11:30:00.0|[9, 10, 11] |
|2017-06-01 14:00:00.0|2017-06-01 14:30:00.0|[14] |
+---------------------+---------------------+------------+
Note: For larger differences between the timestamps the above code can be adjusted to take that into account. It would then be necessary to look at other fields in addition to the hour, e.g. day/month/year.

Resources