Difference between 2 consecutive values in Kusto - azure

I have the following script:
let StartTime = datetime(2022-02-18 10:10:00 AM);
let EndTime = datetime(2022-02-18 10:15:00 AM);
MachineEvents
| where Timestamp between (StartTime .. EndTime)
| where Id == "00112233" and Name == "Higher"
| top 2 by Timestamp
| project Timestamp, Value
I got the following result:
What I am trying to achieve after that is to check if the last Value received (in this case for example it is 15451.433) is less than 30,000. If the condition is true, then I should check again the difference between the last two consecutive values (in this case : 15451.433 - 15457.083). If the difference is < 0 then I should return the Value as true, else it should return as false (by other words the Value should give a boolean value instead of double as shown in the figure)

datatable(Timestamp:datetime, Value:double)
[
datetime(2022-02-18 10:15:00 AM), 15457.083,
datetime(2022-02-18 10:14:00 AM), 15451.433,
datetime(2022-02-18 10:13:00 AM), 15433.333,
datetime(2022-02-18 10:12:00 AM), 15411.111
]
| top 2 by Timestamp
| project Timestamp, Value
| extend nextValue=next(Value)
| extend finalResult = iff(Value < 30000, nextValue - Value < 0, false)
| top 1 by Timestamp
| project finalResult
Output:
finalResult
1

You can use the prev() function (or next()) to process the values in the other rows.
...
| extend previous = prev(value)
| extend diff = value - previous
| extend isPositive = diff > 0
You might need to use serialize if you don't have something like top that already does that for you.

Related

Azure Kusto Query Language Count two row values as one

A value in my application logs changed a few weeks ago and now when I query the logs, I receive two different values in my count. I'm using Azure Logs for graphing so this is rather painful.
How I can query one row value as the other (or query the two row values together)?
For example
I'm counting the fruit column values ('PinkLadyApple', 'Orange', 'Banana'),
The PinkLadyApple is now a JazzApple and when I count the results for 90 days I receive
Name | Sum
PinkLadyApple | 100
JazzApple | 20
Orange | 150
Banana | 80
I'm looking for a solution that either renames the value in the response to just Apple or the two values can be combined under JazzApple
I will then render the results in a graph.
My query looks like
fruits
| where timestamp > ago(90d)
| where fruitname in ('PinkLadyApple', 'Orange', 'Banana')
| summarize FruitCount = sum(OtherColumn) by name
You can change the old value to the new one, using the iff() function, like this:
fruits
| where timestamp > ago(90d)
| where fruitname in ('JazzApple', 'PinkLadyApple', 'Orange', 'Banana')
| extend fruitname = iff(fruitname == 'JazzApple', 'PinkLadyApple', fruitname)
| summarize FruitCount = sum(OtherColumn) by name
Note that for best performance, I filter for the old and new value (which uses indexes), and only then changing the old value to the new one, just before the summarize operator.
And if you have multiple renamed values, then you can use the case() function instead, like this:
fruits
| where timestamp > ago(90d)
| where fruitname in ('JazzApple', 'PinkLadyApple', 'Orange', 'Banana', 'OldValue2', 'OldValue3')
| extend fruitname = case(
fruitname == 'JazzApple', 'PinkLadyApple',
fruitname == 'OldValue2', 'NewValue2',
fruitname == 'OldValue3', 'NewValue3',
fruitname)
| summarize FruitCount = sum(OtherColumn) by name

Extract a substring new column based on a substring based on conditions ideally with Pandas

I got a data set (Excel) with hundreds of entries. In one string column there is most of the information. The information is divided by '_' and typed in by humans. Therefore, it is not possible to work with index positions.
To create a usable data basis it's mandatory to extract information from this column in another column.
The search pattern = '*v*' is alone not enough. But combined with the condition that the first item has to be a digit it works.
I tried to get it to work with iterrows, iteritems, str.strip, str.extract and many more. But the best solution I received with a for-loop.
pattern = '_*v*_'
test = []
for i in df['col']:
'#Split the string in substrings
i = i.split('_')
for c in i:
if c.find('x') == 1:
if c[0].isdigit():
# print(c)
test.append(c)
else:
'#To be able to fix a few rows manually
test.append(0)
[4]: test =[22v3, 33v55, 4v2]
#Input
+-----------+-----------+
| col | targetcol |
+-----------+-----------+
| as_22v3 | |
| 33v55_bdd | |
| Ave_4v2 | |
+-----------+-----------+
#Output
+-----------+-----------+--+
| col | targetcol | |
+-----------+-----------+--+
| as_22v3 | 22v3 | |
| 33v55_bdd | 33v55 | |
| Ave_4v2 | 4v2 | |
+-----------+-----------+--+
My code does work, but only for the first few rows. It stops after 36 values and I can't figure out why. There is no error message besides of course that it is not possible to assign the list to a DataFrame series since it has not the same size.
pandas.Series.str.extract should help:
>>> df['col'].str.extract(r'(\d+v+\d+)')
0
0 22v3
1 33v55
2 4v2
df = pd.DataFrame({
'col': ['as_22v3', '33v55_bdd', 'Ave_4v2']
})
df['targetcol'] = df['col'].str.extract(r'(\d+v+\d+)')
EDIT
df = pd.DataFrame({
'col': ['as_22v3', '33v55_bdd', 'Ave_4v2', '_22 v3', 'space 2,2v3', '2.v3',
'2.111v999', 'asd.123v77', '1 v7', '123 v 8135']
})
pattern = r'(\d+(\,[0-9]+)?(\s+)?v\d+)'
df['result'] = df['col'].str.extract(pattern)[0]
col result
0 as_22v3 22v3
1 33v55_bdd 33v55
2 Ave_4v2 4v2
3 _22 v3 22 v3
4 space 2,2v3 2,2v3
5 2.v3 NaN
6 2.111v999 111v999
7 asd.123v77 123v77
8 1 v7 1 v7
9 123 v 8135 NaN
You say it stops after 36 values? You say it is Excel file you are processing? One thing you could try is to save data set to .csv file and try to read this file in with pd.read_csv function. There are sometimes some extra characters in Excel file that are not easily visible.

COGNOS Report: COUNT IF

I am not sure how to go about creating a custom field to count instances given a condition.
I have a field, ID, that exists in two formats:
A#####
B#####
I would like to create two columns (one for A and one for B) and count instances by month. Something like COUNTIF ID STARTS WITH A for the first column resulting in something like below. Right now I can only create a table with the total count.
+-------+------+------+
| Month | ID A | ID B |
+-------+------+------+
| Jan | 100 | 10 |
+-------+------+------+
| Feb | 130 | 13 |
+-------+------+------+
| Mar | 90 | 12 |
+-------+------+------+
Define ID A as...
CASE
WHEN ID LIKE 'A%' THEN 1
ELSE 0
END
...and set the Default aggregation property to Total.
Do the same for ID B.
Apologies if I misunderstood the requirement, but you maybe able to spin the list into crosstab using the section off the toolbar, your measure value would be count(ID).
Try this
Query 1 to count A , filtering by substring(ID,1,1) = 'A'
Query 2 to count B , filtering by substring(ID,1,1) = 'B'
Join Query 1 and Query 2 by Year/Month
List by Month with Count A and Count B

Spark: How to join RDDs by time range

I have a delicate Spark problem, where i just can't wrap my head around.
We have two RDDs ( coming from Cassandra ). RDD1 contains Actions and RDD2 contains Historic data. Both have an id on which they can be matched/joined. But the problem is the two tables have an N:N relation ship. Actions contains multiple rows with the same id and so does Historic. Here are some example date from both tables.
Actions time is actually a timestamp
id | time | valueX
1 | 12:05 | 500
1 | 12:30 | 500
2 | 12:30 | 125
Historic set_at is actually a timestamp
id | set_at| valueY
1 | 11:00 | 400
1 | 12:15 | 450
2 | 12:20 | 50
2 | 12:25 | 75
How can we join these two tables in a way, that we get a result like this
1 | 100 # 500 - 400 for Actions#1 with time 12:05 because Historic was in that time at 400
1 | 50 # 500 - 450 for Actions#2 with time 12:30 because H. was in that time at 450
2 | 50 # 125 - 75 for Actions#3 with time 12:30 because H. was in that time at 75
I can't come up with a good solution that feels right, without making a lot of iterations over huge datasets. I always have to think about making a range from the Historic set and then somehow check if the Actions fits in the range e.g (11:00 - 12:15) to make the calculation. But that seems to pretty slow to me. Is there any more efficient way to do that? Seems to me, that this kind of problem could be popular, but i couldn't find any hints on this yet. How would you solve this problem in spark?
My current attempts so far ( in half way done code )
case class Historic(id: String, set_at: Long, valueY: Int)
val historicRDD = sc.cassandraTable[Historic](...)
historicRDD
.map( row => ( row.id, row ) )
.reduceByKey(...)
// transforming to another case which results in something like this; code not finished yet
// (List((Range(0, 12:25), 400), (Range(12:25, NOW), 450)))
// From here we could join with Actions
// And then some .filter maybe to select the right Lists tuple
It's an interesting problem. I also spent some time figuring out an approach. This is what I came up with:
Given case classes for Action(id, time, x) and Historic(id, time, y)
Join the actions with the history (this might be heavy)
filter all historic data not relevant for a given action
key the results by (id,time) - differentiate same key at different times
reduce the history by action to the max value, leaving us with relevant historical record for the given action
In Spark:
val actionById = actions.keyBy(_.id)
val historyById = historic.keyBy(_.id)
val actionByHistory = actionById.join(historyById)
val filteredActionByidTime = actionByHistory.collect{ case (k,(action,historic)) if (action.time>historic.t) => ((action.id, action.time),(action,historic))}
val topHistoricByAction = filteredActionByidTime.reduceByKey{ case ((a1:Action,h1:Historic),(a2:Action, h2:Historic)) => (a1, if (h1.t>h2.t) h1 else h2)}
// we are done, let's produce a report now
val report = topHistoricByAction.map{case ((id,time),(action,historic)) => (id,time,action.X -historic.y)}
Using the data provided above, the report looks like:
report.collect
Array[(Int, Long, Int)] = Array((1,43500,100), (1,45000,50), (2,45000,50))
(I transformed the time to seconds to have a simplistic timestamp)
After a few hours of thinking, trying and failing I came up with this solution. I am not sure if it is any good, but due the lack of other options, this is my solution.
First we expand our case class Historic
case class Historic(id: String, set_at: Long, valueY: Int) {
val set_at_map = new java.util.TreeMap[Long, Int]() // as it seems Scala doesn't provides something like this with similar operations we'll need a few lines later
set_at_map.put(0, valueY) // Means from the beginning of Epoch ...
set_at_map.put(set_at, valueY) // .. to the set_at date
// This is the fun part. With .getHistoricValue we can pass any timestamp and we will get the a value of the key back that contains the passed date. For more information look at this answer: http://stackoverflow.com/a/13400317/1209327
def getHistoricValue(date: Long) : Option[Int] = {
var e = set_at_map.floorEntry(date)
if (e != null && e.getValue == null) {
e = set_at_map.lowerEntry(date)
}
if ( e == null ) None else e.getValue()
}
}
The case class is ready and now we bring it into action
val historicRDD = sc.cassandraTable[Historic](...)
.map( row => ( row.id, row ) )
.reduceByKey( (row1, row2) => {
row1.set_at_map.put(row2.set_at, row2.valueY) // we add the historic Events up to each id
row1
})
// Now we load the Actions and map it by id as we did with Historic
val actionsRDD = sc.cassandraTable[Actions](...)
.map( row => ( row.id, row ) )
// Now both RDDs have the same key and we can join them
val fin = actionsRDD.join(historicRDD)
.map( row => {
( row._1.id,
(
row._2._1.id,
row._2._1.valueX - row._2._2.getHistoricValue(row._2._1.time).get // returns valueY for that timestamp
)
)
})
I am totally new to Scala, so please let me know if we could improve this code on some place.
I know that this question has been answered but I want to add another solution that worked for me -
your data -
Actions
id | time | valueX
1 | 12:05 | 500
1 | 12:30 | 500
2 | 12:30 | 125
Historic
id | set_at| valueY
1 | 11:00 | 400
1 | 12:15 | 450
2 | 12:20 | 50
2 | 12:25 | 75
Union Actions and Historic
Combined
id | time | valueX | record-type
1 | 12:05 | 500 | Action
1 | 12:30 | 500 | Action
2 | 12:30 | 125 | Action
1 | 11:00 | 400 | Historic
1 | 12:15 | 450 | Historic
2 | 12:20 | 50 | Historic
2 | 12:25 | 75 | Historic
Write a custom partitioner and use repartitionAndSortWithinPartitions to partition by id, but sort by time.
Partition-1
1 | 11:00 | 400 | Historic
1 | 12:05 | 500 | Action
1 | 12:15 | 450 | Historic
1 | 12:30 | 500 | Action
Partition-2
2 | 12:20 | 50 | Historic
2 | 12:25 | 75 | Historic
2 | 12:30 | 125 | Action
Traverse through the records per partition.
If it is a Historical record, add it to a map, or update the map if it already has that id - keep track of the latest valueY per id using a map per partition.
If it is a Action record, get the valueY value from the map and subtract it from valueX
A map M
Partition-1 traversal in order
M={ 1 -> 400} // A new entry in map M
1 | 100 // M(1) = 400; 500-400
M={1 -> 450} // update M, because key already exists
1 | 50 // M(1)
Partition-2 traversal in order
M={ 2 -> 50} // A new entry in M
M={ 2 -> 75} // update M, because key already exists
2 | 50 // M(2) = 75; 125-75
You could try to partition and sort by time, but you need to merge the partitions later. And that could add to some complexity.
This, I found it preferable to the many-to-many join that we usually get when using time ranges to join.

Detect overlapping ranges and correct then in oracle

Googling it a bit I found this to be an interesting question. Would like you guys shots.
Having my table
USER | MAP | STARTDAY | ENDDAY
1 | A | 20110101 | 20110105
1 | B | 20110106 | 20110110
2 | A | 20110101 | 20110107
2 | B | 20110105 | 20110110
Whant I want is to fix user's 2 case, where maps A and B overlaps by a couple days (from 20110105 until 20110107).
I wish I was able to query that table in a way that it never return overlapping ranges. My input data is falky already, so I don't have to worry with the conflict treatment, I just want to be able to get a single value for any given BETWEEN these dates.
Possible outputs for the query I'm trying to build would be like
USER | MAP | STARTDAY | ENDDAY
2 | B | 20110108 | 20110110 -- pushed overlapping days ahead..
2 | A | 20110101 | 20110104 -- shrunk overlapping range
It doesn't even matter if the algorithm causes "invalid ranges", e.g. Start = 20110105, End = 20110103, I'll just put null when I get to these cases.
What would you guys say? Any straight forward way to get this done?
Thanks!
f.
Analytic functions could help:
select userid, map
, case when prevend >= startday then prevend+1 else startday end newstart
, endday
from
( select userid, map, startday, endday
, lag(endday) over (partition by userid order by startday) prevend
from mytable
)
order by userid, startday
Gives:
USERID MAP NEWSTART ENDDAY
1 A 01/01/2011 01/05/2011
1 B 01/06/2011 01/10/2011
2 A 01/01/2011 01/07/2011
2 B 01/08/2011 01/10/2011

Resources