I am not able correctly divide time-series data with another time-series.
I get data from my TestTablewhich results in the following view:
TagId, sdata
8862, [0,0,0,0,2,2,2,3,4]
6304, [0,0,0,0,2,2,2,3,2]
I want to divide the sdata series for tagId 8862 with the series from 6304
I expect the following result:
[NaN,NaN,NaN,NaN,1,1,1,1,2]
When I try the below code, I only get two empty ddata rows in my S2 results
TestTable
| where TagId in (8862,6304)
| make-series sdata = avg(todouble(Value)) default=0 on TimeStamp in range (datetime(2019-06-27), datetime(2019-06-29), 1m) by TagId
| as S1;
S1 | project ddata = series_divide(sdata[0].['sdata'], sdata[1].['sdata'])
| as S2
What am I doing wrong?
both arguments to series_divide() can't come from two separate rows in the dataset.
here's an example for how you could achieve that (based on the limited-and-perhaps-not-fully-representative-of-your-real use case, as shown in your question)
let T =
datatable(tag_id:long, sdata:dynamic)
[
8862, dynamic([0,0,0,0,2,2,2,3,4]),
6304, dynamic([0,0,0,0,2,2,2,3,2]),
]
;
let get_value_from_T = (_tag_id:long)
{
toscalar(
T
| where tag_id == _tag_id
| take 1
| project sdata
)
};
print sdata_1 = get_value_from_T(8862), sdata_2 = get_value_from_T(6304)
| extend result = series_divide(sdata_1, sdata_2)
which returns:
|sdata_1 | sdata_2 | result |
|--------------------|---------------------|---------------------------------------------|
|[0,0,0,0,2,2,2,3,4] | [0,0,0,0,2,2,2,3,2] |["NaN","NaN","NaN","NaN",1.0,1.0,1.0,1.0,2.0]|
Related
im having 2 list of different variable, so i want to compare and update the 'Check' value from list 2 if the 'Brand' from list 2 is found in list 1
-------------------- --------------------
| Name | Brand | | Brand | Check |
-------------------- --------------------
| vga x | Asus | | MSI | X |
| vga b | Asus | | ASUS | - |
| mobo x | MSI | | KINGSTON | - |
| memory | Kingston| | SAMSUNG | - |
-------------------- --------------------
so usually i just did
for(x in list1){
for(y in list2){
if(y.brand == x.brand){
y.check == true
}
}
}
is there any simple solution for that?
Since you're mutating the objects, it doesn't really get any cleaner than what you have. It can be done using any like this, but in my opinion is not any clearer to read:
list2.forEach { bar ->
bar.check = bar.check || list1.any { it.brand == bar.brand }
}
The above is slightly more efficient than what you have since it inverts the iteration of the two lists so you don't have to check every element of list1 unless it's necessary. The same could be done with yours like this:
for(x in list2){
for(y in list1){
if(y.brand == x.brand){
x.check = true
break
}
}
}
data class Item(val name: String, val brand: String)
fun main() {
val list1 = listOf(
Item("vga_x", "Asus"),
Item("vga_b", "Asus"),
Item("mobo_x", "MSI"),
Item("memory", "Kingston")
)
val list2 = listOf(
Item("", "MSI"),
Item("", "ASUS"),
Item("", "KINGSTON"),
Item("", "SAMSUNG")
)
// Get intersections
val intersections = list1.map{it.brand}.intersect(list2.map{it.brand})
println(intersections)
// Returns => [MSI]
// Has any intersections
val intersected = list1.map{it.brand}.any { it in list2.map{it.brand} }
println(intersected)
// Returns ==> true
}
UPDATE: I just see that this isn't a solution for your problem. But I'll leave it here.
I would like to use Spark to parse network messages and group them into logical entities in a stateful manner.
Problem Description
Let's assume each message is in one row of an input dataframe, depicted below.
| row | time | raw payload |
+-------+------+---------------+
| 1 | 10 | TEXT1; |
| 2 | 20 | TEXT2;TEXT3; |
| 3 | 30 | LONG- |
| 4 | 40 | TEXT1; |
| 5 | 50 | TEXT4;TEXT5;L |
| 6 | 60 | ONG |
| 7 | 70 | -TEX |
| 8 | 80 | T2; |
The task is to parse the logical messages in the raw payload, and provide them in a new output dataframe. In the example each logical message in the payload ends with a semicolon (delimiter).
The desired output dataframe could then look as follows:
| row | time | message |
+-------+------+---------------+
| 1 | 10 | TEXT1; |
| 2 | 20 | TEXT2; |
| 3 | 20 | TEXT3; |
| 4 | 30 | LONG-TEXT1; |
| 5 | 50 | TEXT4; |
| 6 | 50 | TEXT5; |
| 7 | 50 | LONG-TEXT2; |
Note that some messages rows do not yield a new row in the result (e.g. rows 4, 6,7,8), and some yield even multiple rows (e.g. rows 2, 5)
My questions:
is this a use case for UDAF? If so, how for example should i implement the merge function? i have no idea what its purpose is.
since the message ordering matters (i cannot process LONGTEXT-1, LONGTEXT-2 properly without respecting the message order), can i tell spark to parallelize perhaps on a higer level (e.g. per calendar day of messages) but not parallelize within a day (e.g. events at time 50,60,70,80 need to be processed in order).
follow up question: is it conceivable that the solution will be usable not just in traditional spark, but also in spark structured streaming? Or does the latter require its own kind of stateful processing method?
Generally, you can run arbitrary stateful aggregations on spark streaming by using mapGroupsWithState of flatMapGroupsWithState. You can find some examples here. None of those though will guarantee that the processing of the stream will be ordered by event time.
If you need to enforce data ordering, you should try to use window operations on event time. In that case, you need to run stateless operations instead, but if the number of elements in each window group is small enough, you can use collectList for instance and then apply a UDF (where you can manage the state for each window group) on each list.
ok i figured it out in the meantime how to do this with an UDAF.
class TagParser extends UserDefinedAggregateFunction {
override def inputSchema: StructType = StructType(StructField("value", StringType) :: Nil)
override def bufferSchema: StructType = StructType(
StructField("parsed", ArrayType(StringType)) ::
StructField("rest", StringType)
:: Nil)
override def dataType: DataType = ArrayType(StringType)
override def deterministic: Boolean = true
override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = IndexedSeq[String]()
buffer(1) = null
}
def doParse(str: String, buffer: MutableAggregationBuffer): Unit = {
buffer(0) = IndexedSeq[String]()
val prevRest = buffer(1)
var idx = -1
val strToParse = if (prevRest != null) prevRest + str else str
do {
val oldIdx = idx;
idx = strToParse.indexOf(';', oldIdx + 1)
if (idx == -1) {
buffer(1) = strToParse.substring(oldIdx + 1)
} else {
val newlyParsed = strToParse.substring(oldIdx + 1, idx)
buffer(0) = buffer(0).asInstanceOf[IndexedSeq[String]] :+ newlyParsed
buffer(1) = null
}
} while (idx != -1)
}
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
if (buffer == null) {
return
}
doParse(input.getAs[String](0), buffer)
}
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = throw new UnsupportedOperationException
override def evaluate(buffer: Row): Any = buffer(0)
}
Here a demo app the uses the above UDAF to solve the problem from above:
case class Packet(time: Int, payload: String)
object TagParserApp extends App {
val spark, sc = ... // kept out for brevity
val df = sc.parallelize(List(
Packet(10, "TEXT1;"),
Packet(20, "TEXT2;TEXT3;"),
Packet(30, "LONG-"),
Packet(40, "TEXT1;"),
Packet(50, "TEXT4;TEXT5;L"),
Packet(60, "ONG"),
Packet(70, "-TEX"),
Packet(80, "T2;")
)).toDF()
val tp = new TagParser
val window = Window.rowsBetween(Window.unboundedPreceding, Window.currentRow)
val df2 = df.withColumn("msg", tp.apply(df.col("payload")).over(window))
df2.show()
}
this yields:
+----+-------------+--------------+
|time| payload| msg|
+----+-------------+--------------+
| 10| TEXT1;| [TEXT1]|
| 20| TEXT2;TEXT3;|[TEXT2, TEXT3]|
| 30| LONG-| []|
| 40| TEXT1;| [LONG-TEXT1]|
| 50|TEXT4;TEXT5;L|[TEXT4, TEXT5]|
| 60| ONG| []|
| 70| -TEX| []|
| 80| T2;| [LONG-TEXT2]|
+----+-------------+--------------+
the main issue for me was to figure out how to actually apply this UDAF, namely using this:
df.withColumn("msg", tp.apply(df.col("payload")).over(window))
the only thing i need now to figure out are the aspects of parallelization (which i only want to happen where we do not rely on ordering) but that's a separate issue for me.
I need to calculate additional features from a dataset using multiple lead's and lag's. The high number of lead's and lag's causes a out-of-memory error.
Data frame:
|----------+----------------+---------+---------+-----+---------|
| DeviceID | Timestamp | Sensor1 | Sensor2 | ... | Sensor9 |
|----------+----------------+---------+---------+-----+---------|
| | | | | | |
| Long | Unix timestamp | Double | Double | | Double |
| | | | | | |
|----------+----------------+---------+---------+-----+---------|
Window definition:
// Each window contains about 600 rows
val w = Window.partitionBy("DeviceID").orderBy("Timestamp")
Compute extra features:
var res = df
val sensors = (1 to 9).map(i => s"Sensor$i")
for (i <- 1 to 5) {
for (s <- sensors) {
res = res.withColumn(lag(s, i).over(w))
.withColumn(lead(s, i)).over(w)
}
// Compute features from all the lag's and lead's
[...]
}
System info:
RAM: 16G
JVM heap: 11G
The code gives correct results with small datasets, but gives an out-of-memory error with 10GB of input data.
I think the culprit is the high number of window functions because the DAG shows a very long sequence of
Window -> WholeStageCodeGen -> Window -> WholeStageCodeGen ...
Is there anyway to calculate the same features in a more efficient way?
For example, is it possible to get lag(Sensor1, 1), lag(Sensor2, 1), ..., lag(Sensor9, 1) without calling lag(..., 1) nine times?
If the answer to the previous question is no, then how can I avoid out-of-memory? I have already tried increasing the number of partitions.
You could try something like
res = res.select('*', lag(s"Sensor$1", 1).over(w), lag(s"Sensor$1", 2).over(w), ...)
That is, to write everything in a select instead of many withColumn
Then there will be only 1 Window in the plan. Maybe it helps with the performance.
I am trying to compare if an array is a subset of other and use it in another query. I could get the comparision method working. However, if I use the compare method in another query I get an error saying "Left and right side of the relational operator must be scalars" This hints that the comparearrays is not reutrning a scalar. Any ideas?
let x = parsejson('["a", "b", "c"]');
let y = parsejson('["a", "b", "c"]');
let z = parsejson('["b","a"]');
let comparearrays = (arr1:dynamic, arr2:dynamic)
{
let arr1Length = arraylength(arr1);
let total =
range s from 0 to arr1Length-1 step 1
| project dat = iff(arr1[s] in (arr2), true , false)
| where dat == true
| count;
total | extend isEqual= iff(Count == arr1Length,'true','false') | project
tostring(isEqual)
};
//comparearrays(z, x)
datatable (i:int) [4] | project i | where comparearrays(x,y) == 'true'
You are correct in your understanding - the current implementation returns a table with a single row and single column, but have no fear - toscalar to the rescue:
let x = parsejson('["a", "b", "c"]');
let y = parsejson('["a", "b", "c"]');
let z = parsejson('["b","a"]');
let comparearrays = (arr1:dynamic, arr2:dynamic)
{
let arr1Length = arraylength(arr1);
let result =
range s from 0 to arr1Length-1 step 1
| project dat = iff(arr1[s] in (arr2), true , false)
| where dat == true
| count
| extend isEqual = iff(Count == arr1Length,'true','false')
| project tostring(isEqual);
toscalar(result)
};
//comparearrays(z, x)
datatable (i:int) [4] | project i | where comparearrays(x,y) == 'true'
You do have a bug in the comparearrays functions, since comparearrays(z, x) returns true which is not correct....
I am needing to add two values together to create a third value with CQL. Is there any way to do this? My table has the columns number_of_x and number_of_y and I am trying to create total. I did an update on the table with a set command as follows:
UPDATE my_table
SET total = number_of_x + number_of_y ;
When I run that I get the message back saying:
no viable alternative at input ';'.
Per docs. assignment is one of:
column_name = value
set_or_list_item = set_or_list_item ( + | - ) ...
map_name = map_name ( + | - ) ...
map_name = map_name ( + | - ) { map_key : map_value, ... }
column_name [ term ] = value
counter_column_name = counter_column_name ( + | - ) integer
And you cannot mix counter and non counter columns in the same table so what you are describing is impossible in a single statement. But you can do a read before write:
CREATE TABLE my_table ( total int, x int, y int, key text PRIMARY KEY )
INSERT INTO my_table (key, x, y) VALUES ('CUST_1', 1, 1);
SELECT * FROM my_table WHERE key = 'CUST_1';
key | total | x | y
--------+-------+---+---
CUST_1 | null | 1 | 1
UPDATE my_table SET total = 2 WHERE key = 'CUST_1' IF x = 1 AND y = 1;
[applied]
-----------
True
SELECT * FROM my_table WHERE key = 'CUST_1';
key | total | x | y
--------+-------+---+---
CUST_1 | 2 | 1 | 1
The IF clause will handle concurrency issues if x or y was updated since the SELECT. You can than retry again if applied is False.
My recommendation however in this scenario is for your application to just read both x and y, then do addition locally as it will perform MUCH better.
If you really want C* to do the addition for you, there is a sum aggregate function in 2.2+ but it will require updating your schema a little:
CREATE TABLE table_for_aggregate (key text, type text, value int, PRIMARY KEY (key, type));
INSERT INTO table_for_aggregate (key, type, value) VALUES ('CUST_1', 'X', 1);
INSERT INTO table_for_aggregate (key, type, value) VALUES ('CUST_1', 'Y', 1);
SELECT sum(value) from table_for_aggregate WHERE key = 'CUST_1';
system.sum(value)
-------------------
2