I have the following two queries which i am running on the azure kusto online query terminal
(available at this link - https://dataexplorer.azure.com/clusters/help/databases/Samples )
//okay this is lag related code.
//but we need to serialize first in some way, in other words sort it
StormEvents
| order by StartTime | extend LaggedOutput = next( State,2,"NOTHING FOUND") | project State,LaggedOutput;
//lets try coalasce
//next inside the coalesce returns a empty string and that is replaced with our replacement.
//note : I think we can forgo coalesce completely because next
//already has a default value.
StormEvents
| order by StartTime | project coalesce(next(State,2,""),"COALSESCE");
So, my question is, why bother with coalesce at all? next() already provides a default value that I can apply in this scenario?
In the scenario you provided, the answer is yes: you can remove the coalesce from the 2nd query, and just use next operator with a default value like below(for the 2nd query):
StormEvents
| order by StartTime
| project next(State,2,"COALSESCE")
the output is same as using project coalesce(next(State,2,""),"COALSESCE").
But for other scenarios, like I want to get a non-null value from several values, sample like below:
print result=coalesce(tolong("not a number"), tolong("42"), 33)
and here, we can only use the coalesce operator to get the first non-null value => 42. This is the scenario the next operator cannot cover.
Related
I have recently started working with Kusto. I am stuck with a use case where i need to confirm the approach i am taking is right.
I have data in the following format
In the above example, if the status is 1 and if the time frame is equal to 15 seconds then i need to assume it as 1 occurrence.
So in this case 2 occurrence of status.
My approach was
if the current and next rows status is equal to 1 then take the time difference and do row_cum_sum and break it if the next(STATUS)!=0.
Even though the approach is giving me correct output, I am assuming the performance can slow down once the size is increased.
I am looking for an alternative approach if any. Also adding the complete scenario to reproduce this with a sample data.
.create-or-alter function with (folder = "Tests", skipvalidation = "true") InsertFakeTrue() {
range LoopTime from ago(365d) to now() step 6s
| project TIME=LoopTime,STATUS=toint(1)
}
.create-or-alter function with (folder = "Tests", skipvalidation = "true") InsertFakeFalse() {
range LoopTime from ago(365d) to now() step 29s
| project TIME=LoopTime,STATUS=toint(0)
}
.set-or-append FAKEDATA <| InsertFakeTrue();
.set-or-append FAKEDATA <| InsertFakeFalse();
FAKEDATA
| order by TIME asc
| serialize
| extend cstatus=STATUS
| extend nstatus=next(STATUS)
| extend WindowRowSum=row_cumsum(iff(nstatus ==1 and cstatus ==1, datetime_diff('second',next(TIME),TIME),0),cstatus !=1)
| extend windowCount=iff(nstatus !=1 or isnull(next(TIME)), iff(WindowRowSum ==15, 1,iff(WindowRowSum >15,(WindowRowSum/15)+((WindowRowSum%15)/15),0)),0 )
| summarize IDLE_COUNT=sum(windowCount)
The approach in the question is the way to achieve such calculations in Kusto and given that the logic requires sorting is also efficient (as long as the sorted data can reside on a single machine).
Regarding union operator - it runs in parallel by default, you can control the concurrency and spread using hints, see: union operator
I know count action can be expensive in Spark, so to improve performance I'd like to have a different way just to check if a query can return any results
Here is what I did
var df = spark.sql("select * from table_name where condition = 'blah' limit 1");
var dfEmpty = df.head(1).isEmpty;
Is it a valid solution or is there any potential uncaught error if I use above solution to check query result? It is a lot faster though.
isEmpty is head of the data.. this is quite resonable to check empty or not and it was given by spark api and is optimized... Hence, I'd prefer this...
Also in the query I think limit 1 is not required.
/**
* Returns true if the `Dataset` is empty.
*
* #group basic
* #since 2.4.0
*/
def isEmpty: Boolean = withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan =>
plan.executeCollect().head.getLong(0) == 0
}
I think this is ok, I guess you could also omit the limit(1) because this is also part of the implementation of df.isEmpty. See also How to check if spark dataframe is empty?.
Note that the solution with df.isEmpty does may not evaluate all columns. E.g. if you have an UDF for 1 column, this will probabely not execute and could throws exceptions on a real query. df.head(1).isEmpty on the other hand will evaluate all columns for 1 rows.
My Dataset looks like below, i want to fetch the 1st row,1st column value (A1 in this case)
+-------+---+--------------+----------+
|account|ccy|count(account)|sum_amount|
+-------+---+--------------+----------+
| A1|USD| 2| 500.24|
| A2|SGD| 1| 200.24|
| A2|USD| 1| 300.36|
+-------+---+--------------+----------+
I can do this as below :
Dataset finalDS = dataset.groupBy("account", "ccy").
agg(count("account"), sum("amount").alias("sum_amount"))
.orderBy("account", "ccy");
Object[] items = (Object[])(finalDS.filter(functions.col("sum_amount")
.equalTo(300.36))
.collect());
String accountNo = (String)((GenericRowWithSchema)items[0]).get(0);
2 questions :
Any other/more efficient way to do this ? I am aware of Dataframe/JavaRDD queries
Without the explicit cast Object[], there is a compile time failure, however I would have thought that this is an implicit cast. Why ? I suspect something to do with scala compilation.
Any other/more efficient way to do this ? I am aware of Dataframe/JavaRDD queries
You'd better use Dataset.head (javadocs) function in order to eliminate passing all the data to driver process. This will limit you to loading only 1st row to driver RAM instead of the entire dataset. You also can consider using take function to obtain first N rows.
Without the explicit cast Object[], there is a compile time failure, however I would have thought that this is an implicit cast. Why ? I suspect something to do with scala compilation.
It depends on how your dataset is typed. In case of Datarame (which is Dataset[Row], proof), you'll get an Array[Row] on call to collect. It's worth to mention the signature of collect functions:
def collect(): Array[T] = withAction("collect", queryExecution)(collectFromPlan)
Is that possible to perform set of operations on dataframe (adding new columns, replace some existing values, etc) and do not fast fail on first failed rows, but instead perform full transformation and separately return rows that has been processed with errors?
Example:
it's more like pseudocode, but the idea must be clear:
df.withColumn('PRICE_AS_NUM', to_num(df["PRICE_AS_STR"]))
to_num - is my custom function of transformation string to number.
assuming I have some records where price can't be cast to number - I want to get those records in separate dataframe.
I see an approach, but it will make code a little ugly (and not quite productive):
do a filter with try catch - if exception happen - filter those records into separate df..
What if I have many of such transformations... Is any better way?
I think one approach would be to wrap your transformation with a try/except function that returns a boolean. Then use when() and otherwise() to filter on the boolean. For example:
def to_num_wrapper(inputs):
try:
to_num(inputs)
return True
except:
return False
from pyspark.sql.functions import when
df.withColumn('PRICE_AS_NUM',
when(
to_num_wrapper(df["PRICE_AS_STR"]),
to_num(df["PRICE_AS_STR"])
).otherwise('FAILED')
)
Then you can filter on the columns where the value is 'FAILED'.
Preferred option
Always prefer built-in SQL functions over UDF. There safe to execute and much faster than a Python UDF. As a bonus they follow SQL semantics - if there is a problem on the line, the output is NULL - undefined.
If you go with UDF
Follow the same approach as built-in functions.
def safe_udf(f, dtype):
def _(*args):
try:
return f(*args)
except:
pass
return udf(_, dtype)
to_num_wrapper = safe_udf(lambda x: float(x), "float")
df = spark.createDataFrame([("1.123", ), ("foo", )], ["str"])
df.withColumn("num", to_num_wrapper("str")).show()
# +-----+-----+
# | str| num|
# +-----+-----+
# |1.123|1.123|
# | foo| null|
# +-----+-----+
While swallowing exception might be counter-intuitive it just a matter of following SQL conventions.
No matter which one you choose:
Once you adjust you with one of the above, handling malformed data is just a matter of applying DataFrameNaFunctions (.na.drop, .na.replace).
I'm looking for a good way to store data associated with a time range, in order to be able to efficiently retrieve it later.
Each entry of data can be simplified as (start time, end time, value). I will need to later retrieve all the entries which fall inside a (x, y) range. In SQL, the query would be something like
SELECT value FROM data WHERE starttime <= x AND endtime >= y
Can you suggest a structure for the data in Cassandra which would allow me to perform such queries efficiently?
This is an oddly difficult thing to model efficiently.
I think using Cassandra's secondary indexes (along with a dummy indexed value which is unfortunately still needed at the moment) is your best option. You'll need to use one row per event with at least three columns: 'start', 'end', and 'dummy'. Create a secondary index on each of these. The first two can be LongType and the last can be BytesType. See this post on using secondary indexes for more details. Since you have to use an EQ expression on at least one column for a secondary index query (the unfortunate requirement I mentioned), the EQ will be on 'dummy', which can always set to 0. (This means that the EQ index expression will match every row and essentially be a no-op.) You can store the rest of the event data in the row alongside start, end, and dummy.
In pycassa, a Python Cassandra client, your query would look like this:
from pycassa.index import *
start_time = 12312312000
end_time = 12312312300
start_exp = create_index_expression('start', start_time, GT)
end_exp = create_index_expression('end', end_time, LT)
dummy_exp = create_index_expression('dummy', 0, EQ)
clause = create_index_clause([start_exp, end_exp, dummy_exp], count=1000)
for result in entries.get_indexed_slices(clause):
# do stuff with result
There should be something similar in other clients.
The alternative that I considered first involved OrderPreservingPartitioner, which is almost always a Bad Thing. For the index, you would use the start time as the row key and the finish time as the column name. You could then perform a range slice with start_key=start_time and column_finish=finish_time. This would scan every row after the start time and only return those with columns before the finish_time. Not very efficient, and you have to do a big multiget, etc. The built-in secondary index approach is better because nodes will only index local data and most of the boilerplate indexing code is handled for you.