PromQL join a metric with a group left with another metric resticted by a boolean condition - promql

I need to join a metric m1 with a metric m2 where m2 is filtered, something like
m1 > on(...) group left m2 > xxxx
The above query returns mi timeseries joined with m2 but "> xxxx" is applied to value of m1 instead
of reducing the set of m2 timeseries
This could be a syntax
m1 > on(...) group left (m2 > xxxx) but, obviously I get a parse error

Just use parenthesis in order to properly set the priority of the second > operation:
(m1 > on(...) group_left m2) > xxxx

Related

Categorical assignments using pd.apply based on multiple conditions on Pandas Columns Part 2

THis is a building off Categorical assignments using pd.apply based on multiple conditions on Pandas Columns
i have a dataframe that contains Item Id numbers with multiple tasks and completion dates for those tasks. I am trying to assign categories based on task completions or in-completions in a separate column
my data frame looks like this:
Item ID Task 1 Comp Date Task 2 Comp Date Task 3 Comp Date Solution
12781463 NaT NaT NaT Solution X
10547725 6/6/2019 7/30/2019 8/1/2019 Solution Y
12847251 5/31/2019 6/12/2019 NaT Solution Y
12734403 5/31/2019 NaT NAT Solution Y
to test my approach to my challenge i took a subset of my data set and wrote a portion of the function that will be used with pd.apply(). below is some sample code for my .apply() function
def gating(row):
if pd.notnull(row['Task 3 Comp Date']):
return "Complete"
elif (row['Solution'] == 'Solution X' & pd.isnull(row['Task 3 Comp Date'])):
return "Pending Solution X"
df['Gating'] = df.apply(gating, axis = 1)
I expected to see Pending Solution X for Item ID 12781463 but got the error message below. Seems like .apply() doesnt play well with what i am trying to do
("unsupported operand type(s) for &: 'str' and 'bool'", 'occurred at index 9')

how to filter a dataframe based on another dataframe?

I have two Dataframe in pyspark:
d1: (x,y,value) and d2: (k,v, value). The entries in d1 are unique (you can consider the column x alone is a unique, and y alone as a key)
x y value
a b 0.2
c d 0.4
e f 0,8
d2 is the following format:
k v value
a c 0.7
k k 0.3
j h 0.8
e p 0.1
a b 0.1
I need to filter d2 accorning the co-occurence on d1. i.e., a , c 0.7 and e p 0.1 should be deleted as a can occur only with b and similarly for e.
I tried to select from d1 the x and y columns.
sourceList = df1.select("x").collect()
sourceList = [row.x for row in sourceList]
sourceList_b = sc.broadcast(sourceList)
then
check_id_isin = sf.udf(lambda x: x in sourceList , BooleanType())
d2 = d2.where(~d2.k.isin(sourceList_b.value))
for small datasets it works well but for large one, the collect cause an exception. I want to know if there is a better logic to compute this step.
One way could be to join d1 to d2, then fill the missing value in the column y from the column v using coalesce, then filter the row where y and v are different such as:
import pyspark.sql.functions as F
(d2.join( d1.select('x','y').withColumnRenamed('x','k'), #rename x to k for easier join
on=['k'], how='left') #join left to keep only d2 rows
.withColumn('y', F.coalesce('y', 'v')) #fill the value missing in y with the one from v
.filter((F.col('v') == F.col('y'))) #keep only where the value in v are equal to y
.drop('y').show()) #drop the column y not necessary
and you get:
+---+---+-----+
| k| v|value|
+---+---+-----+
| k| k| 0.3|
| j| h| 0.8|
+---+---+-----+
and should keep also any rows where both values in couple (x,y) are in (k,v)
So you have two problems here:
Logic for joining these two tables:
This can be done by performing an inner join on two columns instead of one. This is the code for that:
# Create an expression wherein you do an inner join on two cols
joinExpr = ((d1.x = d2.k) & (d1.y == d2.y))
joinDF = d1.join(d2, joinExpr)
The second problem is speed. There are multiple ways of fixing it. Here are my top two:
a. If one of the dataframes is significantly smaller (usually under 2 GB) than the other dataframe, then you can use the broadcast join. It essentially copies the smaller dataframe to all the workers so that there is no need to shuffle while joining. Here is an example:
from pyspark.sql.functions import broadcast
joinExpr = ((d1.x = d2.k) & (d1.y == d2.y))
joinDF = d1.join(broadcast(d2), joinExpr)
b. Try adding more workers and increasing the memory.
What you probably want todo is think of this in relational terms. Join d1 and d2 on d1.x = d2.k AND d1.y = d2.kv. An inner join will drop any records from D2 that don't have a corresponding pair in d1. By join a join spark will do a cluster wide shuffle of the data allowing for much greater parallelism and scalability compared to a broadcast exchange which general caps out at about ~10mb of data (which is what spark uses as the cut over point between a shuffle join and a broadcast join.
Also as in FYI WHERE (a,b) IS IN (...) gets translated into a join in most cases unless the (...) is a small set of data.
https://github.com/vaquarkhan/vaquarkhan/wiki/Apache-Spark--Shuffle-hash-join-vs--Broadcast-hash-join

Calculating contrast values on Excel

I am currently studying experimental designs in statistics and I am calculating values pertaining to 2^3 factorial designs.
The question that I have is particularly with the calculations of the "contrasts".
My goal of this question is to learn how to use the table "Coded Factors" and "Total" in order to get the values "Contrast" using the IF THEN function in Excel.
For example, Contrast A is calculated as : x - y . Where
x = sum of the values in the Total, where the Coded Factor A is + .
And y= sum of the values in the Total, where the Coded Factor A is - .
This would be rather simple, but for the interactions it is a bit more complex.
For example, contrast AC is obtained as : x - y . Where
x = sum of the values in the Total, where the product of Coded Factor A and that of C becomes + .
And y = sum of the values in the Total, where the product of Coded Factor A and that of B becomes - .
I would really appreciate your help.
Edited:
Considering the way how IF statements work, I thought that it might be a good idea to convert the + into 1 and - into -1 to make the calculation straight forward.
Convert all +/- to 1/-1. Use some cells as helper..
Put in these formulas :
J2 --> =LEFT(J1)
K2 --> =MID(J1,2,1)
L2 --> =MID(J1,3,1)
Put
J3 --> =IF(J$2="",1,INDEX($B3:$D3,MATCH(J$2,$B$2:$D$2,0)))
and drag to L10. Then
M3 --> =J3*K3*L3*G3
and drag to M10. Lastly,
M1 --> =SUM(M3:M10)
How to use : Input the Factor comb in cell J1 and the result will be in M1.
Idea : separate the factor text > load the multiplier > multiply Total values with multiplier > get sum.
Hope it helps.

Only output events for a condition when at least a single event matching the condition has been found, else output the input

My input has a field 'condition' with only two values. Lets assume only values 'A' or 'B'.
When at least a single event with condition=A within a tumbling window is found, only events with condition=A should be outputted. However, when no events for A are found, only events with B should be outputted in the same window.
Given the following input with a tumbling window of 4 ticks:
Condition Time
----------- ------
A T1
B T2
A T3
B T5
B T6
B T7
B T8
B T10
A T11
A T12
A T13
A T14
A T15
The output should be as follows:
Condition Time (Window)
----------- ------ ----------
A T1 T1-3
A T3 T1-3
B T5 T5-8
B T6 T5-8
B T7 T5-8
B T8 T5-8
A T11 T9-12
A T12 T9-12
A T13 T13-16
A T14 T13-16
A T15 T13-16
How can I setup my steps so the following output is achieved from my input?
I tried several option with using group but was unsuccessful
This is an interesting problem.
First allow me to correct your definition of a window. Windows of 4 ticks for the time range from 0 to 16 are:
( 0 - 4]
( 4 - 8]
( 8 - 12]
(12 - 16]
, where start time is excluded and end time is included. The end time is the timestamp of the result of the computation over that window.
Now here is the query that computes your answer.
WITH
count_as as (
SELECT
cnt = SUM(case cond when 'A' then 1 else 0 end)
FROM input TIMESTAMP BY time
GROUP BY tumblingwindow(second, 4)
)
SELECT
input.cond, input.time
FROM
count_as a
JOIN
input TIMESTAMP BY time
ON DATEDIFF(second, input, a) >= 0 AND DATEDIFF(second, input, a) < 4
WHERE
(a.cnt > 0 AND input.cond = 'A')
OR
(a.cnt = 0 AND input.cond = 'B')
The count_as step computes number of A's in a window. This will produce an event at each end of the window (4, 8, 12, and 16 seconds in this example) with the count of A's seen in the last 4 sec.
Then we just join it back with input, but only last 4 seconds of it.
And we need to be careful defining time bounds (aka wiggle room) to correctly align with window boundaries. Thus using >=0 and <4 instead of, say, between.

How to perform FST (Finite State Transducer) composition

Consider the following FSTs :
T1
0 1 a : b
0 2 b : b
2 3 b : b
0 0 a : a
1 3 b : a
T2
0 1 b : a
1 2 b : a
1 1 a : d
1 2 a : c
How do I perform the composition operation on these two FSTs (i.e. T1 o T2)
I saw some algorithms but couldn't understand much. If anyone could explain it in a easy way it would be a major help.
Please note that this is NOT a homework. The example is taken from the lecture slides where the solution is given but I couldn't figure out how to get to it.
Since you didn't specify the input format, I'm assuming that 0 is the initial state, any integers that appear in the second column but not the first are accepting states (3 for T1 and 2 for T2), and each row is an element of the transition relation, giving the the previous state, the next state, the input letter and the output letter.
Any operation on FSTs needs to produce a new FST, so we need states, an input alphabet, an output alphabet, initial states, final states and a transition relation (the specifications of the FSTs A, B and W below are given in this order). Suppose our FSTs are:
A = (Q, Σ, Γ, Q0, QF, α)
B = (P, Γ, Δ, P0, PF, β)
and we want to find
W = (R, Σ, Δ, R0, RF, ω) = A ∘ B
Note that we don't need to determine the alphabets of W; the definition of composition does that.
Imagine running A and B in series, with A's output tape fed as B's input tape. The state of the combined FST is simply the combined states of A and B. In other words, the states of the composition are in the cross product of the states of the individual FSTs.
R = Q × P
In your example, the states of W would be pairs of integers:
R = {(0,0), (0,1), ... (3, 2)}
though we could renumber these and get (for example):
R = {00, 01, 02, 10, 11, 12, 20, 21, 22, 30, 31, 32}
Similarly, initial and accepting states of the composed FST are the cross products of those in the component FSTs. In particular, R accepts a string iff A and B both accept the string.
R0 = Q0 × P0
RF = QF × PF
In the example, R0 = {00} and RF = {32}.
All that remains is to determine the transition relationship ω. For this, combine each transition rule for A with every transition rule for B that might apply. That is, combine each transition rule of A (qi, σ) → (qj, γ) with every rule of B that has a "γ" as the input character.
ω = { ((qi,ph), σ) → ((qj, pk), δ) : (qi, σ) → (qj, γ) ∈ α,
(ph, γ) → (pk, δ) ∈ β}
In the example, this means combining (e.g.) 0 1 a : b of T1 with 0 1 b : a and 1 2 b : a of T2 to get:
00 11 a : a
01 12 a : a
Similarly, you'd combine 0 2 b : b of T1 with those same 0 1 b : a and 1 2 b : a of T2, 0 0 a : a of T1 with 1 1 a : d and 1 2 a : c of T2 &c.
Note that you might have unreachable states (those that never appear as a "next" state) and transitions that will never occur (those from unreachable states). As an optimization step, you can remove those states and transitions. However, leaving them in will not affect the correctness of the construction; it's simply an optimization.
If you are more amenable to graphical explanations, the following set of slides provides incremental, graphical examples of the composition algorithm in practice, and also includes discussion of epsilon transitions in the component transducers. Epsilon transitions complicate the composition process, and the algorithm described in outis answer may not generate the correct result in this case, depending on the semiring being used.
See slides 10~35 for some graphical examples:
http://www.gavo.t.u-tokyo.ac.jp/~novakj/wfst-algorithms.pdf
T1 and T2
Composition of T1 and T2
The states of the composition T are pairs of a T1 state and a T2 state. T satisfies the following conditions:
its initial state is the pair of the initial state of T1 and the initial state
of T2
Its final states are pairs of a final state of T1 and a final state of T2
There is a transition t from (q1, q2) to (r1, r2) for each pair of transitions T1 from q1 to r1 and T2 from q2 to r2 such that the output label of T1 matches the input label of T2. The transition T takes its input label from T1, its output label from T2, and its weight is the combination of the weights of T1 and T2 done with the same operation
that combines weights along a path.
Since there are no weights we can ignore this. Above was picked up exactly from a following beautiful paper. Link here

Resources