Differnce from first value in pivot Spark Sql - apache-spark

I have the following data :
val df = Seq(("Central" , "Copy Paper" , "Benjamin Ross" , "$15.58" , "$3.91" , "126"),
| ("East" , "Copy Paper" , "Catherine Rose" , "$12.21" , "$0.08" ,"412"),
| ("West" ,"Copy Paper" , "Patrick O'Brill" , "$2,756.66" , "$1,629.98" ,"490"),
| ("Central" , "Business Envelopes" , "John Britto" , "$212.74" , "$109.66" , "745"),
| ("East" , "Business Envelopes" , "xyz" , "$621" , "$721" ,"812")).toDF("Region" , "Product" , "Customer" , "Sales", "Cost" , "Autonumber")
df.show()
+-------+------------------+---------------+---------+---------+----------+
| Region| Product| Customer| Sales| Cost|Autonumber|
+-------+------------------+---------------+---------+---------+----------+
|Central| Copy Paper| Benjamin Ross| $15.58| $3.91| 126|
| East| Copy Paper| Catherine Rose| $12.21| $0.08| 412|
| West| Copy Paper|Patrick O'Brill|$2,756.66|$1,629.98| 490|
|Central|Business Envelopes| John Britto| $212.74| $109.66| 745|
| East|Business Envelopes| xyz| $621| $721| 812|
+-------+------------------+---------------+---------+---------+----------+
You can see that For Business Envelopes product , there is no data with regard to West. If there was data for West , the result would not have been null. Since there is no data pivoting with region resulted in null values which I want it to be 0 , so it can be subtracted from the first(sum(Autonumber)) and a value can be obtained. Instead now it returns a null. If in some way I can get data for Central in the group by query , things would be much simpler.
I tried the following query :
spark.sql("SELECT * FROM (SELECT region r, product as p, SUM(Autonumber) - first(sum(Autonumber)) over ( partition by product order by product , region) as new from test1 group by r , p order by p,r) test1 pivot (sum(new) for r in ('Central' Central , 'East' East, 'West' West))").show
This was the data that I got
+------------------+-------+-----+-----+
| p|Central| East| West|
+------------------+-------+-----+-----+
|Business Envelopes| 0.0| 67.0| null|
| Copy Paper| 0.0|286.0|364.0|
+------------------+-------+-----+-----+
Where I expected it to be this way..
+------------------+-------+-----+------+
| p|Central| East| West|
+------------------+-------+-----+------+
|Business Envelopes| | 67.0|-745.0|
| Copy Paper| |286.0| 364.0|
+------------------+-------+-----+------+
This is nothing but pivot on region with sum(autonumber) and then subtracting from the first value.
Any suggestions on what can be done to get -745 instead of null ?

I suppose that's not possible with this way.
Instead I pivoted the dataset and then subtracted from the first value.
spark.sql("select p , coalesce(Central , 0) - null as Central , coalesce(East,0) - coalesce(central,0) as East , coalesce(West , 0) - coalesce(central,0) as West from (SELECT * FROM (SELECT region r, product as p, SUM(Autonumber) as new from test group by r , p order by p) test pivot (sum(new) for r in ('Central' Central ,'East' East, 'West' West)))").show

Related

Avg of multiple columns in project operator

We have 2 customlogs in loganalytics out of that I am able to get avg of each one and I need to merge those 2 and make it as 1 means avg of vpn+url
workspace(name).vpn_CL
| extend healty=iff(Status_s == 'Connected' , 100 , 0)
| summarize vpn = avg(healty) by EnvName_s, ClientName_s
|
join
(
workspace(name).url_CL
| extend Availability=iff(StatusDescription_s == 'OK' , 100 , 0)
| summarize URL=avg(Availability) by EnvName_s, ClientName_s
) on ClientName_s
| project Client=ClientName_s, Environment=EnvName_s , vpn , URL
Based on my understanding, I think the value of the average of vpn+url is the result of plus vpn value with url value when the number of healty entites is equals to the number of Availability entites.
Otherwise, if their entites quantity are not equal, the average of the two labels is a expected value based on their probabilities,
Then,
workspace(name).vpn_CL
| extend healty=iff(Status_s == 'Connected' , 100 , 0)
| summarize m = count(), vpn = avg(healty) by EnvName_s, ClientName_s
|
join
(
workspace(name).url_CL
| extend Availability=iff(StatusDescription_s == 'OK' , 100 , 0)
| summarize n = count(), URL=avg(Availability) by EnvName_s, ClientName_s
) on ClientName_s
| project Client=ClientName_s, Environment=EnvName_s , vpn , URL, avgOfVpnUrl = vpn*m/(m+n)+url*n/(m+n)
Hope it helps.

Pyspark window function with condition

Suppose I have a DataFrame of events with time difference between each row, the main rule is that one visit is counted if only the event has been within 5 minutes of the previous or next event:
+--------+-------------------+--------+
|userid |eventtime |timeDiff|
+--------+-------------------+--------+
|37397e29|2017-06-04 03:00:00|60 |
|37397e29|2017-06-04 03:01:00|60 |
|37397e29|2017-06-04 03:02:00|60 |
|37397e29|2017-06-04 03:03:00|180 |
|37397e29|2017-06-04 03:06:00|60 |
|37397e29|2017-06-04 03:07:00|420 |
|37397e29|2017-06-04 03:14:00|60 |
|37397e29|2017-06-04 03:15:00|1140 |
|37397e29|2017-06-04 03:34:00|540 |
|37397e29|2017-06-04 03:53:00|540 |
+--------+----------------- -+--------+
The challenge is to group by the start_time and end_time of the latest eventtime that has the condition of being within 5 minutes. The output should be like this table:
+--------+-------------------+--------------------+-----------+
|userid |start_time |end_time |events |
+--------+-------------------+--------------------+-----------+
|37397e29|2017-06-04 03:00:00|2017-06-04 03:07:00 |6 |
|37397e29|2017-06-04 03:14:00|2017-06-04 03:15:00 |2 |
+--------+-------------------+--------------------+-----------+
So far I have used window lag functions and some conditions, however, I do not know where to go from here:
%spark.pyspark
from pyspark.sql import functions as F
from pyspark.sql import Window as W
from pyspark.sql.functions import col
windowSpec = W.partitionBy(result_poi["userid"], result_poi["unique_reference_number"]).orderBy(result_poi["eventtime"])
windowSpecDesc = W.partitionBy(result_poi["userid"], result_poi["unique_reference_number"]).orderBy(result_poi["eventtime"].desc())
# The windows are between the current row and following row. e.g: 3:00pm and 3:03pm
nextEventTime = F.lag(col("eventtime"), -1).over(windowSpec)
# The windows are between the current row and following row. e.g: 3:00pm and 3:03pm
previousEventTime = F.lag(col("eventtime"), 1).over(windowSpec)
diffEventTime = nextEventTime - col("eventtime")
nextTimeDiff = F.coalesce((F.unix_timestamp(nextEventTime)
- F.unix_timestamp('eventtime')), F.lit(0))
previousTimeDiff = F.coalesce((F.unix_timestamp('eventtime') -F.unix_timestamp(previousEventTime)), F.lit(0))
# Check if the next POI is the equal to the current POI and has a time differnce less than 5 minutes.
validation = F.coalesce(( (nextTimeDiff < 300) | (previousTimeDiff < 300) ), F.lit(False))
# Change True to 1
visitCheck = F.coalesce((validation == True).cast("int"), F.lit(1))
result_poi.withColumn("visit_check", visitCheck).withColumn("nextTimeDiff", nextTimeDiff).select("userid", "eventtime", "nextTimeDiff", "visit_check").orderBy("eventtime")
My questions: Is this a viable approach, and if so, how can I "go forward" and look at the maximum eventtime that fulfill the 5 minutes condition. To my knowledge, iterate through values of a Spark SQL Column, is it possible? wouldn't it be too expensive?. Is there another way to achieve this result?
Result of Solution suggested by #Aku:
+--------+--------+---------------------+---------------------+------+
|userid |subgroup|start_time |end_time |events|
+--------+--------+--------+------------+---------------------+------+
|37397e29|0 |2017-06-04 03:00:00.0|2017-06-04 03:06:00.0|5 |
|37397e29|1 |2017-06-04 03:07:00.0|2017-06-04 03:14:00.0|2 |
|37397e29|2 |2017-06-04 03:15:00.0|2017-06-04 03:15:00.0|1 |
|37397e29|3 |2017-06-04 03:34:00.0|2017-06-04 03:43:00.0|2 |
+------------------------------------+-----------------------+-------+
It doesn't give the result expected. 3:07 - 3:14 and 03:34-03:43 are being counted as ranges within 5 minutes, it shouldn't be like that. Also, 3:07 should be the end_time in the first row as it is within 5 minutes of the previous row 3:06.
You'll need one extra window function and a groupby to achieve this.
What we want is for every line with timeDiff greater than 300 to be the end of a group and the start of a new one. Aku's solution should work, only the indicators mark the start of a group instead of the end. To change this you'll have to do a cumulative sum up to n-1 instead of n (n being your current line):
w = Window.partitionBy("userid").orderBy("eventtime")
DF = DF.withColumn("indicator", (DF.timeDiff > 300).cast("int"))
DF = DF.withColumn("subgroup", func.sum("indicator").over(w) - func.col("indicator"))
DF = DF.groupBy("subgroup").agg(
func.min("eventtime").alias("start_time"),
func.max("eventtime").alias("end_time"),
func.count("*").alias("events")
)
+--------+-------------------+-------------------+------+
|subgroup| start_time| end_time|events|
+--------+-------------------+-------------------+------+
| 0|2017-06-04 03:00:00|2017-06-04 03:07:00| 6|
| 1|2017-06-04 03:14:00|2017-06-04 03:15:00| 2|
| 2|2017-06-04 03:34:00|2017-06-04 03:34:00| 1|
| 3|2017-06-04 03:53:00|2017-06-04 03:53:00| 1|
+--------+-------------------+-------------------+------+
It seems that you also filter out lines with only one event, hence:
DF = DF.filter("events != 1")
+--------+-------------------+-------------------+------+
|subgroup| start_time| end_time|events|
+--------+-------------------+-------------------+------+
| 0|2017-06-04 03:00:00|2017-06-04 03:07:00| 6|
| 1|2017-06-04 03:14:00|2017-06-04 03:15:00| 2|
+--------+-------------------+-------------------+------+
So if I understand this correctly you essentially want to end each group when TimeDiff > 300? This seems relatively straightforward with rolling window functions:
First some imports
from pyspark.sql.window import Window
import pyspark.sql.functions as func
Then setting windows, I assumed you would partition by userid
w = Window.partitionBy("userid").orderBy("eventtime")
Then figuring out what subgroup each observation falls into, by first marking the first member of each group, then summing the column.
indicator = (TimeDiff > 300).cast("integer")
subgroup = func.sum(indicator).over(w).alias("subgroup")
Then some aggregation functions and you should be done
DF = DF.select("*", subgroup)\
.groupBy("subgroup")\
.agg(
func.min("eventtime").alias("start_time"),
func.max("eventtime").alias("end_time"),
func.count(func.lit(1)).alias("events")
)
Approach can be grouping the dataframe based on your timeline criteria.
You can create a dataframe with the rows breaking the 5 minutes timeline.
Those rows are criteria for grouping the records and
that rows will set the startime and endtime for each group.
Then find the count and max timestamp(endtime) for each group.

SSAS OLAP Cube Dynamic Security. Many dimensions in one role

After setting a cube I was asked to add dynamic security with use of table of users and data they can see.
The problem is that i have to take into account 3 different dimensions.
I've decided to use the fact table with noneEmpty function on count.
NonEmpty([Dimension].[Hierarchy].members,
([Measures].[Allowed Count],
[Users].[User].&[UserName]
)
)
After setting role I've got result like:
Dim1 | Dim2 | Dim3
1 | A | 300
1 | A | 320
1 | A | 340
1 | B | 300
1 | B | 320
1 | B | 340
Where it should be:
Dim1 | Dim2 | Dim3
1 | A | 300
1 | A | 320
1 | B | 340
Data for allowed user access are stored in table like
UserName | Dim1Key | Dim2Key | Dim3Key
Hierarchy is like
Each Dim1 contains each type of Dim2 that contains each type of Dim3.
And user can only access given member of Dim3 in Dim2 in Dim1.
Is there a way to connect this dimensions in MDX so each Dim in the end has only its respective values
UPDATE:
After some research I've got this query:
SELECT [Measures].[CC Count] ON 0,
NonEmpty(
(
NonEmpty((Dim1.children),
([Measures].[CC Count],
[Users].[User].&[userName]
))
,
NonEmpty((Dim2.children),
([Measures].[CC Count],
[Users].[User].&[userName]
)),
NonEmpty((Dim3.children),
([Measures].[CC Count],
[Users].[User].&[userName]
))
)
,([Measures].[CC Count],
[Users].[User].&[userName]
))
ON 1
FROM [Cost Center]
That gives me wanted results, but I can't place it into Dimensiom Data in Role. Is there a way to change it?
Please try creating a new hidden dimension where the key attribute has a composite key of key1, key2 and key3. You will have to pick some NameColumn but it doesn't matter. So pick key1 as the name. You don't need anything on the dimension except the dimension key.
In the Dimension Usage of your cube designer make sure this new dimension is joined to all fact tables and to the security measure group which provided the CC Count measure.
Then create role based security just on that dimension. The users will be able to see all members of all dimensions but this new composite key dimension will ensure they can't see fact rows they are not supposed to. And this should perform much better than the alternative which is cell security.

Excel : Get the most frequent value for each group

I Have a table ( excel ) with two columns ( Time 'hh:mm:ss' , Value ) and i want to get most frequent value for each group of row.
for example i have
Time | Value
4:35:49 | 122
4:35:49 | 122
4:35:50 | 121
4:35:50 | 121
4:35:50 | 111
4:35:51 | 122
4:35:51 | 111
4:35:51 | 111
4:35:51 | 132
4:35:51 | 132
And i want to get most frequent value of each Time
Time | Value
4:35:49 | 122
4:35:50 | 121
4:35:51 | 132
Thanks in advance
UPDATE
The first answer of #scott with helper column is the correct one
See the pic
You could use a helper column:
First it will need a helper column so in C I put
=COUNTIFS($A$2:$A$11,A2,$B$2:$B$11,B2)
Then in F2 I put the following Array Formula:
=INDEX($B$2:$B$11,MATCH(MAX(IF($A$2:$A$11=E2,IF($C$2:$C$11 = MAX(IF($A$2:$A$11=E2,$C$2:$C$11)),$B$2:$B$11))),$B$2:$B$11,0))
It is an array formula and must be confirmed with Ctrl-Shift-Enter. Then copied down.
I set it up like this:
Here is one way to do this in MS Access:
select tv.*
from (select time, value, count(*) as cnt
from t
group by time, value
) as tv
where exists (select 1
from (select top 1 time, value, count(*) as cnt
from t as t2
where t.time = t2.time
group by time, value
order by count(*) desc, value desc
) as x
where x.time = tv.time and x.value = tv.value
);
MS Access doesn't support features such as window functions or CTEs that make this type of query easier in other databases.
Would that work? I haven't tried and got inspired here
;WITH t3 AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY time ORDER BY c DESC, value DESC) AS rn
FROM (SELECT COUNT(*) AS c, time, value FROM t GROUP BY time, value) AS t2
)
SELECT *
FROM t3
WHERE rn = 1

Detect overlapping ranges and correct then in oracle

Googling it a bit I found this to be an interesting question. Would like you guys shots.
Having my table
USER | MAP | STARTDAY | ENDDAY
1 | A | 20110101 | 20110105
1 | B | 20110106 | 20110110
2 | A | 20110101 | 20110107
2 | B | 20110105 | 20110110
Whant I want is to fix user's 2 case, where maps A and B overlaps by a couple days (from 20110105 until 20110107).
I wish I was able to query that table in a way that it never return overlapping ranges. My input data is falky already, so I don't have to worry with the conflict treatment, I just want to be able to get a single value for any given BETWEEN these dates.
Possible outputs for the query I'm trying to build would be like
USER | MAP | STARTDAY | ENDDAY
2 | B | 20110108 | 20110110 -- pushed overlapping days ahead..
2 | A | 20110101 | 20110104 -- shrunk overlapping range
It doesn't even matter if the algorithm causes "invalid ranges", e.g. Start = 20110105, End = 20110103, I'll just put null when I get to these cases.
What would you guys say? Any straight forward way to get this done?
Thanks!
f.
Analytic functions could help:
select userid, map
, case when prevend >= startday then prevend+1 else startday end newstart
, endday
from
( select userid, map, startday, endday
, lag(endday) over (partition by userid order by startday) prevend
from mytable
)
order by userid, startday
Gives:
USERID MAP NEWSTART ENDDAY
1 A 01/01/2011 01/05/2011
1 B 01/06/2011 01/10/2011
2 A 01/01/2011 01/07/2011
2 B 01/08/2011 01/10/2011

Resources