identifying decrease in values in spark (outliers) - apache-spark

I have a large data set with millions of records which is something like
Movie Likes Comments Shares Views
A 100 10 20 30
A 102 11 22 35
A 104 12 25 45
A *103* 13 *24* 50
B 200 10 20 30
B 205 *9* 21 35
B *203* 12 29 42
B 210 13 *23* *39*
Likes, comments etc are rolling totals and they are suppose to increase. If there is drop in any of this for a movie then its a bad data needs to be identified.
I have initial thoughts about groupby movie and then sort within the group. I am using dataframes in spark 1.6 for processing and it does not seem to be achievable as there is no sorting within the grouped data in dataframe.
Buidling something for outlier detection can be another approach but because of time constraint I have not explored it yet.
Is there anyway I can achieve this ?
Thanks !!

You can use the lag window function to bring the previous values into scope:
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy('Movie).orderBy('maybesometemporalfield)
dataset.withColumn("lag_likes", lag('Likes, 1) over windowSpec)
.withColumn("lag_comments", lag('Comments, 1) over windowSpec)
.show
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-functions.html#lag
Another approach would be to assign a row number (if there isn't one already), lag that column, then join the row to it's previous row, to allow you to do the comparison.
HTH

Related

How to estimate a Spark DataFrame's row count after join two or more table?

I'm developing a feature, support dynamic sql as input, and then using the input to submit a spark job. But the inputs are unpredicatable, some inputs may exceed the limit, it's a dager for me. I want to check the sql's cost before submit the job, is a way I can estimate the cost accurately?
My Spark conf is :
Spark Version: 3.3.1
conf:
spark.sql.cbo.enabled: true
spark.sql.statistics.histogram.enabled:true
example:
I have a dataFrame df1 like this
n x y z
'A' 1 2 3
'A' 4 5 6
'A' 7 8 9
'A' 10 11 12
'A' 13 14 15
'A' 16 17 18
'A' 19 20 21
'A' 22 23 24
'A' 25 26 27
'A' 28 29 30
row count of df1.join(df1,"n","left").join(df1,"n","left") should be 1000
row count of df1.join(df1,"n","left").join(df1,"n","left") should be 10
but result of dataFrame.queryExecution.optimizedPlan.stats is awlyways 1000 for examples above.
I've tried in some way:
dataFrame.queryExecution.optimizedPlan.stats, but the es rows is much bigger than actual rows, especially when join operation exists.
Use dataFrame.rdd.countApprox. The problem is that it need much time to get the actual result when dataFrame is big
I also try to use org.apache.spark.sql.execution.command.CommandUtils#calculateMultipleLocationSizesInParallel, it's better than dataFrame.rdd.countApprox, but in some extreme scenario, it also cost more than tens of minutes。
First let's calculate the number of rows in each table
df1_count = df1.count()
df2_count = df2.count()
df3_count = df3.count()
Then use cogroup to create a DataFrame containing the row counts from each table
counts_df = df1.cogroup(df2, df3)
Add up the row counts to get the estimated total number of rows in the joined DataFrame
estimated_row_count = counts_df.sum()
Eventually, when you join, you can try this approach
joined_df = df1.join(df2, on=..., how=...).join(df3, on=..., how=...)
exact_row_count = joined_df.count()

Giving weight to a delta

I have to analyze a delta impact for different categories but we need to focus on only with the highest priority. Example:
Col.A(dollar value)
Col.B (Forecast)
Col.C (Actual)
Col.D (delta)
2000
50
30
20
60
40
10
50
Is sumproduct the only best method? Also there are different attributes by which I need to look at the data

compare values of two dataframes based on certain filter conditions and then get count

I am new to spark. I am writing a pyspark code where I have two dataframes such that :
DATAFRAME-1:
NAME BATCH MARKS
A 1 44
B 15 50
C 45 99
D 2 18
DATAFRAME-2:
NAME MARKS
A 36
B 100
C 23
D 67
I want my output as a comparison between these two dataframes such that I can store the counts as my variables.
for instance,
improvedStudents = 1 (since D belongs to batch 1-15 and has improved his score)
badPerformance = 2 (A,B have performed bad since they belong between batch 1-15 and their marks are lesser than before)
neutralPerformance = 1 (C because even if his marks went down, he belongs to batch 45 that we dont want to consider)
This is just an example out of a complex problem I'm trying to solve.
Thanks
If the data is as in your example why don't you just join them and create new columns for every metric that you have:
val df = df1.withColumnRenamed("MARKS", "PRE_MARKS")
.join(df2.withColumnRenamed("MARKS", "POST_MARKS"), Seq("NAME"))
.withColumn("Evaluation",
when(col("BATCH") > 15, lit("neutral"))
.when(col("PRE_MARKS") gt col("POST_MARKS"), lit("bad"))
.when(col("POST_MARKS") gt col("PRE_MARKS"), lit("improved"))
.otherwise(lit("neutral"))
.groupBy("Evaluation")
.count

Divide excel column to N equal groups

I have a column with ordinal values. I want to have another column that ranks them in equal groups (relatively to their value).
Example: If I have a score and I want to divide to 5 equal groups:
Score
100
90
80
70
60
50
40
30
20
10
What function do I use in the new column to get this eventually:
Score Group
100 5
90 5
80 4
70 4
60 3
50 3
40 2
30 2
20 1
10 1
Thanks! (I'm guessing the solution is somewhere in mod, row and count - but I couldn't find any good solution for this specific problem)
If you don't care about how the groups are split for groups that aren't evenly divisible, you can use this formula and drag down as far as necessary:
= FLOOR(5*(COUNTA(A:A)-COUNTA(INDEX(A:A,1):INDEX(A:A,ROW())))/COUNTA(A:A),1)+1
Possibly a more efficient solution exists, but this is the first way I thought to do it.
Obviously you'll have to change the references to the A column if you want it in a different column.
See below for working example.

PowerPivot: How to identify Max Value per Group in a Calculated Column

I am building a data model within Power Pivot for Excel 2013 and need to be able to identify the max value within a column for a particular group. Unfortunately what I thought would work and what I have searched for previously gave me an error or wasn't applicable (there was a similar question that dealt with calculated measures rather than columns and wasn't replicable in Power Pivot data view to the best of my knowledge)
I have included an indication of what I am trying to achieve below, in this case I am trying to calculate the Max % uptake column.
Group | % uptake | Max % uptake
A 40 45
A 22 45
A 45 45
B 12 33
B 18 33
B 33 33
C 3 16
C 16 16
C 9 16
Many thanks
Use
=CALCULATE(MAX([UPTAKE]),FILTER(Table1,[GROUP]=EARLIER([GROUP])))
use this formula in cell ("C2"):
=MAX(INDIRECT(CONCATENATE("B",MATCH(A2,$A$1:$A$10,0),":B",SUMPRODUCT(MAX(($A$1:$A$10=A2)*(ROW($A$1:$A$10)))))))

Resources