Restructuring pyspark dataframe - python-3.x

I am solving a regression problem. For that I have cluster the data first and applied regression model on each cluster. Now I want to implement another regression model which will take predicted output of each cluster as a feature and output the aggregated predicted value.
I have already implemented the clustering and regression model in pyspark.
But I am not able to finally extract the output of each cluster as a feature for input to another regression model.
How Can this conversion be achieved in pyspark(prefarably) or pandas efficiently?
Current dataframe :
date cluster predVal actual
31-03-2019 0 14 13
31-03-2019 1 24 15
31-03-2019 2 13 10
30-03-2019 0 14 13
30-03-2019 1 24 15
30-03-2019 2 13 10
Required dataframe
date predVal0 predVal1 predVal2 actual
31-03-2019 14 24 13 38 // 13+15+10
30-03-2019 14 24 13 38 // 13+15+10

You want to do a pivot in pyspark and then create a new column by summing the predVal{i} columns. You should proceed in three steps.
First step, you want to apply a pivot. Your index is the date, your column to pivot is the cluster and the column of the value if the predVal.
df_pivot = df.groupBy('date').pivot('cluster').agg(first('predVal'))
Then, you should apply a sum
df_actual = df.groupBy('date').sum('actual')
At the end, you can join the actual column with the pivot data on the index column data:
df_final = df_pivot.join(df_actual ,['date'])
This link is answering pretty well your question:
- https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html

Related

How to estimate a Spark DataFrame's row count after join two or more table?

I'm developing a feature, support dynamic sql as input, and then using the input to submit a spark job. But the inputs are unpredicatable, some inputs may exceed the limit, it's a dager for me. I want to check the sql's cost before submit the job, is a way I can estimate the cost accurately?
My Spark conf is :
Spark Version: 3.3.1
conf:
spark.sql.cbo.enabled: true
spark.sql.statistics.histogram.enabled:true
example:
I have a dataFrame df1 like this
n x y z
'A' 1 2 3
'A' 4 5 6
'A' 7 8 9
'A' 10 11 12
'A' 13 14 15
'A' 16 17 18
'A' 19 20 21
'A' 22 23 24
'A' 25 26 27
'A' 28 29 30
row count of df1.join(df1,"n","left").join(df1,"n","left") should be 1000
row count of df1.join(df1,"n","left").join(df1,"n","left") should be 10
but result of dataFrame.queryExecution.optimizedPlan.stats is awlyways 1000 for examples above.
I've tried in some way:
dataFrame.queryExecution.optimizedPlan.stats, but the es rows is much bigger than actual rows, especially when join operation exists.
Use dataFrame.rdd.countApprox. The problem is that it need much time to get the actual result when dataFrame is big
I also try to use org.apache.spark.sql.execution.command.CommandUtils#calculateMultipleLocationSizesInParallel, it's better than dataFrame.rdd.countApprox, but in some extreme scenario, it also cost more than tens of minutes。
First let's calculate the number of rows in each table
df1_count = df1.count()
df2_count = df2.count()
df3_count = df3.count()
Then use cogroup to create a DataFrame containing the row counts from each table
counts_df = df1.cogroup(df2, df3)
Add up the row counts to get the estimated total number of rows in the joined DataFrame
estimated_row_count = counts_df.sum()
Eventually, when you join, you can try this approach
joined_df = df1.join(df2, on=..., how=...).join(df3, on=..., how=...)
exact_row_count = joined_df.count()

compare values of two dataframes based on certain filter conditions and then get count

I am new to spark. I am writing a pyspark code where I have two dataframes such that :
DATAFRAME-1:
NAME BATCH MARKS
A 1 44
B 15 50
C 45 99
D 2 18
DATAFRAME-2:
NAME MARKS
A 36
B 100
C 23
D 67
I want my output as a comparison between these two dataframes such that I can store the counts as my variables.
for instance,
improvedStudents = 1 (since D belongs to batch 1-15 and has improved his score)
badPerformance = 2 (A,B have performed bad since they belong between batch 1-15 and their marks are lesser than before)
neutralPerformance = 1 (C because even if his marks went down, he belongs to batch 45 that we dont want to consider)
This is just an example out of a complex problem I'm trying to solve.
Thanks
If the data is as in your example why don't you just join them and create new columns for every metric that you have:
val df = df1.withColumnRenamed("MARKS", "PRE_MARKS")
.join(df2.withColumnRenamed("MARKS", "POST_MARKS"), Seq("NAME"))
.withColumn("Evaluation",
when(col("BATCH") > 15, lit("neutral"))
.when(col("PRE_MARKS") gt col("POST_MARKS"), lit("bad"))
.when(col("POST_MARKS") gt col("PRE_MARKS"), lit("improved"))
.otherwise(lit("neutral"))
.groupBy("Evaluation")
.count

Pyspark: How do I get today's score and 30 day avg score in a single row

I have use-case where I want to get the rank for today as well as 30 day average as a column. The data has 30 day data for a particular ID and type. The data looks like: -
Id Type checkInDate avgrank
1 ALONE 2019-04-24 1.333333
1 ALONE 2019-03-31 34.057471
2 ALONE 2019-04-17 1.660842
1 TOGETHER 2019-04-13 19.500000
1 TOGETHER 2019-04-08 5.481203
2 ALONE 2019-03-29 122.449156
3 ALONE 2019-04-07 3.375000
1 TOGETHER 2019-04-01 49.179719
5 TOGETHER 2019-04-17 1.391753
2 ALONE 2019-04-22 3.916667
1 ALONE 2019-04-15 2.459151
As my result I want to have output like
Id Type TodayAvg 30DayAvg
1 ALONE 30.0 9.333333
1 TOGETHER 1.0 34.057471
2 ALONE 7.8 99.660842
2 TOGETHER 3 19.500000
.
.
The way I think I can achieve it is having 2 dataframes, one doing a filter on today's date and the 2nd dataframe doing an average over 30 days and then joining the today dataframes on ID and Type
rank = glueContext.create_dynamic_frame.from_catalog(database="testing", table_name="rank", transformation_ctx="rank")
filtert_rank = Filter.apply(frame=rank, f=lambda x: (x["checkInDate"] == curr_dt))
rank_avg = glueContext.create_dynamic_frame.from_catalog(database="testing", table_name="rank", transformation_ctx="rank_avg")
rank_avg_f = rank_avg.groupBy("id", "type").agg(F.mean("avgrank"))
rank_join = filtert_rank.join(rank_avg, ["id", "type"], how='inner')
Is there a simpler way to do it i.e. without reading the dataframe twice?
You can convert the dynamic frame to a apache spark data frame and perform regular sql.
Check the documentation for toDF() and sparksql.

Extracting array column from spark dataframe

My spark dataframe has array column, I have to generate new columns by extracting data from single array column. are there any methods available for this.
id Amount
10 [Tax:10,Total:30,excludingTax:20]
11 [Total:30]
12 [Tax:05,Total:35,excludingTax:30]
I have to generate this dataframe.
ID Tax Total
10 10 30
11 0 30
12 05 35
If you know for sure [Tax:10,Total:30,excludingTax:20] are the only fields in the same order you can always map over entire dataframe and extract them as Amount[0], Amount[1] ...
Then assign them as a instance of a case class and finally convert back to dataframe.
Only thing you have to be care full that you don't call Amount[3] if Amount has only 2 values. That is easily achievable by checking the array length.
Alternately if you don't know the order. Best way is to use JSONRdd. Then loop through the JSON object parse them and create a new row. Finally convert that to a dataframe

identifying decrease in values in spark (outliers)

I have a large data set with millions of records which is something like
Movie Likes Comments Shares Views
A 100 10 20 30
A 102 11 22 35
A 104 12 25 45
A *103* 13 *24* 50
B 200 10 20 30
B 205 *9* 21 35
B *203* 12 29 42
B 210 13 *23* *39*
Likes, comments etc are rolling totals and they are suppose to increase. If there is drop in any of this for a movie then its a bad data needs to be identified.
I have initial thoughts about groupby movie and then sort within the group. I am using dataframes in spark 1.6 for processing and it does not seem to be achievable as there is no sorting within the grouped data in dataframe.
Buidling something for outlier detection can be another approach but because of time constraint I have not explored it yet.
Is there anyway I can achieve this ?
Thanks !!
You can use the lag window function to bring the previous values into scope:
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy('Movie).orderBy('maybesometemporalfield)
dataset.withColumn("lag_likes", lag('Likes, 1) over windowSpec)
.withColumn("lag_comments", lag('Comments, 1) over windowSpec)
.show
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-functions.html#lag
Another approach would be to assign a row number (if there isn't one already), lag that column, then join the row to it's previous row, to allow you to do the comparison.
HTH

Resources