I'm trying to force spark to only apply a window function on a specified subset of a dataframe, while the actual window has access to rows outside of this subset. Let me go through an example:
I have a spark dataframe that has been saved to hdfs. The dataframe contains events, so each row has a timestamp an id and an integer feature. Also there is a column that I calculate, which is a sum window function made like this:
df = spark.table("some_table_in_hdfs")
w = Window.partitionBy("id").orderBy("date")
df = df.withColumn("feat_int_sum", F.sum("feat_int").over(w))
df.show()
+----------+--------+---+------------+
| date|feat_int| id|feat_int_sum|
+----------+--------+---+------------+
|2018-08-10| 5| 0| 5|
|2018-08-12| 27| 0| 32|
|2018-08-14| 3| 0| 35|
|2018-08-11| 32| 1| 32|
|2018-08-12| 552| 1| 584|
|2018-08-16| 2| 1| 586|
+----------+--------+---+------------+
When I do a load of new data from a different source, I would like to append to the above dataframe in hdfs. In order to do this, I have to apply the window function to the new data as well. I would like to union the two dataframes so that the window function has access to the "old" feat_int values to do the sum over.
df_new = spark.table("some_new_table")
df_new.show()
+----------+--------+---+
| date|feat_int| id|
+----------+--------+---+
|2018-08-20| 65| 0|
|2018-08-23| 3| 0|
|2018-08-24| 4| 0|
|2018-08-21| 69| 1|
|2018-08-25| 37| 1|
|2018-08-26| 3| 1|
+----------+--------+---+
df_union = df.union(df_new.withColumn("feat_int_sum", F.lit(None)))
df_union.show()
+----------+--------+---+------------+
| date|feat_int| id|feat_int_sum|
+----------+--------+---+------------+
|2018-08-10| 5| 0| 5|
|2018-08-12| 27| 0| 32|
|2018-08-14| 3| 0| 35|
|2018-08-20| 65| 0| null|
|2018-08-23| 3| 0| null|
|2018-08-24| 4| 0| null|
|2018-08-11| 32| 1| 32|
|2018-08-12| 552| 1| 584|
|2018-08-16| 2| 1| 586|
|2018-08-21| 69| 1| null|
|2018-08-25| 37| 1| null|
|2018-08-26| 3| 1| null|
+----------+--------+---+------------+
The problem is, that I would like to apply the sum window function on df_union but only on the rows with null in feat_int_sum. The reason is that I don't want to recalculate the window function for all the values that are already calculated in df. So the desired result would be something like this:
+----------+--------+---+------------+-----------------+
| date|feat_int| id|feat_int_sum|feat_int_sum_temp|
+----------+--------+---+------------+-----------------+
|2018-08-10| 5| 0| 5| null|
|2018-08-12| 27| 0| 32| null|
|2018-08-14| 3| 0| 35| null|
|2018-08-20| 65| 0| null| 100|
|2018-08-23| 3| 0| null| 103|
|2018-08-24| 4| 0| null| 107|
|2018-08-11| 32| 1| 32| null|
|2018-08-12| 552| 1| 584| null|
|2018-08-16| 2| 1| 586| null|
|2018-08-21| 69| 1| null| 655|
|2018-08-25| 37| 1| null| 692|
|2018-08-26| 3| 1| null| 695|
+----------+--------+---+------------+-----------------+
I tried wrapping the window function in a when statement like this:
df_union.withColumn("feat_int_sum_temp", F.when(F.col('date') > '2018-08-16', F.sum('feat_int').over(w))
But looking at the spark explain plan, it seems that it will run the window function for all rows, and afterwards apply the when conditon.
The whole reason I don't want to run the window function on the old rows, is that I'm dealing with some really big tables, and I don't want to waste computational ressources recalculating values that will not be used.
After this step I would coalesce the feat_int_sum and the feat_int_sum_temp columns, and append only the new part of the data to hdfs.
I would appreciate any hints on how to force spark to only apply the window function on the specified subset, while the actual window has access to rows outside of this subset.
Related
I am trying to test the usage of F.count(F.col().isNotNull()) in window function. Please see the following code script
from pyspark.sql import functions as F
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
list=([1,5,4],
[1,5,None],
[1,5,1],
[1,5,4],
[2,5,1],
[2,5,2],
[2,5,None],
[2,5,None],
[2,5,4])
df=spark.createDataFrame(list,['I_id','p_id','xyz'])
w= Window().partitionBy("I_id","p_id").orderBy(F.col("xyz").asc_nulls_first())
df.withColumn("xyz1",F.count(F.col("xyz").isNotNull()).over(w)).show()
The result is shown as follows. In the first two rows, my understanding is that F.count(F.col("xyz") should count the non-zero items from xyz = -infinity to xyz = null, how does the following isNotNull() process this. Why it gets 2 for the first two rows in xyz1 column.
If you count the Booleans, since they are either True or False, you will count all the rows in the specified window, regardless of whether xyz is null or not.
What you could do is to sum the isNotNull Boolean rather than counting them.
df.withColumn("xyz1",F.sum(F.col("xyz").isNotNull().cast('int')).over(w)).show()
+----+----+----+----+
|I_id|p_id| xyz|xyz1|
+----+----+----+----+
| 2| 5|null| 0|
| 2| 5|null| 0|
| 2| 5| 1| 1|
| 2| 5| 2| 2|
| 2| 5| 4| 3|
| 1| 5|null| 0|
| 1| 5| 1| 1|
| 1| 5| 4| 3|
| 1| 5| 4| 3|
+----+----+----+----+
Another way is to do a conditional count using when:
df.withColumn("xyz1",F.count(F.when(F.col("xyz").isNotNull(), 1)).over(w)).show()
+----+----+----+----+
|I_id|p_id| xyz|xyz1|
+----+----+----+----+
| 2| 5|null| 0|
| 2| 5|null| 0|
| 2| 5| 1| 1|
| 2| 5| 2| 2|
| 2| 5| 4| 3|
| 1| 5|null| 0|
| 1| 5| 1| 1|
| 1| 5| 4| 3|
| 1| 5| 4| 3|
+----+----+----+----+
Learning Apache Spark through PySpark and having issues.
I have the following DF:
+----------+------------+-----------+----------------+
| game_id|posteam_type|total_plays|total_touchdowns|
+----------+------------+-----------+----------------+
|2009092003| home| 90| 3|
|2010091912| home| 95| 0|
|2010112106| home| 75| 0|
|2010121213| home| 85| 3|
|2009092011| null| 9| null|
|2010110703| null| 2| null|
|2010112111| null| 6| null|
|2011100909| home| 102| 3|
|2011120800| home| 72| 2|
|2012010110| home| 74| 6|
|2012110410| home| 68| 1|
|2012120911| away| 91| 2|
|2011103008| null| 6| null|
|2012111100| null| 3| null|
|2013092212| home| 86| 6|
|2013112407| home| 73| 4|
|2013120106| home| 99| 3|
|2014090705| home| 94| 3|
|2014101203| home| 77| 4|
|2014102611| home| 107| 6|
+----------+------------+-----------+----------------+
I'm attempting to find the average number of plays it takes to score a TD or sum(total_plays)/sum(total_touchdowns).
I figured out the code to get the sums but can't figure out how to get the total average:
plays = nfl_game_play.groupBy().agg({'total_plays': 'sum'}).collect()
touchdowns = nfl_game_play.groupBy().agg({'total_touchdowns',: 'sum'}).collect()
As you can see I tried storing each as a variable but beyond just remembering what each value is and manually doing it.
Try with below code:
Example:
df.show()
#+-----------+----------------+
#|total_plays|total_touchdowns|
#+-----------+----------------+
#| 90| 3|
#| 95| 0|
#| 9| null|
#+-----------+----------------+
from pyspark.sql.functions import *
total_avg=df.groupBy().agg(sum("total_plays")/sum("total_touchdowns")).collect()[0][0]
#64.66666666666667
I have column quantity in dataframe. I want to add a new column to this dataframe with each record having min("Quantity"). I am trying to use lit() in pyspark. something like below
df.withColumn("min_quant", lit(min(col("Quantity")))).show().
It's resulting in the getting below error
grouping expressions sequence is empty, and `InvoiceNo` is not an aggregate function.
Wrap (min(`Quantity`) AS `min_quant`) in windowing function(s) or wrap
This is working:
df.withColumn("min_quant", lit(2)).show().
But, in place of 2 here, I want min(Quantity). Am I missing something?
Please try using window function as min() function needs aggregation.
val windowSpec = Window.orderBy("InvoiceNo")
df.withColumn("min_quant", min("Quantity") over(windowSpec)).show()
Sample Result:
+---------+----+--------+---------+
|InvoiceNo|name|Quantity|min_quant|
+---------+----+--------+---------+
| 1| ABC| 19| 1|
| 1| ABC| 1| 1|
| 1| ABC| 8| 1|
| 1| ABC| 389| 1|
| 1| ABC| 196| 1|
| 2| CBD| 10| 1|
| 2| CBD| 946| 1|
| 3| XYZ| 3| 1|
+---------+----+--------+---------+
I have this dataset:
+----+-----+-------+-----+
|code|code2|machine|value|
+----+-----+-------+-----+
| 1| 2| A| 42|
| 2| 1| A| 11|
| 1| 4| A| 55|
| 1| 1| B| 2|
| 3| 3| B| 34|
| 3| 2| B| 111|
+----+-----+-------+-----+
I want that for each machine a kind of matrix like the following:
code and code2 are the column and at the intersection I want to fill the value.
Machine A
+----+----+----+----+----+
| A| 1| 2| 3| 4|
+----+----+----+----+----+
| 1| 0| 11| 0| 0|
| 2| 42| 0| 0| 0|
| 3| 0| 0| 0| 0|
| 4| 55| 0| 0| 0|
+----+----+----+----+----+
Machine B
+----+----+----+----+----+
| B| 1| 2| 3| 4|
+----+----+----+----+----+
| 1| 2| 0| 0| 0|
| 2| 0| 0| 111| 0|
| 3| 0| 0| 34| 0|
| 4| 0| 0| 0| 0|
+----+----+----+----+----+
I have multiple machine there (unknown number) and the codes can only be 0-255.
So my problem is how to achieve that matrix...
My fist naive idea was to make a hashmap and as key the machine name and as value a 256x256 2D array. But I don't think it would be efficient and I also don't know how to achieve that.
Or probably have a dataset for each machine??
If someone has an idea I would like to listen.
Btw I'm using Scala.
For maximum coding flexibility, you could switch to the RDD API. An example of a solution would give you a RDD that maps a machine to its matrix, represented as a scala two-dimensional array. Note that Array.ofDimInt creates a two-dim array of sine n*m with zeros everywhere.
df
.map(x=> x.getAs[String]("machine") -> (x.getAs[Int]("code"), x.getAs[Int]("code2"),x.getAs[Int]("value")))
.groupByKey
.mapValues( seq => {
var result = Array.ofDim[Int](256, 256)
seq.foreach{ case (i,j,value) => result(i)(j) = value }
result
})
I have the following dataframe showing the revenue of purchases.
+-------+--------+-------+
|user_id|visit_id|revenue|
+-------+--------+-------+
| 1| 1| 0|
| 1| 2| 0|
| 1| 3| 0|
| 1| 4| 100|
| 1| 5| 0|
| 1| 6| 0|
| 1| 7| 200|
| 1| 8| 0|
| 1| 9| 10|
+-------+--------+-------+
Ultimately I want the new column purch_revenue to show the revenue generated by the purchase in every row.
As a workaround, I have also tried to introduce a purchase identifier purch_id which is incremented each time a purchase was made. So this is listed just as a reference.
+-------+--------+-------+-------------+--------+
|user_id|visit_id|revenue|purch_revenue|purch_id|
+-------+--------+-------+-------------+--------+
| 1| 1| 0| 100| 1|
| 1| 2| 0| 100| 1|
| 1| 3| 0| 100| 1|
| 1| 4| 100| 100| 1|
| 1| 5| 0| 100| 2|
| 1| 6| 0| 100| 2|
| 1| 7| 200| 100| 2|
| 1| 8| 0| 100| 3|
| 1| 9| 10| 100| 3|
+-------+--------+-------+-------------+--------+
I've tried to use the lag/lead function like this:
user_timeline = Window.partitionBy("user_id").orderBy("visit_id")
find_rev = fn.when(fn.col("revenue") > 0,fn.col("revenue"))\
.otherwise(fn.lead(fn.col("revenue"), 1).over(user_timeline))
df.withColumn("purch_revenue", find_rev)
This duplicates the revenue column if revenue > 0 and also pulls it up by one row. Clearly, I can chain this for a finite N, but that's not a solution.
Is there a way to apply this recursively until revenue > 0?
Alternatively, is there a way to increment a value based on a condition? I've tried to figure out a way to do that but struggled to find one.
Window functions don't support recursion but it is not required here. This type of sesionization can be easily handled with cumulative sum:
from pyspark.sql.functions import col, sum, when, lag
from pyspark.sql.window import Window
w = Window.partitionBy("user_id").orderBy("visit_id")
purch_id = sum(lag(when(
col("revenue") > 0, 1).otherwise(0),
1, 0
).over(w)).over(w) + 1
df.withColumn("purch_id", purch_id).show()
+-------+--------+-------+--------+
|user_id|visit_id|revenue|purch_id|
+-------+--------+-------+--------+
| 1| 1| 0| 1|
| 1| 2| 0| 1|
| 1| 3| 0| 1|
| 1| 4| 100| 1|
| 1| 5| 0| 2|
| 1| 6| 0| 2|
| 1| 7| 200| 2|
| 1| 8| 0| 3|
| 1| 9| 10| 3|
+-------+--------+-------+--------+