pyspark find monthly re-engaged user - apache-spark

Have a large dataframe looks like this, need to find monthly reengaged users number, which means if a user did not visit last month but come back this month.
If only need to compare two months it will be easy. How to do this month over month more efficiently.
df = spark.createDataFrame(
[
("2020-05-06", "1"),
("2020-05-07", "1"),
("2020-05-08", "2"),
("2020-05-10", "3"),
("2020-05-07", "3"),
("2020-05-07", "1"),
("2020-05-20", "4"),
("2020-05-30", "2"),
("2020-05-03", "1"),
("2020-06-06", "1"),
("2020-06-07", "1"),
("2020-06-08", "5"),
("2020-06-10", "3"),
("2020-06-07", "3"),
("2020-06-07", "1"),
("2020-06-20", "3"),
("2020-06-30", "5"),
("2020-07-03", "2"),
("2020-07-06", "4"),
("2020-07-07", "4"),
("2020-07-08", "2"),
("2020-07-10", "3"),
("2020-07-07", "3"),
("2020-07-07", "4"),
("2020-07-20", "3"),
("2020-07-30", "2"),
("2020-08-03", "1"),
("2020-08-03", "2"),
("2020-08-06", "5"),
("2020-08-07", "4"),
("2020-08-08", "2"),
("2020-08-10", "3"),
("2020-08-07", "3"),
("2020-08-07", "4"),
("2020-08-20", "3"),
("2020-08-30", "2"),
("2020-08-03", "1"),
],
["visit_date", "userId"],
)
df = df.withColumn("first_day_month", F.trunc("visit_date", "month")).withColumn(
"first_day_last_month", F.expr("add_months(first_day_month, -1)")
)
s5 = df.where(F.col("first_day_month") == "2020-05-01")
s6 = df.where(F.col("first_day_month") == "2020-06-01").withColumnRenamed(
"userId", "userId_right"
)
ss = s5.join(s6, s5.userId == s6.userId_right, how="right")
ss.select("userId_right").where(F.col("userId").isNull()).show()
Spark array manipulation seems also worth trying but needs to do a row by row
array_interset calculation which I'm not familiar with it yet also not sure if it's efficient to run this way
dd = (
df.groupby("first_day_month")
.agg(F.collect_list("userId").alias("users_current_month"))
.orderBy("first_day_month")
)
dd.show()
+---------------+-------------------+
|first_day_month|users_current_month|
+---------------+-------------------+
| 2020-05-01| [1, 2, 3, 4]|
| 2020-06-01| [1, 3, 5]|
| 2020-07-01| [2, 3, 4]|
| 2020-08-01| [1, 2, 3, 4, 5]|
+---------------+-------------------+
Any idea?
expected results:
first_day_month reengaged_user_count
2020-06-01 1
2020-07-01 2
2020-08-01 2

Using analytics function, we can do something like this :
df = df.withColumn("first_day_month", F.trunc("visit_date", "month")).withColumn(
"first_day_last_month",
F.lag("first_day_month").over(Window.partitionBy("userId").orderBy("visit_date")),
)
ss = df.where(F.months_between("first_day_month", "first_day_last_month") > 1)
ss.show()
+----------+------+---------------+--------------------+
|visit_date|userId|first_day_month|first_day_last_month|
+----------+------+---------------+--------------------+
|2020-08-06| 5| 2020-08-01| 2020-06-01|
|2020-08-03| 1| 2020-08-01| 2020-06-01|
|2020-07-06| 4| 2020-07-01| 2020-05-01|
|2020-07-03| 2| 2020-07-01| 2020-05-01|
+----------+------+---------------+--------------------+
ss.groupBy("first_day_month").agg(F.collect_set("UserId")).show()
+---------------+-------------------+
|first_day_month|collect_set(UserId)|
+---------------+-------------------+
| 2020-08-01| [1, 5]|
| 2020-07-01| [2, 4]|
+---------------+-------------------+

Related

Append a value after every element in PySpark list Dataframe

I am having a dataframe like this
Data ID
[1,2,3,4] 22
I want to create a new column and each and every entry in the new column will be value from Data field appended with ID by ~ symbol, like below
Data ID New_Column
[1,2,3,4] 22 [1|22~2|22~3|22~4|22]
Note : In Data field the array size is not fixed one. It may not have entry or N number of entry will be there.
Can anyone please help me to solve!
package spark
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object DF extends App {
val spark = SparkSession.builder()
.master("local")
.appName("DataFrame-example")
.getOrCreate()
import spark.implicits._
val df = Seq(
(22, Seq(1,2,3,4)),
(23, Seq(1,2,3,4,5,6,7,8)),
(24, Seq())
).toDF("ID", "Data")
val arrUDF = udf((id: Long, array: Seq[Long]) => {
val r = array.size match {
case 0 => ""
case _ => array.map(x => s"$x|$id").mkString("~")
}
s"[$r]"
})
val resDF = df.withColumn("New_Column", lit(arrUDF('ID, 'Data)))
resDF.show(false)
//+---+------------------------+-----------------------------------------+
//|ID |Data |New_Column |
//+---+------------------------+-----------------------------------------+
//|22 |[1, 2, 3, 4] |[1|22~2|22~3|22~4|22] |
//|23 |[1, 2, 3, 4, 5, 6, 7, 8]|[1|23~2|23~3|23~4|23~5|23~6|23~7|23~8|23]|
//|24 |[] |[] |
//+---+------------------------+-----------------------------------------+
}
Spark 2.4+
Pyspark equivalent for the same goes like
df = spark.createDataFrame([(22, [1,2,3,4]),(23, [1,2,3,4,5,6,7,8]),(24, [])],['Id','Data'])
df.show()
+---+--------------------+
| Id| Data|
+---+--------------------+
| 22| [1, 2, 3, 4]|
| 23|[1, 2, 3, 4, 5, 6...|
| 24| []|
+---+--------------------+
df.withColumn('ff', f.when(f.size('Data')==0,'').otherwise(f.expr('''concat_ws('~',transform(Data, x->concat(x,'|',Id)))'''))).show(20,False)
+---+------------------------+---------------------------------------+
|Id |Data |ff |
+---+------------------------+---------------------------------------+
|22 |[1, 2, 3, 4] |1|22~2|22~3|22~4|22 |
|23 |[1, 2, 3, 4, 5, 6, 7, 8]|1|23~2|23~3|23~4|23~5|23~6|23~7|23~8|23|
|24 |[] | |
+---+------------------------+---------------------------------------+
If you want final output as array
df.withColumn('ff',f.array(f.when(f.size('Data')==0,'').otherwise(f.expr('''concat_ws('~',transform(Data, x->concat(x,'|',Id)))''')))).show(20,False)
+---+------------------------+-----------------------------------------+
|Id |Data |ff |
+---+------------------------+-----------------------------------------+
|22 |[1, 2, 3, 4] |[1|22~2|22~3|22~4|22] |
|23 |[1, 2, 3, 4, 5, 6, 7, 8]|[1|23~2|23~3|23~4|23~5|23~6|23~7|23~8|23]|
|24 |[] |[] |
+---+------------------------+-----------------------------------------+
Hope this helps
A udf can help:
def func(array, suffix):
return '~'.join([str(x) + '|' + str(suffix) for x in array])
from pyspark.sql.types import StringType
from pyspark.sql import functions as F
my_udf = F.udf(func, StringType())
df.withColumn("New_Column", my_udf("Data", "ID")).show()
prints
+------------+---+-------------------+
| Data| ID| New_Column |
+------------+---+-------------------+
|[1, 2, 3, 4]| 22|22~1|22~2|22~3|22~4|
+------------+---+-------------------+

Spark graphx issue

I am trying to follow the example in
https://docs.databricks.com/spark/latest/graph-analysis/graphframes/user-guide-python.html
However when changing some criteria the result is not as per expectation.
Please see the steps below -
from functools import reduce
from pyspark.sql.functions import col, lit, when
from graphframes import *
vertices = sqlContext.createDataFrame([
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
("d", "David", 29),
("e", "Esther", 32),
("f", "Fanny", 36),
("g", "Gabby", 60)], ["id", "name", "age"])
edges = sqlContext.createDataFrame([
("a", "b", "follow"),
("b", "c", "follow"),
("c", "b", "follow"),
("f", "c", "follow"),
("e", "f", "follow"),
("e", "d", "follow"),
("d", "a", "follow"),
("a", "e", "follow")
], ["src", "dst", "relationship"])
g = GraphFrame(vertices, edges)
Now one change I have done in "relationship" column , all values are "follow" instead of friend.
Now below query is running fine -
g.bfs(fromExpr ="name = 'Alice'",toExpr = "age < 32", edgeFilter ="relationship != 'friend'" , maxPathLength = 10).show()
+--------------+--------------+---------------+--------------+----------------+
| from| e0| v1| e1| to|
+--------------+--------------+---------------+--------------+----------------+
|[a, Alice, 34]|[a, e, follow]|[e, Esther, 32]|[e, d, follow]| [d, David, 29]|
|[a, Alice, 34]|[a, b, follow]| [b, Bob, 36]|[b, c, follow]|[c, Charlie, 30]|
+--------------+--------------+---------------+--------------+----------------+
but if I change the filter criteria from 32 to 40 , wrong result is being fetched -
>>> g.bfs(fromExpr ="name = 'Alice'",toExpr = "age < 35", edgeFilter ="relationship != 'friend'" , maxPathLength = 10).show()
+--------------+--------------+
| from| to|
+--------------+--------------+
|[a, Alice, 34]|[a, Alice, 34]|
+--------------+--------------+
Ideally it should fetch similar result from first query because filter condition is still getting satisfied for all rows.
Any explanation behind this ?
bfs() search for the first result that meet your predicate. Alice age is 34, it meets toExpr = "age < 35" predicate so you got zero length path starting from Alice. Please change toExpr for something more specific. for example toExpr ="name = 'David' or name = 'Charlie'" Should give you exactly the same result as in the first query.

Get top values based on compound key for each partition in Spark RDD

I want to use the following rdd
rdd = sc.parallelize([("K1", "e", 9), ("K1", "aaa", 9), ("K1", "ccc", 3), ("K1", "ddd", 9),
("B1", "qwe", 4), ("B1", "rty", 7), ("B1", "iop", 8), ("B1", "zxc", 1)])
to get the output
[('K1', 'aaa', 9),
('K1', 'ddd', 9),
('K1', 'e', 9),
('B1', 'iop', 8),
('B1', 'rty', 7),
('B1', 'qwe', 4)]
I referred to Get Top 3 values for every key in a RDD in Spark and used the following code
from heapq import nlargest
rdd.groupBy(
lambda x: x[0]
).flatMap(
lambda g: nlargest(3, g[1], key=lambda x: (x[2],x[1]))
).collect()
However, I can only derive
[('K1', 'e', 9),
('K1', 'ddd', 9),
('K1', 'aaa', 9),
('B1', 'iop', 8),
('B1', 'qwe', 7),
('B1', 'rty', 4)]
How shall I do?
It is a sorting problem actually, but sorting is a computationally very expensive process due to shuffling. But you can try:
rdd2 = rdd.groupBy(
lambda x: x[0]
).flatMap(
lambda g: nlargest(3, g[1], key=lambda x: (x[2],x[1]))
)
rdd2.sortBy(lambda x: x[1], x[2]).collect()
# [('K1', 'aaa', 9), ('K1', 'ddd', 9), ('K1', 'e', 9), ('B1', 'iop', 8), ('B1', 'qwe', 4), ('B1', 'rty', 7)]
I have sorted it using the first and second value of the tuples.
Also note, q comes before r alphabetically. So your mentioned expected output is off and misleading.
If you are open for dataframe , you can use windows function with rank
Inspired from here
import pyspark.sql.functions as f
from pyspark.sql import functions as F
from pyspark.sql import SparkSession
from pyspark.sql import Window
spark = SparkSession.builder.appName('test').master("local[*]").getOrCreate()
df = spark.createDataFrame([
("K1", "e", 9),
("K1", "aaa", 9),
("K1", "ccc", 3),
("K1", "ddd", 9),
("B1", "qwe", 4),
("B1", "rty", 7),
("B1", "iop", 8),
("B1", "zxc", 1)], ['A', 'B', 'C']
)
w = Window.partitionBy('A').orderBy(df.C.desc())
df.select('*', F.rank().over(w).alias('rank')).filter("rank<4").drop('rank').show()
+---+---+---+
| A | B | C|
+---+---+---+
| B1 | iop | 8|
| B1 | rty | 7|
| B1 | qwe | 4|
| K1 | e | 9|
| K1 | aaa | 9|
| K1 | ddd | 9|
+---+---+---+

Filtering a Dataset<Row> if month is in list of integers

I have a spark Dataset of rows in Java that looks like this.
+-------+-------------------+---------------+----------+--------------------+-----+
|item_id| date_time|horizon_minutes|last_value| values|label|
+-------+-------------------+---------------+----------+--------------------+-----+
| 8|2019-04-30 09:55:00| 15| 0.0|[0.0,0.0,0.0,0.0,...| 0.0|
| 8|2019-04-30 10:00:00| 15| 0.0|[0.0,0.0,0.0,0.0,...| 0.0|
| 8|2019-04-30 10:05:00| 15| 0.0|[0.0,0.0,0.0,0.0,...| 0.0|
I want to filter the Dataframe to take only those rows whose month is inside a list of integers (e.g. 1,2,5,12)
I have tried the filter function based on strings
rowsDS.filter("month(date_time)" ???)
But I don't know how to include the "isin list" of integers condition.
I have also tried to filter through a lambda function with no luck.
rowsDS.filter(row -> listofints.contains(row.getDate(1).getMonth()))
Evaluation failed. Reason(s):
Lambda expressions cannot be used in an evaluation expression
Is there any simple way to do this?. I would preferably want to use lambda functions as I do not like much the string based filters of SparkSQL such as the first example.
For Dataframe:
val result = df.where(month($"date_time").isin(2, 3, 4))
In Java:
Dataset<Row> result = df.where(month(col("date_time")).isin(2, 3, 4));
For get "col" and "month" function in Java:
import static org.apache.spark.sql.functions.*;
You can define UDF as described here and here
My example:
val seq1 = Seq(
("A", "abc", 0.1, 0.0, 0),
("B", "def", 0.15, 0.5, 0),
("C", "ghi", 0.2, 0.2, 1),
("D", "jkl", 1.1, 0.1, 0),
("E", "mno", 0.1, 0.1, 0)
)
val ls = List("A", "B")
val df1 = ss.sparkContext.makeRDD(seq1).toDF("cA", "cB", "cC", "cD", "cE")
def rawFilterFunc(r: String) = ls.contains(r)
ss.udf.register("ff", rawFilterFunc _)
df1.filter(callUDF("ff", df1("cA"))).show()
Gives output:
+---+---+----+---+---+
| cA| cB| cC| cD| cE|
+---+---+----+---+---+
| A|abc| 0.1|0.0| 0|
| B|def|0.15|0.5| 0|
+---+---+----+---+---+

How to do Rdd and broadcasted Rdd multiplication in pyspark?

I have two data frames like Below:
data frame1:(df1)
+---+----------+
|id |features |
+---+----------+
|8 |[5, 4, 5] |
|9 |[4, 5, 2] |
+---+----------+
data frame2:(df2)
+---+----------+
|id |features |
+---+----------+
|1 |[1, 2, 3] |
|2 |[4, 5, 6] |
+---+----------+
after that i have converted into Df to Rdd
rdd1=df1.rdd
if I am doing rdd1.collect() result is like below
[Row(id=8, f=[5, 4, 5]), Row(id=9, f=[4, 5, 2])]
rdd2=df2.rdd
broadcastedrddif = sc.broadcast(rdd2.collectAsMap())
now if I am doing broadcastedrddif.value
{1: [1, 2, 3], 2: [4, 5, 6]}
now i want to do sum of multiplication of rdd1 and broadcastedrddif i.e it should return output like below.
((8,[(1,(5*1+2*4+5*3)),(2,(5*4+4*5+5*6))]),(9,[(1,(4*1+5*2+2*3)),(2,(4*4+5*5+2*6)]) ))
so my final output should be
((8,[(1,28),(2,70)]),(9,[(1,20),(2,53)]))
where (1, 28) is a tuple not a float.
Please help me on this.
I did not understand why you used sc.broadcast() but i used it anyway...
Very useful in this case mapValues on the last RDD and I used a list comprehension to execute the operations using the dictionary.
x1=sc.parallelize([[8,5,4,5], [9,4,5,2]]).map(lambda x: (x[0], (x[1],x[2],x[3])))
x1.collect()
x2=sc.parallelize([[1,1,2,3], [2,4,5,6]]).map(lambda x: (x[0], (x[1],x[2],x[3])))
x2.collect()
#I took immediately an RDD because is more simply to test
broadcastedrddif = sc.broadcast(x2.collectAsMap())
d2=broadcastedrddif.value
def sum_prod(x,y):
c=0
for i in range(0,len(x)):
c+=x[i]*y[i]
return c
x1.mapValues(lambda x: [(i, sum_prod(list(x),list(d2[i]))) for i in [k for k in d2.keys()]]).collect()
Out[19]: [(8, [(1, 28), (2, 70)]), (9, [(1, 20), (2, 53)])]

Resources