Spark DataFrame add column from subquery - apache-spark

using SQL syntax I can add new column using subquery like that:
import spark.sqlContext.implicits._
List(
("a", "1", "2"),
("b", "1", "3"),
("c", "1", "4"),
("d", "1", "5")
).toDF("name", "start", "end")
.createOrReplaceTempView("base")
List(
("a", "1", "2"),
("b", "2", "3"),
("c", "3", "4"),
("d", "4", "5"),
("f", "5", "6")
).toDF("name", "number", "_count")
.createOrReplaceTempView("col")
spark.sql(
"""
|select a.name,
| (select Max(_count) from col b where b.number == a.end) - (select Max(_count) from col b where b.number == a.start) as result
|from base a
|""".stripMargin)
.show(false)
How I can do that with DataFrame API?

I found the syntax:
import spark.sqlContext.implicits._
val b = List(
("a", "1", "2"),
("b", "1", "3"),
("c", "1", "4"),
("d", "1", "5")
).toDF("name", "start", "end")
List(
("a", "1", "2"),
("b", "2", "3"),
("c", "3", "4"),
("d", "4", "5"),
("f", "5", "6")
).toDF("name", "number", "_count")
.createOrReplaceTempView("ref_table")
b.withColumn("result", expr("((select max(_count) from ref_table r where r.number = end) - (select max(_count) from ref_table r where r.number = start)) as result")).show(false)

I think max is not required, we can follow below approach
val base = List(
("a", "1", "2"),
("b", "1", "3"),
("c", "1", "4"),
("d", "1", "5")
).toDF("name", "start", "end")
val col = List(
("a", "1", "2"),
("b", "2", "3"),
("c", "3", "4"),
("d", "4", "5"),
("f", "5", "6")
).toDF("name", "number", "_count")
val df = base.join(col, col("number") === base("end")).select(base("name"), col("_count"))
val df1 = base.join(col, col("number") === base("start")).select(base("name").alias("nameDf"), col("_count").alias("count"))
df.join(df1, df("name") === df1("nameDf")).select($"name", ($"_count"- $"count").alias("result")).show(false)

Related

pyspark find monthly re-engaged user

Have a large dataframe looks like this, need to find monthly reengaged users number, which means if a user did not visit last month but come back this month.
If only need to compare two months it will be easy. How to do this month over month more efficiently.
df = spark.createDataFrame(
[
("2020-05-06", "1"),
("2020-05-07", "1"),
("2020-05-08", "2"),
("2020-05-10", "3"),
("2020-05-07", "3"),
("2020-05-07", "1"),
("2020-05-20", "4"),
("2020-05-30", "2"),
("2020-05-03", "1"),
("2020-06-06", "1"),
("2020-06-07", "1"),
("2020-06-08", "5"),
("2020-06-10", "3"),
("2020-06-07", "3"),
("2020-06-07", "1"),
("2020-06-20", "3"),
("2020-06-30", "5"),
("2020-07-03", "2"),
("2020-07-06", "4"),
("2020-07-07", "4"),
("2020-07-08", "2"),
("2020-07-10", "3"),
("2020-07-07", "3"),
("2020-07-07", "4"),
("2020-07-20", "3"),
("2020-07-30", "2"),
("2020-08-03", "1"),
("2020-08-03", "2"),
("2020-08-06", "5"),
("2020-08-07", "4"),
("2020-08-08", "2"),
("2020-08-10", "3"),
("2020-08-07", "3"),
("2020-08-07", "4"),
("2020-08-20", "3"),
("2020-08-30", "2"),
("2020-08-03", "1"),
],
["visit_date", "userId"],
)
df = df.withColumn("first_day_month", F.trunc("visit_date", "month")).withColumn(
"first_day_last_month", F.expr("add_months(first_day_month, -1)")
)
s5 = df.where(F.col("first_day_month") == "2020-05-01")
s6 = df.where(F.col("first_day_month") == "2020-06-01").withColumnRenamed(
"userId", "userId_right"
)
ss = s5.join(s6, s5.userId == s6.userId_right, how="right")
ss.select("userId_right").where(F.col("userId").isNull()).show()
Spark array manipulation seems also worth trying but needs to do a row by row
array_interset calculation which I'm not familiar with it yet also not sure if it's efficient to run this way
dd = (
df.groupby("first_day_month")
.agg(F.collect_list("userId").alias("users_current_month"))
.orderBy("first_day_month")
)
dd.show()
+---------------+-------------------+
|first_day_month|users_current_month|
+---------------+-------------------+
| 2020-05-01| [1, 2, 3, 4]|
| 2020-06-01| [1, 3, 5]|
| 2020-07-01| [2, 3, 4]|
| 2020-08-01| [1, 2, 3, 4, 5]|
+---------------+-------------------+
Any idea?
expected results:
first_day_month reengaged_user_count
2020-06-01 1
2020-07-01 2
2020-08-01 2
Using analytics function, we can do something like this :
df = df.withColumn("first_day_month", F.trunc("visit_date", "month")).withColumn(
"first_day_last_month",
F.lag("first_day_month").over(Window.partitionBy("userId").orderBy("visit_date")),
)
ss = df.where(F.months_between("first_day_month", "first_day_last_month") > 1)
ss.show()
+----------+------+---------------+--------------------+
|visit_date|userId|first_day_month|first_day_last_month|
+----------+------+---------------+--------------------+
|2020-08-06| 5| 2020-08-01| 2020-06-01|
|2020-08-03| 1| 2020-08-01| 2020-06-01|
|2020-07-06| 4| 2020-07-01| 2020-05-01|
|2020-07-03| 2| 2020-07-01| 2020-05-01|
+----------+------+---------------+--------------------+
ss.groupBy("first_day_month").agg(F.collect_set("UserId")).show()
+---------------+-------------------+
|first_day_month|collect_set(UserId)|
+---------------+-------------------+
| 2020-08-01| [1, 5]|
| 2020-07-01| [2, 4]|
+---------------+-------------------+

Use zip operation in a more efficient way on a dataset in python

I do this operation below to this dataset:
d = pd.DataFrame({'id': ["007", "007", "001", "001", "008", "008", "007", "007", "009", "007", "000", "001", "009", "009", "009"],
'id_2': ["b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b",
"c", "c", "c", "c"],
'userid': ["us1", "us2", "us1", "us2", "us4", "us4", "us5", "us1", "us2", "us1", "us2", "us4", "us1", "us2", "us1"],
"timestamp_date": [1589175687010, 1589188715313, 1589187142475, 1589187315368, 1589187155446, 1589187301028, 1589189765216, 1589190375088,
1589364060781, 1589421612029, 1589363453544, 1589364557808, 1589354347548, 1589356096273, 1589273208050]})
df = d.sort_values('timestamp_date')
df.groupby(['id_2', 'id'], sort=False).apply(
lambda x: list(zip(x['userid'][:-1], x['userid'][1:],
x['timestamp_date'][:-1], x['timestamp_date'][1:]))).reset_index(name='duplicates')
But the problem is that this operation is taking super long. Just to give and idea, for 4 million registers, it is taking around 17 minutes.
I would like to know if there is any other way that I can do it that is more efficient. I think the zip is the problem for my tests and for what I've read online, but I couldn't find another way of doing and achieving the same result :(
Thanks

Spark graphx issue

I am trying to follow the example in
https://docs.databricks.com/spark/latest/graph-analysis/graphframes/user-guide-python.html
However when changing some criteria the result is not as per expectation.
Please see the steps below -
from functools import reduce
from pyspark.sql.functions import col, lit, when
from graphframes import *
vertices = sqlContext.createDataFrame([
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
("d", "David", 29),
("e", "Esther", 32),
("f", "Fanny", 36),
("g", "Gabby", 60)], ["id", "name", "age"])
edges = sqlContext.createDataFrame([
("a", "b", "follow"),
("b", "c", "follow"),
("c", "b", "follow"),
("f", "c", "follow"),
("e", "f", "follow"),
("e", "d", "follow"),
("d", "a", "follow"),
("a", "e", "follow")
], ["src", "dst", "relationship"])
g = GraphFrame(vertices, edges)
Now one change I have done in "relationship" column , all values are "follow" instead of friend.
Now below query is running fine -
g.bfs(fromExpr ="name = 'Alice'",toExpr = "age < 32", edgeFilter ="relationship != 'friend'" , maxPathLength = 10).show()
+--------------+--------------+---------------+--------------+----------------+
| from| e0| v1| e1| to|
+--------------+--------------+---------------+--------------+----------------+
|[a, Alice, 34]|[a, e, follow]|[e, Esther, 32]|[e, d, follow]| [d, David, 29]|
|[a, Alice, 34]|[a, b, follow]| [b, Bob, 36]|[b, c, follow]|[c, Charlie, 30]|
+--------------+--------------+---------------+--------------+----------------+
but if I change the filter criteria from 32 to 40 , wrong result is being fetched -
>>> g.bfs(fromExpr ="name = 'Alice'",toExpr = "age < 35", edgeFilter ="relationship != 'friend'" , maxPathLength = 10).show()
+--------------+--------------+
| from| to|
+--------------+--------------+
|[a, Alice, 34]|[a, Alice, 34]|
+--------------+--------------+
Ideally it should fetch similar result from first query because filter condition is still getting satisfied for all rows.
Any explanation behind this ?
bfs() search for the first result that meet your predicate. Alice age is 34, it meets toExpr = "age < 35" predicate so you got zero length path starting from Alice. Please change toExpr for something more specific. for example toExpr ="name = 'David' or name = 'Charlie'" Should give you exactly the same result as in the first query.

Get the all childs of the parent id using spark graph frame motif search

I am using apache spark to create graphframe using motif query.
I have created required edges and vertices and after that executing motif query on lookup pattern. I need to fetch all childs of a particular node with its subchilds.For example:
// Vertex DataFrame
val v = sqlContext.createDataFrame(List(
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
("d", "David", 29),
("e", "Esther", 32),
("f", "Fanny", 36),
("g", "Gabby", 60)
)).toDF("id", "name", "age")
// Edge DataFrame
val e = sqlContext.createDataFrame(List(
("a", "b", "friend"),
("b", "c", "follow"),
("c", "b", "follow"),
("f", "c", "follow"),
("e", "f", "follow"),
("e", "d", "friend"),
("d", "a", "friend"),
("a", "e", "friend")
)).toDF("src", "dst", "relationship")
// Create a GraphFrame
val g = GraphFrame(v, e)
Now if I will click on node a then i should get all child and subchild of a.

Not Enough Arguments For an IF Statement...But There Are

When I run this equation in Excel it tells me that there is only 1 argument for the IF statement. I am not sure why it is saying that when I have 3 arguments. Within the OR statement I have 2 different AND statements. It works just fine if I get rid of the second AND statement. Did I mess up a parentheses somewhere? I am not sure what is wrong. Any help would be greatly appreciated. Thanks!
=IF(OR(ARRAYFORMULA(SUM(COUNTIF(B19:O19,{"I","Ip","Ia","It","Ih","A","Aa","Ap","At","Ah","X","R","Rt","Rx","Rp","Rh","K","Kt","E","Et","AL","HL","TV*","FFSL","ADM*"})))=10, AND(ARRAYFORMULA(SUM(COUNTIF(B19:O19,{"R-10","Rx-10*","Rp-10","Rt-10*","Rh-10","I-10","Ia-10","Ip-10","It-10","Ih-10","X-10*","A-10*","At-10"})))=4, ARRAYFORMULA(SUM(COUNTIF(B19:O19,{"I","Ip","Ia","It","Ih","A","Aa","Ap","At","Ah","X","R","Rt","Rx","Rp","Rh","K","Kt","E","Et","AL","HL","TV*","FFSL","ADM*"})))=5),AND(ARRAYFORMULA(SUM(COUNTIF(B19:O19,{"HL-9","X-9","N-9","E-9","J-9","Jh-9","Nh-9","Eh-9"})))=8,ARRAYFORMULA(SUM(COUNTIF(B19:O19,{"I","Ip","Ia","It","Ih","A","Aa","Ap","At","Ah","X","R","Rt","Rx","Rp","Rh","K","Kt","E","Et","AL","HL","TV*","FFSL","ADM*"})))=1) ,"80 Hours","Error"))
This question makes me think "If only there was an online Excel Formula Beautifier".
Oh look, there is.
If you copy-and-paste it into the beautifier you get the code below.
You can now see that your parameters "80 Hours", "Error" are parameters of the first ARRAYFORMULA function, not the IF function.
=IF(
OR(
ARRAYFORMULA(
SUM(
COUNTIF(
B19:O19,
{ "I",
"Ip",
"Ia",
"It",
"Ih",
"A",
"Aa",
"Ap",
"At",
"Ah",
"X",
"R",
"Rt",
"Rx",
"Rp",
"Rh",
"K",
"Kt",
"E",
"Et",
"AL",
"HL",
"TV*",
"FFSL",
"ADM*"
ARRAYROWSTOP)
ARRAYSTOP)
)
)
) = 10,
AND(
ARRAYFORMULA(
SUM(
COUNTIF(
B19:O19,
{ "R-10",
"Rx-10*",
"Rp-10",
"Rt-10*",
"Rh-10",
"I-10",
"Ia-10",
"Ip-10",
"It-10",
"Ih-10",
"X-10*",
"A-10*",
"At-10"
ARRAYROWSTOP)
ARRAYSTOP)
)
)
) = 4,
ARRAYFORMULA(
SUM(
COUNTIF(
B19:O19,
{ "I",
"Ip",
"Ia",
"It",
"Ih",
"A",
"Aa",
"Ap",
"At",
"Ah",
"X",
"R",
"Rt",
"Rx",
"Rp",
"Rh",
"K",
"Kt",
"E",
"Et",
"AL",
"HL",
"TV*",
"FFSL",
"ADM*"
ARRAYROWSTOP)
ARRAYSTOP)
)
)
) = 5
),
AND(
ARRAYFORMULA(
SUM(
COUNTIF(
B19:O19,
{ "HL-9",
"X-9",
"N-9",
"E-9",
"J-9",
"Jh-9",
"Nh-9",
"Eh-9"
ARRAYROWSTOP)
ARRAYSTOP)
)
)
) = 8,
ARRAYFORMULA(
SUM(
COUNTIF(
B19:O19,
{ "I",
"Ip",
"Ia",
"It",
"Ih",
"A",
"Aa",
"Ap",
"At",
"Ah",
"X",
"R",
"Rt",
"Rx",
"Rp",
"Rh",
"K",
"Kt",
"E",
"Et",
"AL",
"HL",
"TV*",
"FFSL",
"ADM*"
ARRAYROWSTOP)
ARRAYSTOP)
)
)
) = 1
),
"80 Hours",
"Error"
)
)

Resources