Spark graphx issue - apache-spark

I am trying to follow the example in
https://docs.databricks.com/spark/latest/graph-analysis/graphframes/user-guide-python.html
However when changing some criteria the result is not as per expectation.
Please see the steps below -
from functools import reduce
from pyspark.sql.functions import col, lit, when
from graphframes import *
vertices = sqlContext.createDataFrame([
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
("d", "David", 29),
("e", "Esther", 32),
("f", "Fanny", 36),
("g", "Gabby", 60)], ["id", "name", "age"])
edges = sqlContext.createDataFrame([
("a", "b", "follow"),
("b", "c", "follow"),
("c", "b", "follow"),
("f", "c", "follow"),
("e", "f", "follow"),
("e", "d", "follow"),
("d", "a", "follow"),
("a", "e", "follow")
], ["src", "dst", "relationship"])
g = GraphFrame(vertices, edges)
Now one change I have done in "relationship" column , all values are "follow" instead of friend.
Now below query is running fine -
g.bfs(fromExpr ="name = 'Alice'",toExpr = "age < 32", edgeFilter ="relationship != 'friend'" , maxPathLength = 10).show()
+--------------+--------------+---------------+--------------+----------------+
| from| e0| v1| e1| to|
+--------------+--------------+---------------+--------------+----------------+
|[a, Alice, 34]|[a, e, follow]|[e, Esther, 32]|[e, d, follow]| [d, David, 29]|
|[a, Alice, 34]|[a, b, follow]| [b, Bob, 36]|[b, c, follow]|[c, Charlie, 30]|
+--------------+--------------+---------------+--------------+----------------+
but if I change the filter criteria from 32 to 40 , wrong result is being fetched -
>>> g.bfs(fromExpr ="name = 'Alice'",toExpr = "age < 35", edgeFilter ="relationship != 'friend'" , maxPathLength = 10).show()
+--------------+--------------+
| from| to|
+--------------+--------------+
|[a, Alice, 34]|[a, Alice, 34]|
+--------------+--------------+
Ideally it should fetch similar result from first query because filter condition is still getting satisfied for all rows.
Any explanation behind this ?

bfs() search for the first result that meet your predicate. Alice age is 34, it meets toExpr = "age < 35" predicate so you got zero length path starting from Alice. Please change toExpr for something more specific. for example toExpr ="name = 'David' or name = 'Charlie'" Should give you exactly the same result as in the first query.

Related

pyspark find monthly re-engaged user

Have a large dataframe looks like this, need to find monthly reengaged users number, which means if a user did not visit last month but come back this month.
If only need to compare two months it will be easy. How to do this month over month more efficiently.
df = spark.createDataFrame(
[
("2020-05-06", "1"),
("2020-05-07", "1"),
("2020-05-08", "2"),
("2020-05-10", "3"),
("2020-05-07", "3"),
("2020-05-07", "1"),
("2020-05-20", "4"),
("2020-05-30", "2"),
("2020-05-03", "1"),
("2020-06-06", "1"),
("2020-06-07", "1"),
("2020-06-08", "5"),
("2020-06-10", "3"),
("2020-06-07", "3"),
("2020-06-07", "1"),
("2020-06-20", "3"),
("2020-06-30", "5"),
("2020-07-03", "2"),
("2020-07-06", "4"),
("2020-07-07", "4"),
("2020-07-08", "2"),
("2020-07-10", "3"),
("2020-07-07", "3"),
("2020-07-07", "4"),
("2020-07-20", "3"),
("2020-07-30", "2"),
("2020-08-03", "1"),
("2020-08-03", "2"),
("2020-08-06", "5"),
("2020-08-07", "4"),
("2020-08-08", "2"),
("2020-08-10", "3"),
("2020-08-07", "3"),
("2020-08-07", "4"),
("2020-08-20", "3"),
("2020-08-30", "2"),
("2020-08-03", "1"),
],
["visit_date", "userId"],
)
df = df.withColumn("first_day_month", F.trunc("visit_date", "month")).withColumn(
"first_day_last_month", F.expr("add_months(first_day_month, -1)")
)
s5 = df.where(F.col("first_day_month") == "2020-05-01")
s6 = df.where(F.col("first_day_month") == "2020-06-01").withColumnRenamed(
"userId", "userId_right"
)
ss = s5.join(s6, s5.userId == s6.userId_right, how="right")
ss.select("userId_right").where(F.col("userId").isNull()).show()
Spark array manipulation seems also worth trying but needs to do a row by row
array_interset calculation which I'm not familiar with it yet also not sure if it's efficient to run this way
dd = (
df.groupby("first_day_month")
.agg(F.collect_list("userId").alias("users_current_month"))
.orderBy("first_day_month")
)
dd.show()
+---------------+-------------------+
|first_day_month|users_current_month|
+---------------+-------------------+
| 2020-05-01| [1, 2, 3, 4]|
| 2020-06-01| [1, 3, 5]|
| 2020-07-01| [2, 3, 4]|
| 2020-08-01| [1, 2, 3, 4, 5]|
+---------------+-------------------+
Any idea?
expected results:
first_day_month reengaged_user_count
2020-06-01 1
2020-07-01 2
2020-08-01 2
Using analytics function, we can do something like this :
df = df.withColumn("first_day_month", F.trunc("visit_date", "month")).withColumn(
"first_day_last_month",
F.lag("first_day_month").over(Window.partitionBy("userId").orderBy("visit_date")),
)
ss = df.where(F.months_between("first_day_month", "first_day_last_month") > 1)
ss.show()
+----------+------+---------------+--------------------+
|visit_date|userId|first_day_month|first_day_last_month|
+----------+------+---------------+--------------------+
|2020-08-06| 5| 2020-08-01| 2020-06-01|
|2020-08-03| 1| 2020-08-01| 2020-06-01|
|2020-07-06| 4| 2020-07-01| 2020-05-01|
|2020-07-03| 2| 2020-07-01| 2020-05-01|
+----------+------+---------------+--------------------+
ss.groupBy("first_day_month").agg(F.collect_set("UserId")).show()
+---------------+-------------------+
|first_day_month|collect_set(UserId)|
+---------------+-------------------+
| 2020-08-01| [1, 5]|
| 2020-07-01| [2, 4]|
+---------------+-------------------+

Use zip operation in a more efficient way on a dataset in python

I do this operation below to this dataset:
d = pd.DataFrame({'id': ["007", "007", "001", "001", "008", "008", "007", "007", "009", "007", "000", "001", "009", "009", "009"],
'id_2': ["b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b",
"c", "c", "c", "c"],
'userid': ["us1", "us2", "us1", "us2", "us4", "us4", "us5", "us1", "us2", "us1", "us2", "us4", "us1", "us2", "us1"],
"timestamp_date": [1589175687010, 1589188715313, 1589187142475, 1589187315368, 1589187155446, 1589187301028, 1589189765216, 1589190375088,
1589364060781, 1589421612029, 1589363453544, 1589364557808, 1589354347548, 1589356096273, 1589273208050]})
df = d.sort_values('timestamp_date')
df.groupby(['id_2', 'id'], sort=False).apply(
lambda x: list(zip(x['userid'][:-1], x['userid'][1:],
x['timestamp_date'][:-1], x['timestamp_date'][1:]))).reset_index(name='duplicates')
But the problem is that this operation is taking super long. Just to give and idea, for 4 million registers, it is taking around 17 minutes.
I would like to know if there is any other way that I can do it that is more efficient. I think the zip is the problem for my tests and for what I've read online, but I couldn't find another way of doing and achieving the same result :(
Thanks

How to change position of the content in each element of tuple in this list?

I have this tuple:
[("a", "b", 7), ("a", "c", 9),("b", "c", 10)]
I want to duplicate elements in above tuple by the mirror of each elements like this:
[("a", "b", 7),("b", "a", 7),("a", "c", 9),("c", "a", 9),("b", "c", 10), ("c", "b", 10)]
Please help me, I'm very appreciate with your help
You can unpack the tuples and make new tuples for your output list
def mirror(tuples):
result = []
for first, second, third in tuples:
result.append((first, second, third))
result.append((second, first, third))
return result
print(mirror([("a", "b", 7), ("a", "c", 9),("b", "c", 10)]))
# [('a', 'b', 7), ('b', 'a', 7), ('a', 'c', 9), ('c', 'a', 9), ('b', 'c', 10), ('c', 'b', 10)]

Column data value consistency check PySpark SQL

I have two tables with same column names, same data, same number of rows but ordering of rows might differ. Now I select column A from table_1 and column A from table_2 and compare the values. How can i achieve this using PySpark SQL can I do sha2/md5 checksum and compare?
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
from pyspark.sql.types import *
from pyspark.sql import Row
import pyspark.sql.functions as f
app_name="test"
table1="DB1.department"
table2="DB2.department"
conf = SparkConf().setAppName(app_name)
sc = SparkContext(conf=conf)
sqlContext = HiveContext(sc)
query1="select * from %s" %(table1)
df1 = sqlContext.sql(query1)
query2="select * from %s" %(table2)
df2 = sqlContext.sql(query2)
df3=sqlContext.sql(SELECT DB1.departmentid FROM DB1.department a FULL JOIN
DB2.department b ON a.departmentid = b.departmentid WHERE a.departmentid
IS NULL OR b.departmentid IS NULL)
df5=sqlContext.sql("select md5(departmentid) from department1")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/context.py", line 580, in sql
return DataFrame(self._ssql_ctx.sql(sqlQuery), self)
File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line
813, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 51, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"cannot resolve 'md5(departmentid)'
due to data type mismatch: argument 1 requires binary type, however,
'departmentid' is of bigint type.; line 1 pos 11"
when tried with md5 checksum it is saying it expects binarytype but department id is bigint
Table1:
departmentid departmentname departmentaddress
1 A Newyork
2 B Newjersey
3 C SanJose
4 D WashingtonDC
5 E Mexico
6 F Delhi
7 G Pune
8 H chennai
Table2:
departmentid departmentname departmentaddress
7 G Pune
8 H chennai
1 A Newyork
2 B Newjersey
3 C SanJose
4 D WashingtonDC
5 E Mexico
6 F Delhi
Here in table two order of rows has just changed but still data remained so, now technically these two tables are identical. Until and unless new row gets added or values modified the two tables are identical (Tables are taken for example and explanation, in real we deal with Bigdata)
The simplest solution is:
def is_identical(x, y):
return (x.count() == y.count()) and (x.subtract(y).count() == 0)
Example data:
df1 = spark.createDataFrame(
[(1, "A", "Newyork"), (2, "B", "Newjersey"),
(3, "C", "SanJose"), (4, "D", "WashingtonDC"), (5, "E", "Mexico"), (6, "F", "Delhi"),
(7, "G", "Pune"), (8, "H", "chennai")],
("departmentid", "departmentname", "departmentadd"))
df2 = spark.createDataFrame(
[(7, "G", "Pune"), (8, "H", "chennai"), (1, "A", "Newyork"), (2, "B", "Newjersey"),
(3, "C", "SanJose"), (4, "D", "WashingtonDC"), (5, "E", "Mexico"), (6, "F", "Delhi")],
("departmentid", "departmentname", "departmentadd"))
df3 = spark.createDataFrame(
[(1, "A", "New York"), (2, "B", "New Jersey"),
(3, "C", "SanJose"), (4, "D", "WashingtonDC"), (5, "E", "Mexico"), (6, "F", "Delhi"),
(7, "G", "Pune"), (8, "H", "chennai")],
("departmentid", "departmentname", "departmentadd"))
df4 = spark.createDataFrame(
[(3, "C", "SanJose"), (4, "D", "WashingtonDC"), (5, "E", "Mexico"), (6, "F", "Delhi")],
("departmentid", "departmentname", "departmentadd"))
Checks:
is_identical(df1, df2)
# True
is_identical(df1, df3)
# False
is_identical(df1, df4)
# False
is_identical(df4, df4)
# True
With outer join
from pyspark.sql.functions import col, coalesce, lit
from functools import reduce
from operator import and_
def is_identical_(x, y, keys=("departmentid", )):
def both_null(c):
return (col("x.{}".format(c)).isNull() &
col("y.{}".format(c)).isNull())
def both_equal(c):
return coalesce((col("x.{}".format(c)) ==
col("y.{}".format(c))), lit(False))
p = reduce(and_, [both_null(c) | both_equal(c) for c in x.columns if c not in keys])
return (x.alias("x").join(y.alias("y"), list(keys), "full_outer")
.where(~p).count() == 0)
you'd get the same result:
is_identical_(df1, df2)
# True
is_identical_(df1, df3)
# False
is_identical_(df1, df4)
# False
is_identical_(df4, df4)
# True
md5 is not use for you, because it is not an aggregation function. It computes checksum for a specific value.

Get the all childs of the parent id using spark graph frame motif search

I am using apache spark to create graphframe using motif query.
I have created required edges and vertices and after that executing motif query on lookup pattern. I need to fetch all childs of a particular node with its subchilds.For example:
// Vertex DataFrame
val v = sqlContext.createDataFrame(List(
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
("d", "David", 29),
("e", "Esther", 32),
("f", "Fanny", 36),
("g", "Gabby", 60)
)).toDF("id", "name", "age")
// Edge DataFrame
val e = sqlContext.createDataFrame(List(
("a", "b", "friend"),
("b", "c", "follow"),
("c", "b", "follow"),
("f", "c", "follow"),
("e", "f", "follow"),
("e", "d", "friend"),
("d", "a", "friend"),
("a", "e", "friend")
)).toDF("src", "dst", "relationship")
// Create a GraphFrame
val g = GraphFrame(v, e)
Now if I will click on node a then i should get all child and subchild of a.

Resources