Joining Two Dataframes: Lost tasks - apache-spark

I have time series data in a PySpark DataFrame. Each of my signals (value column) should be assigned a unique id. However the id values are imprecise and need to be extended to both sides. The original DataFrame looks like this:
df_start
+------+----+-------+
| time | id | value |
+------+----+-------+
| 1| 0| 1.0|
| 2| 1| 2.0|
| 3| 1| 2.0|
| 4| 0| 1.0|
| 5| 0| 0.0|
| 6| 0| 1.0|
| 7| 2| 2.0|
| 8| 2| 3.0|
| 9| 2| 2.0|
| 10| 0| 1.0|
| 11| 0| 0.0|
+------+----+-------+
The desired output is:
df_desired
+------+----+-------+
| time | id | value |
+------+----+-------+
| 1| 1| 1.0|
| 2| 1| 2.0|
| 3| 1| 2.0|
| 4| 1| 1.0|
| 6| 2| 1.0|
| 7| 2| 2.0|
| 8| 2| 3.0|
| 9| 2| 2.0|
| 10| 2| 1.0|
| 11| 2| 1.0|
+------+----+-------+
So there are two things happening here:
The id column is not precise enough: Each id starts logging a (here 1 and 1) time steps to late, and ends b time steps (here 1 and 2) to early. I therefore have to replace some zeros with their respective id.
After 'padding' the entries in the id column, I remove all remaining rows with id=0. (Here only for row with time=5.)
Luckily I know, for each Id, what the relative logging time delay is. Currently, I convert this to the absolute, correct logging times in
df_join
+----+-------+-------+
| id | min_t | max_t |
+----+-------+-------+
| 1| 1| 4|
| 2| 6| 11|
+----+-------+-------+
which I use to then 'filter' the original data using a join
df_desired = df_join.join(df_start,
df_start.time.between(df_join.min_t, df_join.max_t)
)
which results in the desired output.
In reality df_join has at least 400 000 rows and df_start has about 10 billion rows, of which we keep most.
When I run this on our cluster, I am at some point getting warnings like Lost task, ExecutorLostFailure, Container marked as failed, Exit code: 134.
I suspect the executors are running out of memory, however I have not found any solution.

Related

Applying PySpark dropDuplicates method messes up the sorting of the data frame

I'm not sure why this is the behaviour, but when I apply dropDuplicates to a sorted data frame, the sorting order is disrupted. See the following two tables in comparison.
The following table is the output of sorted_df.show(), in which the sorting is in order.
+----------+-----------+
|sorted_col|another_col|
+----------+-----------+
| 1| 1|
| 8| 5|
| 15| 1|
| 19| 9|
| 20| 7|
| 27| 9|
| 67| 8|
| 91| 9|
| 91| 7|
| 91| 1|
+----------+-----------+
The following table is the output of sorted_df.dropDuplicates().show(), and the sorting is not right anymore, even though it's the same data frame.
+----------+-----------+
|sorted_col|another_col|
+----------+-----------+
| 27| 9|
| 67| 8|
| 15| 1|
| 91| 7|
| 1| 1|
| 91| 1|
| 8| 5|
| 91| 9|
| 20| 7|
| 19| 9|
+----------+-----------+
Can someone explain why this behaviour persists and how can I keep the same sorting order with dropDuplicates applied?
Apache Spark version 3.1.2
dropDuplicates involves a shuffle. Ordering is therefore disrupted.

regarding the usage of F.count(F.col("some column").isNotNull()) in window function

I am trying to test the usage of F.count(F.col().isNotNull()) in window function. Please see the following code script
from pyspark.sql import functions as F
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
list=([1,5,4],
[1,5,None],
[1,5,1],
[1,5,4],
[2,5,1],
[2,5,2],
[2,5,None],
[2,5,None],
[2,5,4])
df=spark.createDataFrame(list,['I_id','p_id','xyz'])
w= Window().partitionBy("I_id","p_id").orderBy(F.col("xyz").asc_nulls_first())
df.withColumn("xyz1",F.count(F.col("xyz").isNotNull()).over(w)).show()
The result is shown as follows. In the first two rows, my understanding is that F.count(F.col("xyz") should count the non-zero items from xyz = -infinity to xyz = null, how does the following isNotNull() process this. Why it gets 2 for the first two rows in xyz1 column.
If you count the Booleans, since they are either True or False, you will count all the rows in the specified window, regardless of whether xyz is null or not.
What you could do is to sum the isNotNull Boolean rather than counting them.
df.withColumn("xyz1",F.sum(F.col("xyz").isNotNull().cast('int')).over(w)).show()
+----+----+----+----+
|I_id|p_id| xyz|xyz1|
+----+----+----+----+
| 2| 5|null| 0|
| 2| 5|null| 0|
| 2| 5| 1| 1|
| 2| 5| 2| 2|
| 2| 5| 4| 3|
| 1| 5|null| 0|
| 1| 5| 1| 1|
| 1| 5| 4| 3|
| 1| 5| 4| 3|
+----+----+----+----+
Another way is to do a conditional count using when:
df.withColumn("xyz1",F.count(F.when(F.col("xyz").isNotNull(), 1)).over(w)).show()
+----+----+----+----+
|I_id|p_id| xyz|xyz1|
+----+----+----+----+
| 2| 5|null| 0|
| 2| 5|null| 0|
| 2| 5| 1| 1|
| 2| 5| 2| 2|
| 2| 5| 4| 3|
| 1| 5|null| 0|
| 1| 5| 1| 1|
| 1| 5| 4| 3|
| 1| 5| 4| 3|
+----+----+----+----+

GraphFrames detect exclusive outbound relations

In my graph I need to detect vertices that do not have inbound relations. Using the example below, "a" is the only node that is not being related by the anyone.
a --> b
b --> c
c --> d
c --> b
I would really appreciate any examples to detect "a" type nodes in my graph.
Thanks
unfortunately the approach is not as simple because the graph.degress, graph.inDegrees, graph.outDegrees functions are not returning vertices with 0 edges.
(see documentation for Scala which holds true for Python too https://graphframes.github.io/graphframes/docs/_site/api/scala/index.html#org.graphframes.GraphFrame)
so the following code will always return a empty dataframe
g=Graph(vertices,edges)
# check for start points
g.inDegrees.filter("inDegree==0").show()
+---+--------+
| id|inDegree|
+---+--------+
+---+--------+
# or check for end points
g.outDegrees.filter("outDegree==0").show()
+---+---------+
| id|outDegree|
+---+---------+
+---+---------+
# or check for any vertices that are alone without edge
g.degrees.filter("degree==0").show()
+---+------+
| id|degree|
+---+------+
+---+------+
what works is a left, right or full join of the inDegree and outDegree result and filter on the NULL values of the respective column
the join will provide you a merged columns with NULL values on the start and end positions
g.inDegrees.join(g.outDegrees,on="id",how="full").show()
+---+--------+---------+
| id|inDegree|outDegree|
+---+--------+---------+
| b6| 1| null|
| a3| 1| 1|
| a4| 1| null|
| c7| 1| 1|
| b2| 1| 2|
| c9| 3| 1|
| c5| 1| 1|
| c1| null| 1|
| c6| 1| 1|
| a2| 1| 1|
| b3| 1| 1|
| b1| null| 1|
| c8| 3| null|
| a1| null| 1|
| c4| 1| 4|
| c3| 1| 1|
| b4| 1| 1|
| c2| 1| 3|
|c10| 1| null|
| b5| 2| 1|
+---+--------+---------+
now you can filter on what search
my_in_Degrees=g.inDegrees
my_out_Degrees=g.outDegrees
# get starting vertices (no more childs)
my_in_Degrees.join(my_out_Degrees,on="id",how="full").filter(my_in_Degrees.inDegree.isNull()).show()
+---+--------+---------+
| id|inDegree|outDegree|
+---+--------+---------+
| c1| null| 1|
| b1| null| 1|
| a1| null| 1|
+---+--------+---------+
# get ending vertices (no more parents)
my_in_Degrees.join(my_out_Degrees,on="id",how="full").filter(my_out_Degrees.outDegree.isNull()).show()
+---+--------+---------+
| id|inDegree|outDegree|
+---+--------+---------+
| b6| 1| null|
| a4| 1| null|
|c10| 1| null|
+---+--------+---------+

Divide aggregate value using values from data frame in PySpark

I have a data frame like below in pyspark.
+---+-------------+----+
| id| device| val|
+---+-------------+----+
| 3| mac pro| 1|
| 1| iphone| 2|
| 1|android phone| 2|
| 1| windows pc| 2|
| 1| spy camera| 2|
| 2| spy camera| 3|
| 2| iphone| 3|
| 3| spy camera| 1|
| 3| cctv| 1|
+---+-------------+----+
I want to populate some columns based on the below lists
phone_list = ['iphone', 'android phone', 'nokia']
pc_list = ['windows pc', 'mac pro']
security_list = ['spy camera', 'cctv']
I have done like below.
import pyspark.sql.functions as F
df.withColumn('cat',
F.when(df.device.isin(phone_list), 'phones').otherwise(
F.when(df.device.isin(pc_list), 'pc').otherwise(
F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')).show()
I got the desired result.
Now I want to do some change to the code I want to populate the column value after I divide the cat column with the value in the data frame for that id.
I tried something like below but didn't get the correct result
df.withColumn('cat',
F.when(df.device.isin(phone_list), 'phones').otherwise(
F.when(df.device.isin(pc_list), 'pc').otherwise(
F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')/ df.val).show()
How can I get what I want?
edit
Expected result
+---+----+------+--------+
| id| pc|phones|security|
+---+----+------+--------+
| 1| 0.5| 1| 0.5|
| 3| 1| null| 2|
| 2|null| 0.33| 0.33|
+---+----+------+--------+
Aggregation would need an aggregation function, a simple column would not be identified
Since val column contains same value for each group of id column, you can use first inbuilt function as
df.withColumn('cat',
F.when(df.device.isin(phone_list), 'phones').otherwise(
F.when(df.device.isin(pc_list), 'pc').otherwise(
F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')/ F.first(df.val)).show()
which should give you
+---+----+------------------+------------------+
| id| pc| phones| security|
+---+----+------------------+------------------+
| 3| 1.0| null| 2.0|
| 1| 0.5| 1.0| 0.5|
| 2|null|0.3333333333333333|0.3333333333333333|
+---+----+------------------+------------------+

Spark - Window with recursion? - Conditionally propagating values across rows

I have the following dataframe showing the revenue of purchases.
+-------+--------+-------+
|user_id|visit_id|revenue|
+-------+--------+-------+
| 1| 1| 0|
| 1| 2| 0|
| 1| 3| 0|
| 1| 4| 100|
| 1| 5| 0|
| 1| 6| 0|
| 1| 7| 200|
| 1| 8| 0|
| 1| 9| 10|
+-------+--------+-------+
Ultimately I want the new column purch_revenue to show the revenue generated by the purchase in every row.
As a workaround, I have also tried to introduce a purchase identifier purch_id which is incremented each time a purchase was made. So this is listed just as a reference.
+-------+--------+-------+-------------+--------+
|user_id|visit_id|revenue|purch_revenue|purch_id|
+-------+--------+-------+-------------+--------+
| 1| 1| 0| 100| 1|
| 1| 2| 0| 100| 1|
| 1| 3| 0| 100| 1|
| 1| 4| 100| 100| 1|
| 1| 5| 0| 100| 2|
| 1| 6| 0| 100| 2|
| 1| 7| 200| 100| 2|
| 1| 8| 0| 100| 3|
| 1| 9| 10| 100| 3|
+-------+--------+-------+-------------+--------+
I've tried to use the lag/lead function like this:
user_timeline = Window.partitionBy("user_id").orderBy("visit_id")
find_rev = fn.when(fn.col("revenue") > 0,fn.col("revenue"))\
.otherwise(fn.lead(fn.col("revenue"), 1).over(user_timeline))
df.withColumn("purch_revenue", find_rev)
This duplicates the revenue column if revenue > 0 and also pulls it up by one row. Clearly, I can chain this for a finite N, but that's not a solution.
Is there a way to apply this recursively until revenue > 0?
Alternatively, is there a way to increment a value based on a condition? I've tried to figure out a way to do that but struggled to find one.
Window functions don't support recursion but it is not required here. This type of sesionization can be easily handled with cumulative sum:
from pyspark.sql.functions import col, sum, when, lag
from pyspark.sql.window import Window
w = Window.partitionBy("user_id").orderBy("visit_id")
purch_id = sum(lag(when(
col("revenue") > 0, 1).otherwise(0),
1, 0
).over(w)).over(w) + 1
df.withColumn("purch_id", purch_id).show()
+-------+--------+-------+--------+
|user_id|visit_id|revenue|purch_id|
+-------+--------+-------+--------+
| 1| 1| 0| 1|
| 1| 2| 0| 1|
| 1| 3| 0| 1|
| 1| 4| 100| 1|
| 1| 5| 0| 2|
| 1| 6| 0| 2|
| 1| 7| 200| 2|
| 1| 8| 0| 3|
| 1| 9| 10| 3|
+-------+--------+-------+--------+

Resources