Custom duplicate removal strategy with Spark joins

Custom duplicate removal strategy with Spark joins - apache-spark

I've more than 2 tables and I wish to join them and create a single table where queries will be faster.
Table-1
---------------
user | activityId
---------------
user1 | 123
user2 | 123
user3 | 123
user4 | 123
user5 | 123
---------------
Table-2
---------------------------------
user | activityId | event-1-time
---------------------------------
user2 | 123 | 1001
user2 | 123 | 1002
user3 | 123 | 1003
user5 | 123 | 1004
---------------------------------
Table-3
---------------------------------
user | activityId | event-2-time
---------------------------------
user2 | 123 | 10001
user5 | 123 | 10002
---------------------------------
Left join on table-1 over (user,activityId) with table-2 & table-3 will produce result like:
Joined-data
--------------------------------------------------------------------
user | activityId | event-1 | event-1-time | event-2 | event-2-time
--------------------------------------------------------------------
user1 | 123 | 0 | null | 0 | null
user2 | 123 | 1 | 1001 | 1 | 10001
user2 | 123 | 1 | 1002 | 1 | 10001
user3 | 123 | 1 | 1003 | 0 | null
user4 | 123 | 0 | null | 0 | null
user5 | 123 | 1 | 1004 | 1 | 10002
--------------------------------------------------------------------
I wish to remove the redundancy introduced with event-2 with same time i.e. event-2 appeared only once but reported twice since event-1 appeared twice.
In other words user and activityId grouped records should be distinct at every table level.
I want following output. I do not care about relationship(event-1 with event-2). Is there anything which allows to customize join and achieve this behavior
user | activityId | event-1 | event-1-time | event-2 | event-2-time
--------------------------------------------------------------------
user1 | 123 | 0 | null | 0 | null
user2 | 123 | 1 | 1001 | 1 | 10001
user2 | 123 | 1 | 1002 | 0 | null
user3 | 123 | 1 | 1003 | 0 | null
user4 | 123 | 0 | null | 0 | null
user5 | 123 | 1 | 1004 | 1 | 10002
--------------------------------------------------------------------
Edit:
I'm using Scala for joining these tables. Query used:
val joined = table1.join(table2, Seq("user","activityId"), "left").join(table3, Seq("user","activityId"), "left")
joined.select(table1("user"), table1("activityId"), when(table2("activityId").isNull,0).otherwise(1) as "event-1",
table2("timestamp") as "event-1-time"), when(table3("activityId").isNull, 0).otherwise(1) as "event-2", table3("timestamp") as "event-2-time").show

You should create an additional column populating with row index for each group of user ordering by activityId and then use that added column in the outer join process
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("user").orderBy("activityId")
import org.apache.spark.sql.functions._
val tempTable1 = table1.withColumn("rowNumber", row_number().over(windowSpec))
val tempTable2 = table2.withColumn("rowNumber", row_number().over(windowSpec)).withColumn("event-1", lit(1))
val tempTable3 = table3.withColumn("rowNumber", row_number().over(windowSpec)).withColumn("event-2", lit(1))
tempTable1
.join(tempTable2, Seq("user", "activityId", "rowNumber"), "outer")
.join(tempTable3, Seq("user", "activityId", "rowNumber"), "outer")
.drop("rowNumber")
.na.fill(0)
You should get your desired output dataframe as
+-----+----------+------------+-------+------------+-------+
|user |activityId|event-1-time|event-1|event-2-time|event-2|
+-----+----------+------------+-------+------------+-------+
|user1|123 |null |0 |null |0 |
|user2|123 |1002 |1 |null |0 |
|user2|123 |1001 |1 |10001 |1 |
|user3|123 |1003 |1 |null |0 |
|user4|123 |null |0 |null |0 |
|user5|123 |1004 |1 |10002 |1 |
+-----+----------+------------+-------+------------+-------+

Below is a code implementation of the requirement
from pyspark.sql import Row
ll = [('test',123),('test',123),('test',123),('test',123)]
rdd = sc.parallelize(ll)
test1 = rdd.map(lambda x: Row(user=x[0], activityid=int(x[1])))
test1_df = sqlContext.createDataFrame(test1)
mm = [('test',123,1001),('test',123,1002),('test',123,1003),('test',123,1004)]
rdd1 = sc.parallelize(mm)
test2 = rdd1.map(lambda x: Row(user=x[0],
activityid=int(x[1]),event_time_1=int(x[2])))
test2_df = sqlContext.createDataFrame(test2)
nn = [('test',123,10001),('test',123,10002)]
rdd2 = sc.parallelize(nn)
test3 = rdd2.map(lambda x: Row(user=x[0],
activityid=int(x[1]),event_time_2=int(x[2])))
test3_df = sqlContext.createDataFrame(test3)
from pyspark.sql.window import Window
import pyspark.sql.functions as func
from pyspark.sql.functions import dense_rank, rank
n = Window.partitionBy(test2_df.user,test2_df.activityid).orderBy(test2_df.event_time_1)
int2_df = test2_df.select("user","activityid","event_time_1",rank().over(n).alias("col_rank")).filter('col_rank = 1')
o = Window.partitionBy(test3_df.user,test3_df.activityid).orderBy(test3_df.event_time_2)
int3_df = test3_df.select("user","activityid","event_time_2",rank().over(o).alias("col_rank")).filter('col_rank = 1')
test1_df.distinct().join(int2_df,["user","activityid"],"leftouter").join(int3_df,["user","activityid"],"leftouter").show(10)
+----+----------+------------+--------+------------+--------+
|user|activityid|event_time_1|col_rank|event_time_2|col_rank|
+----+----------+------------+--------+------------+--------+
|test| 123| 1001| 1| 10001| 1|
+----+----------+------------+--------+------------+--------+

Related

How to convert Spark map keys to individual columns

I'm using spark 2.3 and scala 2.11.8.
I have a Dataframe like below,
--------------------------------------------------------
| ID | Name | Desc_map |
--------------------------------------------------------
| 1 | abcd | "Company" -> "aa" , "Salary" -> "1" ....|
| 2 | efgh | "Company" -> "bb" , "Salary" -> "2" ....|
| 3 | ijkl | "Company" -> "cc" , "Salary" -> "3" ....|
| 4 | mnop | "Company" -> "dd" , "Salary" -> "4" ....|
--------------------------------------------------------
Expected Dataframe,
----------------------------------------
| ID | Name | Company | Salary | .... |
----------------------------------------
| 1 | abcd | aa | 1 | .... |
| 2 | efgh | bb | 2 | .... |
| 3 | ijkl | cc | 3 | .... |
| 4 | mnop | dd | 4 | .... |
----------------------------------------
Any help is appreciated.

If data is your dataset that contains:
+---+----+----------------------------+
|ID |Name|Map |
+---+----+----------------------------+
|1 |abcd|{Company -> aa, Salary -> 1}|
|2 |efgh|{Company -> bb, Salary -> 2}|
|3 |ijkl|{Company -> cc, Salary -> 3}|
|4 |mnop|{Company -> aa, Salary -> 4}|
+---+----+----------------------------+
You can get your desired output through:
data = data.selectExpr(
"ID",
"Name",
"Map.Company",
"Map.Salary"
)
Final output:
+---+----+-------+------+
|ID |Name|Company|Salary|
+---+----+-------+------+
|1 |abcd|aa |1 |
|2 |efgh|bb |2 |
|3 |ijkl|cc |3 |
|4 |mnop|aa |4 |
+---+----+-------+------+
Good luck!

Collapse DataFrame using Window functions

I would like to collapse the rows in a dataframe based on an ID column and count the number of records per ID using window functions. Doing this, I would like to avoid partitioning the window by ID, because this would result in a very large number of partitions.
I have a dataframe of the form
+----+-----------+-----------+-----------+
| ID | timestamp | metadata1 | metadata2 |
+----+-----------+-----------+-----------+
| 1 | 09:00 | ABC | apple |
| 1 | 08:00 | NULL | NULL |
| 1 | 18:00 | XYZ | apple |
| 2 | 07:00 | NULL | banana |
| 5 | 23:00 | ABC | cherry |
+----+-----------+-----------+-----------+
where I would like to keep only the records with the most recent timestamp per ID, such that I have
+----+-----------+-----------+-----------+-------+
| ID | timestamp | metadata1 | metadata2 | count |
+----+-----------+-----------+-----------+-------+
| 1 | 18:00 | XYZ | apple | 3 |
| 2 | 07:00 | NULL | banana | 1 |
| 5 | 23:00 | ABC | cherry | 1 |
+----+-----------+-----------+-----------+-------+
I have tried:
window = Window.orderBy( [asc('ID'), desc('timestamp')] )
window_count = Window.orderBy( [asc('ID'), desc('timestamp')] ).rowsBetween(-sys.maxsize,sys.maxsize)
columns_metadata = [metadata1, metadata2]
df = df.select(
*(first(col_name, ignorenulls=True).over(window).alias(col_name) for col_name in columns_metadata),
count(col('ID')).over(window_count).alias('count')
)
df = df.withColumn("row_tmp", row_number().over(window)).filter(col('row_tmp') == 1).drop(col('row_tmp'))
which is in part based on How to select the first row of each group?
This without the use of pyspark.sql.Window.partitionBy, this does not give the desired output.

I read you wanted without partitioning by ID after I posted it. I could only think of this approach.
Your dataframe:
df = sqlContext.createDataFrame(
[
('1', '09:00', 'ABC', 'apple')
,('1', '08:00', '', '')
,('1', '18:00', 'XYZ', 'apple')
,('2', '07:00', '', 'banana')
,('5', '23:00', 'ABC', 'cherry')
]
,['ID', 'timestamp', 'metadata1', 'metadata2']
)
We can use rank and partition by ID over timestamp:
from pyspark.sql.window import Window
import pyspark.sql.functions as F
w1 = Window().partitionBy(df['ID']).orderBy(df['timestamp']).orderBy(F.desc('timestamp'))
w2 = Window().partitionBy(df['ID'])
df\
.withColumn("rank", F.rank().over(w1))\
.withColumn("count", F.count('ID').over(w2))\
.filter(F.col('rank') == 1)\
.select('ID', 'timestamp', 'metadata1', 'metadata2', 'count')\
.show()
+---+---------+---------+---------+-----+
| ID|timestamp|metadata1|metadata2|count|
+---+---------+---------+---------+-----+
| 1| 18:00| XYZ| apple| 3|
| 2| 07:00| | banana| 1|
| 5| 23:00| ABC| cherry| 1|
+---+---------+---------+---------+-----+

PySpark: Timeslice and split rows in dataframe with 5 minutes interval on a specific condition

I have a dataframe with the following columns:
+-----+----------+--------------------------+-----------+
|id | sourceid | timestamp | indicator |
+-----+----------+--------------------------+-----------+
| 0 | 128 | 2019-12-03 12:00:00.0 | 0 |
| 1 | 128 | 2019-12-03 12:30:00.0 | 1 |
| 2 | 128 | 2019-12-03 12:37:00.0 | 0 |
| 3 | 128 | 2019-12-03 13:15:00.0 | 1 |
| 4 | 128 | 2019-12-03 13:17:00.0 | 0 |
+-----+----------+--------------------------+-----------+
I am trying to split the timestamp column into rows of 5 minute time intervals for indicator values which are not 0.
Explanation:
The first entry is at time timestamp = 2019-12-03 12:00:00.0, indicator= 0, do nothing.
Moving on to the next entry with timestamp = 2019-12-03 12:30:00.0, indicator= 1, I want to split timestamp into rows with a 5 minutes interval till we reach the next entry which is timestamp = 2019-12-03 12:37:00.0, indicator= 0.
If there is a case where timestamp = 2019-12-03 13:15:00.0, indicator = 1 and the next timestamp = 2019-12-03 13:17:00.0, indicator = 0, I'd like to split the row considering both the times have indicator as 1 as 13:17:00.0 falls between 13:15:00.0 - 13:20:00.0 as shown below.
How can I achieve this with PySpark?
Expected Output:
+-----+----------+--------------------------+-------------+
|id | sourceid | timestamp | indicator |
+-----+----------+--------------------------+-------------+
| 1 | 128 | 2019-12-03 12:30:00.0 | 1 |
| 1 | 128 | 2019-12-03 12:35:00.0 | 1 |
| 4 | 128 | 2019-12-03 13:15:00.0 | 1 |
| 4 | 128 | 2019-12-03 13:20:00.0 | 1 |
+-----+----------+--------------------------+-------------+

IIUC, you can filter rows based on indicators on the current and the next rows, and then use array + explode to create new rows (for testing purpose, I added some more rows into your original example):
from pyspark.sql import Window, functions as F
w1 = Window.partitionBy('sourceid').orderBy('timestamp')
# add a flag to check if the next indicator is '0'
df1 = df.withColumn('next_indicator_is_0', F.lead('indicator').over(w1) == 0)
df1.show(truncate=False)
+---+--------+---------------------+---------+-------------------+
|id |sourceid|timestamp |indicator|next_indicator_is_0|
+---+--------+---------------------+---------+-------------------+
|0 |128 |2019-12-03 12:00:00.0|0 |false |
|1 |128 |2019-12-03 12:30:00.0|1 |true |
|2 |128 |2019-12-03 12:37:00.0|0 |false |
|3 |128 |2019-12-03 13:12:00.0|1 |false |
|4 |128 |2019-12-03 13:15:00.0|1 |true |
|5 |128 |2019-12-03 13:17:00.0|0 |false |
|6 |128 |2019-12-03 13:20:00.0|1 |null |
+---+--------+---------------------+---------+-------------------+
df1.filter("indicator = 1 AND next_indicator_is_0") \
.withColumn('timestamp', F.expr("explode(array(`timestamp`, `timestamp` + interval 5 minutes))")) \
.drop('next_indicator_is_0') \
.show(truncate=False)
+---+--------+---------------------+---------+
|id |sourceid|timestamp |indicator|
+---+--------+---------------------+---------+
|1 |128 |2019-12-03 12:30:00.0|1 |
|1 |128 |2019-12-03 12:35:00 |1 |
|4 |128 |2019-12-03 13:15:00.0|1 |
|4 |128 |2019-12-03 13:20:00 |1 |
+---+--------+---------------------+---------+
Note: you can reset id column by using F.row_number().over(w1) or F.monotonically_increasing_id() based on your requirements.

Getting a column as concatenated column from a reference table and primary id's from a Dataset

I'm trying to get a concatenated data as a single column using below datasets.
Sample DS:
val df = sc.parallelize(Seq(
("a", 1,2,3),
("b", 4,6,5)
)).toDF("value", "id1", "id2", "id3")
+-------+-----+-----+-----+
| value | id1 | id2 | id3 |
+-------+-----+-----+-----+
| a | 1 | 2 | 3 |
| b | 4 | 6 | 5 |
+-------+-----+-----+-----+
from the Reference Dataset
+----+----------+--------+
| id | descr | parent|
+----+----------+--------+
| 1 | apple | fruit |
| 2 | banana | fruit |
| 3 | cat | animal |
| 4 | dog | animal |
| 5 | elephant | animal |
| 6 | Flight | object |
+----+----------+--------+
val ref= sc.parallelize(Seq(
(1,"apple","fruit"),
(2,"banana","fruit"),
(3,"cat","animal"),
(4,"dog","animal"),
(5,"elephant","animal"),
(6,"Flight","object"),
)).toDF("id", "descr", "parent")
I am trying to get the below desired OutPut
+-----------------------+--------------------------+
| desc | parent |
+-----------------------+--------------------------+
| apple+banana+cat/M | fruit+fruit+animal/M |
| dog+Flight+elephant/M | animal+object+animal/M |
+-----------------------+--------------------------+
And also I need to concat only if(id2,id3) is not null. Otherwise only with id1.
I breaking my head for the solution.

Exploding the first dataframe df and joining to ref with followed by groupBy should work as you expected
val dfNew = df.withColumn("id", explode(array("id1", "id2", "id3")))
.select("id", "value")
ref.join(dfNew, Seq("id"))
.groupBy("value")
.agg(
concat_ws("+", collect_list("descr")) as "desc",
concat_ws("+", collect_list("parent")) as "parent"
)
.drop("value")
.show()
Output:
+-------------------+--------------------+
|desc |parent |
+-------------------+--------------------+
|Flight+elephant+dog|object+animal+animal|
|apple+cat+banana |fruit+animal+fruit |
+-------------------+--------------------+

How to operate global variable in Spark SQL dataframe row by row sequentially on Spark cluster?

I have dataset which like this:
+-------+------+-------+
|groupid|rownum|column2|
+-------+------+-------+
| 1 | 1 | 7 |
| 1 | 2 | 9 |
| 1 | 3 | 8 |
| 1 | 4 | 5 |
| 1 | 5 | 1 |
| 1 | 6 | 0 |
| 1 | 7 | 15 |
| 1 | 8 | 1 |
| 1 | 9 | 13 |
| 1 | 10 | 20 |
| 2 | 1 | 8 |
| 2 | 2 | 1 |
| 2 | 3 | 4 |
| 2 | 4 | 2 |
| 2 | 5 | 19 |
| 2 | 6 | 11 |
| 2 | 7 | 5 |
| 2 | 8 | 6 |
| 2 | 9 | 15 |
| 2 | 10 | 8 |
still have more rows......
I want to add a new column "column3" , which if the continuous column2 values are less than 10,then they will be arranged a same number such as 1. if their appear a value larger than 10 in column2, this row will be dropped ，then the following column3 row’s value will increase 1. For example， when groupid = 1，the column3's value from rownum 1 to 6 will be 1 and the rownum7 will be dropped, the column3's value of rownum 8 will be 2 and the rownum9,10 will be dropped.After the procedure, the table will like this:
+-------+------+-------+-------+
|groupid|rownum|column2|column3|
+-------+------+-------+-------+
| 1 | 1 | 7 | 1 |
| 1 | 2 | 9 | 1 |
| 1 | 3 | 8 | 1 |
| 1 | 4 | 5 | 1 |
| 1 | 5 | 1 | 1 |
| 1 | 6 | 0 | 1 |
| 1 | 7 | 15 | drop | this row will be dropped, in fact not exist
| 1 | 8 | 1 | 2 |
| 1 | 9 | 13 | drop | like above
| 1 | 10 | 20 | drop | like above
| 2 | 1 | 8 | 1 |
| 2 | 2 | 1 | 1 |
| 2 | 3 | 4 | 1 |
| 2 | 4 | 2 | 1 |
| 2 | 5 | 19 | drop | ...
| 2 | 6 | 11 | drop | ...
| 2 | 7 | 5 | 2 |
| 2 | 8 | 6 | 2 |
| 2 | 9 | 15 | drop | ...
| 2 | 10 | 8 | 3 |
In our project, the dataset is expressed as dataframe in spark sql
I try to solve this problem by udf in this way:
var last_rowNum: Int = 1
var column3_Num: Int = 1
def assign_column3_Num(rowNum:Int): Int = {
if (rowNum == 1){ //do nothing, just arrange 1
column3_Num = 1
last_rowNum = 1
return column3_Num
}
/*** if the difference between rownum is 1, they have the same column3
* value, if not, column3_Num++, so they are different
*/
if(rowNum - last_rowNum == 1){
last_rowNum = rowNum
return column3_Num
}else{
column3_Num += 1
last_rowNum = rowNum
return column3_Num
}
}
spark.sqlContext.udf.register("assign_column3_Num",assign_column3_Num _)
df.filter("column2>10") //drop the larger rows
.withColumn("column3",assign_column3_Num(col("column2"))) //add column3
as you can see, I use global variable. However, it's only effective in spark local[1] model. if i use local[8] or yarn-client, the result will totally wrong! this is because spark's running mechanism，they operate the global variable without distinguishing groupid and order!
So the question is how can i arrange right number when spark running on cluster?
use udf or udaf or RDD or other ?
thank you!

You can achieve your requirement by defining a udf function as below (comments are given for clarity)
import org.apache.spark.sql.functions._
def createNewCol = udf((rownum: collection.mutable.WrappedArray[Int], column2: collection.mutable.WrappedArray[Int]) => { // udf function
var value = 1 //value for column3
var previousValue = 0 //value for checking condition
var arrayBuffer = Array.empty[(Int, Int, Int)] //initialization of array to be returned
for((a, b) <- rownum.zip(column2)){ //zipping the collected lists and looping
if(b > 10 && previousValue < 10) //checking condition for column3
value = value +1 //adding 1 for column3
arrayBuffer = arrayBuffer ++ Array((a, b, value)) //adding the values
previousValue = b
}
arrayBuffer
})
Now utilize the algorithm defined in the udf function and to get the desired result, you would need to collect the values of rownum and column2 grouping them by groupid and sorting them by rownum and then call the udf function. Next steps would be to explode and select necessary columns. (commented for clarity)
df.orderBy("rownum").groupBy("groupid").agg(collect_list("rownum").as("rownum"), collect_list("column2").as("column2")) //collecting in order for generating values for column3
.withColumn("new", createNewCol(col("rownum"), col("column2"))) //calling udf function and storing the array of struct(rownum, column2, column3) in new column
.drop("rownum", "column2") //droping unnecessary columns
.withColumn("new", explode(col("new"))) //exploding the new column array so that each row can have struct(rownum, column2, column3)
.select(col("groupid"), col("new._1").as("rownum"), col("new._2").as("column2"), col("new._3").as("column3")) //selecting as separate columns
.filter(col("column2") < 10) // filtering the rows with column2 greater than 10
.show(false)
You should have your desired output as
+-------+------+-------+-------+
|groupid|rownum|column2|column3|
+-------+------+-------+-------+
|1 |1 |7 |1 |
|1 |2 |9 |1 |
|1 |3 |8 |1 |
|1 |4 |5 |1 |
|1 |5 |1 |1 |
|1 |6 |0 |1 |
|1 |8 |1 |2 |
|2 |1 |8 |1 |
|2 |2 |1 |1 |
|2 |3 |4 |1 |
|2 |4 |2 |1 |
|2 |7 |5 |2 |
|2 |8 |6 |2 |
|2 |10 |8 |3 |
+-------+------+-------+-------+

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Custom duplicate removal strategy with Spark joins - apache-spark

Related

How to convert Spark map keys to individual columns

Collapse DataFrame using Window functions

PySpark: Timeslice and split rows in dataframe with 5 minutes interval on a specific condition

Getting a column as concatenated column from a reference table and primary id's from a Dataset

How to operate global variable in Spark SQL dataframe row by row sequentially on Spark cluster?

Categories

Resources