Crossjoin between two dataframes that is dependent on a common column - apache-spark

A crossJoin can be done as follows:
df1 = pd.DataFrame({'subgroup':['A','B','C','D']})
df2 = pd.DataFrame({'dates':pd.date_range(date_today, date_today + timedelta(3), freq='D')})
sdf1 = spark.createDataFrame(df1)
sdf2 = spark.createDataFrame(df2)
sdf1.crossJoin(sdf2).toPandas()
In this example there are two dataframes each containing 4 rows, in the end, I get 16 rows.
However, for my problem, I would like to do a cross join per user, and the user is another column in the two dataframes, e.g.:
df1 = pd.DataFrame({'user':[1,1,1,1,2,2,2,2],'subgroup':['A','B','C','D','A','B','D','E']})
df2 = pd.DataFrame({'user':[1,1,1,1,2,2,2,2],'dates':np.hstack([np.array(pd.date_range(date_today, date_today + timedelta(3), freq='D')),np.array(pd.date_range(date_today+timedelta(1), date_today + timedelta(4), freq='D'))])})
The result of applying the per-user crossJoin should be a dataframe with 32 rows. Is this possible in pyspark and how can this be done?

A cross join is a join that generates a multiplication of lines because the joining key does not identify rows uniquely (in our case the joining key is trivial or there is no joining key at all)
Let's start with sample data frames:
import pyspark.sql.functions as psf
import pyspark.sql.types as pst
df1 = spark.createDataFrame(
[[user, value] for user, value in zip(5 * list(range(2)), np.random.randint(0, 100, 10).tolist())],
schema=pst.StructType([pst.StructField(c, pst.IntegerType()) for c in ['user', 'value1']]))
df2 = spark.createDataFrame(
[[user, value] for user, value in zip(5 * list(range(2)), np.random.randint(0, 100, 10).tolist())],
schema=pst.StructType([pst.StructField(c, pst.IntegerType()) for c in ['user', 'value2']]))
+----+------+
|user|value1|
+----+------+
| 0| 76|
| 1| 59|
| 0| 14|
| 1| 71|
| 0| 66|
| 1| 61|
| 0| 2|
| 1| 22|
| 0| 16|
| 1| 83|
+----+------+
+----+------+
|user|value2|
+----+------+
| 0| 65|
| 1| 81|
| 0| 60|
| 1| 69|
| 0| 21|
| 1| 61|
| 0| 98|
| 1| 76|
| 0| 40|
| 1| 21|
+----+------+
Let's try joining the data frames on a constant column to see the equivalence between a cross join and regular join on a constant (trivial) column:
df = df1.withColumn('key', psf.lit(1)) \
.join(df2.withColumn('key', psf.lit(1)), on=['key'])
We get an error from spark > 2, because it realises we're trying to do a cross join (cartesian product)
Py4JJavaError: An error occurred while calling o1865.showString.
: org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans
LogicalRDD [user#1538, value1#1539], false
and
LogicalRDD [user#1542, value2#1543], false
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
If your joining key (user here) is not a column that uniquely identifies rows, you'll get a multiplication of lines as well but within each user group:
df = df1.join(df2, on='user')
print("Number of rows : \tdf1: {} \tdf2: {} \tdf: {}".format(df1.count(), df2.count(), df.count()))
Number of rows : df1: 10 df2: 10 df: 50
+----+------+------+
|user|value1|value2|
+----+------+------+
| 1| 59| 81|
| 1| 59| 69|
| 1| 59| 61|
| 1| 59| 76|
| 1| 59| 21|
| 1| 71| 81|
| 1| 71| 69|
| 1| 71| 61|
| 1| 71| 76|
| 1| 71| 21|
| 1| 61| 81|
| 1| 61| 69|
| 1| 61| 61|
| 1| 61| 76|
| 1| 61| 21|
| 1| 22| 81|
| 1| 22| 69|
| 1| 22| 61|
| 1| 22| 76|
| 1| 22| 21|
+----+------+------+
5 * 5 rows for user 0 + 5 * 5 rows for user 1, hence 50
Note: Using a self join followed by a filter usually means you should be using window functions instead.

Related

Generate ID in spark as per the below logic in Spark Scala

I have a dataframe having parent_id,service_id,product_relation_id,product_name field as given below, I want to assign id field as shown in the table below,
Please note that
one parent_id has many service_id
one service_id has many product_name
ID generation should follow the below pattern
Parent -- 1.n
Child 1 -- 1.n.1
Child 2 -- 1.n.2
Child 3 -- 1.n.3
Child 4 -- 1.n.4
How do we implement this logic in a manner that considering performance as well on Big Data ?
Scala Implementation
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val parentWindowSpec = Window.orderBy("parent_id")
val childWindowSpec = Window.partitionBy(
"parent_version", "service_id"
).orderBy("product_relation_id")
val df = spark.read.options(
Map("inferSchema"->"true","delimiter"->",","header"->"true")
).csv("product.csv")
val df2 = df.withColumn(
"parent_version", dense_rank.over(parentWindowSpec)
).withColumn(
"child_version",row_number.over(childWindowSpec) - 1)
val df3 = df2.withColumn("id",
when(col("product_name") === lit("Parent"),
concat(lit("1."), col("parent_version")))
.otherwise(concat(lit("1."), col("parent_version"),lit("."),col("child_version")))
).drop("parent_version").drop("child_version")
Output:
scala> df3.show
21/03/26 11:55:01 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+---------+----------+-------------------+------------+-----+
|parent_id|service_id|product_relation_id|product_name| id|
+---------+----------+-------------------+------------+-----+
| 100| 1| 1-A| Parent| 1.1|
| 100| 1| 1-A| Child1|1.1.1|
| 100| 1| 1-A| Child2|1.1.2|
| 100| 1| 1-A| Child3|1.1.3|
| 100| 1| 1-A| Child4|1.1.4|
| 100| 2| 1-B| Parent| 1.1|
| 100| 2| 1-B| Child1|1.1.1|
| 100| 2| 1-B| Child2|1.1.2|
| 100| 2| 1-B| Child3|1.1.3|
| 100| 2| 1-B| Child4|1.1.4|
| 100| 3| 1-C| Parent| 1.1|
| 100| 3| 1-C| Child1|1.1.1|
| 100| 3| 1-C| Child2|1.1.2|
| 100| 3| 1-C| Child3|1.1.3|
| 100| 3| 1-C| Child4|1.1.4|
| 200| 5| 1-D| Parent| 1.2|
| 200| 5| 1-D| Child1|1.2.1|
| 200| 5| 1-D| Child2|1.2.2|
| 200| 5| 1-D| Child3|1.2.3|
| 200| 5| 1-D| Child4|1.2.4|
+---------+----------+-------------------+------------+-----+
only showing top 20 rows

How do I calculate the start/end of an interval (set of rows) containing identical values?

Assume we have a spark DataFrame that looks like the following (ordered by time):
+------+-------+
| time | value |
+------+-------+
| 1 | A |
| 2 | A |
| 3 | A |
| 4 | B |
| 5 | B |
| 6 | A |
+------+-------+
I'd like to calculate the start/end times of each sequence of uninterrupted values. The expected output from the above DataFrame would be:
+-------+-------+-----+
| value | start | end |
+-------+-------+-----+
| A | 1 | 3 |
| B | 4 | 5 |
| A | 6 | 6 |
+-------+-------+-----+
(The end value for the final row could also be null.)
Doing this with a simple group aggregation:
.groupBy("value")
.agg(
F.min("time").alias("start"),
F.max("time").alias("end")
)
doesn't take into account the fact that the same value can appear in multiple different intervals.
the idea is to create an identifier for each group and use it to group by and compute your min and max time.
assuming df is your dataframe:
from pyspark.sql import functions as F, Window
df = df.withColumn(
"fg",
F.when(
F.lag('value').over(Window.orderBy("time"))==F.col("value"),
0
).otherwise(1)
)
df = df.withColumn(
"rn",
F.sum("fg").over(
Window
.orderBy("time")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
)
)
From that point, you have your dataframe with an identifier for each consecutive group.
df.show()
+----+-----+---+---+
|time|value| rn| fg|
+----+-----+---+---+
| 1| A| 1| 1|
| 2| A| 1| 0|
| 3| A| 1| 0|
| 4| B| 2| 1|
| 5| B| 2| 0|
| 6| A| 3| 1|
+----+-----+---+---+
then you just have to do the aggregation
df.groupBy(
'value',
"rn"
).agg(
F.min('time').alias("start"),
F.max('time').alias("end")
).drop("rn").show()
+-----+-----+---+
|value|start|end|
+-----+-----+---+
| A| 1| 3|
| B| 4| 5|
| A| 6| 6|
+-----+-----+---+

Pyspark - Creating a dataframe by user defined aggregate function and pivoting

I need to write a user defined aggregate function that captures the number of days between previous discharge_date and following admit_date for each consecutive visits.
I will also need to pivot on the "PERSON_ID" values.
I have the following input_df :
input_df :
+---------+----------+--------------+
|PERSON_ID|ADMIT_DATE|DISCHARGE_DATE|
+---------+----------+--------------+
| 111|2018-03-15| 2018-03-16|
| 333|2018-06-10| 2018-06-11|
| 111|2018-03-01| 2018-03-02|
| 222|2018-12-01| 2018-12-02|
| 222|2018-12-05| 2018-12-06|
| 111|2018-03-30| 2018-03-31|
| 333|2018-06-01| 2018-06-02|
| 333|2018-06-20| 2018-06-21|
| 111|2018-01-01| 2018-01-02|
+---------+----------+--------------+
First, I need to group by each person and sort the corresponding rows by ADMIT_DATE. That would yield "input_df2".
input_df2:
+---------+----------+--------------+
|PERSON_ID|ADMIT_DATE|DISCHARGE_DATE|
+---------+----------+--------------+
| 111|2018-01-01| 2018-01-03|
| 111|2018-03-01| 2018-03-02|
| 111|2018-03-15| 2018-03-16|
| 111|2018-03-30| 2018-03-31|
| 222|2018-12-01| 2018-12-02|
| 222|2018-12-05| 2018-12-06|
| 333|2018-06-01| 2018-06-02|
| 333|2018-06-10| 2018-06-11|
| 333|2018-06-20| 2018-06-21|
+---------+----------+--------------+
The desired output_df :
+------------------+-----------------+-----------------+----------------+
|PERSON_ID_DISTINCT| FIRST_DIFFERENCE|SECOND_DIFFERENCE|THIRD_DIFFERENCE|
+------------------+-----------------+-----------------+----------------+
| 111| 1 month 26 days | 13 days| 14 days|
| 222| 3 days| NAN| NAN|
| 333| 8 days| 9 days| NAN|
+------------------+-----------------+-----------------+----------------+
I know the maximum number a person appears in my input_df, so I know how many columns should be created by :
print input_df.groupBy('PERSON_ID').count().sort('count', ascending=False).show(5)
Thanks a lot in advance,
You can use pyspark.sql.functions.datediff() to compute the difference between two dates in days. In this case, you just need to compute the difference between the current row's ADMIT_DATE and the previous row's DISCHARGE_DATE. You can do this by using pyspark.sql.functions.lag() over a Window.
For example, we can compute the duration between visits in days as a new column DURATION.
import pyspark.sql.functions as f
from pyspark.sql import Window
w = Window.partitionBy('PERSON_ID').orderBy('ADMIT_DATE')
input_df.withColumn(
'DURATION',
f.datediff(f.col('ADMIT_DATE'), f.lag('DISCHARGE_DATE').over(w))
)\
.withColumn('INDEX', f.row_number().over(w)-1)\
.sort('PERSON_ID', 'INDEX')\
.show()
#+---------+----------+--------------+--------+-----+
#|PERSON_ID|ADMIT_DATE|DISCHARGE_DATE|DURATION|INDEX|
#+---------+----------+--------------+--------+-----+
#| 111|2018-01-01| 2018-01-02| null| 0|
#| 111|2018-03-01| 2018-03-02| 58| 1|
#| 111|2018-03-15| 2018-03-16| 13| 2|
#| 111|2018-03-30| 2018-03-31| 14| 3|
#| 222|2018-12-01| 2018-12-02| null| 0|
#| 222|2018-12-05| 2018-12-06| 3| 1|
#| 333|2018-06-01| 2018-06-02| null| 0|
#| 333|2018-06-10| 2018-06-11| 8| 1|
#| 333|2018-06-20| 2018-06-21| 9| 2|
#+---------+----------+--------------+--------+-----+
Notice, I also added an INDEX column using pyspark.sql.functions.row_number(). We can just filter for INDEX > 0 (because the first value will always be null) and then just pivot the DataFrame:
input_df.withColumn(
'DURATION',
f.datediff(f.col('ADMIT_DATE'), f.lag('DISCHARGE_DATE').over(w))
)\
.withColumn('INDEX', f.row_number().over(w) - 1)\
.where('INDEX > 0')\
.groupBy('PERSON_ID').pivot('INDEX').agg(f.first('DURATION'))\
.sort('PERSON_ID')\
.show()
#+---------+---+----+----+
#|PERSON_ID| 1| 2| 3|
#+---------+---+----+----+
#| 111| 58| 13| 14|
#| 222| 3|null|null|
#| 333| 8| 9|null|
#+---------+---+----+----+
Now you can rename the columns to whatever you desire.
Note: This assumes that ADMIT_DATE and DISCHARGE_DATE are of type date.
input_df.printSchema()
#root
# |-- PERSON_ID: long (nullable = true)
# |-- ADMIT_DATE: date (nullable = true)
# |-- DISCHARGE_DATE: date (nullable = true)

Simplify code and reduce join statements in pyspark data frames

I have a data frame in pyspark like below.
df.show()
+---+-------------+
| id| device|
+---+-------------+
| 3| mac pro|
| 1| iphone|
| 1|android phone|
| 1| windows pc|
| 1| spy camera|
| 2| spy camera|
| 2| iphone|
| 3| spy camera|
| 3| cctv|
+---+-------------+
phone_list = ['iphone', 'android phone', 'nokia']
pc_list = ['windows pc', 'mac pro']
security_list = ['spy camera', 'cctv']
from pyspark.sql.functions import col
phones_df = df.filter(col('device').isin(phone_list)).groupBy("id").count().selectExpr("id as id", "count as phones")
phones_df.show()
+---+------+
| id|phones|
+---+------+
| 1| 2|
| 2| 1|
+---+------+
pc_df = df.filter(col('device').isin(pc_list)).groupBy("id").count().selectExpr("id as id", "count as pc")
pc_df.show()
+---+---+
| id| pc|
+---+---+
| 1| 1|
| 3| 1|
+---+---+
security_df = df.filter(col('device').isin(security_list)).groupBy("id").count().selectExpr("id as id", "count as security")
security_df.show()
+---+--------+
| id|security|
+---+--------+
| 1| 1|
| 2| 1|
| 3| 2|
+---+--------+
Then I want to do a full outer join on all the three data frames. I have done like below.
full_df = phones_df.join(pc_df, phones_df.id == pc_df.id, 'full_outer').select(f.coalesce(phones_df.id, pc_df.id).alias('id'), phones_df.phones, pc_df.pc)
final_df = full_df.join(security_df, full_df.id == security_df.id, 'full_outer').select(f.coalesce(full_df.id, security_df.id).alias('id'), full_df.phones, full_df.pc, security_df.security)
Final_df.show()
+---+------+----+--------+
| id|phones| pc|security|
+---+------+----+--------+
| 1| 2| 1| 1|
| 2| 1|null| 1|
| 3| null| 1| 2|
+---+------+----+--------+
I am able to get what I want but want to simplify my code.
1) I want to create phones_df, pc_df, security_df in a better way because I am using the same code while creating these data frames I want to reduce this.
2) I want to simplify the join statements to one statement
How can I do this? Could anyone explain.
Here is one way using when.otherwise to map column to categories, and then pivot it to the desired output:
import pyspark.sql.functions as F
df.withColumn('cat',
F.when(df.device.isin(phone_list), 'phones').otherwise(
F.when(df.device.isin(pc_list), 'pc').otherwise(
F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')).show()
+---+----+------+--------+
| id| pc|phones|security|
+---+----+------+--------+
| 1| 1| 2| 1|
| 3| 1| null| 2|
| 2|null| 1| 1|
+---+----+------+--------+

splitting content of a pyspark dataframe column and aggregating them into new columns

I am trying to extract and split the data within pyspark dataframe column, following which, aggregate it into a new columns.
Input Table.
+--+-----------+
|id|description|
+--+-----------+
|1 | 3:2,3|2:1|
|2 | 2 |
|3 | 2:12,16 |
|4 | 3:2,4,6 |
|5 | 2 |
|6 | 2:3,7|2:3|
+--------------+
Desired Output.
+--+-----------+-------+-----------+
|id|description|sum_emp|org_changed|
+--+-----------+-------+-----------+
|1 | 3:2,3|2:1| 5 | 3 |
|2 | 2 | 2 | 0 |
|3 | 2:12,16 | 2 | 2 |
|4 | 3:2,4,6 | 3 | 3 |
|5 | 2 | 2 | 0 |
|6 | 2:3,7|2:3| 4 | 3 |
+--------------+-------+-----------+
Before the ":", values ought to be added. The values post the ":" are to be counted. The | marks the shift in the record(can be ignored)
Some data points are as long as 2:3,4,5|3:4,6,3|4:3,7,8
Any help would be greatly appreciated
Scenario Explained:
Considering the 6th id for example. The 6 refers to a biz unit id. The 'Description' column describes the team within that given unit.
Now for the meaning of the values 2:3,7|2:3 are as follows:
1)Fist 2 followed by 3&7 = there are 2 folks in the team and one of them has been in another org for 3 years and for 7 years (perhaps its the second guys first company)
2)Second 2 followed by 3 = there are 2 folks again in a sub team, and 1 person has spent 3 years in another org.
Desired output:
sum_emp = total number of employees in that given biz unit.
org_changed = total number of organizations folks in that biz unit have changed.
First let's create our dataframe:
df = spark.createDataFrame(
sc.parallelize([[1,"3:2,3|2:1"],
[2,"2"],
[3,"2:12,16"],
[4,"3:2,4,6"],
[5,"2"],
[6,"2:3,7|2:3"]]),
["id","description"])
+---+-----------+
| id|description|
+---+-----------+
| 1| 3:2,3|2:1|
| 2| 2|
| 3| 2:12,16|
| 4| 3:2,4,6|
| 5| 2|
| 6| 2:3,7|2:3|
+---+-----------+
First we'll split the records and explode the resulting array so we only have one record per line:
import pyspark.sql.functions as psf
df = df.withColumn(
"record",
psf.explode(psf.split("description", '\|'))
)
+---+-----------+-------+
| id|description| record|
+---+-----------+-------+
| 1| 3:2,3|2:1| 3:2,3|
| 1| 3:2,3|2:1| 2:1|
| 2| 2| 2|
| 3| 2:12,16|2:12,16|
| 4| 3:2,4,6|3:2,4,6|
| 5| 2| 2|
| 6| 2:3,7|2:3| 2:3,7|
| 6| 2:3,7|2:3| 2:3|
+---+-----------+-------+
Now we'll split records into the number of players and a list of years:
df = df.withColumn(
"record",
psf.split("record", ':')
).withColumn(
"nb_players",
psf.col("record")[0]
).withColumn(
"years",
psf.split(psf.col("record")[1], ',')
)
+---+-----------+----------+----------+---------+
| id|description| record|nb_players| years|
+---+-----------+----------+----------+---------+
| 1| 3:2,3|2:1| [3, 2,3]| 3| [2, 3]|
| 1| 3:2,3|2:1| [2, 1]| 2| [1]|
| 2| 2| [2]| 2| null|
| 3| 2:12,16|[2, 12,16]| 2| [12, 16]|
| 4| 3:2,4,6|[3, 2,4,6]| 3|[2, 4, 6]|
| 5| 2| [2]| 2| null|
| 6| 2:3,7|2:3| [2, 3,7]| 2| [3, 7]|
| 6| 2:3,7|2:3| [2, 3]| 2| [3]|
+---+-----------+----------+----------+---------+
Finally, we want to sum for each id the number of players and the length of years:
df = df.withColumn(
"years_size",
psf.when(psf.size("years") > 0, psf.size("years")).otherwise(0)
).groupby("id").agg(
psf.first("description").alias("description"),
psf.sum("nb_players").alias("sum_emp"),
psf.sum("years_size").alias("org_changed")
).sort("id").show()
+---+-----------+-------+-----------+
| id|description|sum_emp|org_changed|
+---+-----------+-------+-----------+
| 1| 3:2,3|2:1| 5.0| 3|
| 2| 2| 2.0| 0|
| 3| 2:12,16| 2.0| 2|
| 4| 3:2,4,6| 3.0| 3|
| 5| 2| 2.0| 0|
| 6| 2:3,7|2:3| 4.0| 3|
+---+-----------+-------+-----------+

Resources