Full outer join in pyspark data frames - apache-spark

I have created two data frames in pyspark like below. In these data frames I have column id. I want to perform a full outer join on these two data frames.
valuesA = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
a = sqlContext.createDataFrame(valuesA,['name','id'])
a.show()
+---------+---+
| name| id|
+---------+---+
| Pirate| 1|
| Monkey| 2|
| Ninja| 3|
|Spaghetti| 4|
+---------+---+
valuesB = [('dave',1),('Thor',2),('face',3), ('test',5)]
b = sqlContext.createDataFrame(valuesB,['Movie','id'])
b.show()
+-----+---+
|Movie| id|
+-----+---+
| dave| 1|
| Thor| 2|
| face| 3|
| test| 5|
+-----+---+
full_outer_join = a.join(b, a.id == b.id,how='full')
full_outer_join.show()
+---------+----+-----+----+
| name| id|Movie| id|
+---------+----+-----+----+
| Pirate| 1| dave| 1|
| Monkey| 2| Thor| 2|
| Ninja| 3| face| 3|
|Spaghetti| 4| null|null|
| null|null| test| 5|
+---------+----+-----+----+
I want to have a result like below when I do a full_outer_join
+---------+-----+----+
| name|Movie| id|
+---------+-----+----+
| Pirate| dave| 1|
| Monkey| Thor| 2|
| Ninja| face| 3|
|Spaghetti| null| 4|
| null| test| 5|
+---------+-----+----+
I have done like below but getting some different result
full_outer_join = a.join(b, a.id == b.id,how='full').select(a.id, a.name, b.Movie)
full_outer_join.show()
+---------+----+-----+
| name| id|Movie|
+---------+----+-----+
| Pirate| 1| dave|
| Monkey| 2| Thor|
| Ninja| 3| face|
|Spaghetti| 4| null|
| null|null| test|
+---------+----+-----+
As you can see that I am missing Id 5 in my result data frame.
How can I achieve what I want?

Since the join columns have the same name, you can specify the join columns as a list:
a.join(b, ['id'], how='full').show()
+---+---------+-----+
| id| name|Movie|
+---+---------+-----+
| 5| null| test|
| 1| Pirate| dave|
| 3| Ninja| face|
| 2| Monkey| Thor|
| 4|Spaghetti| null|
+---+---------+-----+
Or coalesce the two id columns:
import pyspark.sql.functions as F
a.join(b, a.id == b.id, how='full').select(
F.coalesce(a.id, b.id).alias('id'), a.name, b.Movie
).show()
+---+---------+-----+
| id| name|Movie|
+---+---------+-----+
| 5| null| test|
| 1| Pirate| dave|
| 3| Ninja| face|
| 2| Monkey| Thor|
| 4|Spaghetti| null|
+---+---------+-----+

You can either reaname the column id from the dataframe b and drop later or use the list in join condition.
a.join(b, ['id'], how='full')
Output:
+---+---------+-----+
|id |name |Movie|
+---+---------+-----+
|1 |Pirate |dave |
|3 |Ninja |face |
|5 |null |test |
|4 |Spaghetti|null |
|2 |Monkey |Thor |
+---+---------+-----+

Related

Spark SQL orderBy and global ordering across partitions

I want to sort the Dataframe, so that the different partitions are sorted internally (and also across each other, i.e ALL elements of one partition are gonna be either <= or >= than ALL elements of another partition). This is important because I want to use Window functions with the Window.partitionBy("partitionID"). However, there is something wrong with my understanding of how Spark works.
I run the following sample code:
val df = sc.parallelize(List((10),(8),(5),(9),(1),(6),(4),(7),(3),(2)),5)
.toDF("val")
.withColumn("partitionID",spark_partition_id)
df.show
+---+-----------+
|val|partitionID|
+---+-----------+
| 10| 0|
| 8| 0|
| 5| 1|
| 9| 1|
| 1| 2|
| 6| 2|
| 4| 3|
| 7| 3|
| 3| 4|
| 2| 4|
+---+-----------+
so far so good, 5 partitions are expected without internal or external order.
To fix that I do:
scala> val df2 = df.orderBy("val").withColumn("partitionID2",spark_partition_id)
df2: org.apache.spark.sql.DataFrame = [val: int, partitionID: int, partitionID2: int]
scala> df2.show
+---+-----------+------------+
|val|partitionID|partitionID2|
+---+-----------+------------+
| 1| 2| 2|
| 2| 4| 4|
| 3| 4| 4|
| 4| 3| 3|
| 5| 1| 1|
| 6| 2| 2|
| 7| 3| 3|
| 8| 0| 0|
| 9| 1| 1|
| 10| 0| 0|
+---+-----------+------------+
Now the val column is sorted, as expected but the partitions themselves are not "sorted". My expected result is something along the lines:
+---+-----------+------------+
|val|partitionID|partitionID2|
+---+-----------+------------+
| 1| 2| 2|
| 2| 4| 2|
| 3| 4| 4|
| 4| 3| 4|
| 5| 1| 1|
| 6| 2| 1|
| 7| 3| 3|
| 8| 0| 3|
| 9| 1| 0|
| 10| 0| 0|
+---+-----------+------------+
or something equivalent, i.e subsequent sorted elements belong in the same partition.
Can you point out what part of my logic is flawed and how to extract the intended behavior in this example? Every help is appreciated.
I run the above using scala and Spark 1.6 if that is relevant.
val df2 = df
.orderBy("val")
.repartitionByRange(5, col("val"))
.withColumn("partitionID2", spark_partition_id)
df2.show(false)
// +---+-----------+------------+
// |val|partitionID|partitionID2|
// +---+-----------+------------+
// |1 |2 |0 |
// |2 |4 |0 |
// |3 |4 |1 |
// |4 |3 |1 |
// |5 |1 |2 |
// |6 |2 |2 |
// |7 |3 |3 |
// |8 |0 |3 |
// |9 |1 |4 |
// |10 |0 |4 |
// +---+-----------+------------+

Pivot on two columns with both numeric and categorical value in pySpark

I have a data set in pyspark like this :
from collections import namedtuple
user_row = namedtuple('user_row', 'id time category value'.split())
data = [
user_row(1,1,'speed','50'),
user_row(1,1,'speed','60'),
user_row(1,2,'door', 'open'),
user_row(1,2,'door','open'),
user_row(1,2,'door','close'),
user_row(1,2,'speed','75'),
user_row(2,10,'speed','30'),
user_row(2,11,'door', 'open'),
user_row(2,12,'door','open'),
user_row(2,13,'speed','50'),
user_row(2,13,'speed','40')
]
user_df = spark.createDataFrame(data)
user_df.show()
+---+----+--------+-----+
| id|time|category|value|
+---+----+--------+-----+
| 1| 1| speed| 50|
| 1| 1| speed| 60|
| 1| 2| door| open|
| 1| 2| door| open|
| 1| 2| door|close|
| 1| 2| speed| 75|
| 2| 10| speed| 30|
| 2| 11| door| open|
| 2| 12| door| open|
| 2| 13| speed| 50|
| 2| 13| speed| 40|
+---+----+--------+-----+
What I want to get is something like below where grouping by id and time and pivot on category and if it is numeric return the average and if it is categorical it returns the mode.
+---+----+--------+-----+
| id|time| door|speed|
+---+----+--------+-----+
| 1| 1| null| 55|
| 1| 2| open| 75|
| 2| 10| null| 30|
| 2| 11| open| null|
| 2| 12| open| null|
| 2| 13| null| 45|
+---+----+--------+-----+
I tried this but for categorical value it returns null (I am not worry about nulls in speed column)
df = user_df\
.groupBy('id','time')\
.pivot('category')\
.agg(avg('value'))\
.orderBy(['id', 'time'])\
df.show()
+---+----+----+-----+
| id|time|door|speed|
+---+----+----+-----+
| 1| 1|null| 55.0|
| 1| 2|null| 75.0|
| 2| 10|null| 30.0|
| 2| 11|null| null|
| 2| 12|null| null|
| 2| 13|null| 45.0|
+---+----+----+-----+
You can do an additional pivot and coalesce them. try this.
import pyspark.sql.functions as F
from collections import namedtuple
user_row = namedtuple('user_row', 'id time category value'.split())
data = [
user_row(1,1,'speed','50'),
user_row(1,1,'speed','60'),
user_row(1,2,'door', 'open'),
user_row(1,2,'door','open'),
user_row(1,2,'door','close'),
user_row(1,2,'speed','75'),
user_row(2,10,'speed','30'),
user_row(2,11,'door', 'open'),
user_row(2,12,'door','open'),
user_row(2,13,'speed','50'),
user_row(2,13,'speed','40')
]
user_df = spark.createDataFrame(data)
#%%
#user_df.show()
df = user_df.groupBy('id','time')\
.pivot('category')\
.agg(F.avg('value').alias('avg'),F.max('value').alias('max'))\
#%%
expr1= [x for x in df.columns if '_avg' in x]
expr2= [x for x in df.columns if 'max' in x]
expr=zip(expr1,expr2)
#%%
sel_expr= [F.coalesce(x[0],x[1]).alias(x[0].split('_')[0]) for x in expr]
#%%
df_final = df.select('id','time',*sel_expr).orderBy('id','time')
df_final.show()
+---+----+----+-----+
| id|time|door|speed|
+---+----+----+-----+
| 1| 1|null| 55.0|
| 1| 2|open| 75.0|
| 2| 10|null| 30.0|
| 2| 11|open| null|
| 2| 12|open| null|
| 2| 13|null| 45.0|
+---+----+----+-----+
Try collecting the data and transforming as required
spark 2.4+
user_df.groupby('id','time').pivot('category').agg(collect_list('value')).\
select('id','time',col('door')[0].alias('door'),expr('''aggregate(speed, cast(0.0 as double), (acc, x) -> acc + x, acc -> acc/size(speed))''').alias('speed')).show()
+---+----+----+-----+
| id|time|door|speed|
+---+----+----+-----+
| 1| 1|null| 55.0|
| 2| 13|null| 45.0|
| 2| 11|open| null|
| 2| 12|open| null|
| 2| 10|null| 30.0|
| 1| 2|open| 75.0|
+---+----+----+-----+

GraphFrames detect exclusive outbound relations

In my graph I need to detect vertices that do not have inbound relations. Using the example below, "a" is the only node that is not being related by the anyone.
a --> b
b --> c
c --> d
c --> b
I would really appreciate any examples to detect "a" type nodes in my graph.
Thanks
unfortunately the approach is not as simple because the graph.degress, graph.inDegrees, graph.outDegrees functions are not returning vertices with 0 edges.
(see documentation for Scala which holds true for Python too https://graphframes.github.io/graphframes/docs/_site/api/scala/index.html#org.graphframes.GraphFrame)
so the following code will always return a empty dataframe
g=Graph(vertices,edges)
# check for start points
g.inDegrees.filter("inDegree==0").show()
+---+--------+
| id|inDegree|
+---+--------+
+---+--------+
# or check for end points
g.outDegrees.filter("outDegree==0").show()
+---+---------+
| id|outDegree|
+---+---------+
+---+---------+
# or check for any vertices that are alone without edge
g.degrees.filter("degree==0").show()
+---+------+
| id|degree|
+---+------+
+---+------+
what works is a left, right or full join of the inDegree and outDegree result and filter on the NULL values of the respective column
the join will provide you a merged columns with NULL values on the start and end positions
g.inDegrees.join(g.outDegrees,on="id",how="full").show()
+---+--------+---------+
| id|inDegree|outDegree|
+---+--------+---------+
| b6| 1| null|
| a3| 1| 1|
| a4| 1| null|
| c7| 1| 1|
| b2| 1| 2|
| c9| 3| 1|
| c5| 1| 1|
| c1| null| 1|
| c6| 1| 1|
| a2| 1| 1|
| b3| 1| 1|
| b1| null| 1|
| c8| 3| null|
| a1| null| 1|
| c4| 1| 4|
| c3| 1| 1|
| b4| 1| 1|
| c2| 1| 3|
|c10| 1| null|
| b5| 2| 1|
+---+--------+---------+
now you can filter on what search
my_in_Degrees=g.inDegrees
my_out_Degrees=g.outDegrees
# get starting vertices (no more childs)
my_in_Degrees.join(my_out_Degrees,on="id",how="full").filter(my_in_Degrees.inDegree.isNull()).show()
+---+--------+---------+
| id|inDegree|outDegree|
+---+--------+---------+
| c1| null| 1|
| b1| null| 1|
| a1| null| 1|
+---+--------+---------+
# get ending vertices (no more parents)
my_in_Degrees.join(my_out_Degrees,on="id",how="full").filter(my_out_Degrees.outDegree.isNull()).show()
+---+--------+---------+
| id|inDegree|outDegree|
+---+--------+---------+
| b6| 1| null|
| a4| 1| null|
|c10| 1| null|
+---+--------+---------+

Pyspark pivot data frame based on condition

I have a data frame in pyspark like below.
df.show()
+---+-------+----+
| id| type|s_id|
+---+-------+----+
| 1| ios| 11|
| 1| ios| 12|
| 1| ios| 13|
| 1| ios| 14|
| 1|android| 15|
| 1|android| 16|
| 1|android| 17|
| 2| ios| 21|
| 2|android| 18|
+---+-------+----+
Now from this data frame I want to create another data frame by pivoting it.
df1.show()
+---+-----+-----+-----+---------+---------+---------+
| id| ios1| ios2| ios3| android1| android2| android3|
+---+-----+-----+-----+---------+---------+---------+
| 1| 11| 12| 13| 15| 16| 17|
| 2| 21| Null| Null| 18| Null| Null|
+---+-----+-----+-----+---------+---------+---------+
Here I need to consider a condition that for each Id even though there will be more than 3 types I want to consider only 3 or less than 3.
How can I do that?
Edit
new_df.show()
+---+-------+----+
| id| type|s_id|
+---+-------+----+
| 1| ios| 11|
| 1| ios| 12|
| 1| | 13|
| 1| | 14|
| 1|andriod| 15|
| 1| | 16|
| 1| | 17|
| 2|andriod| 18|
| 2| ios| 21|
+---+-------+----+
The result I am getting is below
+---+----+----+----+--------+----+----+
| id| 1| 2| 3|andriod1|ios1|ios2|
+---+----+----+----+--------+----+----+
| 1| 13| 14| 16| 15| 11| 12|
| 2|null|null|null| 18| 21|null|
+---+----+----+----+--------+----+----+
What I want is
+---+--------+--------+--------+----+----+----+
|id |android1|android2|android3|ios1|ios2|ios3|
+---+--------+--------+--------+----+----+----+
|1 |15 | null| null| 11| 12|null|
|2 |18 | null| null| 21|null|null|
+---+--------+--------+--------+----+----+----+
Using the following logic should get you your desired result.
Window function is used to generate row number for each group of id and type ordered by s_id. Generated row number is used to filter and concat with type. Then finally grouping and pivoting should give you your desired output
from pyspark.sql import Window
windowSpec = Window.partitionBy("id", "type").orderBy("s_id")
from pyspark.sql import functions as f
df.withColumn("ranks", f.row_number().over(windowSpec))\
.filter(f.col("ranks") < 4)\
.withColumn("type", f.concat(f.col("type"), f.col("ranks")))\
.drop("ranks")\
.groupBy("id")\
.pivot("type")\
.agg(f.first("s_id"))\
.show(truncate=False)
which should give you
+---+--------+--------+--------+----+----+----+
|id |android1|android2|android3|ios1|ios2|ios3|
+---+--------+--------+--------+----+----+----+
|1 |15 |16 |17 |11 |12 |13 |
|2 |18 |null |null |21 |null|null|
+---+--------+--------+--------+----+----+----+
answer for the edited part
You just need an additional filter as
df.withColumn("ranks", f.row_number().over(windowSpec)) \
.filter(f.col("ranks") < 4) \
.filter(f.col("type") != "") \
.withColumn("type", f.concat(f.col("type"), f.col("ranks"))) \
.drop("ranks") \
.groupBy("id") \
.pivot("type") \
.agg(f.first("s_id")) \
.show(truncate=False)
which would give you
+---+--------+----+----+
|id |andriod1|ios1|ios2|
+---+--------+----+----+
|1 |15 |11 |12 |
|2 |18 |21 |null|
+---+--------+----+----+
Now this dataframe lacks android2, android3 and ios3 columns. Because they are not present in your updated input data. you can add them using withColumn api and populate null values

Simplify code and reduce join statements in pyspark data frames

I have a data frame in pyspark like below.
df.show()
+---+-------------+
| id| device|
+---+-------------+
| 3| mac pro|
| 1| iphone|
| 1|android phone|
| 1| windows pc|
| 1| spy camera|
| 2| spy camera|
| 2| iphone|
| 3| spy camera|
| 3| cctv|
+---+-------------+
phone_list = ['iphone', 'android phone', 'nokia']
pc_list = ['windows pc', 'mac pro']
security_list = ['spy camera', 'cctv']
from pyspark.sql.functions import col
phones_df = df.filter(col('device').isin(phone_list)).groupBy("id").count().selectExpr("id as id", "count as phones")
phones_df.show()
+---+------+
| id|phones|
+---+------+
| 1| 2|
| 2| 1|
+---+------+
pc_df = df.filter(col('device').isin(pc_list)).groupBy("id").count().selectExpr("id as id", "count as pc")
pc_df.show()
+---+---+
| id| pc|
+---+---+
| 1| 1|
| 3| 1|
+---+---+
security_df = df.filter(col('device').isin(security_list)).groupBy("id").count().selectExpr("id as id", "count as security")
security_df.show()
+---+--------+
| id|security|
+---+--------+
| 1| 1|
| 2| 1|
| 3| 2|
+---+--------+
Then I want to do a full outer join on all the three data frames. I have done like below.
full_df = phones_df.join(pc_df, phones_df.id == pc_df.id, 'full_outer').select(f.coalesce(phones_df.id, pc_df.id).alias('id'), phones_df.phones, pc_df.pc)
final_df = full_df.join(security_df, full_df.id == security_df.id, 'full_outer').select(f.coalesce(full_df.id, security_df.id).alias('id'), full_df.phones, full_df.pc, security_df.security)
Final_df.show()
+---+------+----+--------+
| id|phones| pc|security|
+---+------+----+--------+
| 1| 2| 1| 1|
| 2| 1|null| 1|
| 3| null| 1| 2|
+---+------+----+--------+
I am able to get what I want but want to simplify my code.
1) I want to create phones_df, pc_df, security_df in a better way because I am using the same code while creating these data frames I want to reduce this.
2) I want to simplify the join statements to one statement
How can I do this? Could anyone explain.
Here is one way using when.otherwise to map column to categories, and then pivot it to the desired output:
import pyspark.sql.functions as F
df.withColumn('cat',
F.when(df.device.isin(phone_list), 'phones').otherwise(
F.when(df.device.isin(pc_list), 'pc').otherwise(
F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')).show()
+---+----+------+--------+
| id| pc|phones|security|
+---+----+------+--------+
| 1| 1| 2| 1|
| 3| 1| null| 2|
| 2|null| 1| 1|
+---+----+------+--------+

Resources