Spark SQL orderBy and global ordering across partitions - apache-spark

I want to sort the Dataframe, so that the different partitions are sorted internally (and also across each other, i.e ALL elements of one partition are gonna be either <= or >= than ALL elements of another partition). This is important because I want to use Window functions with the Window.partitionBy("partitionID"). However, there is something wrong with my understanding of how Spark works.
I run the following sample code:
val df = sc.parallelize(List((10),(8),(5),(9),(1),(6),(4),(7),(3),(2)),5)
.toDF("val")
.withColumn("partitionID",spark_partition_id)
df.show
+---+-----------+
|val|partitionID|
+---+-----------+
| 10| 0|
| 8| 0|
| 5| 1|
| 9| 1|
| 1| 2|
| 6| 2|
| 4| 3|
| 7| 3|
| 3| 4|
| 2| 4|
+---+-----------+
so far so good, 5 partitions are expected without internal or external order.
To fix that I do:
scala> val df2 = df.orderBy("val").withColumn("partitionID2",spark_partition_id)
df2: org.apache.spark.sql.DataFrame = [val: int, partitionID: int, partitionID2: int]
scala> df2.show
+---+-----------+------------+
|val|partitionID|partitionID2|
+---+-----------+------------+
| 1| 2| 2|
| 2| 4| 4|
| 3| 4| 4|
| 4| 3| 3|
| 5| 1| 1|
| 6| 2| 2|
| 7| 3| 3|
| 8| 0| 0|
| 9| 1| 1|
| 10| 0| 0|
+---+-----------+------------+
Now the val column is sorted, as expected but the partitions themselves are not "sorted". My expected result is something along the lines:
+---+-----------+------------+
|val|partitionID|partitionID2|
+---+-----------+------------+
| 1| 2| 2|
| 2| 4| 2|
| 3| 4| 4|
| 4| 3| 4|
| 5| 1| 1|
| 6| 2| 1|
| 7| 3| 3|
| 8| 0| 3|
| 9| 1| 0|
| 10| 0| 0|
+---+-----------+------------+
or something equivalent, i.e subsequent sorted elements belong in the same partition.
Can you point out what part of my logic is flawed and how to extract the intended behavior in this example? Every help is appreciated.
I run the above using scala and Spark 1.6 if that is relevant.

val df2 = df
.orderBy("val")
.repartitionByRange(5, col("val"))
.withColumn("partitionID2", spark_partition_id)
df2.show(false)
// +---+-----------+------------+
// |val|partitionID|partitionID2|
// +---+-----------+------------+
// |1 |2 |0 |
// |2 |4 |0 |
// |3 |4 |1 |
// |4 |3 |1 |
// |5 |1 |2 |
// |6 |2 |2 |
// |7 |3 |3 |
// |8 |0 |3 |
// |9 |1 |4 |
// |10 |0 |4 |
// +---+-----------+------------+

Related

How to compute an agregate from a specific set of keys for each record in Spark SQL

id pid tran1 tran2
1 1,2,3,4 5 3
2 2,4 10 6
3 3 15 9
4 4 20 12
I have the above data set.
I need to perform an aggregation on tran1 and tran2 columns for all the elements in pid column for a given id. For example, for id=1: I will be aggregating (summing) data from records with id equals 1 or 2 or 3 or 4.
The desired output is:
id pid tran1 tran2
1 1,2,3,4 50 30
2 2,4 30 18
3 3 15 9
4 4 20 12
scala> df.show
+---+-------+-----+-----+
| id| pid|tran1|tran2|
+---+-------+-----+-----+
| 1|1,2,3,4| 5| 3|
| 2| 2,4| 10| 6|
| 3| 3| 15| 9|
| 4| 4| 20| 12|
+---+-------+-----+-----+
scala> val df1 = df.withColumn("pid", explode(split(col("pid"), ",")))
scala> val df2 = df1.alias("df1").join(df.alias("df"), col("df1.pid") === col("df.id"),"left").select(col("df1.id"),col("df1.pid"),col("df.tran1"),col("df.tran2"))
scala> df2.show
+---+---+-----+-----+
| id|pid|tran1|tran2|
+---+---+-----+-----+
| 1| 1| 5| 3|
| 1| 2| 10| 6|
| 1| 3| 15| 9|
| 1| 4| 20| 12|
| 2| 2| 10| 6|
| 2| 4| 20| 12|
| 3| 3| 15| 9|
| 4| 4| 20| 12|
+---+---+-----+-----+
scala> df2.groupBy(col("id")).agg(concat_ws(",",collect_list(col("pid"))).alias("pid"), sum(col("tran1")).alias("tran1"), sum(col("tran2")).alias("tran2")).orderBy(col("id")).show(false)
+---+-------+-----+-----+
|id |pid |tran1|tran2|
+---+-------+-----+-----+
|1 |1,2,3,4|50.0 |30.0 |
|2 |2,4 |30.0 |18.0 |
|3 |3 |15.0 |9.0 |
|4 |4 |20.0 |12.0 |
+---+-------+-----+-----+

Spark DataFrame select null value

I have a spark dataframe with few columns as null. I need to create a new dataframe , adding a new column "error_desc" which will mention all the columns with null values for every row. I need to do this dynamically without mentioning each column name.
eg: if my dataframe is below
+-----+------+------+
|Rowid|Record|Value |
+-----+------+------+
| 1| a| b|
| 2| null| d|
| 3| m| null|
+-----+------+------+
my final dataframe should be
+-----+------+-----+--------------+
|Rowid|Record|Value| error_desc|
+-----+------+-----+--------------+
| 1| a| b| null|
| 2| null| d|record is null|
| 3| m| null| value is null|
+-----+------+-----+--------------+
I have added few more rows in Input DataFrame to cover more cases. You do not required to hard code any column. Use below UDF, it will give your desire output.
scala> import org.apache.spark.sql.Row
scala> import org.apache.spark.sql.expressions.UserDefinedFunction
scala> df.show()
+-----+------+-----+
|Rowid|Record|Value|
+-----+------+-----+
| 1| a| b|
| 2| null| d|
| 3| m| null|
| 4| null| d|
| 5| null| null|
| null| e| null|
| 7| e| r|
+-----+------+-----+
scala> def CheckNull:UserDefinedFunction = udf((Column:String,r:Row) => {
| var check:String = ""
| val ColList = Column.split(",").toList
| ColList.foreach{ x =>
| if (r.getAs(x) == null)
| {
| check = check + x.toString + " is null. "
| }}
| check
| })
scala> df.withColumn("error_desc",CheckNull(lit(df.columns.mkString(",")),struct(df.columns map col: _*))).show(false)
+-----+------+-----+-------------------------------+
|Rowid|Record|Value|error_desc |
+-----+------+-----+-------------------------------+
|1 |a |b | |
|2 |null |d |Record is null. |
|3 |m |null |Value is null. |
|4 |null |d |Record is null. |
|5 |null |null |Record is null. Value is null. |
|null |e |null |Rowid is null. Value is null. |
|7 |e |r | |
+-----+------+-----+-------------------------------+

Pyspark pivot data frame based on condition

I have a data frame in pyspark like below.
df.show()
+---+-------+----+
| id| type|s_id|
+---+-------+----+
| 1| ios| 11|
| 1| ios| 12|
| 1| ios| 13|
| 1| ios| 14|
| 1|android| 15|
| 1|android| 16|
| 1|android| 17|
| 2| ios| 21|
| 2|android| 18|
+---+-------+----+
Now from this data frame I want to create another data frame by pivoting it.
df1.show()
+---+-----+-----+-----+---------+---------+---------+
| id| ios1| ios2| ios3| android1| android2| android3|
+---+-----+-----+-----+---------+---------+---------+
| 1| 11| 12| 13| 15| 16| 17|
| 2| 21| Null| Null| 18| Null| Null|
+---+-----+-----+-----+---------+---------+---------+
Here I need to consider a condition that for each Id even though there will be more than 3 types I want to consider only 3 or less than 3.
How can I do that?
Edit
new_df.show()
+---+-------+----+
| id| type|s_id|
+---+-------+----+
| 1| ios| 11|
| 1| ios| 12|
| 1| | 13|
| 1| | 14|
| 1|andriod| 15|
| 1| | 16|
| 1| | 17|
| 2|andriod| 18|
| 2| ios| 21|
+---+-------+----+
The result I am getting is below
+---+----+----+----+--------+----+----+
| id| 1| 2| 3|andriod1|ios1|ios2|
+---+----+----+----+--------+----+----+
| 1| 13| 14| 16| 15| 11| 12|
| 2|null|null|null| 18| 21|null|
+---+----+----+----+--------+----+----+
What I want is
+---+--------+--------+--------+----+----+----+
|id |android1|android2|android3|ios1|ios2|ios3|
+---+--------+--------+--------+----+----+----+
|1 |15 | null| null| 11| 12|null|
|2 |18 | null| null| 21|null|null|
+---+--------+--------+--------+----+----+----+
Using the following logic should get you your desired result.
Window function is used to generate row number for each group of id and type ordered by s_id. Generated row number is used to filter and concat with type. Then finally grouping and pivoting should give you your desired output
from pyspark.sql import Window
windowSpec = Window.partitionBy("id", "type").orderBy("s_id")
from pyspark.sql import functions as f
df.withColumn("ranks", f.row_number().over(windowSpec))\
.filter(f.col("ranks") < 4)\
.withColumn("type", f.concat(f.col("type"), f.col("ranks")))\
.drop("ranks")\
.groupBy("id")\
.pivot("type")\
.agg(f.first("s_id"))\
.show(truncate=False)
which should give you
+---+--------+--------+--------+----+----+----+
|id |android1|android2|android3|ios1|ios2|ios3|
+---+--------+--------+--------+----+----+----+
|1 |15 |16 |17 |11 |12 |13 |
|2 |18 |null |null |21 |null|null|
+---+--------+--------+--------+----+----+----+
answer for the edited part
You just need an additional filter as
df.withColumn("ranks", f.row_number().over(windowSpec)) \
.filter(f.col("ranks") < 4) \
.filter(f.col("type") != "") \
.withColumn("type", f.concat(f.col("type"), f.col("ranks"))) \
.drop("ranks") \
.groupBy("id") \
.pivot("type") \
.agg(f.first("s_id")) \
.show(truncate=False)
which would give you
+---+--------+----+----+
|id |andriod1|ios1|ios2|
+---+--------+----+----+
|1 |15 |11 |12 |
|2 |18 |21 |null|
+---+--------+----+----+
Now this dataframe lacks android2, android3 and ios3 columns. Because they are not present in your updated input data. you can add them using withColumn api and populate null values

Full outer join in pyspark data frames

I have created two data frames in pyspark like below. In these data frames I have column id. I want to perform a full outer join on these two data frames.
valuesA = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
a = sqlContext.createDataFrame(valuesA,['name','id'])
a.show()
+---------+---+
| name| id|
+---------+---+
| Pirate| 1|
| Monkey| 2|
| Ninja| 3|
|Spaghetti| 4|
+---------+---+
valuesB = [('dave',1),('Thor',2),('face',3), ('test',5)]
b = sqlContext.createDataFrame(valuesB,['Movie','id'])
b.show()
+-----+---+
|Movie| id|
+-----+---+
| dave| 1|
| Thor| 2|
| face| 3|
| test| 5|
+-----+---+
full_outer_join = a.join(b, a.id == b.id,how='full')
full_outer_join.show()
+---------+----+-----+----+
| name| id|Movie| id|
+---------+----+-----+----+
| Pirate| 1| dave| 1|
| Monkey| 2| Thor| 2|
| Ninja| 3| face| 3|
|Spaghetti| 4| null|null|
| null|null| test| 5|
+---------+----+-----+----+
I want to have a result like below when I do a full_outer_join
+---------+-----+----+
| name|Movie| id|
+---------+-----+----+
| Pirate| dave| 1|
| Monkey| Thor| 2|
| Ninja| face| 3|
|Spaghetti| null| 4|
| null| test| 5|
+---------+-----+----+
I have done like below but getting some different result
full_outer_join = a.join(b, a.id == b.id,how='full').select(a.id, a.name, b.Movie)
full_outer_join.show()
+---------+----+-----+
| name| id|Movie|
+---------+----+-----+
| Pirate| 1| dave|
| Monkey| 2| Thor|
| Ninja| 3| face|
|Spaghetti| 4| null|
| null|null| test|
+---------+----+-----+
As you can see that I am missing Id 5 in my result data frame.
How can I achieve what I want?
Since the join columns have the same name, you can specify the join columns as a list:
a.join(b, ['id'], how='full').show()
+---+---------+-----+
| id| name|Movie|
+---+---------+-----+
| 5| null| test|
| 1| Pirate| dave|
| 3| Ninja| face|
| 2| Monkey| Thor|
| 4|Spaghetti| null|
+---+---------+-----+
Or coalesce the two id columns:
import pyspark.sql.functions as F
a.join(b, a.id == b.id, how='full').select(
F.coalesce(a.id, b.id).alias('id'), a.name, b.Movie
).show()
+---+---------+-----+
| id| name|Movie|
+---+---------+-----+
| 5| null| test|
| 1| Pirate| dave|
| 3| Ninja| face|
| 2| Monkey| Thor|
| 4|Spaghetti| null|
+---+---------+-----+
You can either reaname the column id from the dataframe b and drop later or use the list in join condition.
a.join(b, ['id'], how='full')
Output:
+---+---------+-----+
|id |name |Movie|
+---+---------+-----+
|1 |Pirate |dave |
|3 |Ninja |face |
|5 |null |test |
|4 |Spaghetti|null |
|2 |Monkey |Thor |
+---+---------+-----+

Split numerical count in Spark DataFrame column into several columns

Let's say I have a spark DataFrame like this
+------------------+----------+--------------+-----+
| user| dt| action|count|
+------------------+----------+--------------+-----+
|Albert |2018-03-24|Action1 | 19|
|Albert |2018-03-25|Action1 | 1|
|Albert |2018-03-26|Action1 | 6|
|Barack |2018-03-26|Action2 | 3|
|Barack |2018-03-26|Action3 | 1|
|Donald |2018-03-26|Action3 | 29|
|Hillary |2018-03-24|Action1 | 4|
|Hillary |2018-03-26|Action2 | 2|
and I'd like to have counts for Action1/Action2/Action3 in the separate counts, so to convert it into another DataFrame like this
+------------------+----------+-------------+-------------+-------------+
| user| dt|action1_count|action2_count|action3_count|
+------------------+----------+-------------+-------------+-------------+
|Albert |2018-03-24| 19| 0| 0|
|Albert |2018-03-25| 1| 0| 0|
|Albert |2018-03-26| 6| 0| 0|
|Barack |2018-03-26| 0| 3| 0|
|Barack |2018-03-26| 0| 0| 1|
|Donald |2018-03-26| 0| 0| 29|
|Hillary |2018-03-24| 4| 0| 0|
|Hillary |2018-03-26| 0| 2| 0|
As I'm a newbie to Spark, my attempt to reach that was quite dull and straightforward:
Get 3 new DF's from filtering by each "action"
Join original DF with each of new ones, using the second DF's "count" in the new DF
The code I tried looked like this:
val a1 = originalDf.filter("action = 'Action1'")
val df1 = originalDf.as('o)
.join(a1,
($"o.user" === $"a1.user" && $"o.dt" === $"a1.dt"),
"left_outer")
.select($"o.user", $"o.dt", $"a1.count".as("action1_count"))
Then do the same with Action2/Action3, then join those.
However, even at this stage I've already got several problems with such approach:
It doesn't work at all - I mean fails with an error the reason of which I don't understand: org.apache.spark.sql.AnalysisException: cannot resolve 'o.user' given input columns: [user, dt, action, count, user, dt, action, count];
Even if it succeeded, I assume I would have got nulls where I need zeros.
I feel there should be a better way to reach this. Like some map construct or something. But at the moment I don't feel I'm able to construct the transform required to convert first dataframe into second one.
So as right now I don't have working solution at all, I'll be very thankful for any suggestions.
UPD: I might also get DF's that don't contain all of 3 possible "action" values, for instance
+------------------+----------+--------------+-----+
| user| dt| action|count|
+------------------+----------+--------------+-----+
|Albert |2018-03-24|Action1 | 19|
|Albert |2018-03-25|Action1 | 1|
|Albert |2018-03-26|Action1 | 6|
|Hillary |2018-03-24|Action1 | 4|
For those, I still need the resulting DF with 3 columns:
+------------------+----------+-------------+-------------+-------------+
| user| dt|action1_count|action2_count|action3_count|
+------------------+----------+-------------+-------------+-------------+
|Albert |2018-03-24| 19| 0| 0|
|Albert |2018-03-25| 1| 0| 0|
|Albert |2018-03-26| 6| 0| 0|
|Hillary |2018-03-24| 4| 0| 0|
You can avoid multiple join by using when to select appropriate value of column.
About your join, I don't really think it got exception like cannot resolve 'o.user', you may want to check your code again.
val df = Seq(("Albert","2018-03-24","Action1",19),
("Albert","2018-03-25","Action1",1),
("Albert","2018-03-26","Action1",6),
("Barack","2018-03-26","Action2",3),
("Barack","2018-03-26","Action3",1),
("Donald","2018-03-26","Action3",29),
("Hillary","2018-03-24","Action1",4),
("Hillary","2018-03-26","Action2",2)).toDF("user", "dt", "action", "count")
val df2 = df.withColumn("count1", when($"action" === "Action1", $"count").otherwise(lit(0))).
withColumn("count2", when($"action" === "Action2", $"count").otherwise(lit(0))).
withColumn("count3", when($"action" === "Action3", $"count").otherwise(lit(0)))
+-------+----------+-------+-----+------+------+------+
|user |dt |action |count|count1|count2|count3|
+-------+----------+-------+-----+------+------+------+
|Albert |2018-03-24|Action1|19 |19 |0 |0 |
|Albert |2018-03-25|Action1|1 |1 |0 |0 |
|Albert |2018-03-26|Action1|6 |6 |0 |0 |
|Barack |2018-03-26|Action2|3 |0 |3 |0 |
|Barack |2018-03-26|Action3|1 |0 |0 |1 |
|Donald |2018-03-26|Action3|29 |0 |0 |29 |
|Hillary|2018-03-24|Action1|4 |4 |0 |0 |
|Hillary|2018-03-26|Action2|2 |0 |2 |0 |
+-------+----------+-------+-----+------+------+------+
Here's one approach using pivot and first, with the advantage of not having to know what the action values are:
val df = Seq(
("Albert", "2018-03-24", "Action1", 19),
("Albert", "2018-03-25", "Action1", 1),
("Albert", "2018-03-26", "Action1", 6),
("Barack", "2018-03-26", "Action2", 3),
("Barack", "2018-03-26", "Action3", 1),
("Donald", "2018-03-26", "Action3", 29),
("Hillary", "2018-03-24", "Action1", 4),
("Hillary", "2018-03-26", "Action2", 2)
).toDF("user", "dt", "action", "count")
val pivotDF = df.groupBy("user", "dt", "action").pivot("action").agg(first($"count")).
na.fill(0).
orderBy("user", "dt", "action")
// +-------+----------+-------+-------+-------+-------+
// | user| dt| action|Action1|Action2|Action3|
// +-------+----------+-------+-------+-------+-------+
// | Albert|2018-03-24|Action1| 19| 0| 0|
// | Albert|2018-03-25|Action1| 1| 0| 0|
// | Albert|2018-03-26|Action1| 6| 0| 0|
// | Barack|2018-03-26|Action2| 0| 3| 0|
// | Barack|2018-03-26|Action3| 0| 0| 1|
// | Donald|2018-03-26|Action3| 0| 0| 29|
// |Hillary|2018-03-24|Action1| 4| 0| 0|
// |Hillary|2018-03-26|Action2| 0| 2| 0|
// +-------+----------+-------+-------+-------+-------+
[UPDATE]
Per comments, if you have more Action? to be created as columns than those in the pivot column, you can traverse the missing Action? to add them as zero-filled as columns:
val fullActionList = List("Action1", "Action2", "Action3", "Action4", "Action5")
val missingActions = fullActionList.diff(
pivotDF.select($"action").as[String].collect.toList.distinct
)
// missingActions: List[String] = List(Action4, Action5)
missingActions.foldLeft( pivotDF )( _.withColumn(_, lit(0)) ).
show
// +-------+----------+-------+-------+-------+-------+-------+-------+
// | user| dt| action|Action1|Action2|Action3|Action4|Action5|
// +-------+----------+-------+-------+-------+-------+-------+-------+
// | Albert|2018-03-24|Action1| 19| 0| 0| 0| 0|
// | Albert|2018-03-25|Action1| 1| 0| 0| 0| 0|
// | Albert|2018-03-26|Action1| 6| 0| 0| 0| 0|
// | Barack|2018-03-26|Action2| 0| 3| 0| 0| 0|
// | Barack|2018-03-26|Action3| 0| 0| 1| 0| 0|
// | Donald|2018-03-26|Action3| 0| 0| 29| 0| 0|
// |Hillary|2018-03-24|Action1| 4| 0| 0| 0| 0|
// |Hillary|2018-03-26|Action2| 0| 2| 0| 0| 0|
// +-------+----------+-------+-------+-------+-------+-------+-------+

Resources