id pid tran1 tran2
1 1,2,3,4 5 3
2 2,4 10 6
3 3 15 9
4 4 20 12
I have the above data set.
I need to perform an aggregation on tran1 and tran2 columns for all the elements in pid column for a given id. For example, for id=1: I will be aggregating (summing) data from records with id equals 1 or 2 or 3 or 4.
The desired output is:
id pid tran1 tran2
1 1,2,3,4 50 30
2 2,4 30 18
3 3 15 9
4 4 20 12
scala> df.show
+---+-------+-----+-----+
| id| pid|tran1|tran2|
+---+-------+-----+-----+
| 1|1,2,3,4| 5| 3|
| 2| 2,4| 10| 6|
| 3| 3| 15| 9|
| 4| 4| 20| 12|
+---+-------+-----+-----+
scala> val df1 = df.withColumn("pid", explode(split(col("pid"), ",")))
scala> val df2 = df1.alias("df1").join(df.alias("df"), col("df1.pid") === col("df.id"),"left").select(col("df1.id"),col("df1.pid"),col("df.tran1"),col("df.tran2"))
scala> df2.show
+---+---+-----+-----+
| id|pid|tran1|tran2|
+---+---+-----+-----+
| 1| 1| 5| 3|
| 1| 2| 10| 6|
| 1| 3| 15| 9|
| 1| 4| 20| 12|
| 2| 2| 10| 6|
| 2| 4| 20| 12|
| 3| 3| 15| 9|
| 4| 4| 20| 12|
+---+---+-----+-----+
scala> df2.groupBy(col("id")).agg(concat_ws(",",collect_list(col("pid"))).alias("pid"), sum(col("tran1")).alias("tran1"), sum(col("tran2")).alias("tran2")).orderBy(col("id")).show(false)
+---+-------+-----+-----+
|id |pid |tran1|tran2|
+---+-------+-----+-----+
|1 |1,2,3,4|50.0 |30.0 |
|2 |2,4 |30.0 |18.0 |
|3 |3 |15.0 |9.0 |
|4 |4 |20.0 |12.0 |
+---+-------+-----+-----+
I have a spark dataframe with few columns as null. I need to create a new dataframe , adding a new column "error_desc" which will mention all the columns with null values for every row. I need to do this dynamically without mentioning each column name.
eg: if my dataframe is below
+-----+------+------+
|Rowid|Record|Value |
+-----+------+------+
| 1| a| b|
| 2| null| d|
| 3| m| null|
+-----+------+------+
my final dataframe should be
+-----+------+-----+--------------+
|Rowid|Record|Value| error_desc|
+-----+------+-----+--------------+
| 1| a| b| null|
| 2| null| d|record is null|
| 3| m| null| value is null|
+-----+------+-----+--------------+
I have added few more rows in Input DataFrame to cover more cases. You do not required to hard code any column. Use below UDF, it will give your desire output.
scala> import org.apache.spark.sql.Row
scala> import org.apache.spark.sql.expressions.UserDefinedFunction
scala> df.show()
+-----+------+-----+
|Rowid|Record|Value|
+-----+------+-----+
| 1| a| b|
| 2| null| d|
| 3| m| null|
| 4| null| d|
| 5| null| null|
| null| e| null|
| 7| e| r|
+-----+------+-----+
scala> def CheckNull:UserDefinedFunction = udf((Column:String,r:Row) => {
| var check:String = ""
| val ColList = Column.split(",").toList
| ColList.foreach{ x =>
| if (r.getAs(x) == null)
| {
| check = check + x.toString + " is null. "
| }}
| check
| })
scala> df.withColumn("error_desc",CheckNull(lit(df.columns.mkString(",")),struct(df.columns map col: _*))).show(false)
+-----+------+-----+-------------------------------+
|Rowid|Record|Value|error_desc |
+-----+------+-----+-------------------------------+
|1 |a |b | |
|2 |null |d |Record is null. |
|3 |m |null |Value is null. |
|4 |null |d |Record is null. |
|5 |null |null |Record is null. Value is null. |
|null |e |null |Rowid is null. Value is null. |
|7 |e |r | |
+-----+------+-----+-------------------------------+
I have a data frame in pyspark like below.
df.show()
+---+-------+----+
| id| type|s_id|
+---+-------+----+
| 1| ios| 11|
| 1| ios| 12|
| 1| ios| 13|
| 1| ios| 14|
| 1|android| 15|
| 1|android| 16|
| 1|android| 17|
| 2| ios| 21|
| 2|android| 18|
+---+-------+----+
Now from this data frame I want to create another data frame by pivoting it.
df1.show()
+---+-----+-----+-----+---------+---------+---------+
| id| ios1| ios2| ios3| android1| android2| android3|
+---+-----+-----+-----+---------+---------+---------+
| 1| 11| 12| 13| 15| 16| 17|
| 2| 21| Null| Null| 18| Null| Null|
+---+-----+-----+-----+---------+---------+---------+
Here I need to consider a condition that for each Id even though there will be more than 3 types I want to consider only 3 or less than 3.
How can I do that?
Edit
new_df.show()
+---+-------+----+
| id| type|s_id|
+---+-------+----+
| 1| ios| 11|
| 1| ios| 12|
| 1| | 13|
| 1| | 14|
| 1|andriod| 15|
| 1| | 16|
| 1| | 17|
| 2|andriod| 18|
| 2| ios| 21|
+---+-------+----+
The result I am getting is below
+---+----+----+----+--------+----+----+
| id| 1| 2| 3|andriod1|ios1|ios2|
+---+----+----+----+--------+----+----+
| 1| 13| 14| 16| 15| 11| 12|
| 2|null|null|null| 18| 21|null|
+---+----+----+----+--------+----+----+
What I want is
+---+--------+--------+--------+----+----+----+
|id |android1|android2|android3|ios1|ios2|ios3|
+---+--------+--------+--------+----+----+----+
|1 |15 | null| null| 11| 12|null|
|2 |18 | null| null| 21|null|null|
+---+--------+--------+--------+----+----+----+
Using the following logic should get you your desired result.
Window function is used to generate row number for each group of id and type ordered by s_id. Generated row number is used to filter and concat with type. Then finally grouping and pivoting should give you your desired output
from pyspark.sql import Window
windowSpec = Window.partitionBy("id", "type").orderBy("s_id")
from pyspark.sql import functions as f
df.withColumn("ranks", f.row_number().over(windowSpec))\
.filter(f.col("ranks") < 4)\
.withColumn("type", f.concat(f.col("type"), f.col("ranks")))\
.drop("ranks")\
.groupBy("id")\
.pivot("type")\
.agg(f.first("s_id"))\
.show(truncate=False)
which should give you
+---+--------+--------+--------+----+----+----+
|id |android1|android2|android3|ios1|ios2|ios3|
+---+--------+--------+--------+----+----+----+
|1 |15 |16 |17 |11 |12 |13 |
|2 |18 |null |null |21 |null|null|
+---+--------+--------+--------+----+----+----+
answer for the edited part
You just need an additional filter as
df.withColumn("ranks", f.row_number().over(windowSpec)) \
.filter(f.col("ranks") < 4) \
.filter(f.col("type") != "") \
.withColumn("type", f.concat(f.col("type"), f.col("ranks"))) \
.drop("ranks") \
.groupBy("id") \
.pivot("type") \
.agg(f.first("s_id")) \
.show(truncate=False)
which would give you
+---+--------+----+----+
|id |andriod1|ios1|ios2|
+---+--------+----+----+
|1 |15 |11 |12 |
|2 |18 |21 |null|
+---+--------+----+----+
Now this dataframe lacks android2, android3 and ios3 columns. Because they are not present in your updated input data. you can add them using withColumn api and populate null values
I have created two data frames in pyspark like below. In these data frames I have column id. I want to perform a full outer join on these two data frames.
valuesA = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
a = sqlContext.createDataFrame(valuesA,['name','id'])
a.show()
+---------+---+
| name| id|
+---------+---+
| Pirate| 1|
| Monkey| 2|
| Ninja| 3|
|Spaghetti| 4|
+---------+---+
valuesB = [('dave',1),('Thor',2),('face',3), ('test',5)]
b = sqlContext.createDataFrame(valuesB,['Movie','id'])
b.show()
+-----+---+
|Movie| id|
+-----+---+
| dave| 1|
| Thor| 2|
| face| 3|
| test| 5|
+-----+---+
full_outer_join = a.join(b, a.id == b.id,how='full')
full_outer_join.show()
+---------+----+-----+----+
| name| id|Movie| id|
+---------+----+-----+----+
| Pirate| 1| dave| 1|
| Monkey| 2| Thor| 2|
| Ninja| 3| face| 3|
|Spaghetti| 4| null|null|
| null|null| test| 5|
+---------+----+-----+----+
I want to have a result like below when I do a full_outer_join
+---------+-----+----+
| name|Movie| id|
+---------+-----+----+
| Pirate| dave| 1|
| Monkey| Thor| 2|
| Ninja| face| 3|
|Spaghetti| null| 4|
| null| test| 5|
+---------+-----+----+
I have done like below but getting some different result
full_outer_join = a.join(b, a.id == b.id,how='full').select(a.id, a.name, b.Movie)
full_outer_join.show()
+---------+----+-----+
| name| id|Movie|
+---------+----+-----+
| Pirate| 1| dave|
| Monkey| 2| Thor|
| Ninja| 3| face|
|Spaghetti| 4| null|
| null|null| test|
+---------+----+-----+
As you can see that I am missing Id 5 in my result data frame.
How can I achieve what I want?
Since the join columns have the same name, you can specify the join columns as a list:
a.join(b, ['id'], how='full').show()
+---+---------+-----+
| id| name|Movie|
+---+---------+-----+
| 5| null| test|
| 1| Pirate| dave|
| 3| Ninja| face|
| 2| Monkey| Thor|
| 4|Spaghetti| null|
+---+---------+-----+
Or coalesce the two id columns:
import pyspark.sql.functions as F
a.join(b, a.id == b.id, how='full').select(
F.coalesce(a.id, b.id).alias('id'), a.name, b.Movie
).show()
+---+---------+-----+
| id| name|Movie|
+---+---------+-----+
| 5| null| test|
| 1| Pirate| dave|
| 3| Ninja| face|
| 2| Monkey| Thor|
| 4|Spaghetti| null|
+---+---------+-----+
You can either reaname the column id from the dataframe b and drop later or use the list in join condition.
a.join(b, ['id'], how='full')
Output:
+---+---------+-----+
|id |name |Movie|
+---+---------+-----+
|1 |Pirate |dave |
|3 |Ninja |face |
|5 |null |test |
|4 |Spaghetti|null |
|2 |Monkey |Thor |
+---+---------+-----+
Let's say I have a spark DataFrame like this
+------------------+----------+--------------+-----+
| user| dt| action|count|
+------------------+----------+--------------+-----+
|Albert |2018-03-24|Action1 | 19|
|Albert |2018-03-25|Action1 | 1|
|Albert |2018-03-26|Action1 | 6|
|Barack |2018-03-26|Action2 | 3|
|Barack |2018-03-26|Action3 | 1|
|Donald |2018-03-26|Action3 | 29|
|Hillary |2018-03-24|Action1 | 4|
|Hillary |2018-03-26|Action2 | 2|
and I'd like to have counts for Action1/Action2/Action3 in the separate counts, so to convert it into another DataFrame like this
+------------------+----------+-------------+-------------+-------------+
| user| dt|action1_count|action2_count|action3_count|
+------------------+----------+-------------+-------------+-------------+
|Albert |2018-03-24| 19| 0| 0|
|Albert |2018-03-25| 1| 0| 0|
|Albert |2018-03-26| 6| 0| 0|
|Barack |2018-03-26| 0| 3| 0|
|Barack |2018-03-26| 0| 0| 1|
|Donald |2018-03-26| 0| 0| 29|
|Hillary |2018-03-24| 4| 0| 0|
|Hillary |2018-03-26| 0| 2| 0|
As I'm a newbie to Spark, my attempt to reach that was quite dull and straightforward:
Get 3 new DF's from filtering by each "action"
Join original DF with each of new ones, using the second DF's "count" in the new DF
The code I tried looked like this:
val a1 = originalDf.filter("action = 'Action1'")
val df1 = originalDf.as('o)
.join(a1,
($"o.user" === $"a1.user" && $"o.dt" === $"a1.dt"),
"left_outer")
.select($"o.user", $"o.dt", $"a1.count".as("action1_count"))
Then do the same with Action2/Action3, then join those.
However, even at this stage I've already got several problems with such approach:
It doesn't work at all - I mean fails with an error the reason of which I don't understand: org.apache.spark.sql.AnalysisException: cannot resolve 'o.user' given input columns: [user, dt, action, count, user, dt, action, count];
Even if it succeeded, I assume I would have got nulls where I need zeros.
I feel there should be a better way to reach this. Like some map construct or something. But at the moment I don't feel I'm able to construct the transform required to convert first dataframe into second one.
So as right now I don't have working solution at all, I'll be very thankful for any suggestions.
UPD: I might also get DF's that don't contain all of 3 possible "action" values, for instance
+------------------+----------+--------------+-----+
| user| dt| action|count|
+------------------+----------+--------------+-----+
|Albert |2018-03-24|Action1 | 19|
|Albert |2018-03-25|Action1 | 1|
|Albert |2018-03-26|Action1 | 6|
|Hillary |2018-03-24|Action1 | 4|
For those, I still need the resulting DF with 3 columns:
+------------------+----------+-------------+-------------+-------------+
| user| dt|action1_count|action2_count|action3_count|
+------------------+----------+-------------+-------------+-------------+
|Albert |2018-03-24| 19| 0| 0|
|Albert |2018-03-25| 1| 0| 0|
|Albert |2018-03-26| 6| 0| 0|
|Hillary |2018-03-24| 4| 0| 0|
You can avoid multiple join by using when to select appropriate value of column.
About your join, I don't really think it got exception like cannot resolve 'o.user', you may want to check your code again.
val df = Seq(("Albert","2018-03-24","Action1",19),
("Albert","2018-03-25","Action1",1),
("Albert","2018-03-26","Action1",6),
("Barack","2018-03-26","Action2",3),
("Barack","2018-03-26","Action3",1),
("Donald","2018-03-26","Action3",29),
("Hillary","2018-03-24","Action1",4),
("Hillary","2018-03-26","Action2",2)).toDF("user", "dt", "action", "count")
val df2 = df.withColumn("count1", when($"action" === "Action1", $"count").otherwise(lit(0))).
withColumn("count2", when($"action" === "Action2", $"count").otherwise(lit(0))).
withColumn("count3", when($"action" === "Action3", $"count").otherwise(lit(0)))
+-------+----------+-------+-----+------+------+------+
|user |dt |action |count|count1|count2|count3|
+-------+----------+-------+-----+------+------+------+
|Albert |2018-03-24|Action1|19 |19 |0 |0 |
|Albert |2018-03-25|Action1|1 |1 |0 |0 |
|Albert |2018-03-26|Action1|6 |6 |0 |0 |
|Barack |2018-03-26|Action2|3 |0 |3 |0 |
|Barack |2018-03-26|Action3|1 |0 |0 |1 |
|Donald |2018-03-26|Action3|29 |0 |0 |29 |
|Hillary|2018-03-24|Action1|4 |4 |0 |0 |
|Hillary|2018-03-26|Action2|2 |0 |2 |0 |
+-------+----------+-------+-----+------+------+------+
Here's one approach using pivot and first, with the advantage of not having to know what the action values are:
val df = Seq(
("Albert", "2018-03-24", "Action1", 19),
("Albert", "2018-03-25", "Action1", 1),
("Albert", "2018-03-26", "Action1", 6),
("Barack", "2018-03-26", "Action2", 3),
("Barack", "2018-03-26", "Action3", 1),
("Donald", "2018-03-26", "Action3", 29),
("Hillary", "2018-03-24", "Action1", 4),
("Hillary", "2018-03-26", "Action2", 2)
).toDF("user", "dt", "action", "count")
val pivotDF = df.groupBy("user", "dt", "action").pivot("action").agg(first($"count")).
na.fill(0).
orderBy("user", "dt", "action")
// +-------+----------+-------+-------+-------+-------+
// | user| dt| action|Action1|Action2|Action3|
// +-------+----------+-------+-------+-------+-------+
// | Albert|2018-03-24|Action1| 19| 0| 0|
// | Albert|2018-03-25|Action1| 1| 0| 0|
// | Albert|2018-03-26|Action1| 6| 0| 0|
// | Barack|2018-03-26|Action2| 0| 3| 0|
// | Barack|2018-03-26|Action3| 0| 0| 1|
// | Donald|2018-03-26|Action3| 0| 0| 29|
// |Hillary|2018-03-24|Action1| 4| 0| 0|
// |Hillary|2018-03-26|Action2| 0| 2| 0|
// +-------+----------+-------+-------+-------+-------+
[UPDATE]
Per comments, if you have more Action? to be created as columns than those in the pivot column, you can traverse the missing Action? to add them as zero-filled as columns:
val fullActionList = List("Action1", "Action2", "Action3", "Action4", "Action5")
val missingActions = fullActionList.diff(
pivotDF.select($"action").as[String].collect.toList.distinct
)
// missingActions: List[String] = List(Action4, Action5)
missingActions.foldLeft( pivotDF )( _.withColumn(_, lit(0)) ).
show
// +-------+----------+-------+-------+-------+-------+-------+-------+
// | user| dt| action|Action1|Action2|Action3|Action4|Action5|
// +-------+----------+-------+-------+-------+-------+-------+-------+
// | Albert|2018-03-24|Action1| 19| 0| 0| 0| 0|
// | Albert|2018-03-25|Action1| 1| 0| 0| 0| 0|
// | Albert|2018-03-26|Action1| 6| 0| 0| 0| 0|
// | Barack|2018-03-26|Action2| 0| 3| 0| 0| 0|
// | Barack|2018-03-26|Action3| 0| 0| 1| 0| 0|
// | Donald|2018-03-26|Action3| 0| 0| 29| 0| 0|
// |Hillary|2018-03-24|Action1| 4| 0| 0| 0| 0|
// |Hillary|2018-03-26|Action2| 0| 2| 0| 0| 0|
// +-------+----------+-------+-------+-------+-------+-------+-------+