I'm using spark 2.3 with scala 2.11.8.
I have a dataframe as below, where x1 and x2 are the types and I have their individual counts in their respective columns x1cnt and x2cnt.
The expected dataframe as shown below, needs to have have the column 'type' that has x1 and x2 for each record and the column 'count' with their respective count.
The example only has two types, but there will be more.
Input DataFrame:
+--------------------+-----------------------+-----+-----+
| col1| col2|x1cnt|x2cnt|
+--------------------+-----------------------+-----+-----+
| 1| 17| 2| 4|
| 1| 21| 0| 6|
| 1| 917| 0| 8|
| 1| 1| 35| 55|
| 1| 901| 0| 0|
| 1| 902| 0| 74|
+--------------------+-----------------------+-----+-----+
Expected result,
Expected DataFrame:
+--------------------+-----------------------+-----+-----+
| col1| col2| type|count|
+--------------------+-----------------------+-----+-----+
| 1| 17| x1| 2|
| 1| 17| x2| 4|
| 1| 21| x1| 0|
| 1| 21| x2| 6|
| 1| 917| x1| 0|
| 1| 917| x2| 8|
| 1| 1| x1| 35|
| 1| 1| x2| 55|
| 1| 901| x1| 0|
| 1| 901| x2| 0|
| 1| 902| x1| 0|
| 1| 902| x2| 74|
+--------------------+-----------------------+-----+-----+
Any help is appretiated.
the STACK function acts like a reverse PIVOT
select
col1
, col2
, stack(2, 'x1', x1cnt, 'x2', x2cnt)
from
table;
I want to sort the Dataframe, so that the different partitions are sorted internally (and also across each other, i.e ALL elements of one partition are gonna be either <= or >= than ALL elements of another partition). This is important because I want to use Window functions with the Window.partitionBy("partitionID"). However, there is something wrong with my understanding of how Spark works.
I run the following sample code:
val df = sc.parallelize(List((10),(8),(5),(9),(1),(6),(4),(7),(3),(2)),5)
.toDF("val")
.withColumn("partitionID",spark_partition_id)
df.show
+---+-----------+
|val|partitionID|
+---+-----------+
| 10| 0|
| 8| 0|
| 5| 1|
| 9| 1|
| 1| 2|
| 6| 2|
| 4| 3|
| 7| 3|
| 3| 4|
| 2| 4|
+---+-----------+
so far so good, 5 partitions are expected without internal or external order.
To fix that I do:
scala> val df2 = df.orderBy("val").withColumn("partitionID2",spark_partition_id)
df2: org.apache.spark.sql.DataFrame = [val: int, partitionID: int, partitionID2: int]
scala> df2.show
+---+-----------+------------+
|val|partitionID|partitionID2|
+---+-----------+------------+
| 1| 2| 2|
| 2| 4| 4|
| 3| 4| 4|
| 4| 3| 3|
| 5| 1| 1|
| 6| 2| 2|
| 7| 3| 3|
| 8| 0| 0|
| 9| 1| 1|
| 10| 0| 0|
+---+-----------+------------+
Now the val column is sorted, as expected but the partitions themselves are not "sorted". My expected result is something along the lines:
+---+-----------+------------+
|val|partitionID|partitionID2|
+---+-----------+------------+
| 1| 2| 2|
| 2| 4| 2|
| 3| 4| 4|
| 4| 3| 4|
| 5| 1| 1|
| 6| 2| 1|
| 7| 3| 3|
| 8| 0| 3|
| 9| 1| 0|
| 10| 0| 0|
+---+-----------+------------+
or something equivalent, i.e subsequent sorted elements belong in the same partition.
Can you point out what part of my logic is flawed and how to extract the intended behavior in this example? Every help is appreciated.
I run the above using scala and Spark 1.6 if that is relevant.
val df2 = df
.orderBy("val")
.repartitionByRange(5, col("val"))
.withColumn("partitionID2", spark_partition_id)
df2.show(false)
// +---+-----------+------------+
// |val|partitionID|partitionID2|
// +---+-----------+------------+
// |1 |2 |0 |
// |2 |4 |0 |
// |3 |4 |1 |
// |4 |3 |1 |
// |5 |1 |2 |
// |6 |2 |2 |
// |7 |3 |3 |
// |8 |0 |3 |
// |9 |1 |4 |
// |10 |0 |4 |
// +---+-----------+------------+
I have a data set in pyspark like this :
from collections import namedtuple
user_row = namedtuple('user_row', 'id time category value'.split())
data = [
user_row(1,1,'speed','50'),
user_row(1,1,'speed','60'),
user_row(1,2,'door', 'open'),
user_row(1,2,'door','open'),
user_row(1,2,'door','close'),
user_row(1,2,'speed','75'),
user_row(2,10,'speed','30'),
user_row(2,11,'door', 'open'),
user_row(2,12,'door','open'),
user_row(2,13,'speed','50'),
user_row(2,13,'speed','40')
]
user_df = spark.createDataFrame(data)
user_df.show()
+---+----+--------+-----+
| id|time|category|value|
+---+----+--------+-----+
| 1| 1| speed| 50|
| 1| 1| speed| 60|
| 1| 2| door| open|
| 1| 2| door| open|
| 1| 2| door|close|
| 1| 2| speed| 75|
| 2| 10| speed| 30|
| 2| 11| door| open|
| 2| 12| door| open|
| 2| 13| speed| 50|
| 2| 13| speed| 40|
+---+----+--------+-----+
What I want to get is something like below where grouping by id and time and pivot on category and if it is numeric return the average and if it is categorical it returns the mode.
+---+----+--------+-----+
| id|time| door|speed|
+---+----+--------+-----+
| 1| 1| null| 55|
| 1| 2| open| 75|
| 2| 10| null| 30|
| 2| 11| open| null|
| 2| 12| open| null|
| 2| 13| null| 45|
+---+----+--------+-----+
I tried this but for categorical value it returns null (I am not worry about nulls in speed column)
df = user_df\
.groupBy('id','time')\
.pivot('category')\
.agg(avg('value'))\
.orderBy(['id', 'time'])\
df.show()
+---+----+----+-----+
| id|time|door|speed|
+---+----+----+-----+
| 1| 1|null| 55.0|
| 1| 2|null| 75.0|
| 2| 10|null| 30.0|
| 2| 11|null| null|
| 2| 12|null| null|
| 2| 13|null| 45.0|
+---+----+----+-----+
You can do an additional pivot and coalesce them. try this.
import pyspark.sql.functions as F
from collections import namedtuple
user_row = namedtuple('user_row', 'id time category value'.split())
data = [
user_row(1,1,'speed','50'),
user_row(1,1,'speed','60'),
user_row(1,2,'door', 'open'),
user_row(1,2,'door','open'),
user_row(1,2,'door','close'),
user_row(1,2,'speed','75'),
user_row(2,10,'speed','30'),
user_row(2,11,'door', 'open'),
user_row(2,12,'door','open'),
user_row(2,13,'speed','50'),
user_row(2,13,'speed','40')
]
user_df = spark.createDataFrame(data)
#%%
#user_df.show()
df = user_df.groupBy('id','time')\
.pivot('category')\
.agg(F.avg('value').alias('avg'),F.max('value').alias('max'))\
#%%
expr1= [x for x in df.columns if '_avg' in x]
expr2= [x for x in df.columns if 'max' in x]
expr=zip(expr1,expr2)
#%%
sel_expr= [F.coalesce(x[0],x[1]).alias(x[0].split('_')[0]) for x in expr]
#%%
df_final = df.select('id','time',*sel_expr).orderBy('id','time')
df_final.show()
+---+----+----+-----+
| id|time|door|speed|
+---+----+----+-----+
| 1| 1|null| 55.0|
| 1| 2|open| 75.0|
| 2| 10|null| 30.0|
| 2| 11|open| null|
| 2| 12|open| null|
| 2| 13|null| 45.0|
+---+----+----+-----+
Try collecting the data and transforming as required
spark 2.4+
user_df.groupby('id','time').pivot('category').agg(collect_list('value')).\
select('id','time',col('door')[0].alias('door'),expr('''aggregate(speed, cast(0.0 as double), (acc, x) -> acc + x, acc -> acc/size(speed))''').alias('speed')).show()
+---+----+----+-----+
| id|time|door|speed|
+---+----+----+-----+
| 1| 1|null| 55.0|
| 2| 13|null| 45.0|
| 2| 11|open| null|
| 2| 12|open| null|
| 2| 10|null| 30.0|
| 1| 2|open| 75.0|
+---+----+----+-----+
I have a spark dataframe with few columns as null. I need to create a new dataframe , adding a new column "error_desc" which will mention all the columns with null values for every row. I need to do this dynamically without mentioning each column name.
eg: if my dataframe is below
+-----+------+------+
|Rowid|Record|Value |
+-----+------+------+
| 1| a| b|
| 2| null| d|
| 3| m| null|
+-----+------+------+
my final dataframe should be
+-----+------+-----+--------------+
|Rowid|Record|Value| error_desc|
+-----+------+-----+--------------+
| 1| a| b| null|
| 2| null| d|record is null|
| 3| m| null| value is null|
+-----+------+-----+--------------+
I have added few more rows in Input DataFrame to cover more cases. You do not required to hard code any column. Use below UDF, it will give your desire output.
scala> import org.apache.spark.sql.Row
scala> import org.apache.spark.sql.expressions.UserDefinedFunction
scala> df.show()
+-----+------+-----+
|Rowid|Record|Value|
+-----+------+-----+
| 1| a| b|
| 2| null| d|
| 3| m| null|
| 4| null| d|
| 5| null| null|
| null| e| null|
| 7| e| r|
+-----+------+-----+
scala> def CheckNull:UserDefinedFunction = udf((Column:String,r:Row) => {
| var check:String = ""
| val ColList = Column.split(",").toList
| ColList.foreach{ x =>
| if (r.getAs(x) == null)
| {
| check = check + x.toString + " is null. "
| }}
| check
| })
scala> df.withColumn("error_desc",CheckNull(lit(df.columns.mkString(",")),struct(df.columns map col: _*))).show(false)
+-----+------+-----+-------------------------------+
|Rowid|Record|Value|error_desc |
+-----+------+-----+-------------------------------+
|1 |a |b | |
|2 |null |d |Record is null. |
|3 |m |null |Value is null. |
|4 |null |d |Record is null. |
|5 |null |null |Record is null. Value is null. |
|null |e |null |Rowid is null. Value is null. |
|7 |e |r | |
+-----+------+-----+-------------------------------+
I have created two data frames in pyspark like below. In these data frames I have column id. I want to perform a full outer join on these two data frames.
valuesA = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
a = sqlContext.createDataFrame(valuesA,['name','id'])
a.show()
+---------+---+
| name| id|
+---------+---+
| Pirate| 1|
| Monkey| 2|
| Ninja| 3|
|Spaghetti| 4|
+---------+---+
valuesB = [('dave',1),('Thor',2),('face',3), ('test',5)]
b = sqlContext.createDataFrame(valuesB,['Movie','id'])
b.show()
+-----+---+
|Movie| id|
+-----+---+
| dave| 1|
| Thor| 2|
| face| 3|
| test| 5|
+-----+---+
full_outer_join = a.join(b, a.id == b.id,how='full')
full_outer_join.show()
+---------+----+-----+----+
| name| id|Movie| id|
+---------+----+-----+----+
| Pirate| 1| dave| 1|
| Monkey| 2| Thor| 2|
| Ninja| 3| face| 3|
|Spaghetti| 4| null|null|
| null|null| test| 5|
+---------+----+-----+----+
I want to have a result like below when I do a full_outer_join
+---------+-----+----+
| name|Movie| id|
+---------+-----+----+
| Pirate| dave| 1|
| Monkey| Thor| 2|
| Ninja| face| 3|
|Spaghetti| null| 4|
| null| test| 5|
+---------+-----+----+
I have done like below but getting some different result
full_outer_join = a.join(b, a.id == b.id,how='full').select(a.id, a.name, b.Movie)
full_outer_join.show()
+---------+----+-----+
| name| id|Movie|
+---------+----+-----+
| Pirate| 1| dave|
| Monkey| 2| Thor|
| Ninja| 3| face|
|Spaghetti| 4| null|
| null|null| test|
+---------+----+-----+
As you can see that I am missing Id 5 in my result data frame.
How can I achieve what I want?
Since the join columns have the same name, you can specify the join columns as a list:
a.join(b, ['id'], how='full').show()
+---+---------+-----+
| id| name|Movie|
+---+---------+-----+
| 5| null| test|
| 1| Pirate| dave|
| 3| Ninja| face|
| 2| Monkey| Thor|
| 4|Spaghetti| null|
+---+---------+-----+
Or coalesce the two id columns:
import pyspark.sql.functions as F
a.join(b, a.id == b.id, how='full').select(
F.coalesce(a.id, b.id).alias('id'), a.name, b.Movie
).show()
+---+---------+-----+
| id| name|Movie|
+---+---------+-----+
| 5| null| test|
| 1| Pirate| dave|
| 3| Ninja| face|
| 2| Monkey| Thor|
| 4|Spaghetti| null|
+---+---------+-----+
You can either reaname the column id from the dataframe b and drop later or use the list in join condition.
a.join(b, ['id'], how='full')
Output:
+---+---------+-----+
|id |name |Movie|
+---+---------+-----+
|1 |Pirate |dave |
|3 |Ninja |face |
|5 |null |test |
|4 |Spaghetti|null |
|2 |Monkey |Thor |
+---+---------+-----+