I'm using spark 2.3 with scala 2.11.8.
I have a dataframe as below, where x1 and x2 are the types and I have their individual counts in their respective columns x1cnt and x2cnt.
The expected dataframe as shown below, needs to have have the column 'type' that has x1 and x2 for each record and the column 'count' with their respective count.
The example only has two types, but there will be more.
Input DataFrame:
+--------------------+-----------------------+-----+-----+
| col1| col2|x1cnt|x2cnt|
+--------------------+-----------------------+-----+-----+
| 1| 17| 2| 4|
| 1| 21| 0| 6|
| 1| 917| 0| 8|
| 1| 1| 35| 55|
| 1| 901| 0| 0|
| 1| 902| 0| 74|
+--------------------+-----------------------+-----+-----+
Expected result,
Expected DataFrame:
+--------------------+-----------------------+-----+-----+
| col1| col2| type|count|
+--------------------+-----------------------+-----+-----+
| 1| 17| x1| 2|
| 1| 17| x2| 4|
| 1| 21| x1| 0|
| 1| 21| x2| 6|
| 1| 917| x1| 0|
| 1| 917| x2| 8|
| 1| 1| x1| 35|
| 1| 1| x2| 55|
| 1| 901| x1| 0|
| 1| 901| x2| 0|
| 1| 902| x1| 0|
| 1| 902| x2| 74|
+--------------------+-----------------------+-----+-----+
Any help is appretiated.
the STACK function acts like a reverse PIVOT
select
col1
, col2
, stack(2, 'x1', x1cnt, 'x2', x2cnt)
from
table;
I have a data set in pyspark like this :
from collections import namedtuple
user_row = namedtuple('user_row', 'id time category value'.split())
data = [
user_row(1,1,'speed','50'),
user_row(1,1,'speed','60'),
user_row(1,2,'door', 'open'),
user_row(1,2,'door','open'),
user_row(1,2,'door','close'),
user_row(1,2,'speed','75'),
user_row(2,10,'speed','30'),
user_row(2,11,'door', 'open'),
user_row(2,12,'door','open'),
user_row(2,13,'speed','50'),
user_row(2,13,'speed','40')
]
user_df = spark.createDataFrame(data)
user_df.show()
+---+----+--------+-----+
| id|time|category|value|
+---+----+--------+-----+
| 1| 1| speed| 50|
| 1| 1| speed| 60|
| 1| 2| door| open|
| 1| 2| door| open|
| 1| 2| door|close|
| 1| 2| speed| 75|
| 2| 10| speed| 30|
| 2| 11| door| open|
| 2| 12| door| open|
| 2| 13| speed| 50|
| 2| 13| speed| 40|
+---+----+--------+-----+
What I want to get is something like below where grouping by id and time and pivot on category and if it is numeric return the average and if it is categorical it returns the mode.
+---+----+--------+-----+
| id|time| door|speed|
+---+----+--------+-----+
| 1| 1| null| 55|
| 1| 2| open| 75|
| 2| 10| null| 30|
| 2| 11| open| null|
| 2| 12| open| null|
| 2| 13| null| 45|
+---+----+--------+-----+
I tried this but for categorical value it returns null (I am not worry about nulls in speed column)
df = user_df\
.groupBy('id','time')\
.pivot('category')\
.agg(avg('value'))\
.orderBy(['id', 'time'])\
df.show()
+---+----+----+-----+
| id|time|door|speed|
+---+----+----+-----+
| 1| 1|null| 55.0|
| 1| 2|null| 75.0|
| 2| 10|null| 30.0|
| 2| 11|null| null|
| 2| 12|null| null|
| 2| 13|null| 45.0|
+---+----+----+-----+
You can do an additional pivot and coalesce them. try this.
import pyspark.sql.functions as F
from collections import namedtuple
user_row = namedtuple('user_row', 'id time category value'.split())
data = [
user_row(1,1,'speed','50'),
user_row(1,1,'speed','60'),
user_row(1,2,'door', 'open'),
user_row(1,2,'door','open'),
user_row(1,2,'door','close'),
user_row(1,2,'speed','75'),
user_row(2,10,'speed','30'),
user_row(2,11,'door', 'open'),
user_row(2,12,'door','open'),
user_row(2,13,'speed','50'),
user_row(2,13,'speed','40')
]
user_df = spark.createDataFrame(data)
#%%
#user_df.show()
df = user_df.groupBy('id','time')\
.pivot('category')\
.agg(F.avg('value').alias('avg'),F.max('value').alias('max'))\
#%%
expr1= [x for x in df.columns if '_avg' in x]
expr2= [x for x in df.columns if 'max' in x]
expr=zip(expr1,expr2)
#%%
sel_expr= [F.coalesce(x[0],x[1]).alias(x[0].split('_')[0]) for x in expr]
#%%
df_final = df.select('id','time',*sel_expr).orderBy('id','time')
df_final.show()
+---+----+----+-----+
| id|time|door|speed|
+---+----+----+-----+
| 1| 1|null| 55.0|
| 1| 2|open| 75.0|
| 2| 10|null| 30.0|
| 2| 11|open| null|
| 2| 12|open| null|
| 2| 13|null| 45.0|
+---+----+----+-----+
Try collecting the data and transforming as required
spark 2.4+
user_df.groupby('id','time').pivot('category').agg(collect_list('value')).\
select('id','time',col('door')[0].alias('door'),expr('''aggregate(speed, cast(0.0 as double), (acc, x) -> acc + x, acc -> acc/size(speed))''').alias('speed')).show()
+---+----+----+-----+
| id|time|door|speed|
+---+----+----+-----+
| 1| 1|null| 55.0|
| 2| 13|null| 45.0|
| 2| 11|open| null|
| 2| 12|open| null|
| 2| 10|null| 30.0|
| 1| 2|open| 75.0|
+---+----+----+-----+
I have a data frame in pyspark like below.
df.show()
+---+-------+----+
| id| type|s_id|
+---+-------+----+
| 1| ios| 11|
| 1| ios| 12|
| 1| ios| 13|
| 1| ios| 14|
| 1|android| 15|
| 1|android| 16|
| 1|android| 17|
| 2| ios| 21|
| 2|android| 18|
+---+-------+----+
Now from this data frame I want to create another data frame by pivoting it.
df1.show()
+---+-----+-----+-----+---------+---------+---------+
| id| ios1| ios2| ios3| android1| android2| android3|
+---+-----+-----+-----+---------+---------+---------+
| 1| 11| 12| 13| 15| 16| 17|
| 2| 21| Null| Null| 18| Null| Null|
+---+-----+-----+-----+---------+---------+---------+
Here I need to consider a condition that for each Id even though there will be more than 3 types I want to consider only 3 or less than 3.
How can I do that?
Edit
new_df.show()
+---+-------+----+
| id| type|s_id|
+---+-------+----+
| 1| ios| 11|
| 1| ios| 12|
| 1| | 13|
| 1| | 14|
| 1|andriod| 15|
| 1| | 16|
| 1| | 17|
| 2|andriod| 18|
| 2| ios| 21|
+---+-------+----+
The result I am getting is below
+---+----+----+----+--------+----+----+
| id| 1| 2| 3|andriod1|ios1|ios2|
+---+----+----+----+--------+----+----+
| 1| 13| 14| 16| 15| 11| 12|
| 2|null|null|null| 18| 21|null|
+---+----+----+----+--------+----+----+
What I want is
+---+--------+--------+--------+----+----+----+
|id |android1|android2|android3|ios1|ios2|ios3|
+---+--------+--------+--------+----+----+----+
|1 |15 | null| null| 11| 12|null|
|2 |18 | null| null| 21|null|null|
+---+--------+--------+--------+----+----+----+
Using the following logic should get you your desired result.
Window function is used to generate row number for each group of id and type ordered by s_id. Generated row number is used to filter and concat with type. Then finally grouping and pivoting should give you your desired output
from pyspark.sql import Window
windowSpec = Window.partitionBy("id", "type").orderBy("s_id")
from pyspark.sql import functions as f
df.withColumn("ranks", f.row_number().over(windowSpec))\
.filter(f.col("ranks") < 4)\
.withColumn("type", f.concat(f.col("type"), f.col("ranks")))\
.drop("ranks")\
.groupBy("id")\
.pivot("type")\
.agg(f.first("s_id"))\
.show(truncate=False)
which should give you
+---+--------+--------+--------+----+----+----+
|id |android1|android2|android3|ios1|ios2|ios3|
+---+--------+--------+--------+----+----+----+
|1 |15 |16 |17 |11 |12 |13 |
|2 |18 |null |null |21 |null|null|
+---+--------+--------+--------+----+----+----+
answer for the edited part
You just need an additional filter as
df.withColumn("ranks", f.row_number().over(windowSpec)) \
.filter(f.col("ranks") < 4) \
.filter(f.col("type") != "") \
.withColumn("type", f.concat(f.col("type"), f.col("ranks"))) \
.drop("ranks") \
.groupBy("id") \
.pivot("type") \
.agg(f.first("s_id")) \
.show(truncate=False)
which would give you
+---+--------+----+----+
|id |andriod1|ios1|ios2|
+---+--------+----+----+
|1 |15 |11 |12 |
|2 |18 |21 |null|
+---+--------+----+----+
Now this dataframe lacks android2, android3 and ios3 columns. Because they are not present in your updated input data. you can add them using withColumn api and populate null values
I have created two data frames in pyspark like below. In these data frames I have column id. I want to perform a full outer join on these two data frames.
valuesA = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
a = sqlContext.createDataFrame(valuesA,['name','id'])
a.show()
+---------+---+
| name| id|
+---------+---+
| Pirate| 1|
| Monkey| 2|
| Ninja| 3|
|Spaghetti| 4|
+---------+---+
valuesB = [('dave',1),('Thor',2),('face',3), ('test',5)]
b = sqlContext.createDataFrame(valuesB,['Movie','id'])
b.show()
+-----+---+
|Movie| id|
+-----+---+
| dave| 1|
| Thor| 2|
| face| 3|
| test| 5|
+-----+---+
full_outer_join = a.join(b, a.id == b.id,how='full')
full_outer_join.show()
+---------+----+-----+----+
| name| id|Movie| id|
+---------+----+-----+----+
| Pirate| 1| dave| 1|
| Monkey| 2| Thor| 2|
| Ninja| 3| face| 3|
|Spaghetti| 4| null|null|
| null|null| test| 5|
+---------+----+-----+----+
I want to have a result like below when I do a full_outer_join
+---------+-----+----+
| name|Movie| id|
+---------+-----+----+
| Pirate| dave| 1|
| Monkey| Thor| 2|
| Ninja| face| 3|
|Spaghetti| null| 4|
| null| test| 5|
+---------+-----+----+
I have done like below but getting some different result
full_outer_join = a.join(b, a.id == b.id,how='full').select(a.id, a.name, b.Movie)
full_outer_join.show()
+---------+----+-----+
| name| id|Movie|
+---------+----+-----+
| Pirate| 1| dave|
| Monkey| 2| Thor|
| Ninja| 3| face|
|Spaghetti| 4| null|
| null|null| test|
+---------+----+-----+
As you can see that I am missing Id 5 in my result data frame.
How can I achieve what I want?
Since the join columns have the same name, you can specify the join columns as a list:
a.join(b, ['id'], how='full').show()
+---+---------+-----+
| id| name|Movie|
+---+---------+-----+
| 5| null| test|
| 1| Pirate| dave|
| 3| Ninja| face|
| 2| Monkey| Thor|
| 4|Spaghetti| null|
+---+---------+-----+
Or coalesce the two id columns:
import pyspark.sql.functions as F
a.join(b, a.id == b.id, how='full').select(
F.coalesce(a.id, b.id).alias('id'), a.name, b.Movie
).show()
+---+---------+-----+
| id| name|Movie|
+---+---------+-----+
| 5| null| test|
| 1| Pirate| dave|
| 3| Ninja| face|
| 2| Monkey| Thor|
| 4|Spaghetti| null|
+---+---------+-----+
You can either reaname the column id from the dataframe b and drop later or use the list in join condition.
a.join(b, ['id'], how='full')
Output:
+---+---------+-----+
|id |name |Movie|
+---+---------+-----+
|1 |Pirate |dave |
|3 |Ninja |face |
|5 |null |test |
|4 |Spaghetti|null |
|2 |Monkey |Thor |
+---+---------+-----+
I want to find the IDs of groups (or blocks) of trues in a Spark DataFrame. That is, I want to go from this:
>>> df.show()
+---------+-----+
|timestamp| bool|
+---------+-----+
| 1|false|
| 2| true|
| 3| true|
| 4|false|
| 5| true|
| 6| true|
| 7| true|
| 8| true|
| 9|false|
| 10|false|
| 11|false|
| 12|false|
| 13|false|
| 14| true|
| 15| true|
| 16| true|
+---------+-----+
to this:
>>> df.show()
+---------+-----+-----+
|timestamp| bool|block|
+---------+-----+-----+
| 1|false| 0|
| 2| true| 1|
| 3| true| 1|
| 4|false| 0|
| 5| true| 2|
| 6| true| 2|
| 7| true| 2|
| 8| true| 2|
| 9|false| 0|
| 10|false| 0|
| 11|false| 0|
| 12|false| 0|
| 13|false| 0|
| 14| true| 3|
| 15| true| 3|
| 16| true| 3|
+---------+-----+-----+
(the zeros are optional, could be Null or -1 or whatever is easier to implement)
I have a solution in scala, should be easy to adapt it to pyspark. Consider the following dataframe df:
+---------+-----+
|timestamp| bool|
+---------+-----+
| 1|false|
| 2| true|
| 3| true|
| 4|false|
| 5| true|
| 6| true|
| 7| true|
| 8| true|
| 9|false|
| 10|false|
| 11|false|
| 12|false|
| 13|false|
| 14| true|
| 15| true|
| 16| true|
+---------+-----+
then you could do:
df
.withColumn("prev_bool",lag($"bool",1).over(Window.orderBy($"timestamp")))
.withColumn("block",sum(when(!$"prev_bool" and $"bool",1).otherwise(0)).over(Window.orderBy($"timestamp")))
.drop($"prev_bool")
.withColumn("block",when($"bool",$"block").otherwise(0))
.show()
+---------+-----+-----+
|timestamp| bool|block|
+---------+-----+-----+
| 1|false| 0|
| 2| true| 1|
| 3| true| 1|
| 4|false| 0|
| 5| true| 2|
| 6| true| 2|
| 7| true| 2|
| 8| true| 2|
| 9|false| 0|
| 10|false| 0|
| 11|false| 0|
| 12|false| 0|
| 13|false| 0|
| 14| true| 3|
| 15| true| 3|
| 16| true| 3|
+---------+-----+-----+