I have data like this:
2014 a 1
2015 b 2
2014 a 2
2015 c 4
2014 b 2
How to transfer it to:
a b c
2014 3 2 0
2015 0 2 4
in Spark.
thanks.
This is a protopyical application of a pivot table
df.show()
//
//+------+------+----+
//|letter|number|year|
//+------+------+----+
//| a| 1|2014|
//| b| 2|2015|
//| a| 2|2014|
//| c| 4|2015|
//| b| 2|2014|
//+------+------+----+
val pivot = df.groupBy("year")
.pivot("letter")
.sum("number")
.na.fill(0,Seq("a"))
.na.fill(0,Seq("c"))
pivot.show()
//+----+---+---+---+
//|year| a| b| c|
//+----+---+---+---+
//|2014| 3| 2| 0|
//|2015| 0| 2| 4|
//+----+---+---+---+
Related
I want to split my pyspark dataframe in groups with monotonically increasing trend and keep the groups with size greater than 10.
here i tried some part of code,
from pyspark.sql import functions as F, Window
df = df1.withColumn(
"FLAG_INCREASE",
F.when(
F.col("x")
> F.lag("x").over(Window.partitionBy("x1").orderBy("time")),
1,
).otherwise(0),
)
i don't know how to do groupby by consective one's in pyspark... if anyone have better solution for this
same thing in pandas we can do like this :
df=df1.groupby((df1['x'].diff() < 0).cumsum())
how to convert this code to pyspark ?
example dataframe:
x
0 1
1 2
2 2
3 2
4 3
5 3
6 4
7 5
8 4
9 3
10 2
11 1
12 2
13 3
14 4
15 5
16 5
17 6
expected output
group1:
x
0 1
1 2
2 2
3 2
4 3
5 3
6 4
7 5
group2:
x
0 1
1 2
2 3
3 4
4 5
5 5
6 6
I'll map out all the steps (and keep all columns in the output) to replicate (df1['x'].diff() < 0).cumsum() which is easy to calculate using a lag.
However, it is important that your data has an ID column that has the order of the dataframe because unlike pandas, spark does not retain dataframe's sorting (due to its distributed nature). For this example, I've assumed that your data has an ID column named idx, which is the index you printed in your example input.
# input data
data_sdf.show(5)
# +---+---+
# |idx|val|
# +---+---+
# | 0| 1|
# | 1| 2|
# | 2| 2|
# | 3| 2|
# | 4| 3|
# +---+---+
# only showing top 5 rows
# calculating the group column
data_sdf. \
withColumn('diff_with_prevval',
func.col('val') - func.lag('val').over(wd.orderBy('idx'))
). \
withColumn('diff_lt_0',
func.coalesce((func.col('diff_with_prevval') < 0).cast('int'), func.lit(0))
). \
withColumn('diff_lt_0_cumsum',
func.sum('diff_lt_0').over(wd.orderBy('idx').rowsBetween(-sys.maxsize, 0))
). \
show()
# +---+---+-----------------+---------+----------------+
# |idx|val|diff_with_prevval|diff_lt_0|diff_lt_0_cumsum|
# +---+---+-----------------+---------+----------------+
# | 0| 1| null| 0| 0|
# | 1| 2| 1| 0| 0|
# | 2| 2| 0| 0| 0|
# | 3| 2| 0| 0| 0|
# | 4| 3| 1| 0| 0|
# | 5| 3| 0| 0| 0|
# | 6| 4| 1| 0| 0|
# | 7| 5| 1| 0| 0|
# | 8| 4| -1| 1| 1|
# | 9| 3| -1| 1| 2|
# | 10| 2| -1| 1| 3|
# | 11| 1| -1| 1| 4|
# | 12| 2| 1| 0| 4|
# | 13| 3| 1| 0| 4|
# | 14| 4| 1| 0| 4|
# | 15| 5| 1| 0| 4|
# | 16| 5| 0| 0| 4|
# | 17| 6| 1| 0| 4|
# +---+---+-----------------+---------+----------------+
You can now use the diff_lt_0_cumsum column in your groupBy() for further use.
After applying sortWithinPartitions to a df and writing the output to a table I'm getting a result I'm not sure how to interpret.
df
.select($"type", $"id", $"time")
.sortWithinPartitions($"type", $"id", $"time")
result file looks somewhat like
1 a 5
2 b 1
1 a 6
2 b 2
1 a 7
2 b 3
1 a 8
2 b 4
It's not actually random, but neither is it sorted like I would expect it to be. Namely, first by type, then id, then time.
If I try to use a repartition before sorting, then I get the result I want. But for some reason the files weight 5 times more(100gb vs 20gb).
I'm writing to a hive orc table with compresssion set to snappy.
Does anyone know why it's sorted like this and why a repartition gets the right order, but a larger size?
Using spark 2.2.
The documentation of sortWithinPartition states
Returns a new Dataset with each partition sorted by the given expressions
The easiest way to think of this function is to imagine a fourth column (the partition id) that is used as primary sorting criterion. The function spark_partition_id() prints the partition.
For example if you have just one large partition (something that you as a Spark user would never do!), sortWithinPartition works as a normal sort:
df.repartition(1)
.sortWithinPartitions("type","id","time")
.withColumn("partition", spark_partition_id())
.show();
prints
+----+---+----+---------+
|type| id|time|partition|
+----+---+----+---------+
| 1| a| 5| 0|
| 1| a| 6| 0|
| 1| a| 7| 0|
| 1| a| 8| 0|
| 2| b| 1| 0|
| 2| b| 2| 0|
| 2| b| 3| 0|
| 2| b| 4| 0|
+----+---+----+---------+
If there are more partitions, the results are only sorted within each partition:
df.repartition(4)
.sortWithinPartitions("type","id","time")
.withColumn("partition", spark_partition_id())
.show();
prints
+----+---+----+---------+
|type| id|time|partition|
+----+---+----+---------+
| 2| b| 1| 0|
| 2| b| 3| 0|
| 1| a| 5| 1|
| 1| a| 6| 1|
| 1| a| 8| 2|
| 2| b| 2| 2|
| 1| a| 7| 3|
| 2| b| 4| 3|
+----+---+----+---------+
Why would one use sortWithPartition instead of sort? sortWithPartition does not trigger a shuffle, as the data is only moved within the executors. sort however will trigger a shuffle. Therefore sortWithPartition executes faster. If the data is partitioned by a meaningful column, sorting within each partition might be enough.
I want create 3 rows for every row in pysaprk DF. I wan to add a new column called loopVar=(val1,val2,val3). Three different values must be added as a value in each loop. Any idea how do I do it ?
Original:
a b c
1 2 3
1 2 3
Condition 1: loop = 1 and b is not null then loopvar =va1
Condition 2: loop = 2 and b is not null then loopvar =va2
Condition 3: loop = 3 and c is not null then loopvar =va3
Output :
a b c loopvar
1 2 3 val1
1 2 3 vall
1 2 3 val2
1 2 3 val2
1 2 3 val3
1 2 3 val3
Use a crossJoin:
df = spark.createDataFrame([[1,2,3], [1,2,3]]).toDF('a','b','c')
df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 2| 3|
| 1| 2| 3|
+---+---+---+
df2 = spark.createDataFrame([['val1'], ['val2'], ['val3']]).toDF('loopvar')
df2.show()
+-------+
|loopvar|
+-------+
| val1|
| val2|
| val3|
+-------+
df3 = df.crossJoin(df2)
df3.show()
+---+---+---+-------+
| a| b| c|loopvar|
+---+---+---+-------+
| 1| 2| 3| val1|
| 1| 2| 3| val2|
| 1| 2| 3| val3|
| 1| 2| 3| val1|
| 1| 2| 3| val2|
| 1| 2| 3| val3|
+---+---+---+-------+
I have a spark df as follows:
p a b c
p1 2 2 1
p2 4 3 2
I want to transpose it to below format using PySpark code:
p col1 col2
p1 a 2
p1 b 2
p1 c 1
p2 a 4
p2 b 3
p2 c 2
How to?
Try with arrays_zip and explode functions.
Example:
df.show()
#+---+---+---+---+
#| p| a| b| c|
#+---+---+---+---+
#| p1| 2| 2| 1|
#| p2| 4| 3| 2|
#+---+---+---+---+
df.withColumn("arr",explode(arrays_zip(array(lit("a"),lit("b"),lit("c")),array(col("a"),col("b"),col("c"))))).\
select("p","arr.*").\
withColumnRenamed("0","col1").\
withColumnRenamed("1","col2").\
show()
#dynamically getting column names from dataframe
arr=[ lit('{}'.format(d)) for d in df.columns if d !='p']
df.withColumn("arr",explode(arrays_zip(array(arr),array(col("a"),col("b"),col("c"))))).select("p","arr.*").\
withColumnRenamed("0","col1").\
withColumnRenamed("1","col2").\
show()
#+---+----+----+
#| p|col1|col2|
#+---+----+----+
#| p1| a| 2|
#| p1| b| 2|
#| p1| c| 1|
#| p2| a| 4|
#| p2| b| 3|
#| p2| c| 2|
#+---+----+----+
I have input transactions as shown
apples,mangos,eggs
milk,oranges,eggs
milk, cereals
mango,apples
I have to generate a Spark dataframe of co-occurrence matrix like this.
apple mango milk cereals eggs
apple 2 2 0 0 1
mango 2 2 0 0 1
milk 0 0 2 1 1
cereals 0 0 1 1 0
eggs 1 1 1 0 2
Apples and mango are bought twice together, so in a matrix[apple][mango] =2.
I am stuck in ideas for implementing this? Any suggestions would be great help. I am using PySpark to implement this.
If data looks like this:
df = spark.createDataFrame(
["apples,mangos,eggs", "milk,oranges,eggs", "milk,cereals", "mangos,apples"],
"string"
).toDF("basket")
Import
from pyspark.sql.functions import split, explode, monotonically_increasing_id
Split and explode:
long = (df
.withColumn("id", monotonically_increasing_id())
.select("id", explode(split("basket", ","))))
Self-join and corsstab
long.withColumnRenamed("col", "col_").join(long, ["id"]).stat.crosstab("col_", "col").show()
# +--------+------+-------+----+------+----+-------+
# |col__col|apples|cereals|eggs|mangos|milk|oranges|
# +--------+------+-------+----+------+----+-------+
# | cereals| 0| 1| 0| 0| 1| 0|
# | eggs| 1| 0| 2| 1| 1| 1|
# | milk| 0| 1| 1| 0| 2| 1|
# | mangos| 2| 0| 1| 2| 0| 0|
# | apples| 2| 0| 1| 2| 0| 0|
# | oranges| 0| 0| 1| 0| 1| 1|
# +--------+------+-------+----+------+----+-------+