combine multiple row in Spark - apache-spark

I wonder if there is any easy way to combine multiple rows into one in Pyspark, I am new to Python and Spark and been using Spark.sql most of the time.
Here is a data example:
id count1 count2 count3
1 null 1 null
1 3 null null
1 null null 5
2 null 1 null
2 1 null null
2 null null 2
the expected output is :
id count1 count2 count3
1 3 1 5
2 1 1 2
I been using spark SQL to join them multiple times, and wonder if there is any easier way to do that.
Thank you!

Spark SQL will sum null as zero, so if you know there are no "overlapping" data elements, just group by the column you wish aggregate to and sum.
Assuming that you want to keep your original column names (and not sum the id column), you'll need to specify the columns that are summed and then rename them after the aggregation.
before.show()
+---+------+------+------+
| id|count1|count2|count3|
+---+------+------+------+
| 1| null| 1| null|
| 1| 3| null| null|
| 1| null| null| 5|
| 2| null| 1| null|
| 2| 1| null| null|
| 2| null| null| 2|
+---+------+------+------+
after = before
.groupby('id').sum(*[c for c in before.columns if c != 'id'])
.select([col(f"sum({c})").alias(c) for c in before.columns if c != 'id'])
after.show()
+------+------+------+
|count1|count2|count3|
+------+------+------+
| 3| 1| 5|
| 1| 1| 2|
+------+------+------+

Related

how does sortWithinPartitions sort?

After applying sortWithinPartitions to a df and writing the output to a table I'm getting a result I'm not sure how to interpret.
df
.select($"type", $"id", $"time")
.sortWithinPartitions($"type", $"id", $"time")
result file looks somewhat like
1 a 5
2 b 1
1 a 6
2 b 2
1 a 7
2 b 3
1 a 8
2 b 4
It's not actually random, but neither is it sorted like I would expect it to be. Namely, first by type, then id, then time.
If I try to use a repartition before sorting, then I get the result I want. But for some reason the files weight 5 times more(100gb vs 20gb).
I'm writing to a hive orc table with compresssion set to snappy.
Does anyone know why it's sorted like this and why a repartition gets the right order, but a larger size?
Using spark 2.2.
The documentation of sortWithinPartition states
Returns a new Dataset with each partition sorted by the given expressions
The easiest way to think of this function is to imagine a fourth column (the partition id) that is used as primary sorting criterion. The function spark_partition_id() prints the partition.
For example if you have just one large partition (something that you as a Spark user would never do!), sortWithinPartition works as a normal sort:
df.repartition(1)
.sortWithinPartitions("type","id","time")
.withColumn("partition", spark_partition_id())
.show();
prints
+----+---+----+---------+
|type| id|time|partition|
+----+---+----+---------+
| 1| a| 5| 0|
| 1| a| 6| 0|
| 1| a| 7| 0|
| 1| a| 8| 0|
| 2| b| 1| 0|
| 2| b| 2| 0|
| 2| b| 3| 0|
| 2| b| 4| 0|
+----+---+----+---------+
If there are more partitions, the results are only sorted within each partition:
df.repartition(4)
.sortWithinPartitions("type","id","time")
.withColumn("partition", spark_partition_id())
.show();
prints
+----+---+----+---------+
|type| id|time|partition|
+----+---+----+---------+
| 2| b| 1| 0|
| 2| b| 3| 0|
| 1| a| 5| 1|
| 1| a| 6| 1|
| 1| a| 8| 2|
| 2| b| 2| 2|
| 1| a| 7| 3|
| 2| b| 4| 3|
+----+---+----+---------+
Why would one use sortWithPartition instead of sort? sortWithPartition does not trigger a shuffle, as the data is only moved within the executors. sort however will trigger a shuffle. Therefore sortWithPartition executes faster. If the data is partitioned by a meaningful column, sorting within each partition might be enough.

Mark all data points between two dates in pyspark

I am given a table with sales and data about promotions attached to it.
When an entry has promo data filled in it means, that a promo campaign started this day for and item X. And it will end at promo_end_date.
Here is an example:
date
promo_end_date
sales
item_id
promo_id
1.1.2020
3.1.2020
1
1
A
2.1.2020
null
1
1
null
3.1.2020
null
1
1
null
4.1.2020
null
1
1
null
5.1.2020
6.1.2020
1
1
B
6.1.2020
null
1
1
null
1.1.2020
null
1
2
null
2.1.2020
null
1
2
null
3.1.2020
null
1
2
null
4.1.2020
6.1.2020
1
2
C
5.1.2020
null
1
2
null
6.1.2020
null
1
2
null
I want to create a binary column on_promo, which will be marking each day with promo campaigns.
So it should look like this:
date
promo_end_date
sales
item_id
promo_id
on_promo
1.1.2020
3.1.2020
1
1
A
1
2.1.2020
null
1
1
null
1
3.1.2020
null
1
1
null
1
4.1.2020
null
1
1
null
0
5.1.2020
6.1.2020
1
1
B
1
6.1.2020
null
1
1
null
1
1.1.2020
null
1
2
null
0
2.1.2020
null
1
2
null
0
3.1.2020
null
1
2
null
0
4.1.2020
6.1.2020
1
2
C
1
5.1.2020
null
1
2
null
1
6.1.2020
null
1
2
null
1
I thought it would be done with window function, where I would partition data by item_id and promo_id and have two conditions: start date and end date. However I can't think of a way to make pyspark take a promo_end_date column as an end date condition.
You can get the most recent promo_end_date using last with ignorenulls=True, and then compare the date with the promo_end_date to know whether there is a current promotion:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'date', F.to_date('date', 'd.M.yyyy')
).withColumn(
'promo_end_date', F.to_date('promo_end_date', 'd.M.yyyy')
).withColumn(
'promo_end_date',
F.last('promo_end_date', ignorenulls=True).over(Window.partitionBy('item_id').orderBy('date'))
).withColumn(
'on_promo', F.when(F.col('date') <= F.col('promo_end_date'), 1).otherwise(0)
)
df2.show()
+----------+--------------+-----+-------+--------+--------+
| date|promo_end_date|sales|item_id|promo_id|on_promo|
+----------+--------------+-----+-------+--------+--------+
|2020-01-01| 2020-01-03| 1| 1| A| 1|
|2020-01-02| 2020-01-03| 1| 1| null| 1|
|2020-01-03| 2020-01-03| 1| 1| null| 1|
|2020-01-04| 2020-01-03| 1| 1| null| 0|
|2020-01-05| 2020-01-06| 1| 1| B| 1|
|2020-01-06| 2020-01-06| 1| 1| null| 1|
|2020-01-01| null| 1| 2| null| 0|
|2020-01-02| null| 1| 2| null| 0|
|2020-01-03| null| 1| 2| null| 0|
|2020-01-04| 2020-01-06| 1| 2| C| 1|
|2020-01-05| 2020-01-06| 1| 2| null| 1|
|2020-01-06| 2020-01-06| 1| 2| null| 1|
+----------+--------------+-----+-------+--------+--------+

Loop 3 times and add a new value each time to a new column in spark DF

I want create 3 rows for every row in pysaprk DF. I wan to add a new column called loopVar=(val1,val2,val3). Three different values must be added as a value in each loop. Any idea how do I do it ?
Original:
a b c
1 2 3
1 2 3
Condition 1: loop = 1 and b is not null then loopvar =va1
Condition 2: loop = 2 and b is not null then loopvar =va2
Condition 3: loop = 3 and c is not null then loopvar =va3
Output :
a b c loopvar
1 2 3 val1
1 2 3 vall
1 2 3 val2
1 2 3 val2
1 2 3 val3
1 2 3 val3
Use a crossJoin:
df = spark.createDataFrame([[1,2,3], [1,2,3]]).toDF('a','b','c')
df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 2| 3|
| 1| 2| 3|
+---+---+---+
df2 = spark.createDataFrame([['val1'], ['val2'], ['val3']]).toDF('loopvar')
df2.show()
+-------+
|loopvar|
+-------+
| val1|
| val2|
| val3|
+-------+
df3 = df.crossJoin(df2)
df3.show()
+---+---+---+-------+
| a| b| c|loopvar|
+---+---+---+-------+
| 1| 2| 3| val1|
| 1| 2| 3| val2|
| 1| 2| 3| val3|
| 1| 2| 3| val1|
| 1| 2| 3| val2|
| 1| 2| 3| val3|
+---+---+---+-------+

Filling gaps in time series Spark for different entities

I have a data frame containing daily events related to various entities in time.
I want to fill the gaps in those times series.
Here is the aggregate data I have (left), and on the right side, the data I want to have:
+---------+----------+-------+ +---------+----------+-------+
|entity_id| date|counter| |entity_id| date|counter|
+---------+----------+-------+ +---------+----------+-------+
| 3|2020-01-01| 7| | 3|2020-01-01| 7|
| 1|2020-01-01| 10| | 1|2020-01-01| 10|
| 2|2020-01-01| 3| | 2|2020-01-01| 3|
| 2|2020-01-02| 9| | 2|2020-01-02| 9|
| 1|2020-01-03| 15| | 1|2020-01-02| 0|
| 2|2020-01-04| 3| | 3|2020-01-02| 0|
| 1|2020-01-04| 14| | 1|2020-01-03| 15|
| 2|2020-01-05| 6| | 2|2020-01-03| 0|
+---------+----------+-------+ | 3|2020-01-03| 0|
| 3|2020-01-04| 0|
| 2|2020-01-04| 3|
| 1|2020-01-04| 14|
| 2|2020-01-05| 6|
| 1|2020-01-05| 0|
| 3|2020-01-05| 0|
+---------+----------+-------+
I have used this stack overflow topic, which was very useful:
Filling gaps in timeseries Spark
Here is my code (filter for only one entity), it is in Python but I think the API is the same in Scala:
(
df
.withColumn("date", sf.to_date("created_at"))
.groupBy(
sf.col("entity_id"),
sf.col("date")
)
.agg(sf.count(sf.lit(1)).alias("counter"))
.filter(sf.col("entity_id") == 1)
.select(
sf.col("date"),
sf.col("counter")
)
.join(
spark
.range(
df # range start
.filter(sf.col("entity_id") == 1)
.select(sf.unix_timestamp(sf.min("created_at")).alias("min"))
.first().min // a * a, # a = 60 * 60 * 24 = seconds in one day
(df # range end
.filter(sf.col("entity_id") == 1)
.select(sf.unix_timestamp(sf.max("created_at")).alias("max"))
.first().max // a + 1) * a,
a # range step, a = 60 * 60 * 24 = seconds in one day
)
.select(sf.to_date(sf.from_unixtime("id")).alias("date")),
["date"], # column which will be used for the join
how="right" # type of join
)
.withColumn("counter", sf.when(sf.isnull("counter"), 0).otherwise(sf.col("counter")))
.sort(sf.col("date"))
.show(200)
)
This work very well, but now I want to avoid the filter and do a range to fill the time series gaps for every entity (entity_id == 2, entity_id == 3, ...). For your information, depending on the entity_id value, the minimum and the maximum of the column date can be different, nevertheless if your help involves the global minimum and maximum of the whole data frame, it is ok for me as well.
If you need any other information, feel free to ask.
edit: add data example I want to have
When creating the elements of the date range, I would rather use the Pandas function than the Spark range, as the Spark range function has some shortcomings when dealing with date values. The amount of different dates is usually small. Even when dealing with a time span of multiple years, the number of different dates is so small that it can be easily broadcasted in a join.
#get the minimun and maximun date and collect it to the driver
min_date, max_date = df.select(F.min("date"), F.max("date")).first()
#use Pandas to create all dates and switch back to PySpark DataFrame
from pandas import pandas as pd
timerange = pd.date_range(start=min_date, end=max_date, freq='1d')
all_dates = spark.createDataFrame(timerange.to_frame(),['date'])
#get all combinations of dates and entity_ids
all_dates_and_ids = all_dates.crossJoin(df.select("entity_id").distinct())
#create the final result by doing a left join and filling null values with 0
result = all_dates_and_ids.join(df, on=['date', 'entity_id'], how="left_outer")\
.fillna({'counter':'0'}) \
.orderBy(['date', 'entity_id'])
This gives
+-------------------+---------+-------+
| date|entity_id|counter|
+-------------------+---------+-------+
|2020-01-01 00:00:00| 1| 10|
|2020-01-01 00:00:00| 2| 3|
|2020-01-01 00:00:00| 3| 7|
|2020-01-02 00:00:00| 1| 0|
|2020-01-02 00:00:00| 2| 9|
|2020-01-02 00:00:00| 3| 0|
|2020-01-03 00:00:00| 1| 15|
|2020-01-03 00:00:00| 2| 0|
|2020-01-03 00:00:00| 3| 0|
|2020-01-04 00:00:00| 1| 14|
|2020-01-04 00:00:00| 2| 3|
|2020-01-04 00:00:00| 3| 0|
|2020-01-05 00:00:00| 1| 0|
|2020-01-05 00:00:00| 2| 6|
|2020-01-05 00:00:00| 3| 0|
+-------------------+---------+-------+

Keep track of the previous row values with additional condition using pyspark

I'm using pyspark to generate a dataframe where I need to update 'amt' column with previous row's 'amt' value only when amt = 0.
For example, below is my dataframe
+---+-----+
| id|amt |
+---+-----+
| 1| 5|
| 2| 0|
| 3| 0|
| 4| 6|
| 5| 0|
| 6| 3|
+---+-----+
Now, I want the following DF to be created. whenever amt = 0, modi_amt col will contain previous row's non zero value, else no change.
+---+-----+----------+
| id|amt |modi_amt |
+---+-----+----------+
| 1| 5| 5|
| 2| 0| 5|
| 3| 0| 5|
| 4| 6| 6|
| 5| 0| 6|
| 6| 3| 3|
+---+-----+----------+
I'm able to get the previous rows value but need help for the rows where multiple 0 amt appears (example, id = 2,3)
code I'm using :
from pyspark.sql.window import Window
my_window = Window.partitionBy().orderBy("id")
DF= DF.withColumn("prev_amt", F.lag(DF.amt).over(my_window))
DF= DF.withColumn("modi_amt",when(DF.amt== 0,DF.prev_amt).otherwise(DF.amt)).drop('prev_amt')
I'm getting the below DF
+---+-----+----------+
| id|amt |modi_amt |
+---+-----+----------+
| 1| 5| 5|
| 2| 0| 5|
| 3| 0| 0|
| 4| 6| 6|
| 5| 0| 6|
| 6| 3| 3|
+---+-----+----------+
basically id 3 also should have modi_amt = 5
I've used the below approach to get the output and it's working fine,
from pyspark.sql.window import Window
my_window = Window.partitionBy().orderBy("id")
# this will hold the previous col value
DF= DF.withColumn("prev_amt", F.lag(DF.amt).over(my_window))
# this will replace the amt 0 with previous column value, but not consecutive rows having 0 amt.
DF = DF.withColumn("amt_adjusted",when(DF.prev_amt == 0,DF.prev_OffSet).otherwise(DF.amt))
# define null for the rows where both amt and amt_adjusted are having 0 (logic for consecutive rows having 0 amt)
DF = DF.withColumn('zeroNonZero', when((DF.amt== 0)&(DF.amt_adjusted == 0),lit(None)).otherwise(DF.amt_adjusted))
# replace all null values with previous Non zero amt row value
DF= DF.withColumn('modi_amt',last("zeroNonZero", ignorenulls= True).over(Window.orderBy("id").rowsBetween(Window.unboundedPreceding,0)))
Is there any other better approach?

Resources