Partially replace dataframe with rows from another - python-3.x

I have two similar dataframes, one has a single date and the other has multiple dates plus an additional column:
df:
| yyyy_mm_dd | id | region | country | product | count |
|------------|-----|--------|----------|---------|-------|
| 2021-06-14 | 111 | EMEA | Spain | P1 | 10 |
| 2021-06-14 | 111 | EMEA | England | P1 | 9 |
| 2021-06-14 | 111 | EMEA | France | P1 | 10 |
| 2021-06-14 | 111 | EMEA | Spain | P2 | 299 |
| 2021-06-14 | 111 | EMEA | England | P2 | 39 |
| 2021-06-14 | 111 | EMEA | France | P2 | 10 |
| 2021-06-14 | 112 | LATAM | Brazil | P1 | 64 |
| 2021-06-14 | 112 | LATAM | Paraguay | P2 | 21 |
| 2021-06-14 | ... | ... | ... | ... | ... |
df1:
| yyyy_mm_dd | id | region | country | product | count | fullfilments |
|------------|-----|--------|----------|---------|-------|--------------|
| 2021-06-14 | 111 | EMEA | Spain | P1 | 1 | 1 |
| 2021-06-14 | 111 | EMEA | England | P1 | 1 | 3 |
| 2021-06-14 | 111 | EMEA | France | P1 | 2 | 4 |
| 2021-06-14 | 111 | EMEA | Spain | P2 | 1 | 1 |
| 2021-06-14 | 111 | EMEA | England | P2 | 2 | 1 |
| 2021-06-14 | 111 | EMEA | France | P2 | 1 | 5 |
| 2021-06-14 | 112 | LATAM | Brazil | P1 | 2 | 2 |
| 2021-06-14 | 112 | LATAM | Paraguay | P2 | 21 | 1 |
| 2021-06-14 | ... | ... | ... | ... | ... | ... |
| 2021-06-13 | 111 | EMEA | Spain | P1 | 0 | 1 |
| 2021-06-13 | 111 | EMEA | England | P2 | 0 | 2 |
Df1 has many dates of grouped data and df only has one date. I would like to replace the count column in df1 with the count in df for matching rows (yyyy_mm_dd, id, region, country, product) and retain fullfilments.
I could probably join both together and drop count in the first df, however I only want to replace where the date is matching and retain all other rows in df1.

You can simply join and use the coalesce function.
When you do the left join from the first dataframe to the second, the only matching records have the not null new_count value. Now, use the coalesce function that will return the first value when it is not null but the second value when the first is null.
coalesce(a , b ) => a
coalesce(a , null) => a
coalesce(null, b ) => b
From your dataframes,
from pyspark.sql import functions as f
df1 = spark.read.option("inferSchema","true").option("header","true").csv("test1.csv")
+----------+---+------+--------+-------+-----+
|yyyy_mm_dd|id |region|country |product|count|
+----------+---+------+--------+-------+-----+
|2021-06-14|111|EMEA |Spain |P1 |10 |
|2021-06-14|111|EMEA |England |P1 |9 |
|2021-06-14|111|EMEA |France |P1 |10 |
|2021-06-14|111|EMEA |Spain |P2 |299 |
|2021-06-14|111|EMEA |England |P2 |39 |
|2021-06-14|111|EMEA |France |P2 |10 |
|2021-06-14|112|LATAM |Brazil |P1 |64 |
|2021-06-14|112|LATAM |Paraguay|P2 |21 |
+----------+---+------+--------+-------+-----+
df2 = spark.read.option("inferSchema","true").option("header","true").csv("test2.csv")
+----------+---+------+--------+-------+-----+------------+
|yyyy_mm_dd|id |region|country |product|count|fullfilments|
+----------+---+------+--------+-------+-----+------------+
|2021-06-14|111|EMEA |Spain |P1 |1 |1 |
|2021-06-14|111|EMEA |England |P1 |1 |3 |
|2021-06-14|111|EMEA |France |P1 |2 |4 |
|2021-06-14|111|EMEA |Spain |P2 |1 |1 |
|2021-06-14|111|EMEA |England |P2 |2 |1 |
|2021-06-14|111|EMEA |France |P2 |1 |5 |
|2021-06-14|112|LATAM |Brazil |P1 |2 |2 |
|2021-06-14|112|LATAM |Paraguay|P2 |21 |1 |
|2021-06-13|111|EMEA |Spain |P1 |0 |1 |
|2021-06-13|111|EMEA |England |P2 |0 |2 |
+----------+---+------+--------+-------+-----+------------+
the join of two dataframes are given by follows:
cols_to_join = ['yyyy_mm_dd', 'id', 'region', 'country', 'product']
df3 = df2.join(df1.withColumnRenamed('count', 'new_count'), cols_to_join, 'left') \
.withColumn('count', f.coalesce('new_count', 'count')).drop('new_count')
df3.show(truncate=False)
+----------+---+------+--------+-------+-----+------------+
|yyyy_mm_dd|id |region|country |product|count|fullfilments|
+----------+---+------+--------+-------+-----+------------+
|2021-06-14|111|EMEA |Spain |P1 |10 |1 |
|2021-06-14|111|EMEA |England |P1 |9 |3 |
|2021-06-14|111|EMEA |France |P1 |10 |4 |
|2021-06-14|111|EMEA |Spain |P2 |299 |1 |
|2021-06-14|111|EMEA |England |P2 |39 |1 |
|2021-06-14|111|EMEA |France |P2 |10 |5 |
|2021-06-14|112|LATAM |Brazil |P1 |64 |2 |
|2021-06-14|112|LATAM |Paraguay|P2 |21 |1 |
|2021-06-13|111|EMEA |Spain |P1 |0 |1 |
|2021-06-13|111|EMEA |England |P2 |0 |2 |
+----------+---+------+--------+-------+-----+------------+

Everytime you need to retrieve a column from different dataframes you must join them:
import pyspark.sql.functions as f
df2 = df1.join(df.withColumnRenamed('count', 'new_count'),
on=['yyyy_mm_dd', 'id', 'region', 'country', 'product'], how='left')
df2 = (df2
.withColumn('count', f.coalesce('new_count', 'count'))
.drop('new_count'))
df2.show(truncate=False)

Related

Create a dataframe base on X days backward observation

Consedering that I have the following DF:
|-----------------|
|Date | Cod |
|-----------------|
|2022-08-01 | A |
|2022-08-02 | A |
|2022-08-03 | A |
|2022-08-04 | A |
|2022-08-05 | A |
|2022-08-01 | B |
|2022-08-02 | B |
|2022-08-03 | B |
|2022-08-04 | B |
|2022-08-05 | B |
|-----------------|
And considering that I have a backward observation of 2 days, how can I generate the following output DF
|------------------------------|
|RefDate | Date | Cod
|------------------------------|
|2022-08-03 | 2022-08-01 | A |
|2022-08-03 | 2022-08-02 | A |
|2022-08-03 | 2022-08-03 | A |
|2022-08-04 | 2022-08-02 | A |
|2022-08-04 | 2022-08-03 | A |
|2022-08-04 | 2022-08-04 | A |
|2022-08-05 | 2022-08-03 | A |
|2022-08-05 | 2022-08-04 | A |
|2022-08-05 | 2022-08-05 | A |
|2022-08-03 | 2022-08-01 | B |
|2022-08-03 | 2022-08-02 | B |
|2022-08-03 | 2022-08-03 | B |
|2022-08-04 | 2022-08-02 | B |
|2022-08-04 | 2022-08-03 | B |
|2022-08-04 | 2022-08-04 | B |
|2022-08-05 | 2022-08-03 | B |
|2022-08-05 | 2022-08-04 | B |
|2022-08-05 | 2022-08-05 | B |
|------------------------------|
I know that I can use loops to generate this output DF, but loops doesn't have a good performance since I can't cache the DF on memory (My original DF has approx 6 billion lines). So, what is the best way to get this output?
MVCE:
data_1=[
("2022-08-01","A"),
("2022-08-02","A"),
("2022-08-03","A"),
("2022-08-04","A"),
("2022-08-05","A"),
("2022-08-01","B"),
("2022-08-02","B"),
("2022-08-03","B"),
("2022-08-04","B"),
("2022-08-05","B")
]
schema_1 = StructType([
StructField("Date", StringType(),True),
StructField("Cod", StringType(),True)
])
df_1 = spark.createDataFrame(data=data_1,schema=schema_1)
You could try a self join. My thoughts - If your cluster and session are configured optimally, it should work with 6B rows.
data_sdf.alias('a'). \
join(data_sdf.alias('b'),
[func.col('a.cod') == func.col('b.cod'),
func.datediff(func.col('a.date'), func.col('b.date')).between(0, 2)],
'inner'
). \
drop(func.col('a.cod')). \
selectExpr('cod', 'a.date as ref_date', 'b.date as date'). \
show()
# +---+----------+----------+
# |cod| ref_date| date|
# +---+----------+----------+
# | B|2022-08-01|2022-08-01|
# | B|2022-08-02|2022-08-01|
# | B|2022-08-02|2022-08-02|
# | B|2022-08-03|2022-08-01|
# | B|2022-08-03|2022-08-02|
# | B|2022-08-03|2022-08-03|
# | B|2022-08-04|2022-08-02|
# | B|2022-08-04|2022-08-03|
# | B|2022-08-04|2022-08-04|
# | B|2022-08-05|2022-08-03|
# | B|2022-08-05|2022-08-04|
# | B|2022-08-05|2022-08-05|
# | A|2022-08-01|2022-08-01|
# | A|2022-08-02|2022-08-01|
# | A|2022-08-02|2022-08-02|
# | A|2022-08-03|2022-08-01|
# | A|2022-08-03|2022-08-02|
# | A|2022-08-03|2022-08-03|
# | A|2022-08-04|2022-08-02|
# | A|2022-08-04|2022-08-03|
# +---+----------+----------+
# only showing top 20 rows
This will generate records for the initial 2 dates as well which can be discarded.

Look back based on X days an get col values based on condition spark

I have the following DF:
--------------------------------
|Id |Date |Value |cond |
|-------------------------------|
|1 |2022-08-03 | 100| 1 |
|1 |2022-08-04 | 200| 2 |
|1 |2022-08-05 | 150| 3 |
|1 |2022-08-06 | 300| 4 |
|1 |2022-08-07 | 400| 5 |
|1 |2022-08-08 | 150| 6 |
|1 |2022-08-09 | 500| 7 |
|1 |2022-08-10 | 150| 8 |
|1 |2022-08-11 | 150| 9 |
|1 |2022-08-12 | 700| 1 |
|1 |2022-08-13 | 800| 2 |
|1 |2022-08-14 | 150| 2 |
|1 |2022-08-15 | 300| 0 |
|1 |2022-08-16 | 200| 1 |
|1 |2022-08-17 | 150| 3 |
|1 |2022-08-18 | 150| 1 |
|1 |2022-08-19 | 250| 4 |
|1 |2022-08-20 | 150| 5 |
|1 |2022-08-21 | 400| 6 |
|2 |2022-08-03 | 100| 1 |
|2 |2022-08-04 | 200| 2 |
|2 |2022-08-05 | 150| 1 |
|2 |2022-08-06 | 300| 1 |
|2 |2022-08-07 | 400| 1 |
|2 |2022-08-08 | 150| 1 |
|2 |2022-08-09 | 125| 1 |
|2 |2022-08-10 | 150| 1 |
|2 |2022-08-11 | 150| 3 |
|2 |2022-08-12 | 170| 6 |
|2 |2022-08-13 | 150| 7 |
|2 |2022-08-14 | 150| 8 |
|2 |2022-08-15 | 300| 1 |
|2 |2022-08-16 | 150| 9 |
|2 |2022-08-17 | 150| 0 |
|2 |2022-08-18 | 400| 1 |
|2 |2022-08-19 | 150| 1 |
|2 |2022-08-20 | 500| 1 |
|2 |2022-08-21 | 150| 1 |
--------------------------------
And this one:
---------------------
|Date | cond |
|-------------------|
|2022-08-03 | 1 |
|2022-08-04 | 2 |
|2022-08-05 | 1 |
|2022-08-06 | 1 |
|2022-08-07 | 1 |
|2022-08-08 | 1 |
|2022-08-09 | 1 |
|2022-08-10 | 1 |
|2022-08-11 | 3 |
|2022-08-12 | 6 |
|2022-08-13 | 8 |
|2022-08-14 | 9 |
|2022-08-15 | 1 |
|2022-08-16 | 2 |
|2022-08-17 | 2 |
|2022-08-18 | 0 |
|2022-08-19 | 1 |
|2022-08-20 | 3 |
|2022-08-21 | 1 |
--------------------
My expected output is:
-------------------------------
|Id |Date |Avg |Count|
|-----------------------------|
|1 |2022-08-03 | 0| 0 |
|1 |2022-08-04 | 0| 0 |
|1 |2022-08-05 | 0| 0 |
|1 |2022-08-06 | 0| 0 |
|1 |2022-08-07 | 0| 0 |
|1 |2022-08-08 | 0| 0 |
|1 |2022-08-09 | 0| 0 |
|1 |2022-08-10 | 0| 0 |
|1 |2022-08-11 | 0| 0 |
|1 |2022-08-12 | 0| 0 |
|1 |2022-08-13 | 0| 0 |
|1 |2022-08-14 | 0| 0 |
|1 |2022-08-15 | 0| 0 |
|1 |2022-08-16 | 0| 0 |
|1 |2022-08-17 | 0| 0 |
|1 |2022-08-18 | 0| 0 |
|1 |2022-08-19 | 0| 0 |
|1 |2022-08-20 | 0| 0 |
|1 |2022-08-21 | 0| 0 |
|2 |2022-08-03 | 0| 0 |
|2 |2022-08-04 | 0| 0 |
|2 |2022-08-05 | 0| 1 |
|2 |2022-08-06 | 0| 2 |
|2 |2022-08-07 | 0| 3 |
|2 |2022-08-08 | 237,5| 4 |
|2 |2022-08-09 | 250| 4 |
|2 |2022-08-10 |243,75| 4 |
|2 |2022-08-11 | 0| 0 |
|2 |2022-08-12 | 0| 0 |
|2 |2022-08-13 | 0| 0 |
|2 |2022-08-14 | 0| 0 |
|2 |2022-08-15 |206,25| 4 |
|2 |2022-08-16 | 0| 0 |
|2 |2022-08-17 | 0| 0 |
|2 |2022-08-18 | 0| 0 |
|2 |2022-08-19 |243,75| 4 |
|2 |2022-08-20 | 0| 0 |
|2 |2022-08-21 | 337,5| 4 |
-------------------------------
The algorithm is:
Verify if Date and Cond are the same in the first and second DFs.
If the condition is true, I need to lookback on DF1 four days (D-1, D-2, D-3, D-4) based on Cond and calculate the Average(Avg) and count of this values. If I have more then 4 days I need to use the top 4 values to calculate the Avg and Count is going to be always 4 in this case.
Example situations based on the inputs:
Id = 1, Date = 2022-08-08
Count is 0 because the condition is false, then Avg is 0 too.
Id = 2, Date = 2022-08-08
Count is 4 because the condition is true, then I get values of 2022-08-07, 2022-08-06, 2022-08-05, 2022-08-03. I exclude 2022-08-04 because Cond value there is 2, and the Date I'm using as reference Cond is 1.
Id = 2, Date = 2022-08-07
Count is 3 because the condition is true, but I have only the 3 values before that date, so I can't calculate the Avg since I need four values, so in that case Avg is zero.
I tried to use window function, but with no success. I was able to achieve the output DF using SQL (Joins with Outter Apply). But spark doesn't have outter apply. So, my doubts are:
How to generate the output DF.
What is the best the way the generate the output DF.
MVCE to generate the input DFs in pyspark:
data_1=[
("1","2022-08-03",100,1),
("1","2022-08-04",200,2),
("1","2022-08-05",150,3),
("1","2022-08-06",300,4),
("1","2022-08-07",400,5),
("1","2022-08-08",150,6),
("1","2022-08-09",500,7),
("1","2022-08-10",150,8),
("1","2022-08-11",150,9),
("1","2022-08-12",700,1),
("1","2022-08-13",800,2),
("1","2022-08-14",150,2),
("1","2022-08-15",300,0),
("1","2022-08-16",200,1),
("1","2022-08-17",150,3),
("1","2022-08-18",150,1),
("1","2022-08-19",250,4),
("1","2022-08-20",150,5),
("1","2022-08-21",400,6),
("2","2022-08-03",100,1),
("2","2022-08-04",200,2),
("2","2022-08-05",150,1),
("2","2022-08-06",300,1),
("2","2022-08-07",400,1),
("2","2022-08-08",150,1),
("2","2022-08-09",125,1),
("2","2022-08-10",150,1),
("2","2022-08-11",150,3),
("2","2022-08-12",170,6),
("2","2022-08-13",150,7),
("2","2022-08-14",150,8),
("2","2022-08-15",300,1),
("2","2022-08-16",150,9),
("2","2022-08-17",150,0),
("2","2022-08-18",400,1),
("2","2022-08-19",150,1),
("2","2022-08-20",500,1),
("2","2022-08-21",150,1)
]
schema_1 = StructType([
StructField("Id", StringType(),True),
StructField("Date", DateType(),True),
StructField("Value", IntegerType(),True),
StructField("Cond", IntegerType(),True)
])
df_1 = spark.createDataFrame(data=data_1,schema=schema_1)
data_2 = [
("2022-08-03", 1),
("2022-08-04", 2),
("2022-08-05", 1),
("2022-08-06", 1),
("2022-08-07", 1),
("2022-08-08", 1),
("2022-08-09", 1),
("2022-08-10", 1),
("2022-08-11", 3),
("2022-08-12", 6),
("2022-08-13", 8),
("2022-08-14", 9),
("2022-08-15", 1),
("2022-08-16", 2),
("2022-08-17", 2),
("2022-08-18", 0),
("2022-08-19", 1),
("2022-08-20", 3),
("2022-08-21", 1)
]
schema_2 = StructType([
StructField("Date", DateType(),True),
StructField("Cond", IntegerType(),True)
])
df_2 = spark.createDataFrame(data=data_2,schema=schema_2)
UPDATE: I updated the question to be more clearly about the conditions to join the DFs!
Do a left join to get the dates you are interested in.
Then use pyspark.sql.window to get the values you need into a list and take size of this as Count.
Finally with the help of pyspark.sql.functions.aggregate get the Avg.
from pyspark.sql import functions as F, Window
# cast to date, and rename columns for later use
df_1 = df_1.withColumn("Date", F.col("Date").cast("date"))
df_2 = df_2.withColumn("Date", F.col("Date").cast("date"))
df_2 = df_2.withColumnRenamed("Date", "DateDf2")\
.withColumnRenamed("Cond", "CondDf2")
# left join
df = df_1.join(df_2, (df_1.Cond==df_2.CondDf2)&(df_1.Date==df_2.DateDf2), how='left')
windowSpec = Window.partitionBy("Id", "Cond").orderBy("Date")
# all the magic happens here!
df = (
# only start counting when "DateDf2" is not null, and put the values into a list
df.withColumn("value_list", F.when(F.isnull("DateDf2"), F.array()).otherwise(F.collect_list("Value").over(windowSpec.rowsBetween(-4, -1))))
.withColumn("Count", F.size("value_list"))
# use aggregate to sum up the list only if the size is 4! and divide by 4 to get average
.withColumn("Avg", F.when(F.col("count")==4, F.aggregate("value_list", F.lit(0), lambda acc,x: acc+x)/4).otherwise(F.lit(0)))
.select("Id", "Date", "Avg", "Count")
.orderBy("Id", "Date")
)
Output is:
+---+----------+------+-----+
|Id |Date |Avg |Count|
+---+----------+------+-----+
|1 |2022-08-03|0.0 |0 |
|1 |2022-08-04|0.0 |0 |
|1 |2022-08-05|0.0 |0 |
|1 |2022-08-06|0.0 |0 |
|1 |2022-08-07|0.0 |0 |
|1 |2022-08-08|0.0 |0 |
|1 |2022-08-09|0.0 |0 |
|1 |2022-08-10|0.0 |0 |
|1 |2022-08-11|0.0 |0 |
|1 |2022-08-12|0.0 |0 |
|1 |2022-08-13|0.0 |0 |
|1 |2022-08-14|0.0 |0 |
|1 |2022-08-15|0.0 |0 |
|1 |2022-08-16|0.0 |0 |
|1 |2022-08-17|0.0 |0 |
|1 |2022-08-18|0.0 |0 |
|1 |2022-08-19|0.0 |0 |
|1 |2022-08-20|0.0 |0 |
|1 |2022-08-21|0.0 |0 |
|2 |2022-08-03|0.0 |0 |
|2 |2022-08-04|0.0 |0 |
|2 |2022-08-05|0.0 |1 |
|2 |2022-08-06|0.0 |2 |
|2 |2022-08-07|0.0 |3 |
|2 |2022-08-08|237.5 |4 |
|2 |2022-08-09|250.0 |4 |
|2 |2022-08-10|243.75|4 |
|2 |2022-08-11|0.0 |0 |
|2 |2022-08-12|0.0 |0 |
|2 |2022-08-13|0.0 |0 |
|2 |2022-08-14|0.0 |0 |
|2 |2022-08-15|206.25|4 |
|2 |2022-08-16|0.0 |0 |
|2 |2022-08-17|0.0 |0 |
|2 |2022-08-18|0.0 |0 |
|2 |2022-08-19|243.75|4 |
|2 |2022-08-20|0.0 |0 |
|2 |2022-08-21|337.5 |4 |
+---+----------+------+-----+
here is the solution for the same
Solution:
from pyspark.sql import Window
import pyspark.sql.functions as F
df_1= df_1.withColumn("Date",F.col("Date").cast("timestamp"))
df_2= df_2.withColumn("Date",F.col("Date").cast("timestamp"))
window_spec = Window.partitionBy(["Id"]).orderBy("Date")
four_days_sld_wnd_exl_cuurent_row = Window.partitionBy(["Id"]).orderBy(["rnk"]).rangeBetween(-4, -1)
window_spec_count_cond_ = Window.partitionBy(["Id"]).orderBy(F.unix_timestamp("Date", 'yyyy-MM-dd') / 86400).rangeBetween(-4, -1)
agg_col_cond_ = (F.col("agg") ==0.0)
date_2_col_cond_ = (F.col("Date_2").isNull())
valid_4_days_agg_value =(F.when((~date_2_col_cond_) & (F.size(F.col("date_arrays_with_cond_1"))==4),
F.sum(F.col("Value")).over(four_days_sld_wnd_exl_cuurent_row)).otherwise(F.lit(0.0)))
count_cond_ = (F.when(~agg_col_cond_ & ~date_2_col_cond_,F.lit(4))
.when(agg_col_cond_ & date_2_col_cond_,F.lit(0))
.otherwise(F.size(F.collect_set(F.col("Date_2")).over(window_spec_count_cond_))))
df_jn = df_1.join(df_2,["Date","Cond"],"left")\
.select(df_1["*"],df_2["Date"].alias("Date_2")).orderBy("Id",df_1["Date"])
filter_having_cond_1=(F.col("Cond") == 1)
cond_columns_matching = (F.col("Date_2").isNull())
df_fnl_with_cond_val_1 = df_jn.filter(filter_having_cond_1)
df_fnl_with_cond_val_other=df_jn.filter(~filter_having_cond_1)\
.withColumn("agg",F.lit(0.0))\
.withColumn("count",F.lit(0))\
.drop("Date_2")
df_fnl_with_cond_val_1 = df_fnl_with_cond_val_1\
.withColumn("rnk",F.row_number().over(window_spec))\
.withColumn("date_arrays_with_cond_1", F.collect_set(F.col("Date")).over(four_days_sld_wnd_exl_cuurent_row))\
.withColumn("agg",valid_4_days_agg_value/4)\
.withColumn("count",count_cond_)\
.drop("date_arrays_with_cond_1","rnk","Date_2")
df_fnl = df_fnl_with_cond_val_1.unionByName(df_fnl_with_cond_val_other)
df_fnl.orderBy(["id","Date"]).show(50,0)
kindly upvote if you like my solution .
output
+---+-------------------+-----+----+------+-----+
|Id |Date |Value|Cond|agg |count|
+---+-------------------+-----+----+------+-----+
|1 |2022-08-03 00:00:00|100 |1 |0.0 |0 |
|1 |2022-08-04 00:00:00|200 |2 |0.0 |0 |
|1 |2022-08-05 00:00:00|150 |3 |0.0 |0 |
|1 |2022-08-06 00:00:00|300 |4 |0.0 |0 |
|1 |2022-08-07 00:00:00|400 |5 |0.0 |0 |
|1 |2022-08-08 00:00:00|150 |6 |0.0 |0 |
|1 |2022-08-09 00:00:00|500 |7 |0.0 |0 |
|1 |2022-08-10 00:00:00|150 |8 |0.0 |0 |
|1 |2022-08-11 00:00:00|150 |9 |0.0 |0 |
|1 |2022-08-12 00:00:00|700 |1 |0.0 |0 |
|1 |2022-08-13 00:00:00|800 |2 |0.0 |0 |
|1 |2022-08-14 00:00:00|150 |2 |0.0 |0 |
|1 |2022-08-15 00:00:00|300 |0 |0.0 |0 |
|1 |2022-08-16 00:00:00|200 |1 |0.0 |0 |
|1 |2022-08-17 00:00:00|150 |3 |0.0 |0 |
|1 |2022-08-18 00:00:00|150 |1 |0.0 |0 |
|1 |2022-08-19 00:00:00|250 |4 |0.0 |0 |
|1 |2022-08-20 00:00:00|150 |5 |0.0 |0 |
|1 |2022-08-21 00:00:00|400 |6 |0.0 |0 |
|2 |2022-08-03 00:00:00|100 |1 |0.0 |0 |
|2 |2022-08-04 00:00:00|200 |2 |0.0 |0 |
|2 |2022-08-05 00:00:00|150 |1 |0.0 |1 |
|2 |2022-08-06 00:00:00|300 |1 |0.0 |2 |
|2 |2022-08-07 00:00:00|400 |1 |0.0 |3 |
|2 |2022-08-08 00:00:00|150 |1 |237.5 |4 |
|2 |2022-08-09 00:00:00|125 |1 |250.0 |4 |
|2 |2022-08-10 00:00:00|150 |1 |243.75|4 |
|2 |2022-08-11 00:00:00|150 |3 |0.0 |0 |
|2 |2022-08-12 00:00:00|170 |6 |0.0 |0 |
|2 |2022-08-13 00:00:00|150 |7 |0.0 |0 |
|2 |2022-08-14 00:00:00|150 |8 |0.0 |0 |
|2 |2022-08-15 00:00:00|300 |1 |206.25|4 |
|2 |2022-08-16 00:00:00|150 |9 |0.0 |0 |
|2 |2022-08-17 00:00:00|150 |0 |0.0 |0 |
|2 |2022-08-18 00:00:00|400 |1 |0.0 |0 |
|2 |2022-08-19 00:00:00|150 |1 |243.75|4 |
|2 |2022-08-20 00:00:00|500 |1 |0.0 |0 |
|2 |2022-08-21 00:00:00|150 |1 |337.5 |4 |
+---+-------------------+-----+----+------+-----+

Extract values from column in spark dataframe and to two new columns

I have a spark dataframe that looks like this:
+----+------+-------------+
|user| level|value_pair |
+----+------+-------------+
| A | 25 |(23.52,25.12)|
| A | 6 |(0,0) |
| A | 2 |(11,12.12) |
| A | 32 |(17,16.12) |
| B | 22 |(19,57.12) |
| B | 42 |(10,3.2) |
| B | 43 |(32,21.0) |
| C | 33 |(12,0) |
| D | 32 |(265.21,19.2)|
| D | 62 |(57.12,50.12)|
| D | 32 |(75.12,57.12)|
| E | 63 |(0,0) |
+----+------+-------------+
How do I extract the values in the value_pair column and add them to two new columns called value1 and value2, using the comma as the separator.
+----+------+-------------+-------+
|user| level|value1 |value2 |
+----+------+-------------+-------+
| A | 25 |23.52 |25.12 |
| A | 6 |0 |0 |
| A | 2 |11 |12.12 |
| A | 32 |17 |16.12 |
| B | 22 |19 |57.12 |
| B | 42 |10 |3.2 |
| B | 43 |32 |21.0 |
| C | 33 |12 |0 |
| D | 32 |265.21 |19.2 |
| D | 62 |57.12 |50.12 |
| D | 32 |75.12 |57.12 |
| E | 63 |0 |0 |
+----+------+-------------+-------+
I know I can separate the values like so:
df = df.withColumn('value1', pyspark.sql.functions.split(df['value_pair'], ',')[0]
df = df.withColumn('value2', pyspark.sql.functions.split(df['value_pair'], ',')[1]
But how do I also get rid of the parantheses?
For the parentheses, as shown in the comments you can use regexp_replace, but you also need to include \. The backslash \ is the escape character for regular expressions.
Also, I believe you need to first remove the brackets, and then expand the column.
from pyspark.sql.functions import split
from pyspark.sql.functions import regexp_replace
df = df.withColumn('value_pair', regexp_replace(df.value_pair, "\(",""))
df = df.withColumn('value_pair', regexp_replace(df.value_pair, "\)",""))
df = df.withColumn('value1', split(df['value_pair'], ',').getItem(0)) \
.withColumn('value2', split(df['value_pair'], ',').getItem(1))
>>> df.show(truncate=False)
+----+-----+-----------+------+---------+
|user|level|value_pair |value1|value2 |
+----+-----+-----------+------+---------+
| A |25 |23.52,25.12|23.52 |25.12 |
| A |6 |0,0 |0 |0 |
| A |2 |11,12.12 |11 |12.12 |
| A |32 |17,16.12 |17 |16.12 |
| B |22 |19,57.12 |19 |57.12 |
| B |42 |10,3.2 |10 |3.2 |
| B |43 |32,21.0 |32 |21.0 |
| C |33 |12,0 |12 |0 |
| D |32 |265.21,19.2|265.21|19.2 |
| D |62 |57.12,50.12|57.12 |50.12 |
| D |32 |75.12,57.12|75.12 |57.12 |
| E |63 |0,0 |0 |0 |
+----+-----+-----------+------+---------+
As noticed, I changed slightly your code on how you grab the 2 items.
More information can be found here

PySpark: Timeslice and split rows in dataframe with 5 minutes interval on a specific condition

I have a dataframe with the following columns:
+-----+----------+--------------------------+-----------+
|id | sourceid | timestamp | indicator |
+-----+----------+--------------------------+-----------+
| 0 | 128 | 2019-12-03 12:00:00.0 | 0 |
| 1 | 128 | 2019-12-03 12:30:00.0 | 1 |
| 2 | 128 | 2019-12-03 12:37:00.0 | 0 |
| 3 | 128 | 2019-12-03 13:15:00.0 | 1 |
| 4 | 128 | 2019-12-03 13:17:00.0 | 0 |
+-----+----------+--------------------------+-----------+
I am trying to split the timestamp column into rows of 5 minute time intervals for indicator values which are not 0.
Explanation:
The first entry is at time timestamp = 2019-12-03 12:00:00.0, indicator= 0, do nothing.
Moving on to the next entry with timestamp = 2019-12-03 12:30:00.0, indicator= 1, I want to split timestamp into rows with a 5 minutes interval till we reach the next entry which is timestamp = 2019-12-03 12:37:00.0, indicator= 0.
If there is a case where timestamp = 2019-12-03 13:15:00.0, indicator = 1 and the next timestamp = 2019-12-03 13:17:00.0, indicator = 0, I'd like to split the row considering both the times have indicator as 1 as 13:17:00.0 falls between 13:15:00.0 - 13:20:00.0 as shown below.
How can I achieve this with PySpark?
Expected Output:
+-----+----------+--------------------------+-------------+
|id | sourceid | timestamp | indicator |
+-----+----------+--------------------------+-------------+
| 1 | 128 | 2019-12-03 12:30:00.0 | 1 |
| 1 | 128 | 2019-12-03 12:35:00.0 | 1 |
| 4 | 128 | 2019-12-03 13:15:00.0 | 1 |
| 4 | 128 | 2019-12-03 13:20:00.0 | 1 |
+-----+----------+--------------------------+-------------+
IIUC, you can filter rows based on indicators on the current and the next rows, and then use array + explode to create new rows (for testing purpose, I added some more rows into your original example):
from pyspark.sql import Window, functions as F
w1 = Window.partitionBy('sourceid').orderBy('timestamp')
# add a flag to check if the next indicator is '0'
df1 = df.withColumn('next_indicator_is_0', F.lead('indicator').over(w1) == 0)
df1.show(truncate=False)
+---+--------+---------------------+---------+-------------------+
|id |sourceid|timestamp |indicator|next_indicator_is_0|
+---+--------+---------------------+---------+-------------------+
|0 |128 |2019-12-03 12:00:00.0|0 |false |
|1 |128 |2019-12-03 12:30:00.0|1 |true |
|2 |128 |2019-12-03 12:37:00.0|0 |false |
|3 |128 |2019-12-03 13:12:00.0|1 |false |
|4 |128 |2019-12-03 13:15:00.0|1 |true |
|5 |128 |2019-12-03 13:17:00.0|0 |false |
|6 |128 |2019-12-03 13:20:00.0|1 |null |
+---+--------+---------------------+---------+-------------------+
df1.filter("indicator = 1 AND next_indicator_is_0") \
.withColumn('timestamp', F.expr("explode(array(`timestamp`, `timestamp` + interval 5 minutes))")) \
.drop('next_indicator_is_0') \
.show(truncate=False)
+---+--------+---------------------+---------+
|id |sourceid|timestamp |indicator|
+---+--------+---------------------+---------+
|1 |128 |2019-12-03 12:30:00.0|1 |
|1 |128 |2019-12-03 12:35:00 |1 |
|4 |128 |2019-12-03 13:15:00.0|1 |
|4 |128 |2019-12-03 13:20:00 |1 |
+---+--------+---------------------+---------+
Note: you can reset id column by using F.row_number().over(w1) or F.monotonically_increasing_id() based on your requirements.

How to operate global variable in Spark SQL dataframe row by row sequentially on Spark cluster?

I have dataset which like this:
+-------+------+-------+
|groupid|rownum|column2|
+-------+------+-------+
| 1 | 1 | 7 |
| 1 | 2 | 9 |
| 1 | 3 | 8 |
| 1 | 4 | 5 |
| 1 | 5 | 1 |
| 1 | 6 | 0 |
| 1 | 7 | 15 |
| 1 | 8 | 1 |
| 1 | 9 | 13 |
| 1 | 10 | 20 |
| 2 | 1 | 8 |
| 2 | 2 | 1 |
| 2 | 3 | 4 |
| 2 | 4 | 2 |
| 2 | 5 | 19 |
| 2 | 6 | 11 |
| 2 | 7 | 5 |
| 2 | 8 | 6 |
| 2 | 9 | 15 |
| 2 | 10 | 8 |
still have more rows......
I want to add a new column "column3" , which if the continuous column2 values are less than 10,then they will be arranged a same number such as 1. if their appear a value larger than 10 in column2, this row will be dropped ,then the following column3 row’s value will increase 1. For example, when groupid = 1,the column3's value from rownum 1 to 6 will be 1 and the rownum7 will be dropped, the column3's value of rownum 8 will be 2 and the rownum9,10 will be dropped.After the procedure, the table will like this:
+-------+------+-------+-------+
|groupid|rownum|column2|column3|
+-------+------+-------+-------+
| 1 | 1 | 7 | 1 |
| 1 | 2 | 9 | 1 |
| 1 | 3 | 8 | 1 |
| 1 | 4 | 5 | 1 |
| 1 | 5 | 1 | 1 |
| 1 | 6 | 0 | 1 |
| 1 | 7 | 15 | drop | this row will be dropped, in fact not exist
| 1 | 8 | 1 | 2 |
| 1 | 9 | 13 | drop | like above
| 1 | 10 | 20 | drop | like above
| 2 | 1 | 8 | 1 |
| 2 | 2 | 1 | 1 |
| 2 | 3 | 4 | 1 |
| 2 | 4 | 2 | 1 |
| 2 | 5 | 19 | drop | ...
| 2 | 6 | 11 | drop | ...
| 2 | 7 | 5 | 2 |
| 2 | 8 | 6 | 2 |
| 2 | 9 | 15 | drop | ...
| 2 | 10 | 8 | 3 |
In our project, the dataset is expressed as dataframe in spark sql
I try to solve this problem by udf in this way:
var last_rowNum: Int = 1
var column3_Num: Int = 1
def assign_column3_Num(rowNum:Int): Int = {
if (rowNum == 1){ //do nothing, just arrange 1
column3_Num = 1
last_rowNum = 1
return column3_Num
}
/*** if the difference between rownum is 1, they have the same column3
* value, if not, column3_Num++, so they are different
*/
if(rowNum - last_rowNum == 1){
last_rowNum = rowNum
return column3_Num
}else{
column3_Num += 1
last_rowNum = rowNum
return column3_Num
}
}
spark.sqlContext.udf.register("assign_column3_Num",assign_column3_Num _)
df.filter("column2>10") //drop the larger rows
.withColumn("column3",assign_column3_Num(col("column2"))) //add column3
as you can see, I use global variable. However, it's only effective in spark local[1] model. if i use local[8] or yarn-client, the result will totally wrong! this is because spark's running mechanism,they operate the global variable without distinguishing groupid and order!
So the question is how can i arrange right number when spark running on cluster?
use udf or udaf or RDD or other ?
thank you!
You can achieve your requirement by defining a udf function as below (comments are given for clarity)
import org.apache.spark.sql.functions._
def createNewCol = udf((rownum: collection.mutable.WrappedArray[Int], column2: collection.mutable.WrappedArray[Int]) => { // udf function
var value = 1 //value for column3
var previousValue = 0 //value for checking condition
var arrayBuffer = Array.empty[(Int, Int, Int)] //initialization of array to be returned
for((a, b) <- rownum.zip(column2)){ //zipping the collected lists and looping
if(b > 10 && previousValue < 10) //checking condition for column3
value = value +1 //adding 1 for column3
arrayBuffer = arrayBuffer ++ Array((a, b, value)) //adding the values
previousValue = b
}
arrayBuffer
})
Now utilize the algorithm defined in the udf function and to get the desired result, you would need to collect the values of rownum and column2 grouping them by groupid and sorting them by rownum and then call the udf function. Next steps would be to explode and select necessary columns. (commented for clarity)
df.orderBy("rownum").groupBy("groupid").agg(collect_list("rownum").as("rownum"), collect_list("column2").as("column2")) //collecting in order for generating values for column3
.withColumn("new", createNewCol(col("rownum"), col("column2"))) //calling udf function and storing the array of struct(rownum, column2, column3) in new column
.drop("rownum", "column2") //droping unnecessary columns
.withColumn("new", explode(col("new"))) //exploding the new column array so that each row can have struct(rownum, column2, column3)
.select(col("groupid"), col("new._1").as("rownum"), col("new._2").as("column2"), col("new._3").as("column3")) //selecting as separate columns
.filter(col("column2") < 10) // filtering the rows with column2 greater than 10
.show(false)
You should have your desired output as
+-------+------+-------+-------+
|groupid|rownum|column2|column3|
+-------+------+-------+-------+
|1 |1 |7 |1 |
|1 |2 |9 |1 |
|1 |3 |8 |1 |
|1 |4 |5 |1 |
|1 |5 |1 |1 |
|1 |6 |0 |1 |
|1 |8 |1 |2 |
|2 |1 |8 |1 |
|2 |2 |1 |1 |
|2 |3 |4 |1 |
|2 |4 |2 |1 |
|2 |7 |5 |2 |
|2 |8 |6 |2 |
|2 |10 |8 |3 |
+-------+------+-------+-------+

Resources