PySpark ETL Update Dataframe - apache-spark

I have thow Pyspark dataframes and I want to uodate the "target" dataframe with "staging" one according on the key...
Which is the best optimezed way to do this in Pyspark?
target
+---+-----------------------+------+------+
|key|updated_timestamp |field0|field1|
+---+-----------------------+------+------+
|005|2019-10-26 21:02:30.638|cdao |coaame|
|001|2019-10-22 13:02:30.638|aaaaaa|fsdc |
|002|2019-12-22 11:42:30.638|stfi |? |
|004|2019-10-21 14:02:30.638|ct |ome |
|003|2019-10-24 21:02:30.638|io |me |
+---+-----------------------+------+------+
staging
+---+-----------------------+----------+---------+
|key|updated_timestamp |field0 |field1 |
+---+-----------------------+----------+---------+
|006|2020-03-06 01:42:30.638|new record|xxaaame |
|005|2019-10-29 09:42:30.638|cwwwwdao |coaaaaame|
|004|2019-10-29 21:03:35.638|cwwwwdao |coaaaaame|
+---+-----------------------+----------+---------+
output dataframe
+---+-----------------------+----------+---------+
|key|updated_timestamp |field0 |field1 |
+---+-----------------------+----------+---------+
|005|2019-10-29 09:42:30.638|cwwwwdao |coaaaaame|
|001|2019-10-22 13:02:30.638|aaaaaa |fsdc |
|002|2019-12-22 11:42:30.638|stfi |? |
|004|2019-10-29 21:03:35.638|cwwwwdao |coaaaaame|
|003|2019-10-24 21:02:30.638|io |me |
|006|2020-03-06 01:42:30.638|new record|xxaaame |
+---+-----------------------+----------+---------+

There are several ways to achieve that. Here is one using a full outer join :
from pyspark.sql import functions as F
output = staging.join(
target,
on='key',
how='full'
).select(
*(
F.coalesce(staging[col], target[col]).alias(col)
for col
in staging.columns
)
)
This works only if the updated value is not NULL.

Another solution using union :
output = staging.union(
target.join(
staging,
on="key",
how="left_anti"
)
)

Related

How does Spark SQL implement the group by aggregate

How does Spark SQL implement the group by aggregate? I want to group by name field and based on the latest data to get the latest salary. How to write the SQL
The data is:
+-------+------|+---------|
// | name |salary|date |
// +-------+------|+---------|
// |AA | 3000|2022-01 |
// |AA | 4500|2022-02 |
// |BB | 3500|2022-01 |
// |BB | 4000|2022-02 |
// +-------+------+----------|
The expected result is:
+-------+------|
// | name |salary|
// +-------+------|
// |AA | 4500|
// |BB | 4000|
// +-------+------+
Assuming that the dataframe is registered as a temporary view named tmp, first use the row_number windowing function for each group (name) in reverse order by date Assign the line number (rn), and then take all the lines with rn=1.
sql = """
select name, salary from
(select *, row_number() over (partition by name order by date desc) as rn
from tmp)
where rn = 1
"""
df = spark.sql(sql)
df.show(truncate=False)
First convert your string to a date.
Covert the date to an UNixTimestamp.(number representation of a date, so you can use Max)
User "First" as an aggregate
function that retrieves a value of your aggregate results. (The first results, so if there is a date tie, it could pull either one.)
:
simpleData = [("James","Sales","NY",90000,34,'2022-02-01'),
("Michael","Sales","NY",86000,56,'2022-02-01'),
("Robert","Sales","CA",81000,30,'2022-02-01'),
("Maria","Finance","CA",90000,24,'2022-02-01'),
("Raman","Finance","CA",99000,40,'2022-03-01'),
("Scott","Finance","NY",83000,36,'2022-04-01'),
("Jen","Finance","NY",79000,53,'2022-04-01'),
("Jeff","Marketing","CA",80000,25,'2022-04-01'),
("Kumar","Marketing","NY",91000,50,'2022-05-01')
]
schema = ["employee_name","name","state","salary","age","updated"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.printSchema()
df.show(truncate=False)
df.withColumn(
"dateUpdated",
unix_timestamp(
to_date(
col("updated") ,
"yyyy-MM-dd"
)
)
).groupBy("name")
.agg(
max("dateUpdated"),
first("salary").alias("Salary")
).show()
+---------+----------------+------+
| name|max(dateUpdated)|Salary|
+---------+----------------+------+
| Sales| 1643691600| 90000|
| Finance| 1648785600| 90000|
|Marketing| 1651377600| 80000|
+---------+----------------+------+
My usual trick is to "zip" date and salary together (depends on what do you want to sort first)
from pyspark.sql import functions as F
(df
.groupBy('name')
.agg(F.max(F.array('date', 'salary')).alias('max_date_salary'))
.withColumn('max_salary', F.col('max_date_salary')[1])
.show()
)
+----+---------------+----------+
|name|max_date_salary|max_salary|
+----+---------------+----------+
| AA|[2022-02, 4500]| 4500|
| BB|[2022-02, 4000]| 4000|
+----+---------------+----------+

select the trailer part of the string pyspark

I have a string as below where in need to select only the last part of the string using pyspark
Input
/dbfs/mnt/abc/date=20210224/fsp_store_abcxyz_lmn_
/dbfs/mnt/abc/date=20210224/fsp_store_schu_lev_bsd_s_
Output
fsp_store_abcxyz_lmn_
fsp_store_schu_lev_bsd_s_
i am using the code below
from pyspark.sql.functions import *
df=spark.sql("""select stack(1,"/dbfs/mnt/abc/date=20210224/fsp_store_abcxyz_lmn_","/dbfs/mnt/abc/date=20210224/fsp_store_schu_lev_bsd_s_") as (txt)""")
df.withColumn("extract",regexp_extract(col("txt"),"_(.*)",1)).display(10,False)
and my output is
store_abcxyz_lmn
store_schu_lev_bsd_s
however my requirement is
fsp_store_schu_lev_bsd_s_
fsp_store_abcxyz_lmn_
could you please help in the above challenge
Try this:
df = spark.createDataFrame([{"Path": "/dbfs/mnt/abc/date=20210224/fsp_store_abcxyz_lmn_"},
{"Path": "/dbfs/mnt/abc/date=20210224/fsp_store_schu_lev_bsd_s_ "}])
df.show(2, False)
+------------------------------------------------------+
|Path |
+------------------------------------------------------+
|/dbfs/mnt/abc/date=20210224/fsp_store_abcxyz_lmn_ |
|/dbfs/mnt/abc/date=20210224/fsp_store_schu_lev_bsd_s_ |
+------------------------------------------------------+
df.withColumn("result", F.reverse(F.split(F.reverse("Path"), "/")[0])).show(2, False)
+------------------------------------------------------+--------------------------+
|Path |result |
+------------------------------------------------------+--------------------------+
|/dbfs/mnt/abc/date=20210224/fsp_store_abcxyz_lmn_ |fsp_store_abcxyz_lmn_ |
|/dbfs/mnt/abc/date=20210224/fsp_store_schu_lev_bsd_s_ |fsp_store_schu_lev_bsd_s_ |
+------------------------------------------------------+--------------------------+

PySpark split using regex doesn't work on a dataframe column with string type

I have a PySpark data frame with a string column(URL) and all records look in the following way
ID URL
1 https://app.xyz.com/inboxes/136636/conversations/2686735685
2 https://app.xyz.com/inboxes/136636/conversations/2938415796
3 https://app.drift.com/inboxes/136636/conversations/2938419189
I want to basically extract the number after conversations/ from URL column using regex into another column.
I tried the following code but it doesn't give me any results.
df1 = df.withColumn('CONV_ID', split(convo_influ_new['URL'], '(?<=conversations/).*').getItem(0))
Expected:
ID URL CONV_ID
1 https://app.xyz.com/inboxes/136636/conversations/2686735685 2686735685
2 https://app.xyz.com/inboxes/136636/conversations/2938415796 2938415796
3 https://app.drift.com/inboxes/136636/conversations/2938419189 2938419189
Result:
ID URL CONV_ID
1 https://app.xyz.com/inboxes/136636/conversations/2686735685 https://app.xyz.com/inboxes/136636/conversations/2686735685
2 https://app.xyz.com/inboxes/136636/conversations/2938415796 https://app.xyz.com/inboxes/136636/conversations/2938415796
3 https://app.drift.com/inboxes/136636/conversations/2938419189 https://app.drift.com/inboxes/136636/conversations/2938419189
Not sure what's happening here. I tried the regex script in different online regex tester toolds and it highlights the part I want but never works in PySpark. I tried different PySpark functions like f.split, regexp_extract, regexp_replace, but none of them work.
If you are URLs have always that form, you can actually just use substring_index to get the last path element :
import pyspark.sql.functions as F
df1 = df.withColumn("CONV_ID", F.substring_index("URL", "/", -1))
df1.show(truncate=False)
#+---+-------------------------------------------------------------+----------+
#|ID |URL |CONV_ID |
#+---+-------------------------------------------------------------+----------+
#|1 |https://app.xyz.com/inboxes/136636/conversations/2686735685 |2686735685|
#|2 |https://app.xyz.com/inboxes/136636/conversations/2938415796 |2938415796|
#|3 |https://app.drift.com/inboxes/136636/conversations/2938419189|2938419189|
#+---+-------------------------------------------------------------+----------+
You can use regexp_extract instead:
import pyspark.sql.functions as F
df1 = df.withColumn(
'CONV_ID',
F.regexp_extract('URL', 'conversations/(.*)', 1)
)
df1.show()
+---+--------------------+----------+
| ID| URL| CONV_ID|
+---+--------------------+----------+
| 1|https://app.xyz.c...|2686735685|
| 2|https://app.xyz.c...|2938415796|
| 3|https://app.drift...|2938419189|
+---+--------------------+----------+
Or if you want to use split, you don't need to specify .*. You just need to specify the pattern used for splitting.
import pyspark.sql.functions as F
df1 = df.withColumn(
'CONV_ID',
F.split('URL', '(?<=conversations/)')[1] # just using 'conversations/' should also be enough
)
df1.show()
+---+--------------------+----------+
| ID| URL| CONV_ID|
+---+--------------------+----------+
| 1|https://app.xyz.c...|2686735685|
| 2|https://app.xyz.c...|2938415796|
| 3|https://app.drift...|2938419189|
+---+--------------------+----------+

how to get k-largest element and index in pyspark dataframe array

I have the following dataframe in pyspark:
+------------------------------------------------------------+
|probability |
+------------------------------------------------------------+
|[0.27047928569511825,0.5312608102025099,0.19825990410237174]|
|[0.06711381377029987,0.8775456658890036,0.05534052034069637]|
|[0.10847074295048188,0.04602848157663474,0.8455007754728833]|
+------------------------------------------------------------+
and I want to get the largest, 2-largest value and their index:
+-------------------------------------------------------------------------------------------------------------- -----+
|probability | largest_1 |index_1|largest_2 |index_2 |
+------------------------------------------------------------|------------------|-------|-------------------|--------+
|[0.27047928569511825,0.5312608102025099,0.19825990410237174]|0.5312608102025099| 1 |0.27047928569511825| 0 |
|[0.06711381377029987,0.8775456658890036,0.05534052034069637]|0.8775456658890036| 1 |0.06711381377029987| 0 |
|[0.10847074295048188,0.04602848157663474,0.8455007754728833]|0.8455007754728833| 2 |0.10847074295048188| 0 |
+--------------------------------------------------------------------------------------------------------------------+
Here is another way using transform (require spark 2.4+) to convert array of doubles into array of structs containing value and index of each item in the original array, sort_array(by descending), and then take the first N:
from pyspark.sql.functions import expr
df.withColumn('d', expr('sort_array(transform(probability, (x,i) -> (x as val, i as idx)), False)')) \
.selectExpr(
'probability',
'd[0].val as largest_1',
'd[0].idx as index_1',
'd[1].val as largest_2',
'd[1].idx as index_2'
).show(truncate=False)
+--------------------------------------------------------------+------------------+-------+-------------------+-------+
|probability |largest_1 |index_1|largest_2 |index_2|
+--------------------------------------------------------------+------------------+-------+-------------------+-------+
|[0.27047928569511825, 0.5312608102025099, 0.19825990410237174]|0.5312608102025099|1 |0.27047928569511825|0 |
|[0.06711381377029987, 0.8775456658890036, 0.05534052034069637]|0.8775456658890036|1 |0.06711381377029987|0 |
|[0.10847074295048188, 0.04602848157663474, 0.8455007754728833]|0.8455007754728833|2 |0.10847074295048188|0 |
+--------------------------------------------------------------+------------------+-------+-------------------+-------+
From Spark-2.4+
You can use array_sort and array_position built in functions for this case.
Example:
df=spark.sql("select array(0.27047928569511825,0.5312608102025099,0.19825990410237174) probability union select array(0.06711381377029987,0.8775456658890036,0.05534052034069637) prbability union select array(0.10847074295048188,0.04602848157663474,0.8455007754728833) probability")
#DataFrame[probability: array<decimal(17,17)>]
#sample data
df.show(10,False)
#+---------------------------------------------------------------+
#|probability |
#+---------------------------------------------------------------+
#|[0.06711381377029987, 0.87754566588900360, 0.05534052034069637]|
#|[0.27047928569511825, 0.53126081020250990, 0.19825990410237174]|
#|[0.10847074295048188, 0.04602848157663474, 0.84550077547288330]|
#+---------------------------------------------------------------+
df.withColumn("sort_arr",array_sort(col("probability"))).\
withColumn("largest_1",element_at(col("sort_arr"),-1)).\
withColumn("largest_2",element_at(col("sort_arr"),-2)).\
selectExpr("*","array_position(probability,largest_1) -1 index_1","array_position(probability,largest_2) -1 index_2").\
drop("sort_arr").\
show(10,False)
#+---------------------------------------------------------------+-------------------+-------------------+-------+-------+
#|probability |largest_1 |largest_2 |index_1|index_2|
#+---------------------------------------------------------------+-------------------+-------------------+-------+-------+
#|[0.06711381377029987, 0.87754566588900360, 0.05534052034069637]|0.87754566588900360|0.06711381377029987|1 |0 |
#|[0.27047928569511825, 0.53126081020250990, 0.19825990410237174]|0.53126081020250990|0.27047928569511825|1 |0 |
#|[0.10847074295048188, 0.04602848157663474, 0.84550077547288330]|0.84550077547288330|0.10847074295048188|2 |0 |
#+---------------------------------------------------------------+-------------------+-------------------+-------+-------+

Pyspark - Join timestamp window against timestamp values

Is there any effective way of joining a list of timestamp windows against a list of timestamp values?
The dataframe A has these values:
+------------------------------------+---------------------------------------------+----------------------+
|userid | window |total_unique_locations|
+------------------------------------+---------------------------------------------+----------------------+
|da24a375-962a|[2017-06-04 03:20:00.0,2017-06-04 03:25:00.0]|2 |
|0fd2b419-d6ec|[2017-06-04 03:50:00.0,2017-06-04 03:55:00.0]|2 |
|c8159400-fe0a|[2017-06-04 03:10:00.0,2017-06-04 03:15:00.0]|2 |
|a4336494-3a10|[2017-06-04 03:00:00.0,2017-06-04 03:05:00.0]|3 |
|b4590016-1af2|2017-06-04 03:45:00.0,2017-06-04 03:50:00.0] |2 |
|03b33b0a-e94e|[2017-06-04 03:30:00.0,2017-06-04 03:35:00.0]|2 |
|e5e4c972-6599|[2017-06-04 03:25:00.0,2017-06-04 03:30:00.0]|5 |
|345e81fb-5e12|[2017-06-04 03:50:00.0,2017-06-04 03:55:00.0]|2 |
|bedd88f1-3751|[2017-06-04 03:20:00.0,2017-06-04 03:25:00.0]|2 |
|da401dab-e7f3|[2017-06-04 03:20:00.0,2017-06-04 03:25:00.0]|2 |
+------------------------------------+---------------------------------------------+----------------------+
where the data type of window is struct<start:timestamp,end:timestamp>
And the dataframe B has these values:
+------------------------------------+------------------+
|userid |eventtime |distance |
+------------------------------------+------------------+
|9f034a1d-02c1|2017-06-04 03:00:00.0|0.17218625176420413|
|9f034a1d-02c1|2017-06-04 03:00:00.0|0.11145767867097957|
|9f034a1d-02c1|2017-06-04 03:00:00.0|0.14064932728588236|
|a3fac437-efcc|2017-06-04 03:00:00.0|0.08328915597349452|
|a3fac437-efcc|2017-06-04 03:00:00.0|0.07079054693441306|
+------------------------------------+------------------+
I tried to use the regular join but it does not work as the window and eventtime have different data types.
A.join(B, A.userid == B.userid, A.window == B.eventtime).select("*")
Any suggestions?
The less efficient solution is to join or crossJoin with beteween:
a.join(b, col("eventtime").between(col("window.start"), col("window.end")))
The more efficient solution is to convert eventtime to a struct with the same definition as used for existing window. For example:
(b
.withColumn("event_window", window(col("eventtime"), "5 minutes"))
.join(a, col("event_window") == col("window")))
You cannot join these two since the data type of window and eventtime are different.
val result = A.join(B,
A("userid") === B("userid") &&
A("window.start") === B("eventtime") ||
A("window.end") === B("eventtime"), "left")
Hope this helps!

Resources