I have a string as below where in need to select only the last part of the string using pyspark
Input
/dbfs/mnt/abc/date=20210224/fsp_store_abcxyz_lmn_
/dbfs/mnt/abc/date=20210224/fsp_store_schu_lev_bsd_s_
Output
fsp_store_abcxyz_lmn_
fsp_store_schu_lev_bsd_s_
i am using the code below
from pyspark.sql.functions import *
df=spark.sql("""select stack(1,"/dbfs/mnt/abc/date=20210224/fsp_store_abcxyz_lmn_","/dbfs/mnt/abc/date=20210224/fsp_store_schu_lev_bsd_s_") as (txt)""")
df.withColumn("extract",regexp_extract(col("txt"),"_(.*)",1)).display(10,False)
and my output is
store_abcxyz_lmn
store_schu_lev_bsd_s
however my requirement is
fsp_store_schu_lev_bsd_s_
fsp_store_abcxyz_lmn_
could you please help in the above challenge
Try this:
df = spark.createDataFrame([{"Path": "/dbfs/mnt/abc/date=20210224/fsp_store_abcxyz_lmn_"},
{"Path": "/dbfs/mnt/abc/date=20210224/fsp_store_schu_lev_bsd_s_ "}])
df.show(2, False)
+------------------------------------------------------+
|Path |
+------------------------------------------------------+
|/dbfs/mnt/abc/date=20210224/fsp_store_abcxyz_lmn_ |
|/dbfs/mnt/abc/date=20210224/fsp_store_schu_lev_bsd_s_ |
+------------------------------------------------------+
df.withColumn("result", F.reverse(F.split(F.reverse("Path"), "/")[0])).show(2, False)
+------------------------------------------------------+--------------------------+
|Path |result |
+------------------------------------------------------+--------------------------+
|/dbfs/mnt/abc/date=20210224/fsp_store_abcxyz_lmn_ |fsp_store_abcxyz_lmn_ |
|/dbfs/mnt/abc/date=20210224/fsp_store_schu_lev_bsd_s_ |fsp_store_schu_lev_bsd_s_ |
+------------------------------------------------------+--------------------------+
Related
I have the following DataFrame containing the date format - yyyyMMddTHH:mm:ss+UTC
Data Preparation
sparkDF = sql.createDataFrame([("20201021T00:00:00+0530",),
("20211011T00:00:00+0530",),
("20200212T00:00:00+0300",),
("20211021T00:00:00+0530",),
("20211021T00:00:00+0900",),
("20211021T00:00:00-0500",)
]
,['timestamp'])
sparkDF.show(truncate=False)
+----------------------+
|timestamp |
+----------------------+
|20201021T00:00:00+0530|
|20211011T00:00:00+0530|
|20200212T00:00:00+0300|
|20211021T00:00:00+0530|
|20211021T00:00:00+0900|
|20211021T00:00:00-0500|
+----------------------+
I m aware of the date format to parse and convert the values to DateType
Timestamp Parsed
sparkDF.select(F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss+0530").alias('timestamp_parsed')).show()
+----------------+
|timestamp_parsed|
+----------------+
| 2020-10-21|
| 2021-10-11|
| null|
| 2021-10-21|
| null|
| null|
+----------------+
As you can see , its specific to +0530 strings , I m aware of the fact that I can use multiple patterns and coalesce the first non-null values
Multiple Patterns & Coalesce
sparkDF.withColumn('p1',F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss+0530"))\
.withColumn('p2',F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss+0900"))\
.withColumn('p3',F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss-0500"))\
.withColumn('p4',F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss+0300"))\
.withColumn('timestamp_parsed',F.coalesce(F.col('p1'),F.col('p2'),F.col('p3'),F.col('p4')))\
.drop(*['p1','p2','p3','p4'])\
.show(truncate=False)
+----------------------+----------------+
|timestamp |timestamp_parsed|
+----------------------+----------------+
|20201021T00:00:00+0530|2020-10-21 |
|20211011T00:00:00+0530|2021-10-11 |
|20200212T00:00:00+0300|2020-02-12 |
|20211021T00:00:00+0530|2021-10-21 |
|20211021T00:00:00+0900|2021-10-21 |
|20211021T00:00:00-0500|2021-10-21 |
+----------------------+----------------+
Is there a better way to accomplish this, as there might be a bunch of other UTC within the data source, is there a standard UTC TZ available within Spark to parse all the cases
i think you have got the 2nd argument of your to_date function wrong which is causing null values in your output
the +530 in your timestamp is the Zulu value which just denotes how many hours and mins ahead (for +) or behind (for -) the current timestamp is withrespect to UTC.
Please refer to the response by Basil here Java / convert ISO-8601 (2010-12-16T13:33:50.513852Z) to Date object This link has full details available for the same.
To answer your question if you replace +0530 by Z it should solve your problem.
Here is the spark code in scala that I tried and worked:
val data = Seq("20201021T00:00:00+0530",
"20211011T00:00:00+0530",
"20200212T00:00:00+0300",
"20211021T00:00:00+0530",
"20211021T00:00:00+0900",
"20211021T00:00:00-0500")
import spark.implicits._
val sparkDF = data.toDF("custom_time")
import org.apache.spark.sql.functions._
val spark_DF2 = sparkDF.withColumn("new_timestamp", to_date($"custom_time", "yyyyMMdd'T'HH:mm:ssZ"))
spark_DF2.show(false)
here is the snapshot of the output. As you can see there are no null values.
You can usually use x, X or Z for offset pattern as you can find on Spark date pattern documentation page. You can then parse your date with the following complete pattern: yyyyMMdd'T'HH:mm:ssxx
However, if you use those kind of offset patterns, your date will be first converted in UTC format, meaning all timestamp with a positive offset will be matched to the previous day. For instance "20201021T00:00:00+0530" will be matched to 2020-10-20 using to_date with the previous pattern.
If you want to get displayed date as a date, ignoring offset, you should first extract date string from complete timestamp string using regexp_extract function, then perform to_date.
If you take your example "20201021T00:00:00+0530", what you want to extract with a regexp is 20201021 part and apply to_date on it. You can do it with the following pattern: ^(\\d+). If you're interested, you can find how to build other patterns in java's Pattern documentation.
So your code should be:
from pyspark.sql import functions as F
sparkDF.select(
F.to_date(
F.regexp_extract(F.col('timestamp'), '^(\\d+)', 0), 'yyyyMMdd'
).alias('timestamp_parsed')
).show()
And with your input you will get:
+----------------+
|timestamp_parsed|
+----------------+
|2020-10-21 |
|2021-10-11 |
|2020-02-12 |
|2021-10-21 |
|2021-10-21 |
|2021-10-21 |
+----------------+
You can create "udf" in spark and use it. Below is the code in scala.
import spark.implicits._
//just to create the dataset for the example you have given
val data = Seq(
("20201021T00:00:00+0530"),
("20211011T00:00:00+0530"),
("20200212T00:00:00+0300"),
("20211021T00:00:00+0530"),
("20211021T00:00:00+0900"),
("20211021T00:00:00-0500"))
val dataset = data.toDF("timestamp")
val udfToDateUTC = functions.udf((epochMilliUTC: String) => {
val formatter = DateTimeFormatter.ofPattern("yyyyMMdd'T'HH:mm:ssZ")
val res = OffsetDateTime.parse(epochMilliUTC, formatter).withOffsetSameInstant(ZoneOffset.UTC)
res.toString()
})
dataset.select(dataset.col("timestamp"),udfToDateUTC(dataset.col("timestamp")).alias("timestamp_parsed")).show(false)
//output
+----------------------+-----------------+
|timestamp |timestamp_parsed |
+----------------------+-----------------+
|20201021T00:00:00+0530|2020-10-20T18:30Z|
|20211011T00:00:00+0530|2021-10-10T18:30Z|
|20200212T00:00:00+0300|2020-02-11T21:00Z|
|20211021T00:00:00+0530|2021-10-20T18:30Z|
|20211021T00:00:00+0900|2021-10-20T15:00Z|
|20211021T00:00:00-0500|2021-10-21T05:00Z|
+----------------------+-----------------+
from pyspark.sql.functions import date_format
customer_data = select("<column_name>",date_format("<column_name>",'yyyyMMdd').cast('customer')
I have a PySpark data frame with a string column(URL) and all records look in the following way
ID URL
1 https://app.xyz.com/inboxes/136636/conversations/2686735685
2 https://app.xyz.com/inboxes/136636/conversations/2938415796
3 https://app.drift.com/inboxes/136636/conversations/2938419189
I want to basically extract the number after conversations/ from URL column using regex into another column.
I tried the following code but it doesn't give me any results.
df1 = df.withColumn('CONV_ID', split(convo_influ_new['URL'], '(?<=conversations/).*').getItem(0))
Expected:
ID URL CONV_ID
1 https://app.xyz.com/inboxes/136636/conversations/2686735685 2686735685
2 https://app.xyz.com/inboxes/136636/conversations/2938415796 2938415796
3 https://app.drift.com/inboxes/136636/conversations/2938419189 2938419189
Result:
ID URL CONV_ID
1 https://app.xyz.com/inboxes/136636/conversations/2686735685 https://app.xyz.com/inboxes/136636/conversations/2686735685
2 https://app.xyz.com/inboxes/136636/conversations/2938415796 https://app.xyz.com/inboxes/136636/conversations/2938415796
3 https://app.drift.com/inboxes/136636/conversations/2938419189 https://app.drift.com/inboxes/136636/conversations/2938419189
Not sure what's happening here. I tried the regex script in different online regex tester toolds and it highlights the part I want but never works in PySpark. I tried different PySpark functions like f.split, regexp_extract, regexp_replace, but none of them work.
If you are URLs have always that form, you can actually just use substring_index to get the last path element :
import pyspark.sql.functions as F
df1 = df.withColumn("CONV_ID", F.substring_index("URL", "/", -1))
df1.show(truncate=False)
#+---+-------------------------------------------------------------+----------+
#|ID |URL |CONV_ID |
#+---+-------------------------------------------------------------+----------+
#|1 |https://app.xyz.com/inboxes/136636/conversations/2686735685 |2686735685|
#|2 |https://app.xyz.com/inboxes/136636/conversations/2938415796 |2938415796|
#|3 |https://app.drift.com/inboxes/136636/conversations/2938419189|2938419189|
#+---+-------------------------------------------------------------+----------+
You can use regexp_extract instead:
import pyspark.sql.functions as F
df1 = df.withColumn(
'CONV_ID',
F.regexp_extract('URL', 'conversations/(.*)', 1)
)
df1.show()
+---+--------------------+----------+
| ID| URL| CONV_ID|
+---+--------------------+----------+
| 1|https://app.xyz.c...|2686735685|
| 2|https://app.xyz.c...|2938415796|
| 3|https://app.drift...|2938419189|
+---+--------------------+----------+
Or if you want to use split, you don't need to specify .*. You just need to specify the pattern used for splitting.
import pyspark.sql.functions as F
df1 = df.withColumn(
'CONV_ID',
F.split('URL', '(?<=conversations/)')[1] # just using 'conversations/' should also be enough
)
df1.show()
+---+--------------------+----------+
| ID| URL| CONV_ID|
+---+--------------------+----------+
| 1|https://app.xyz.c...|2686735685|
| 2|https://app.xyz.c...|2938415796|
| 3|https://app.drift...|2938419189|
+---+--------------------+----------+
I'm having trouble spliting a dataframe's column into more columns in PySpark:
I have a list of lists and I want to transform it into a dataframe, each value in one column.
What I have tried:
I created a dataframe from this list:
[['COL-4560', 'COL-9655', 'NWG-0610', 'D81-3754'],
['DLL-7760', 'NAT-9885', 'PED-0550', 'MAR-0004', 'LLL-5554']]
Using this code:
from pyspark.sql import Row
R = Row('col1', 'col2')
# use enumerate to add the ID column
df_from_list = spark.createDataFrame([R(i, x) for i, x in enumerate(recs_list)])
The result I got is:
+----+--------------------+
|col1| col2|
+----+--------------------+
| 0|[COL-4560, COL-96...|
| 1|[DLL-7760, NAT-98...|
+----+--------------------+
I want to separate the values by comma into columns, so I tried:
from pyspark.sql import functions as F
df2 = df_from_list.select('col1', F.split('col2', ', ').alias('col2'))
# If you don't know the number of columns:
df_sizes = df2.select(F.size('col2').alias('col2'))
df_max = df_sizes.agg(F.max('col2'))
nb_columns = df_max.collect()[0][0]
df_result = df2.select('col1', *[df2['col2'][i] for i in range(nb_columns)])
df_result.show()
But I get an error on this line df2 = df_from_list.select('col1', F.split('col2', ', ').alias('col2')):
AnalysisException: cannot resolve 'split(`col2`, ', ', -1)' due to data type mismatch: argument 1 requires string type, however, '`col2`' is of array<string> type.;;
My ideal final output would be like this:
+----------+----------+----------+----------+----------+
| SKU | REC_01 | REC_02 | REC_03 | REC_04 |
+----------+----------+----------+----------+----------+
| COL-4560 | COL-9655 | NWG-0610 | D81-3754 | null |
| DLL-7760 | NAT-9885 | PED-0550 | MAR-0004 | LLL-5554 |
+---------------------+----------+----------+----------+
Some rows may have four values, but some my have more or less, I don't know the exact number of columns the final dataframe will have.
Does anyone have any idea of what is happening? Thank you very much in advance.
Dataframe df_from_list col2 column is already array type, so no need to split (as split works with stringtype here we have arraytype).
Here are the steps that will work for you.
recs_list=[['COL-4560', 'COL-9655', 'NWG-0610', 'D81-3754'],
['DLL-7760', 'NAT-9885', 'PED-0550', 'MAR-0004', 'LLL-5554']]
from pyspark.sql import Row
R = Row('col1', 'col2')
# use enumerate to add the ID column
df_from_list = spark.createDataFrame([R(i, x) for i, x in enumerate(recs_list)])
from pyspark.sql import functions as F
df2 = df_from_list
# If you don't know the number of columns:
df_sizes = df2.select(F.size('col2').alias('col2'))
df_max = df_sizes.agg(F.max('col2'))
nb_columns = df_max.collect()[0][0]
cols=['SKU','REC_01','REC_02','REC_03','REC_04']
df_result = df2.select(*[df2['col2'][i] for i in range(nb_columns)]).toDF(*cols)
df_result.show()
#+--------+--------+--------+--------+--------+
#| SKU| REC_01| REC_02| REC_03| REC_04|
#+--------+--------+--------+--------+--------+
#|COL-4560|COL-9655|NWG-0610|D81-3754| null|
#|DLL-7760|NAT-9885|PED-0550|MAR-0004|LLL-5554|
#+--------+--------+--------+--------+--------+
I have thow Pyspark dataframes and I want to uodate the "target" dataframe with "staging" one according on the key...
Which is the best optimezed way to do this in Pyspark?
target
+---+-----------------------+------+------+
|key|updated_timestamp |field0|field1|
+---+-----------------------+------+------+
|005|2019-10-26 21:02:30.638|cdao |coaame|
|001|2019-10-22 13:02:30.638|aaaaaa|fsdc |
|002|2019-12-22 11:42:30.638|stfi |? |
|004|2019-10-21 14:02:30.638|ct |ome |
|003|2019-10-24 21:02:30.638|io |me |
+---+-----------------------+------+------+
staging
+---+-----------------------+----------+---------+
|key|updated_timestamp |field0 |field1 |
+---+-----------------------+----------+---------+
|006|2020-03-06 01:42:30.638|new record|xxaaame |
|005|2019-10-29 09:42:30.638|cwwwwdao |coaaaaame|
|004|2019-10-29 21:03:35.638|cwwwwdao |coaaaaame|
+---+-----------------------+----------+---------+
output dataframe
+---+-----------------------+----------+---------+
|key|updated_timestamp |field0 |field1 |
+---+-----------------------+----------+---------+
|005|2019-10-29 09:42:30.638|cwwwwdao |coaaaaame|
|001|2019-10-22 13:02:30.638|aaaaaa |fsdc |
|002|2019-12-22 11:42:30.638|stfi |? |
|004|2019-10-29 21:03:35.638|cwwwwdao |coaaaaame|
|003|2019-10-24 21:02:30.638|io |me |
|006|2020-03-06 01:42:30.638|new record|xxaaame |
+---+-----------------------+----------+---------+
There are several ways to achieve that. Here is one using a full outer join :
from pyspark.sql import functions as F
output = staging.join(
target,
on='key',
how='full'
).select(
*(
F.coalesce(staging[col], target[col]).alias(col)
for col
in staging.columns
)
)
This works only if the updated value is not NULL.
Another solution using union :
output = staging.union(
target.join(
staging,
on="key",
how="left_anti"
)
)
I am using spark-sql-2.4.1v with java8.
I have following scenario
val df = Seq(
("0.9192019", "0.1992019", "0.9955999"),
("0.9292018", "0.2992019", "0.99662018"),
("0.9392017", "0.3992019", "0.99772000")).toDF("item1_value","item2_value","item3_value")
.withColumn("item1_value", $"item1_value".cast(DoubleType))
.withColumn("item2_value", $"item2_value".cast(DoubleType))
.withColumn("item3_value", $"item3_value".cast(DoubleType))
df.show(20)
I need an expected output something like this
-----------------------------------------------------------------------------------
col_name | sum_of_column | avg_of_column | vari_of_column
-----------------------------------------------------------------------------------
"item1_value" | sum("item1_value") | avg("item1_value") | variance("item1_value")
"item2_value" | sum("item2_value") | avg("item2_value") | variance("item2_value")
"item3_value" | sum("item3_value") | avg("item3_value") | variance("item3_value")
----------------------------------------------------------------------------------
how to achieve this dynamically .. tomorrow i may have
This is sample code that can achieve this. You can make column list dynamic and add more functions if needed.
import org.apache.spark.sql.types._
import org.apache.spark.sql.Column
val df = Seq(
("0.9192019", "0.1992019", "0.9955999"),
("0.9292018", "0.2992019", "0.99662018"),
("0.9392017", "0.3992019", "0.99772000")).
toDF("item1_value","item2_value","item3_value").
withColumn("item1_value", $"item1_value".cast(DoubleType)).
withColumn("item2_value", $"item2_value".cast(DoubleType)).
withColumn("item3_value", $"item3_value".cast(DoubleType))
val aggregateColumns = Seq("item1_value","item2_value","item3_value")
var aggDFs = aggregateColumns.map( c => {
df.groupBy().agg(lit(c).as("col_name"),sum(c).as("sum_of_column"), avg(c).as("avg_of_column"), variance(c).as("var_of_column"))
})
var combinedDF = aggDFs.reduce(_ union _)
This returns following output:
scala> df.show(10,false)
+-----------+-----------+-----------+
|item1_value|item2_value|item3_value|
+-----------+-----------+-----------+
|0.9192019 |0.1992019 |0.9955999 |
|0.9292018 |0.2992019 |0.99662018 |
|0.9392017 |0.3992019 |0.99772 |
+-----------+-----------+-----------+
scala> combinedDF.show(10,false)
+-----------+------------------+------------------+---------------------+
|col_name |sum_of_column |avg_of_column |var_of_column |
+-----------+------------------+------------------+---------------------+
|item1_value|2.7876054 |0.9292018 |9.999800000999957E-5 |
|item2_value|0.8976057000000001|0.2992019 |0.010000000000000002 |
|item3_value|2.9899400800000002|0.9966466933333334|1.1242332201333484E-6|
+-----------+------------------+------------------+---------------------+