Pyspark - How to decode a column in URL format - apache-spark

Do you know how to decode the 'campaign' column below in Pyspark? The records in this column are strings in URL format:
+--------------------+------------------------+
|user_id |campaign |
+--------------------+------------------------+
|alskd9239as23093 |MM+%7C+Cons%C3%B3rcios+%|
|lfifsf093039388 |Aquisi%C3%A7%C3%A3o+%7C |
|kasd877191kdsd999 |Aquisi%C3%A7%C3%A3o+%7C |
+--------------------+------------------------+
I know that it is possible to do this with the urllib library in Python. However, my dataset is large and it takes too long to convert it to a pandas dataframe. Do you know how to do this with a Spark DataFrame?

There is no need to convert to intermediate pandas dataframe, you can use pyspark user defined functions (udf) to unquote the quoted string:
from urllib.parse import unquote
df.withColumn('campaign', F.udf(unquote, F.StringType())('campaign'))
If there are null values in the campaign column, then you have to do null check before unquoting the strings:
f = lambda s: unquote(s) if s else s
df.withColumn('campaign', F.udf(f, F.StringType())('campaign'))
+-----------------+-----------------+
| user_id| campaign|
+-----------------+-----------------+
| alskd9239as23093|MM+|+Consórcios+%|
| lfifsf093039388| Aquisição+||
|kasd877191kdsd999| Aquisição+||
+-----------------+-----------------+

Related

Apply wordninja.split() using pandas_udf

I have a dataframe df with the column sld of type string which includes some consecutive characters with no space/delimiter. One of the libraries that can be used to split is wordninja:
E.g. wordninja.split('culturetosuccess') outputs ['culture','to','success']
Using pandas_udf, I have:
#pandas_udf(ArrayType(StringType()))
def split_word(x):
splitted = wordninja.split(x)
return splitted
However, it throws an error when I apply it on the column sld:
df1=df.withColumn('test', split_word(col('sld')))
typeerror: expected string or bytes-like object
What I tried:
I noticed that there is a similar problem with the well-known function split(), but the workaround is to use string.str as mentioned here. This doesn't work on wordninja.split.
Any work around this issue?
Edit: I think in a nutshell the issue is:
the pandas_udf input is pd.series while wordninja.split expects string.
My df looks like this:
+-------------+
|sld |
+-------------+
|"hellofriend"|
|"restinpeace"|
|"this" |
|"that" |
+-------------+
I want something like this:
+-------------+---------------------+
| sld | test |
+-------------+---------------------+
|"hellofriend"|["hello","friend"] |
|"restinpeace"|["rest","in","peace"]|
|"this" |["this"] |
|"that" |["that"] |
+-------------+---------------------+
Just use .apply to perform computation on each element of the Pandas series, something like this:
#pandas_udf(ArrayType(StringType()))
def split_word(x: pd.Series) -> pd.Series:
splitted = x.apply(lambda s: wordninja.split(s))
return splitted
One way is using udf.
import wordninja
from pyspark.sql import functions as F
df = spark.createDataFrame([("hellofriend",), ("restinpeace",), ("this",), ("that",)], ['sld'])
#F.udf
def split_word(x):
return wordninja.split(x)
df.withColumn('col2', split_word('sld')).show()
# +-----------+-----------------+
# | sld| col2|
# +-----------+-----------------+
# |hellofriend| [hello, friend]|
# |restinpeace|[rest, in, peace]|
# | this| [this]|
# | that| [that]|
# +-----------+-----------------+

Splitting a string column into into 2 in PySpark

Using PySpark, I need to parse a single dataframe column into two columns.
Input data:
file name
/level1/level2/level3/file1.ext
/level1/file1000.ext
/level1/level2/file20.ext
Output:
file name
path
file1.ext
/level1/level2/level3/
file1000.ext
/level1/
file20.ext
/level1/level2/
I know I could use substring with hard coded positions, but this is not a good case for hard coding as the length of the file name values may change from row to row, as shown in the example.
However, I know that I need to break the input string after the last slash (/). This is a rule to help avoid hard coding a specific position for splitting the input string.
There are several ways to do it with regex functions, or with the split method.
from pyspark.sql.functions import split, element_at, regexp_extract
df \
.withColumn("file_name", element_at(split("raw", "/"), -1) ) \
.withColumn("file_name2", regexp_extract("raw", "(?<=/)[^/]+$", 0)) \
.withColumn("path", regexp_extract("raw", "^.*/", 0)) \
.show(truncate=False)
+-------------------------------+------------+------------+----------------------+
|raw |file_name |file_name2 |path |
+-------------------------------+------------+------------+----------------------+
|/level1/level2/level3/file1.ext|file1.ext |file1.ext |/level1/level2/level3/|
|/level1/file1000.ext |file1000.ext|file1000.ext|/level1/ |
|/level1/level2/file20.ext |file20.ext |file20.ext |/level1/level2/ |
+-------------------------------+------------+------------+----------------------+
A couple of other options:
from pyspark.sql import functions as F
df=spark.createDataFrame(
[('/level1/level2/level3/file1.ext',),
('/level1/file1000.ext',),
('/level1/level2/file20.ext',)],
['file_name']
)
df = df.withColumn('file', F.substring_index('file_name', '/', -1))
df = df.withColumn('path', F.expr('left(file_name, length(file_name) - length(file))'))
df.show(truncate=0)
# +-------------------------------+------------+----------------------+
# |file_name |file |path |
# +-------------------------------+------------+----------------------+
# |/level1/level2/level3/file1.ext|file1.ext |/level1/level2/level3/|
# |/level1/file1000.ext |file1000.ext|/level1/ |
# |/level1/level2/file20.ext |file20.ext |/level1/level2/ |
# +-------------------------------+------------+----------------------+

Parse Date Format

I have the following DataFrame containing the date format - yyyyMMddTHH:mm:ss+UTC
Data Preparation
sparkDF = sql.createDataFrame([("20201021T00:00:00+0530",),
("20211011T00:00:00+0530",),
("20200212T00:00:00+0300",),
("20211021T00:00:00+0530",),
("20211021T00:00:00+0900",),
("20211021T00:00:00-0500",)
]
,['timestamp'])
sparkDF.show(truncate=False)
+----------------------+
|timestamp |
+----------------------+
|20201021T00:00:00+0530|
|20211011T00:00:00+0530|
|20200212T00:00:00+0300|
|20211021T00:00:00+0530|
|20211021T00:00:00+0900|
|20211021T00:00:00-0500|
+----------------------+
I m aware of the date format to parse and convert the values to DateType
Timestamp Parsed
sparkDF.select(F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss+0530").alias('timestamp_parsed')).show()
+----------------+
|timestamp_parsed|
+----------------+
| 2020-10-21|
| 2021-10-11|
| null|
| 2021-10-21|
| null|
| null|
+----------------+
As you can see , its specific to +0530 strings , I m aware of the fact that I can use multiple patterns and coalesce the first non-null values
Multiple Patterns & Coalesce
sparkDF.withColumn('p1',F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss+0530"))\
.withColumn('p2',F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss+0900"))\
.withColumn('p3',F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss-0500"))\
.withColumn('p4',F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss+0300"))\
.withColumn('timestamp_parsed',F.coalesce(F.col('p1'),F.col('p2'),F.col('p3'),F.col('p4')))\
.drop(*['p1','p2','p3','p4'])\
.show(truncate=False)
+----------------------+----------------+
|timestamp |timestamp_parsed|
+----------------------+----------------+
|20201021T00:00:00+0530|2020-10-21 |
|20211011T00:00:00+0530|2021-10-11 |
|20200212T00:00:00+0300|2020-02-12 |
|20211021T00:00:00+0530|2021-10-21 |
|20211021T00:00:00+0900|2021-10-21 |
|20211021T00:00:00-0500|2021-10-21 |
+----------------------+----------------+
Is there a better way to accomplish this, as there might be a bunch of other UTC within the data source, is there a standard UTC TZ available within Spark to parse all the cases
i think you have got the 2nd argument of your to_date function wrong which is causing null values in your output
the +530 in your timestamp is the Zulu value which just denotes how many hours and mins ahead (for +) or behind (for -) the current timestamp is withrespect to UTC.
Please refer to the response by Basil here Java / convert ISO-8601 (2010-12-16T13:33:50.513852Z) to Date object This link has full details available for the same.
To answer your question if you replace +0530 by Z it should solve your problem.
Here is the spark code in scala that I tried and worked:
val data = Seq("20201021T00:00:00+0530",
"20211011T00:00:00+0530",
"20200212T00:00:00+0300",
"20211021T00:00:00+0530",
"20211021T00:00:00+0900",
"20211021T00:00:00-0500")
import spark.implicits._
val sparkDF = data.toDF("custom_time")
import org.apache.spark.sql.functions._
val spark_DF2 = sparkDF.withColumn("new_timestamp", to_date($"custom_time", "yyyyMMdd'T'HH:mm:ssZ"))
spark_DF2.show(false)
here is the snapshot of the output. As you can see there are no null values.
You can usually use x, X or Z for offset pattern as you can find on Spark date pattern documentation page. You can then parse your date with the following complete pattern: yyyyMMdd'T'HH:mm:ssxx
However, if you use those kind of offset patterns, your date will be first converted in UTC format, meaning all timestamp with a positive offset will be matched to the previous day. For instance "20201021T00:00:00+0530" will be matched to 2020-10-20 using to_date with the previous pattern.
If you want to get displayed date as a date, ignoring offset, you should first extract date string from complete timestamp string using regexp_extract function, then perform to_date.
If you take your example "20201021T00:00:00+0530", what you want to extract with a regexp is 20201021 part and apply to_date on it. You can do it with the following pattern: ^(\\d+). If you're interested, you can find how to build other patterns in java's Pattern documentation.
So your code should be:
from pyspark.sql import functions as F
sparkDF.select(
F.to_date(
F.regexp_extract(F.col('timestamp'), '^(\\d+)', 0), 'yyyyMMdd'
).alias('timestamp_parsed')
).show()
And with your input you will get:
+----------------+
|timestamp_parsed|
+----------------+
|2020-10-21 |
|2021-10-11 |
|2020-02-12 |
|2021-10-21 |
|2021-10-21 |
|2021-10-21 |
+----------------+
You can create "udf" in spark and use it. Below is the code in scala.
import spark.implicits._
//just to create the dataset for the example you have given
val data = Seq(
("20201021T00:00:00+0530"),
("20211011T00:00:00+0530"),
("20200212T00:00:00+0300"),
("20211021T00:00:00+0530"),
("20211021T00:00:00+0900"),
("20211021T00:00:00-0500"))
val dataset = data.toDF("timestamp")
val udfToDateUTC = functions.udf((epochMilliUTC: String) => {
val formatter = DateTimeFormatter.ofPattern("yyyyMMdd'T'HH:mm:ssZ")
val res = OffsetDateTime.parse(epochMilliUTC, formatter).withOffsetSameInstant(ZoneOffset.UTC)
res.toString()
})
dataset.select(dataset.col("timestamp"),udfToDateUTC(dataset.col("timestamp")).alias("timestamp_parsed")).show(false)
//output
+----------------------+-----------------+
|timestamp |timestamp_parsed |
+----------------------+-----------------+
|20201021T00:00:00+0530|2020-10-20T18:30Z|
|20211011T00:00:00+0530|2021-10-10T18:30Z|
|20200212T00:00:00+0300|2020-02-11T21:00Z|
|20211021T00:00:00+0530|2021-10-20T18:30Z|
|20211021T00:00:00+0900|2021-10-20T15:00Z|
|20211021T00:00:00-0500|2021-10-21T05:00Z|
+----------------------+-----------------+
from pyspark.sql.functions import date_format
customer_data = select("<column_name>",date_format("<column_name>",'yyyyMMdd').cast('customer')

add the day information to timestep in a dataframe

I am trying to read the csv file into a dataframe,the csv fileThe csv file looks like this.
The cell value only contains the hour information and miss the day information. I would like to read this csv file into a dataframe and transform the timing information into the format like 2021-05-07 04:04.00 i.e., I would like to add the day information. How to achieve that?
I used the following code, but it seems that pyspark just add the day information as 1970-01-01, kind of system setting.
spark = SparkSession.builder.getOrCreate()
spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")
df_1 = spark.read.csv('test1.csv', header = True)
df_1 = df_1.withColumn('Timestamp', to_timestamp(col('Timing'), 'HH:mm'))
df_1.show(truncate=False)
And I got the following result.
+-------+-------------------+
| Timing| Timestamp|
+-------+-------------------+
|04:04.0|1970-01-01 04:04:00|
|19:04.0|1970-01-01 19:04:00|
You can concat a date string before calling to_timestamp:
import pyspark.sql.functions as F
df2 = df_1.withColumn(
'Timestamp',
F.to_timestamp(
F.concat_ws(' ', F.lit('2021-05-07'), 'Timing'),
'yyyy-MM-dd HH:mm.s'
)
)
df2.show()
+-------+-------------------+
| Timing| Timestamp|
+-------+-------------------+
|04:04.0|2021-05-07 04:04:00|
|19:04.0|2021-05-07 19:04:00|
+-------+-------------------+

PySpark split using regex doesn't work on a dataframe column with string type

I have a PySpark data frame with a string column(URL) and all records look in the following way
ID URL
1 https://app.xyz.com/inboxes/136636/conversations/2686735685
2 https://app.xyz.com/inboxes/136636/conversations/2938415796
3 https://app.drift.com/inboxes/136636/conversations/2938419189
I want to basically extract the number after conversations/ from URL column using regex into another column.
I tried the following code but it doesn't give me any results.
df1 = df.withColumn('CONV_ID', split(convo_influ_new['URL'], '(?<=conversations/).*').getItem(0))
Expected:
ID URL CONV_ID
1 https://app.xyz.com/inboxes/136636/conversations/2686735685 2686735685
2 https://app.xyz.com/inboxes/136636/conversations/2938415796 2938415796
3 https://app.drift.com/inboxes/136636/conversations/2938419189 2938419189
Result:
ID URL CONV_ID
1 https://app.xyz.com/inboxes/136636/conversations/2686735685 https://app.xyz.com/inboxes/136636/conversations/2686735685
2 https://app.xyz.com/inboxes/136636/conversations/2938415796 https://app.xyz.com/inboxes/136636/conversations/2938415796
3 https://app.drift.com/inboxes/136636/conversations/2938419189 https://app.drift.com/inboxes/136636/conversations/2938419189
Not sure what's happening here. I tried the regex script in different online regex tester toolds and it highlights the part I want but never works in PySpark. I tried different PySpark functions like f.split, regexp_extract, regexp_replace, but none of them work.
If you are URLs have always that form, you can actually just use substring_index to get the last path element :
import pyspark.sql.functions as F
df1 = df.withColumn("CONV_ID", F.substring_index("URL", "/", -1))
df1.show(truncate=False)
#+---+-------------------------------------------------------------+----------+
#|ID |URL |CONV_ID |
#+---+-------------------------------------------------------------+----------+
#|1 |https://app.xyz.com/inboxes/136636/conversations/2686735685 |2686735685|
#|2 |https://app.xyz.com/inboxes/136636/conversations/2938415796 |2938415796|
#|3 |https://app.drift.com/inboxes/136636/conversations/2938419189|2938419189|
#+---+-------------------------------------------------------------+----------+
You can use regexp_extract instead:
import pyspark.sql.functions as F
df1 = df.withColumn(
'CONV_ID',
F.regexp_extract('URL', 'conversations/(.*)', 1)
)
df1.show()
+---+--------------------+----------+
| ID| URL| CONV_ID|
+---+--------------------+----------+
| 1|https://app.xyz.c...|2686735685|
| 2|https://app.xyz.c...|2938415796|
| 3|https://app.drift...|2938419189|
+---+--------------------+----------+
Or if you want to use split, you don't need to specify .*. You just need to specify the pattern used for splitting.
import pyspark.sql.functions as F
df1 = df.withColumn(
'CONV_ID',
F.split('URL', '(?<=conversations/)')[1] # just using 'conversations/' should also be enough
)
df1.show()
+---+--------------------+----------+
| ID| URL| CONV_ID|
+---+--------------------+----------+
| 1|https://app.xyz.c...|2686735685|
| 2|https://app.xyz.c...|2938415796|
| 3|https://app.drift...|2938419189|
+---+--------------------+----------+

Resources