Map list of multiple substrings in PySpark - string

I have a dataframe with one column like this:
Locations
Germany:city_Berlin
France:town_Montpellier
Italy:village_Amalfi
I would like to get rid of the substrings: 'city_', 'town_', 'village_', etc.
So the output should be:
Locations
Germany:Berlin
France:tMontpellier
Italy:Amalfi
I can get rid of one of them this way:
F.regexp_replace('Locations', 'city_', '')
Is there a similar way to pass several substrings to remove from the original column?
Ideally I'm looking for a one line solution, without having to create separate functions or convoluted things.

I wouldnt map. Looks to me like you want to replace strings immediately to the left of : if they end with _. If so use regex. Code below
df.withColumn('new_Locations', regexp_replace('Locations', '(?<=\:)[a-z_]+','')).show(truncate=False)
+---+-----------------------+------------------+
|id |Locations |new_Locations |
+---+-----------------------+------------------+
|1 |Germany:city_Berlin |Germany:Berlin |
|2 |France:town_Montpellier|France:Montpellier|
|4 |Italy:village_Amalfi |Italy:Amalfi |
+---+-----------------------+------------------+

F.regexp_replace('Locations', r'(?<=:).*_', '')
.* tells that you will match all characters. But it is located between (?<=:) and _.
_ is the symbol which must follow all the characters matched by .*.
(?<=:) is a syntax for "positive lookbehind". It is not a part of a match, but it ensures that right before the .*_ you must have a : symbol.
Another option - list of strings to remove
strings = ['city', 'big_city', 'town', 'big_town', 'village']
F.regexp_replace('Locations', fr"(?<=:)({'|'.join(strings)})_", '')
Full example:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('Germany:big_city_Berlin',),
('France:big_town_Montpellier',),
('Italy:village_Amalfi',)],
['Locations'])
df = df.withColumn('loc1', F.regexp_replace('Locations', r'(?<=:).*_', ''))
strings = ['city', 'big_city', 'town', 'big_town', 'village']
df = df.withColumn('loc2', F.regexp_replace('Locations', fr"(?<=:)({'|'.join(strings)})_", ''))
df.show(truncate=0)
# +---------------------------+------------------+------------------+
# |Locations |loc1 |loc2 |
# +---------------------------+------------------+------------------+
# |Germany:big_city_Berlin |Germany:Berlin |Germany:Berlin |
# |France:big_town_Montpellier|France:Montpellier|France:Montpellier|
# |Italy:village_Amalfi |Italy:Amalfi |Italy:Amalfi |
# +---------------------------+------------------+------------------+

Not 100% sure about your case although please find here another solution for your problem (Spark 3.x).
from pyspark.sql.functions import expr
df.withColumn("loc_map", expr("str_to_map(Locations, ',', ':')")) \
.withColumn("Locations", expr("transform_values(loc_map, (k, v) -> element_at(split(v, '_'), size(split(v, '_'))))")) \
.drop("loc_map") \
.show(10, False)
First convert Locations into a map using str_to_map
Then iterate through the values of the map and transform each value using transform_values.
Inside transform_values, we use element_at, split and size to identify and return the last item of an array separated by _

Eventually I went with this solution:
df = df.withColumn('Locations', F.regexp_replace('Locations', '(city_)|(town_)|(village_)', ''))
I wasn't aware that I could include several search strings in the regexp_replace function.

Related

Extracting a specific part from a string column in Pyspark

In one of my projects, I need to transform a string column whose values looks like below
"[44252-565333] result[0] - /out/ALL/abc12345_ID.xml.gz"
"[44252-565333] result[0] - /out/ALL/abc12_ID.xml.gz"
I only need the alphanumeric values after "All/" and before "_ID", so the 1st record should be "abc12345" and second record should be "abc12".
in PySpark, I am using substring in withColumn to get the first 8 strings after "ALL/" position which gives me "abc12345" and "abc12_ID".
Then I am using regexp_replace in withColumn to check if rlike is "_ID$", then replace "_ID" with "", otherwise keep the column value. This is giving the expected result: "abc12345" and "abc12".
But is there a better solution for this?
Maybe like this? In one regexp_extract?
F.regexp_extract('col_name', r'ALL\/([^\W_]+)', 1)
Test:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[("[44252-565333] result[0] - /out/ALL/abc12345_ID.xml.gz",),
("[44252-565333] result[0] - /out/ALL/abc12_ID.xml.gz",)],
["col_name"])
df = df.withColumn("col2", F.regexp_extract("col_name", r"ALL\/([^\W_]+)", 1))
df.show(truncate=0)
# +------------------------------------------------------+--------+
# |col_name |col2 |
# +------------------------------------------------------+--------+
# |[44252-565333] result[0] - /out/ALL/abc12345_ID.xml.gz|abc12345|
# |[44252-565333] result[0] - /out/ALL/abc12_ID.xml.gz |abc12 |
# +------------------------------------------------------+--------+

Splitting a string column into into 2 in PySpark

Using PySpark, I need to parse a single dataframe column into two columns.
Input data:
file name
/level1/level2/level3/file1.ext
/level1/file1000.ext
/level1/level2/file20.ext
Output:
file name
path
file1.ext
/level1/level2/level3/
file1000.ext
/level1/
file20.ext
/level1/level2/
I know I could use substring with hard coded positions, but this is not a good case for hard coding as the length of the file name values may change from row to row, as shown in the example.
However, I know that I need to break the input string after the last slash (/). This is a rule to help avoid hard coding a specific position for splitting the input string.
There are several ways to do it with regex functions, or with the split method.
from pyspark.sql.functions import split, element_at, regexp_extract
df \
.withColumn("file_name", element_at(split("raw", "/"), -1) ) \
.withColumn("file_name2", regexp_extract("raw", "(?<=/)[^/]+$", 0)) \
.withColumn("path", regexp_extract("raw", "^.*/", 0)) \
.show(truncate=False)
+-------------------------------+------------+------------+----------------------+
|raw |file_name |file_name2 |path |
+-------------------------------+------------+------------+----------------------+
|/level1/level2/level3/file1.ext|file1.ext |file1.ext |/level1/level2/level3/|
|/level1/file1000.ext |file1000.ext|file1000.ext|/level1/ |
|/level1/level2/file20.ext |file20.ext |file20.ext |/level1/level2/ |
+-------------------------------+------------+------------+----------------------+
A couple of other options:
from pyspark.sql import functions as F
df=spark.createDataFrame(
[('/level1/level2/level3/file1.ext',),
('/level1/file1000.ext',),
('/level1/level2/file20.ext',)],
['file_name']
)
df = df.withColumn('file', F.substring_index('file_name', '/', -1))
df = df.withColumn('path', F.expr('left(file_name, length(file_name) - length(file))'))
df.show(truncate=0)
# +-------------------------------+------------+----------------------+
# |file_name |file |path |
# +-------------------------------+------------+----------------------+
# |/level1/level2/level3/file1.ext|file1.ext |/level1/level2/level3/|
# |/level1/file1000.ext |file1000.ext|/level1/ |
# |/level1/level2/file20.ext |file20.ext |/level1/level2/ |
# +-------------------------------+------------+----------------------+

Spark filter weird behaviour with space character '\xa0'

I have a Delta dataframe containing multiple columns and rows.
I did the following:
Delta.limit(1).select("IdEpisode").show()
+---------+
|IdEpisode|
+---------+
| 287 860|
+---------+
But then, when I do this:
Delta.filter("IdEpisode == '287 860'").show()
It returns 0 rows which is weird because we can clearly see the Id present in the dataframe.
I figured it was about the ' ' in the middle but I don't see why it would be a problem and how to fix it.
IMPORTANT EDIT:
Doing Delta.limit(1).select("IdEpisode").collect()[0][0]
returned: '287\xa0860'
And then doing:
Delta.filter("IdEpisode == '287\xa0860'").show()
returned the rows I've been looking for. Any explanation ?
This character is called NO-BREAK SPACE. It's not a regular space that's why it is not matched with your filtering.
You can remove it using regexp_replace function before applying filter:
import pyspark.sql.functions as F
Delta = spark.createDataFrame([('287\xa0860',)], ['IdEpisode'])
# replace NBSP character with normal space in column
Delta = Delta.withColumn("IdEpisode", F.regexp_replace("IdEpisode", '[\\u00A0]', ' '))
Delta.filter("IdEpisode = '287 860'").show()
#+---------+
#|IdEpisode|
#+---------+
#| 287 860|
#+---------+
You can also clean your column by using the regex \p{Z} to replace all kind of spaces with regular space:
\p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
Delta = Delta.withColumn("IdEpisode", F.regexp_replace("IdEpisode", '\\p{Z}', ' '))

PySpark: select a column based on the condition another columns values match some specific values, then create the match result as a new column

I ask the similarity questions before, but for some reasons, It is sad that I have to reimplement it in PySpark.
For example,
app col1
app1 anybody love me?
app2 I hate u
app3 this hat is good
app4 I don't like this one
app5 oh my god
app6 damn you.
app7 such nice girl
app8 xxxxx
app9 pretty prefect
app10 don't love me.
app11 xxx anybody?
I want to match a keyword list like ['anybody', 'love', 'you', 'xxx', 'don't'] and select the matched keyword result as a new column, named keyword as follows:
app keyword
app1 [anybody, love]
app4 [don't]
app6 [you]
app8 [xxx]
app10 [don't, love]
app11 [xxx]
As the accepted answer the suitable way I can do is create a temporary dataframe which is converted by a string list then inner join these two dataframe together.
And select the rows of app and keyword that are matched in the condition.
-- Hiveql implementation
select t.app, k.keyword
from mytable t
inner join (values ('anybody'), ('you'), ('xxx'), ('don''t')) as k(keyword)
on t.col1 like conca('%', k.keyword, '%')
But I am not familiar with PySpark and awkward to reimplement it.
Could anyone help me?
Thanks in advances.
Please find below two possible approaches:
Option 1
The first option is to use the dataframe API to implement the analogous join as in your previous question. Here we convert the keywords list into a dataframe and then join it with the large dataframe (notice that we broadcast the small dataframe to ensure better performance):
from pyspark.sql.functions import broadcast
df = spark.createDataFrame([
["app1", "anybody love me?"],
["app4", "I don't like this one"],
["app5", "oh my god"],
["app6", "damn you."],
["app7", "such nice girl"],
["app8", "xxxxx"],
["app10", "don't love me."]
]).toDF("app", "col1")
# create keywords dataframe
kdf = spark.createDataFrame([(k,) for k in keywords], "key string")
# +-----+
# | key|
# +-----+
# | xxx|
# |don't|
# +-----+
df.join(broadcast(kdf), df["col1"].contains(kdf["key"]), "inner")
# +-----+---------------------+-----+
# |app |col1 |key |
# +-----+---------------------+-----+
# |app4 |I don't like this one|don't|
# |app8 |xxxxx |xxx |
# |app10|don't love me. |don't|
# +-----+---------------------+-----+
The join condition is based on contains function of the Column class.
Option 2
You also can use PySpark high-order function filter in combination with rlike within an expr:
from pyspark.sql.functions import lit, expr, array
df = spark.createDataFrame([
["app1", "anybody love me?"],
["app4", "I don't like this one"],
["app5", "oh my god"],
["app6", "damn you."],
["app7", "such nice girl"],
["app8", "xxxxx"],
["app10", "don't love me."]
]).toDF("app", "col1")
keywords = ["xxx", "don't"]
df.withColumn("keywords", array([lit(k) for k in keywords])) \
.withColumn("keywords", expr("filter(keywords, k -> col1 rlike k)")) \
.where("size(keywords) > 0") \
.show(10, False)
# +-----+---------------------+--------+
# |app |col1 |keywords|
# +-----+---------------------+--------+
# |app4 |I don't like this one|[don't] |
# |app8 |xxxxx |[xxx] |
# |app10|don't love me. |[don't] |
# +-----+---------------------+--------+
Explanation
with array([lit(k) for k in keywords]) we generate an array which contains the keywords that our search will be based on and then we append it to the existing dataframe using withColumn.
next with expr("size(filter(keywords, k -> col1 rlike k)) > 0") we go through the items of keywords trying to figure out if any of them is present in col1 text. If that is true filter will return one or more items and size will be greater than 0 which consists our where condition for retrieving the records.

Split 1 long txt column into 2 columns in pyspark:databricks

I have a pyspark dataframe column which has data as below.
event_list
PL:1547497782:1547497782~ST:1548593509:1547497782
PU:1547497782:1547497782~MU:1548611698:1547497782:1~MU:1548612195:1547497782:0~ST:1548627786:1547497782
PU:1547497782:1547497782~PU:1547497782:1547497782~ST:1548637508:1547497782
PL:1548631949:0
PL:1548619200:0~PU:1548623089:1548619435~PU:1548629541:1548625887~RE:1548629542:1548625887~PU:1548632702:1548629048~ST:1548635966:1548629048
PL:1548619583:1548619584~ST:1548619610:1548619609
PL:1548619850:0~ST:1548619850:0~PL:1548619850:0~ST:1548619850:0~PL:1548619850:1548619851~ST:1548619856:1548619855
I am only interested to have first 10 digits after PL: and first 10 digits after ST: (if exists). For PL split, I used
df.withColumn('PL', split(df['event_list'], '\:')[1])
for ST: since records have a different length that logic dose not work, i can use this
df.withColumn('ST', split(df['event_list'], '\ST:')[1])
which it returns ST:1548619856:1548619855 and split the first part again. I have 1.5 M records so I was wondering if there is a better way.
here is expected output
PL ST
154749778 1548593509
null 1548627786
null 1548637508
154863194 null
154861920 1548635966
154861958 1548619610
154861985 1548619856
One way is using SparkSQL builtin function str_to_map:
df.selectExpr("str_to_map(event_list, '~', ':') as map1") \
.selectExpr(
"split(map1['PL'],':')[0] as PL",
"split(map1['ST'],':')[0] as ST"
).show()
+----------+----------+
| PL| ST|
+----------+----------+
|1547497782|1548593509|
| null|1548627786|
| null|1548637508|
|1548631949| null|
|1548619200|1548635966|
|1548619583|1548619610|
|1548619850|1548619850|
+----------+----------+
Note: you can replace the above split function to substr function (i.e. substr(map1['PL'],1,10)) in case you need exactly the first 10 chars.
try with a combination of substring_index and substring
df.select(
substring(
substring_index(df['event_list'], 'PL:', -1), # Get the string starting from 'PL:'
3, 10).as('PL')) # Skip the first 3 letters and take 10 chars
Another way is to use regexp_extract, something like
val result = df.withColumn("PL", regexp_extract(col("event_list"),"PL\\:(.{0,10})\\:",1))
.withColumn("ST", regexp_extract(col("event_list"),"ST\\:(.{0,10})\\:",1))

Resources