Extracting a specific part from a string column in Pyspark

Extracting a specific part from a string column in Pyspark - string

In one of my projects, I need to transform a string column whose values looks like below
"[44252-565333] result[0] - /out/ALL/abc12345_ID.xml.gz"
"[44252-565333] result[0] - /out/ALL/abc12_ID.xml.gz"
I only need the alphanumeric values after "All/" and before "_ID", so the 1st record should be "abc12345" and second record should be "abc12".
in PySpark, I am using substring in withColumn to get the first 8 strings after "ALL/" position which gives me "abc12345" and "abc12_ID".
Then I am using regexp_replace in withColumn to check if rlike is "_ID$", then replace "_ID" with "", otherwise keep the column value. This is giving the expected result: "abc12345" and "abc12".
But is there a better solution for this?

Maybe like this? In one regexp_extract?
F.regexp_extract('col_name', r'ALL\/([^\W_]+)', 1)
Test:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[("[44252-565333] result[0] - /out/ALL/abc12345_ID.xml.gz",),
("[44252-565333] result[0] - /out/ALL/abc12_ID.xml.gz",)],
["col_name"])
df = df.withColumn("col2", F.regexp_extract("col_name", r"ALL\/([^\W_]+)", 1))
df.show(truncate=0)
# +------------------------------------------------------+--------+
# |col_name |col2 |
# +------------------------------------------------------+--------+
# |[44252-565333] result[0] - /out/ALL/abc12345_ID.xml.gz|abc12345|
# |[44252-565333] result[0] - /out/ALL/abc12_ID.xml.gz |abc12 |
# +------------------------------------------------------+--------+

Related

Map list of multiple substrings in PySpark

I have a dataframe with one column like this:
Locations
Germany:city_Berlin
France:town_Montpellier
Italy:village_Amalfi
I would like to get rid of the substrings: 'city_', 'town_', 'village_', etc.
So the output should be:
Locations
Germany:Berlin
France:tMontpellier
Italy:Amalfi
I can get rid of one of them this way:
F.regexp_replace('Locations', 'city_', '')
Is there a similar way to pass several substrings to remove from the original column?
Ideally I'm looking for a one line solution, without having to create separate functions or convoluted things.

I wouldnt map. Looks to me like you want to replace strings immediately to the left of : if they end with _. If so use regex. Code below
df.withColumn('new_Locations', regexp_replace('Locations', '(?<=\:)[a-z_]+','')).show(truncate=False)
+---+-----------------------+------------------+
|id |Locations |new_Locations |
+---+-----------------------+------------------+
|1 |Germany:city_Berlin |Germany:Berlin |
|2 |France:town_Montpellier|France:Montpellier|
|4 |Italy:village_Amalfi |Italy:Amalfi |
+---+-----------------------+------------------+

F.regexp_replace('Locations', r'(?<=:).*_', '')
.* tells that you will match all characters. But it is located between (?<=:) and _.
_ is the symbol which must follow all the characters matched by .*.
(?<=:) is a syntax for "positive lookbehind". It is not a part of a match, but it ensures that right before the .*_ you must have a : symbol.
Another option - list of strings to remove
strings = ['city', 'big_city', 'town', 'big_town', 'village']
F.regexp_replace('Locations', fr"(?<=:)({'|'.join(strings)})_", '')
Full example:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('Germany:big_city_Berlin',),
('France:big_town_Montpellier',),
('Italy:village_Amalfi',)],
['Locations'])
df = df.withColumn('loc1', F.regexp_replace('Locations', r'(?<=:).*_', ''))
strings = ['city', 'big_city', 'town', 'big_town', 'village']
df = df.withColumn('loc2', F.regexp_replace('Locations', fr"(?<=:)({'|'.join(strings)})_", ''))
df.show(truncate=0)
# +---------------------------+------------------+------------------+
# |Locations |loc1 |loc2 |
# +---------------------------+------------------+------------------+
# |Germany:big_city_Berlin |Germany:Berlin |Germany:Berlin |
# |France:big_town_Montpellier|France:Montpellier|France:Montpellier|
# |Italy:village_Amalfi |Italy:Amalfi |Italy:Amalfi |
# +---------------------------+------------------+------------------+

Not 100% sure about your case although please find here another solution for your problem (Spark 3.x).
from pyspark.sql.functions import expr
df.withColumn("loc_map", expr("str_to_map(Locations, ',', ':')")) \
.withColumn("Locations", expr("transform_values(loc_map, (k, v) -> element_at(split(v, '_'), size(split(v, '_'))))")) \
.drop("loc_map") \
.show(10, False)
First convert Locations into a map using str_to_map
Then iterate through the values of the map and transform each value using transform_values.
Inside transform_values, we use element_at, split and size to identify and return the last item of an array separated by _

Eventually I went with this solution:
df = df.withColumn('Locations', F.regexp_replace('Locations', '(city_)|(town_)|(village_)', ''))
I wasn't aware that I could include several search strings in the regexp_replace function.

Spark keeping words in column that match a list

I currently have a list and a Spark dataframe:
['murder', 'violence', 'flashback', 'romantic', 'cult', 'revenge', 'psychedelic', 'comedy', 'suspenseful', 'good versus evil']
I am having a tough time figuring out a way to create a new column in the dataframe that takes the first matching word from the tags column for each row and puts it in the newly created column for that row.
For example, lets say the first row in the tags column has only "murder" in it, I would want that to show in the new column. Then, if the next row had "boring", "silly" and "cult" in it, I would want it to show cult in the new column since it matches the list. If the last row in tags column had "revenge", "cult" in it, I would want it to only show revenge, since its the first word that matches the list.

from pyspark.sql import functions as F
df = spark.createDataFrame([('murder',), ('boring silly cult',), ('revenge cult',)], ['tags'])
mylist = ['murder', 'violence', 'flashback', 'romantic', 'cult', 'revenge', 'psychedelic', 'comedy', 'suspenseful', 'good versus evil']
pattern = '|'.join([f'({x})' for x in mylist])
df = df.withColumn('first_from_list', F.regexp_extract('tags', pattern, 0))
df.show()
# +-----------------+---------------+
# | tags|first_from_list|
# +-----------------+---------------+
# | murder| murder|
# |boring silly cult| cult|
# | revenge cult| revenge|
# +-----------------+---------------+

You could use a PySpark UDF (User Defined Function).
First, let's write a python function to find a first match between a list (in this case the list you provided) and a string, that is, the value of the tags column:
def find_first_match(tags):
first_match = ''
genres= ['murder', 'violence', 'flashback', 'romantic', 'cult', 'revenge', 'psychedelic', 'comedy', 'suspenseful', 'good versus evil']
for tag in tags.split():
for genre in genres:
if tag==genre:
first_match=tag
return first_match
Then, we need to convert this function into a PySpark udf so that we can use it in combination with the .withColumn() operation:
find_first_matchUDF = udf(lambda z:find_first_match(z))
Now we can apply the udf function to generate a new column. Assuming df is the name of your DataFrame:
from pyspark.sql.functions import col
new_df = df.withColumn("first_match", find_first_matchUDF(col("tags")))
This approach only works if all tag in your tags column are separated by white spaces.
P.S
You can avoid the second step using annotation:
from pyspark.sql.functions import col
#udf(returnType=StringType())
def find_first_match(tags):
first_match = ''
genres= ['murder', 'violence', 'flashback', 'romantic', 'cult', 'revenge', 'psychedelic', 'comedy', 'suspenseful', 'good versus evil']
for tag in tags.split():
for genre in genres:
if tag==genre:
first_match=tag
return first_match
new_df = df.withColumn("first_match", find_first_match(col("tags")))

Splitting a string column into into 2 in PySpark

Using PySpark, I need to parse a single dataframe column into two columns.
Input data:
file name
/level1/level2/level3/file1.ext
/level1/file1000.ext
/level1/level2/file20.ext
Output:
file name
path
file1.ext
/level1/level2/level3/
file1000.ext
/level1/
file20.ext
/level1/level2/
I know I could use substring with hard coded positions, but this is not a good case for hard coding as the length of the file name values may change from row to row, as shown in the example.
However, I know that I need to break the input string after the last slash (/). This is a rule to help avoid hard coding a specific position for splitting the input string.

There are several ways to do it with regex functions, or with the split method.
from pyspark.sql.functions import split, element_at, regexp_extract
df \
.withColumn("file_name", element_at(split("raw", "/"), -1) ) \
.withColumn("file_name2", regexp_extract("raw", "(?<=/)[^/]+$", 0)) \
.withColumn("path", regexp_extract("raw", "^.*/", 0)) \
.show(truncate=False)
+-------------------------------+------------+------------+----------------------+
|raw |file_name |file_name2 |path |
+-------------------------------+------------+------------+----------------------+
|/level1/level2/level3/file1.ext|file1.ext |file1.ext |/level1/level2/level3/|
|/level1/file1000.ext |file1000.ext|file1000.ext|/level1/ |
|/level1/level2/file20.ext |file20.ext |file20.ext |/level1/level2/ |
+-------------------------------+------------+------------+----------------------+

A couple of other options:
from pyspark.sql import functions as F
df=spark.createDataFrame(
[('/level1/level2/level3/file1.ext',),
('/level1/file1000.ext',),
('/level1/level2/file20.ext',)],
['file_name']
)
df = df.withColumn('file', F.substring_index('file_name', '/', -1))
df = df.withColumn('path', F.expr('left(file_name, length(file_name) - length(file))'))
df.show(truncate=0)
# +-------------------------------+------------+----------------------+
# |file_name |file |path |
# +-------------------------------+------------+----------------------+
# |/level1/level2/level3/file1.ext|file1.ext |/level1/level2/level3/|
# |/level1/file1000.ext |file1000.ext|/level1/ |
# |/level1/level2/file20.ext |file20.ext |/level1/level2/ |
# +-------------------------------+------------+----------------------+

PySpark: select a column based on the condition another columns values match some specific values, then create the match result as a new column

I ask the similarity questions before, but for some reasons, It is sad that I have to reimplement it in PySpark.
For example,
app col1
app1 anybody love me?
app2 I hate u
app3 this hat is good
app4 I don't like this one
app5 oh my god
app6 damn you.
app7 such nice girl
app8 xxxxx
app9 pretty prefect
app10 don't love me.
app11 xxx anybody?
I want to match a keyword list like ['anybody', 'love', 'you', 'xxx', 'don't'] and select the matched keyword result as a new column, named keyword as follows:
app keyword
app1 [anybody, love]
app4 [don't]
app6 [you]
app8 [xxx]
app10 [don't, love]
app11 [xxx]
As the accepted answer the suitable way I can do is create a temporary dataframe which is converted by a string list then inner join these two dataframe together.
And select the rows of app and keyword that are matched in the condition.
-- Hiveql implementation
select t.app, k.keyword
from mytable t
inner join (values ('anybody'), ('you'), ('xxx'), ('don''t')) as k(keyword)
on t.col1 like conca('%', k.keyword, '%')
But I am not familiar with PySpark and awkward to reimplement it.
Could anyone help me?
Thanks in advances.

Please find below two possible approaches:
Option 1
The first option is to use the dataframe API to implement the analogous join as in your previous question. Here we convert the keywords list into a dataframe and then join it with the large dataframe (notice that we broadcast the small dataframe to ensure better performance):
from pyspark.sql.functions import broadcast
df = spark.createDataFrame([
["app1", "anybody love me?"],
["app4", "I don't like this one"],
["app5", "oh my god"],
["app6", "damn you."],
["app7", "such nice girl"],
["app8", "xxxxx"],
["app10", "don't love me."]
]).toDF("app", "col1")
# create keywords dataframe
kdf = spark.createDataFrame([(k,) for k in keywords], "key string")
# +-----+
# | key|
# +-----+
# | xxx|
# |don't|
# +-----+
df.join(broadcast(kdf), df["col1"].contains(kdf["key"]), "inner")
# +-----+---------------------+-----+
# |app |col1 |key |
# +-----+---------------------+-----+
# |app4 |I don't like this one|don't|
# |app8 |xxxxx |xxx |
# |app10|don't love me. |don't|
# +-----+---------------------+-----+
The join condition is based on contains function of the Column class.
Option 2
You also can use PySpark high-order function filter in combination with rlike within an expr:
from pyspark.sql.functions import lit, expr, array
df = spark.createDataFrame([
["app1", "anybody love me?"],
["app4", "I don't like this one"],
["app5", "oh my god"],
["app6", "damn you."],
["app7", "such nice girl"],
["app8", "xxxxx"],
["app10", "don't love me."]
]).toDF("app", "col1")
keywords = ["xxx", "don't"]
df.withColumn("keywords", array([lit(k) for k in keywords])) \
.withColumn("keywords", expr("filter(keywords, k -> col1 rlike k)")) \
.where("size(keywords) > 0") \
.show(10, False)
# +-----+---------------------+--------+
# |app |col1 |keywords|
# +-----+---------------------+--------+
# |app4 |I don't like this one|[don't] |
# |app8 |xxxxx |[xxx] |
# |app10|don't love me. |[don't] |
# +-----+---------------------+--------+
Explanation
with array([lit(k) for k in keywords]) we generate an array which contains the keywords that our search will be based on and then we append it to the existing dataframe using withColumn.
next with expr("size(filter(keywords, k -> col1 rlike k)) > 0") we go through the items of keywords trying to figure out if any of them is present in col1 text. If that is true filter will return one or more items and size will be greater than 0 which consists our where condition for retrieving the records.

remove last character from string

I am trying to create a new dataframe column (b) removing the last character from (a).
column a is a string with different lengths so i am trying the following code -
from pyspark.sql.functions import *
df.select(substring('a', 1, length('a') -1 ) ).show()
I get a TypeError: 'Column' object is not callable
it seems to be due to using multiple functions but i cant understand why as these work on their own -
if i hardcode the column length this will work
df.select(substring('a', 1, 10 ) ).show()
or if i use length on it's own it works
df.select(length('a') ).show()
why can i not use multiple functions ?
is there an easier method of removing the last character from all rows in a column ?

Using substr
df.select(col('a').substr(lit(0), length(col('a')) - 1))
or using regexp_extract:
df.select(regexp_extract(col('a'), '(.*).$', 1))
Function substring does not work as the parameters pos and len needs to be integers, not columns
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=substring#pyspark.sql.functions.substring

Your code is almost correct.you just need to use len function.
df = spark.createDataFrame([('abcde',)],['dummy'])
from pyspark.sql.functions import substring
df.select('dummy',substring('dummy', 1, len('dummy') -1).alias('substr_dummy')).show()
#+-----+------------+
#|dummy|substr_dummy|
#+-----+------------+
#|abcde| abcd|
#+-----+------------+

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Extracting a specific part from a string column in Pyspark - string

Related

Map list of multiple substrings in PySpark

Spark keeping words in column that match a list

Splitting a string column into into 2 in PySpark

PySpark: select a column based on the condition another columns values match some specific values, then create the match result as a new column

remove last character from string

Categories

Resources