Split 1 long txt column into 2 columns in pyspark:databricks - apache-spark

I have a pyspark dataframe column which has data as below.
event_list
PL:1547497782:1547497782~ST:1548593509:1547497782
PU:1547497782:1547497782~MU:1548611698:1547497782:1~MU:1548612195:1547497782:0~ST:1548627786:1547497782
PU:1547497782:1547497782~PU:1547497782:1547497782~ST:1548637508:1547497782
PL:1548631949:0
PL:1548619200:0~PU:1548623089:1548619435~PU:1548629541:1548625887~RE:1548629542:1548625887~PU:1548632702:1548629048~ST:1548635966:1548629048
PL:1548619583:1548619584~ST:1548619610:1548619609
PL:1548619850:0~ST:1548619850:0~PL:1548619850:0~ST:1548619850:0~PL:1548619850:1548619851~ST:1548619856:1548619855
I am only interested to have first 10 digits after PL: and first 10 digits after ST: (if exists). For PL split, I used
df.withColumn('PL', split(df['event_list'], '\:')[1])
for ST: since records have a different length that logic dose not work, i can use this
df.withColumn('ST', split(df['event_list'], '\ST:')[1])
which it returns ST:1548619856:1548619855 and split the first part again. I have 1.5 M records so I was wondering if there is a better way.
here is expected output
PL ST
154749778 1548593509
null 1548627786
null 1548637508
154863194 null
154861920 1548635966
154861958 1548619610
154861985 1548619856

One way is using SparkSQL builtin function str_to_map:
df.selectExpr("str_to_map(event_list, '~', ':') as map1") \
.selectExpr(
"split(map1['PL'],':')[0] as PL",
"split(map1['ST'],':')[0] as ST"
).show()
+----------+----------+
| PL| ST|
+----------+----------+
|1547497782|1548593509|
| null|1548627786|
| null|1548637508|
|1548631949| null|
|1548619200|1548635966|
|1548619583|1548619610|
|1548619850|1548619850|
+----------+----------+
Note: you can replace the above split function to substr function (i.e. substr(map1['PL'],1,10)) in case you need exactly the first 10 chars.

try with a combination of substring_index and substring
df.select(
substring(
substring_index(df['event_list'], 'PL:', -1), # Get the string starting from 'PL:'
3, 10).as('PL')) # Skip the first 3 letters and take 10 chars

Another way is to use regexp_extract, something like
val result = df.withColumn("PL", regexp_extract(col("event_list"),"PL\\:(.{0,10})\\:",1))
.withColumn("ST", regexp_extract(col("event_list"),"ST\\:(.{0,10})\\:",1))

Related

Map list of multiple substrings in PySpark

I have a dataframe with one column like this:
Locations
Germany:city_Berlin
France:town_Montpellier
Italy:village_Amalfi
I would like to get rid of the substrings: 'city_', 'town_', 'village_', etc.
So the output should be:
Locations
Germany:Berlin
France:tMontpellier
Italy:Amalfi
I can get rid of one of them this way:
F.regexp_replace('Locations', 'city_', '')
Is there a similar way to pass several substrings to remove from the original column?
Ideally I'm looking for a one line solution, without having to create separate functions or convoluted things.
I wouldnt map. Looks to me like you want to replace strings immediately to the left of : if they end with _. If so use regex. Code below
df.withColumn('new_Locations', regexp_replace('Locations', '(?<=\:)[a-z_]+','')).show(truncate=False)
+---+-----------------------+------------------+
|id |Locations |new_Locations |
+---+-----------------------+------------------+
|1 |Germany:city_Berlin |Germany:Berlin |
|2 |France:town_Montpellier|France:Montpellier|
|4 |Italy:village_Amalfi |Italy:Amalfi |
+---+-----------------------+------------------+
F.regexp_replace('Locations', r'(?<=:).*_', '')
.* tells that you will match all characters. But it is located between (?<=:) and _.
_ is the symbol which must follow all the characters matched by .*.
(?<=:) is a syntax for "positive lookbehind". It is not a part of a match, but it ensures that right before the .*_ you must have a : symbol.
Another option - list of strings to remove
strings = ['city', 'big_city', 'town', 'big_town', 'village']
F.regexp_replace('Locations', fr"(?<=:)({'|'.join(strings)})_", '')
Full example:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('Germany:big_city_Berlin',),
('France:big_town_Montpellier',),
('Italy:village_Amalfi',)],
['Locations'])
df = df.withColumn('loc1', F.regexp_replace('Locations', r'(?<=:).*_', ''))
strings = ['city', 'big_city', 'town', 'big_town', 'village']
df = df.withColumn('loc2', F.regexp_replace('Locations', fr"(?<=:)({'|'.join(strings)})_", ''))
df.show(truncate=0)
# +---------------------------+------------------+------------------+
# |Locations |loc1 |loc2 |
# +---------------------------+------------------+------------------+
# |Germany:big_city_Berlin |Germany:Berlin |Germany:Berlin |
# |France:big_town_Montpellier|France:Montpellier|France:Montpellier|
# |Italy:village_Amalfi |Italy:Amalfi |Italy:Amalfi |
# +---------------------------+------------------+------------------+
Not 100% sure about your case although please find here another solution for your problem (Spark 3.x).
from pyspark.sql.functions import expr
df.withColumn("loc_map", expr("str_to_map(Locations, ',', ':')")) \
.withColumn("Locations", expr("transform_values(loc_map, (k, v) -> element_at(split(v, '_'), size(split(v, '_'))))")) \
.drop("loc_map") \
.show(10, False)
First convert Locations into a map using str_to_map
Then iterate through the values of the map and transform each value using transform_values.
Inside transform_values, we use element_at, split and size to identify and return the last item of an array separated by _
Eventually I went with this solution:
df = df.withColumn('Locations', F.regexp_replace('Locations', '(city_)|(town_)|(village_)', ''))
I wasn't aware that I could include several search strings in the regexp_replace function.

Splitting a string column into into 2 in PySpark

Using PySpark, I need to parse a single dataframe column into two columns.
Input data:
file name
/level1/level2/level3/file1.ext
/level1/file1000.ext
/level1/level2/file20.ext
Output:
file name
path
file1.ext
/level1/level2/level3/
file1000.ext
/level1/
file20.ext
/level1/level2/
I know I could use substring with hard coded positions, but this is not a good case for hard coding as the length of the file name values may change from row to row, as shown in the example.
However, I know that I need to break the input string after the last slash (/). This is a rule to help avoid hard coding a specific position for splitting the input string.
There are several ways to do it with regex functions, or with the split method.
from pyspark.sql.functions import split, element_at, regexp_extract
df \
.withColumn("file_name", element_at(split("raw", "/"), -1) ) \
.withColumn("file_name2", regexp_extract("raw", "(?<=/)[^/]+$", 0)) \
.withColumn("path", regexp_extract("raw", "^.*/", 0)) \
.show(truncate=False)
+-------------------------------+------------+------------+----------------------+
|raw |file_name |file_name2 |path |
+-------------------------------+------------+------------+----------------------+
|/level1/level2/level3/file1.ext|file1.ext |file1.ext |/level1/level2/level3/|
|/level1/file1000.ext |file1000.ext|file1000.ext|/level1/ |
|/level1/level2/file20.ext |file20.ext |file20.ext |/level1/level2/ |
+-------------------------------+------------+------------+----------------------+
A couple of other options:
from pyspark.sql import functions as F
df=spark.createDataFrame(
[('/level1/level2/level3/file1.ext',),
('/level1/file1000.ext',),
('/level1/level2/file20.ext',)],
['file_name']
)
df = df.withColumn('file', F.substring_index('file_name', '/', -1))
df = df.withColumn('path', F.expr('left(file_name, length(file_name) - length(file))'))
df.show(truncate=0)
# +-------------------------------+------------+----------------------+
# |file_name |file |path |
# +-------------------------------+------------+----------------------+
# |/level1/level2/level3/file1.ext|file1.ext |/level1/level2/level3/|
# |/level1/file1000.ext |file1000.ext|/level1/ |
# |/level1/level2/file20.ext |file20.ext |/level1/level2/ |
# +-------------------------------+------------+----------------------+

Spark filter weird behaviour with space character '\xa0'

I have a Delta dataframe containing multiple columns and rows.
I did the following:
Delta.limit(1).select("IdEpisode").show()
+---------+
|IdEpisode|
+---------+
| 287 860|
+---------+
But then, when I do this:
Delta.filter("IdEpisode == '287 860'").show()
It returns 0 rows which is weird because we can clearly see the Id present in the dataframe.
I figured it was about the ' ' in the middle but I don't see why it would be a problem and how to fix it.
IMPORTANT EDIT:
Doing Delta.limit(1).select("IdEpisode").collect()[0][0]
returned: '287\xa0860'
And then doing:
Delta.filter("IdEpisode == '287\xa0860'").show()
returned the rows I've been looking for. Any explanation ?
This character is called NO-BREAK SPACE. It's not a regular space that's why it is not matched with your filtering.
You can remove it using regexp_replace function before applying filter:
import pyspark.sql.functions as F
Delta = spark.createDataFrame([('287\xa0860',)], ['IdEpisode'])
# replace NBSP character with normal space in column
Delta = Delta.withColumn("IdEpisode", F.regexp_replace("IdEpisode", '[\\u00A0]', ' '))
Delta.filter("IdEpisode = '287 860'").show()
#+---------+
#|IdEpisode|
#+---------+
#| 287 860|
#+---------+
You can also clean your column by using the regex \p{Z} to replace all kind of spaces with regular space:
\p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
Delta = Delta.withColumn("IdEpisode", F.regexp_replace("IdEpisode", '\\p{Z}', ' '))

PATINDEX in spark sql

I have this statement in sql
Case WHEN AAAA is not null then AAAA
Else RTRIM(LEFT(BBBB, PATINDEX('%[0-9]%', BBBB) - 1))
END as NAME.
I need to convert this to spark sql. I tried using indexOf, but it doesn't take the string '%[0-9]%. How do i convert the above statement to spark SQL. please help
Thanks !
My code to do this in scala spark. I used udf to do this.
Edit : Assuming string needs to be cut from first occurrence of number.
import spark.implicits._
val df = Seq(("SOUTH TEXAS SYNDICATE 454C"),
("SANDERS 34-27 #3TF"),
("K. R. BRACKEN B 3H"))
.toDF("name")
df.createOrReplaceTempView("temp")
val getIndexOfFirstNumber = (s: String) => {
val str = s.split("\\D+").filter(_.nonEmpty).toList
s.indexOf(str(0))
}
spark.udf.register("getIndexOfFirstNumber", getIndexOfFirstNumber)
spark.sql("""
select name,substr(name, 0, getIndexOfFirstNumber(name) -1) as final_name
from temp
""").show(20,false)
Result ::
+------------------------------------+----------------------+
|name |final_name |
+------------------------------------+----------------------+
|SOUTH TEXAS SYNDICATE 454C |SOUTH TEXAS SYNDICATE |
|SANDERS 34-27 #3TF |SANDERS |
|K. R. BRACKEN B 3H |K. R. BRACKEN B |
|ALEXANDER-WESSENDORFF 1 (SA) A5 A 5H|ALEXANDER-WESSENDORFF |
|USZYNSKI-FURLOW (SA) B 3H |USZYNSKI-FURLOW (SA) B|
+------------------------------------+----------------------+
Based on Manish answer I build this, it's more generic and was build in Python. You can use it on spark sql as well
The exemple is not for numbers but for the string DATE
import re
def PATINDEX(string,s):
if s:
match = re.search(string, s)
if match:
return match.start()+1
else:
return 0
else:
return 0
spark.udf.register("PATINDEX", PATINDEX)
PATINDEX('DATE','a2aDATEs2s')
You can use the below method to remove the leading zeroes using Databricks or Spark SQL.
REPLACE(LTRIM(REPLACE('0000123045','0',' ')),' ','0')
EXPLANATION:
The first replace function replaces the zero with empty space.
Example : ' 123 45'
The LTRIM function removes the empty space from the left.
Example : '123 45'
Then the third replace function replaces the empty space with zero.
Example:'123045'
Similarly, you can use the function with RTRIM for removing the trailing zeroes accordingly.
Do an upvote if you like my answer.
Thanks

How to use isin function with values from text file?

I'd like to filter a dataframe using an external file.
This is how I use the filter now:
val Insert = Append_Ot.filter(
col("Name2").equalTo("brazil") ||
col("Name2").equalTo("france") ||
col("Name2").equalTo("algeria") ||
col("Name2").equalTo("tunisia") ||
col("Name2").equalTo("egypte"))
Instead of using hardcoded string literals, I'd like to create an external file with the values to filter by.
So I create this file:
val filter_numfile = sc.textFile("/user/zh/worskspace/filter_nmb.txt")
.map(_.split(" ")(1))
.collect
This gives me:
filter_numfile: Array[String] = Array(brazil, france, algeria, tunisia, egypte)
And then, I use isin function on Name2 column.
val Insert = Append_Ot.where($"Name2".isin(filter_numfile: _*))
But this gives me an empty dataframe. Why?
I am just adding some information to philantrovert answer in filter dataframe from external file
His answer is perfect but there might be some case unmatch so you will have to check for case mismatch as well
tl;dr Make sure that the letters use consistent case, i.e. they are all in upper or lower case. Simply use upper or lower standard functions.
lets say you have input file as
1 Algeria
2 tunisia
3 brazil
4 Egypt
you read the text file and change all the countries to lowercase as
val countries = sc.textFile("path to input file").map(_.split(" ")(1).trim)
.collect.toSeq
val array = Array(countries.map(_.toLowerCase) : _*)
Then you have your dataframe
val Append_Ot = sc.parallelize(Seq(("brazil"),("tunisia"),("algeria"),("name"))).toDF("Name2")
where you apply following condition
import org.apache.spark.sql.functions._
val Insert = Append_Ot.where(lower($"Name2").isin(array : _* ))
you should have output as
+-------+
|Name2 |
+-------+
|brazil |
|tunisia|
|algeria|
+-------+
The empty dataframe might be due to spelling mismatch too.

Resources