PATINDEX in spark sql - apache-spark

I have this statement in sql
Case WHEN AAAA is not null then AAAA
Else RTRIM(LEFT(BBBB, PATINDEX('%[0-9]%', BBBB) - 1))
END as NAME.
I need to convert this to spark sql. I tried using indexOf, but it doesn't take the string '%[0-9]%. How do i convert the above statement to spark SQL. please help
Thanks !

My code to do this in scala spark. I used udf to do this.
Edit : Assuming string needs to be cut from first occurrence of number.
import spark.implicits._
val df = Seq(("SOUTH TEXAS SYNDICATE 454C"),
("SANDERS 34-27 #3TF"),
("K. R. BRACKEN B 3H"))
.toDF("name")
df.createOrReplaceTempView("temp")
val getIndexOfFirstNumber = (s: String) => {
val str = s.split("\\D+").filter(_.nonEmpty).toList
s.indexOf(str(0))
}
spark.udf.register("getIndexOfFirstNumber", getIndexOfFirstNumber)
spark.sql("""
select name,substr(name, 0, getIndexOfFirstNumber(name) -1) as final_name
from temp
""").show(20,false)
Result ::
+------------------------------------+----------------------+
|name |final_name |
+------------------------------------+----------------------+
|SOUTH TEXAS SYNDICATE 454C |SOUTH TEXAS SYNDICATE |
|SANDERS 34-27 #3TF |SANDERS |
|K. R. BRACKEN B 3H |K. R. BRACKEN B |
|ALEXANDER-WESSENDORFF 1 (SA) A5 A 5H|ALEXANDER-WESSENDORFF |
|USZYNSKI-FURLOW (SA) B 3H |USZYNSKI-FURLOW (SA) B|
+------------------------------------+----------------------+

Based on Manish answer I build this, it's more generic and was build in Python. You can use it on spark sql as well
The exemple is not for numbers but for the string DATE
import re
def PATINDEX(string,s):
if s:
match = re.search(string, s)
if match:
return match.start()+1
else:
return 0
else:
return 0
spark.udf.register("PATINDEX", PATINDEX)
PATINDEX('DATE','a2aDATEs2s')

You can use the below method to remove the leading zeroes using Databricks or Spark SQL.
REPLACE(LTRIM(REPLACE('0000123045','0',' ')),' ','0')
EXPLANATION:
The first replace function replaces the zero with empty space.
Example : ' 123 45'
The LTRIM function removes the empty space from the left.
Example : '123 45'
Then the third replace function replaces the empty space with zero.
Example:'123045'
Similarly, you can use the function with RTRIM for removing the trailing zeroes accordingly.
Do an upvote if you like my answer.
Thanks

Related

Spark filter weird behaviour with space character '\xa0'

I have a Delta dataframe containing multiple columns and rows.
I did the following:
Delta.limit(1).select("IdEpisode").show()
+---------+
|IdEpisode|
+---------+
| 287 860|
+---------+
But then, when I do this:
Delta.filter("IdEpisode == '287 860'").show()
It returns 0 rows which is weird because we can clearly see the Id present in the dataframe.
I figured it was about the ' ' in the middle but I don't see why it would be a problem and how to fix it.
IMPORTANT EDIT:
Doing Delta.limit(1).select("IdEpisode").collect()[0][0]
returned: '287\xa0860'
And then doing:
Delta.filter("IdEpisode == '287\xa0860'").show()
returned the rows I've been looking for. Any explanation ?
This character is called NO-BREAK SPACE. It's not a regular space that's why it is not matched with your filtering.
You can remove it using regexp_replace function before applying filter:
import pyspark.sql.functions as F
Delta = spark.createDataFrame([('287\xa0860',)], ['IdEpisode'])
# replace NBSP character with normal space in column
Delta = Delta.withColumn("IdEpisode", F.regexp_replace("IdEpisode", '[\\u00A0]', ' '))
Delta.filter("IdEpisode = '287 860'").show()
#+---------+
#|IdEpisode|
#+---------+
#| 287 860|
#+---------+
You can also clean your column by using the regex \p{Z} to replace all kind of spaces with regular space:
\p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
Delta = Delta.withColumn("IdEpisode", F.regexp_replace("IdEpisode", '\\p{Z}', ' '))

Extract Numeric data from the Column in Spark Dataframe

I have a Dataframe with 20 columns and I want to update one particular column (whose data is null) with the data extracted from another column and do some formatting. Below is a sample input
+------------------------+----+
|col1 |col2|
+------------------------+----+
|This_is_111_222_333_test|NULL|
|This_is_111_222_444_test|3296|
|This_is_555_and_666_test|NULL|
|This_is_999_test |NULL|
+------------------------+----+
and my output should be like below
+------------------------+-----------+
|col1 |col2 |
+------------------------+-----------+
|This_is_111_222_333_test|111,222,333|
|This_is_111_222_444_test|3296 |
|This_is_555_and_666_test|555,666 |
|This_is_999_test |999 |
+------------------------+-----------+
Here is the code I have tried and it is working only when the the numeric is continuous, could you please help me with a solution.
df.withColumn("col2",when($"col2".isNull,regexp_replace(regexp_replace(regexp_extract($"col1","([0-9]+_)+",0),"_",","),".$","")).otherwise($"col2")).show(false)
I can do this by creating a UDF, but I am thinking is it possible with the spark in-built functions. My Spark version is 2.2.0
Thank you in advance.
A UDF is a good choice here. Performance is similar to that of the withColumn approach given in the OP (see benchmark below), and it works even if the numbers are not contiguous, which is one of the issues mentioned in the OP.
import org.apache.spark.sql.functions.udf
import scala.util.Try
def getNums = (c: String) => {
c.split("_").map(n => Try(n.toInt).getOrElse(0)).filter(_ > 0)
}
I recreated your data as follows
val data = Seq(("This_is_111_222_333_test", null.asInstanceOf[Array[Int]]),
("This_is_111_222_444_test",Array(3296)),
("This_is_555_666_test",null.asInstanceOf[Array[Int]]),
("This_is_999_test",null.asInstanceOf[Array[Int]]))
.toDF("col1","col2")
data.createOrReplaceTempView("data")
Register the UDF and run it in a query
spark.udf.register("getNums",getNums)
spark.sql("""select col1,
case when size(col2) > 0 then col2 else getNums(col1) end new_col
from data""").show
Which returns
+--------------------+---------------+
| col1| new_col|
+--------------------+---------------+
|This_is_111_222_3...|[111, 222, 333]|
|This_is_111_222_4...| [3296]|
|This_is_555_666_test| [555, 666]|
| This_is_999_test| [999]|
+--------------------+---------------+
Performance was tested with a larger data set created as follows:
val bigData = (0 to 1000).map(_ => data union data).reduce( _ union _)
bigData.createOrReplaceTempView("big_data")
With that, the solution given in the OP was compared to the UDF solution and found to be about the same.
// With UDF
spark.sql("""select col1,
case when length(col2) > 0 then col2 else getNums(col1) end new_col
from big_data""").count
/// OP solution:
bigData.withColumn("col2",when($"col2".isNull,regexp_replace(regexp_replace(regexp_extract($"col1","([0-9]+_)+",0),"_",","),".$","")).otherwise($"col2")).count
Here is another way, please check the performance.
df.withColumn("col2", expr("coalesce(col2, array_join(filter(split(col1, '_'), x -> CAST(x as INT) IS NOT NULL), ','))"))
.show(false)
+------------------------+-----------+
|col1 |col2 |
+------------------------+-----------+
|This_is_111_222_333_test|111,222,333|
|This_is_111_222_444_test|3296 |
|This_is_555_666_test |555,666 |
|This_is_999_test |999 |
+------------------------+-----------+

Split 1 long txt column into 2 columns in pyspark:databricks

I have a pyspark dataframe column which has data as below.
event_list
PL:1547497782:1547497782~ST:1548593509:1547497782
PU:1547497782:1547497782~MU:1548611698:1547497782:1~MU:1548612195:1547497782:0~ST:1548627786:1547497782
PU:1547497782:1547497782~PU:1547497782:1547497782~ST:1548637508:1547497782
PL:1548631949:0
PL:1548619200:0~PU:1548623089:1548619435~PU:1548629541:1548625887~RE:1548629542:1548625887~PU:1548632702:1548629048~ST:1548635966:1548629048
PL:1548619583:1548619584~ST:1548619610:1548619609
PL:1548619850:0~ST:1548619850:0~PL:1548619850:0~ST:1548619850:0~PL:1548619850:1548619851~ST:1548619856:1548619855
I am only interested to have first 10 digits after PL: and first 10 digits after ST: (if exists). For PL split, I used
df.withColumn('PL', split(df['event_list'], '\:')[1])
for ST: since records have a different length that logic dose not work, i can use this
df.withColumn('ST', split(df['event_list'], '\ST:')[1])
which it returns ST:1548619856:1548619855 and split the first part again. I have 1.5 M records so I was wondering if there is a better way.
here is expected output
PL ST
154749778 1548593509
null 1548627786
null 1548637508
154863194 null
154861920 1548635966
154861958 1548619610
154861985 1548619856
One way is using SparkSQL builtin function str_to_map:
df.selectExpr("str_to_map(event_list, '~', ':') as map1") \
.selectExpr(
"split(map1['PL'],':')[0] as PL",
"split(map1['ST'],':')[0] as ST"
).show()
+----------+----------+
| PL| ST|
+----------+----------+
|1547497782|1548593509|
| null|1548627786|
| null|1548637508|
|1548631949| null|
|1548619200|1548635966|
|1548619583|1548619610|
|1548619850|1548619850|
+----------+----------+
Note: you can replace the above split function to substr function (i.e. substr(map1['PL'],1,10)) in case you need exactly the first 10 chars.
try with a combination of substring_index and substring
df.select(
substring(
substring_index(df['event_list'], 'PL:', -1), # Get the string starting from 'PL:'
3, 10).as('PL')) # Skip the first 3 letters and take 10 chars
Another way is to use regexp_extract, something like
val result = df.withColumn("PL", regexp_extract(col("event_list"),"PL\\:(.{0,10})\\:",1))
.withColumn("ST", regexp_extract(col("event_list"),"ST\\:(.{0,10})\\:",1))

Replace a substring of a string in pyspark dataframe

How to replace substrings of a string. For example, I created a data frame based on the following json format.
line1:{"F":{"P3":"1:0.01","P8":"3:0.03,4:0.04", ...},"I":"blah"}
line2:{"F":{"P4":"2:0.01,3:0.02","P10":"5:0.02", ...},"I":"blah"}
I need to replace the substrings "1:", "2:", "3:", with "a:", "b:", "c:", and etc. So the result will be:
line1:{"F":{"P3":"a:0.01","P8":"c:0.03,d:0.04", ...},"I":"blah"}
line2:{"F":{"P4":"b:0.01,c:0.02","P10":"e:0.02", ...},"I":"blah"}
Please consider that this is just an example the real replacement is substring replacement not character replacement.
Any guidance either in Scala or Pyspark is helpful.
from pyspark.sql.functions import *
newDf = df.withColumn('col_name', regexp_replace('col_name', '1:', 'a:'))
Details here:
Pyspark replace strings in Spark dataframe column
Let's say you have a collection of strings for possible modification (simplified for this example).
val data = Seq("1:0.01"
,"3:0.03,4:0.04"
,"2:0.01,3:0.02"
,"5:0.02")
And you have a dictionary of required conversions.
val num2name = Map("1" -> "A"
,"2" -> "Bo"
,"3" -> "Cy"
,"4" -> "Dee")
From here you can use replaceSomeIn() to make the substitutions.
data.map("(\\d+):".r //note: Map key is only part of the match pattern
.replaceSomeIn(_, m => num2name.get(m group 1) //get replacement
.map(_ + ":"))) //restore ":"
//res0: Seq[String] = List(A:0.01
// ,Cy:0.03,Dee:0.04
// ,Bo:0.01,Cy:0.02
// ,5:0.02)
As you can see, "5:" is a match for the regex pattern but since the 5 part is not defined in num2name, the string is left unchanged.
This is the way I solved it in PySpark:
def _blahblah_name_replacement(val, ordered_mapping):
for key, value in ordered_mapping.items():
val = val.replace(key, value)
return val
mapping = {"1:":"aaa:", "2:":"bbb:", ..., "24:":"xxx:", "25:":"yyy:", ....}
ordered_mapping = OrderedDict(reversed(sorted(mapping.items(), key=lambda t: int(t[0][:-1]))))
replacing = udf(lambda x: _blahblah_name_replacement(x, ordered_mapping))
new_df = df.withColumn("F", replacing(col("F")))

How to use isin function with values from text file?

I'd like to filter a dataframe using an external file.
This is how I use the filter now:
val Insert = Append_Ot.filter(
col("Name2").equalTo("brazil") ||
col("Name2").equalTo("france") ||
col("Name2").equalTo("algeria") ||
col("Name2").equalTo("tunisia") ||
col("Name2").equalTo("egypte"))
Instead of using hardcoded string literals, I'd like to create an external file with the values to filter by.
So I create this file:
val filter_numfile = sc.textFile("/user/zh/worskspace/filter_nmb.txt")
.map(_.split(" ")(1))
.collect
This gives me:
filter_numfile: Array[String] = Array(brazil, france, algeria, tunisia, egypte)
And then, I use isin function on Name2 column.
val Insert = Append_Ot.where($"Name2".isin(filter_numfile: _*))
But this gives me an empty dataframe. Why?
I am just adding some information to philantrovert answer in filter dataframe from external file
His answer is perfect but there might be some case unmatch so you will have to check for case mismatch as well
tl;dr Make sure that the letters use consistent case, i.e. they are all in upper or lower case. Simply use upper or lower standard functions.
lets say you have input file as
1 Algeria
2 tunisia
3 brazil
4 Egypt
you read the text file and change all the countries to lowercase as
val countries = sc.textFile("path to input file").map(_.split(" ")(1).trim)
.collect.toSeq
val array = Array(countries.map(_.toLowerCase) : _*)
Then you have your dataframe
val Append_Ot = sc.parallelize(Seq(("brazil"),("tunisia"),("algeria"),("name"))).toDF("Name2")
where you apply following condition
import org.apache.spark.sql.functions._
val Insert = Append_Ot.where(lower($"Name2").isin(array : _* ))
you should have output as
+-------+
|Name2 |
+-------+
|brazil |
|tunisia|
|algeria|
+-------+
The empty dataframe might be due to spelling mismatch too.

Resources