How to use Split function in spark sql with delemter |#|? - apache-spark

My column is having data as,
col
---
abc|#|pqr|#|xyz
aaa|#|sss|#|sdf
It is delemeted by |#| (pipe ,# , pipe).
How to split this with spark sql.
I am trying spark.sql("select split(col,'|#|')").show() but it is not giving me proper result.
I tried escaping \ but still no luck.
Can anyone knows what is going on here..
Note: I need solution for spark sql only.

I am not sure if I have understood your problem statement properly or not but to split a string by its delimiter is fairly simple and can be done in a variety of ways.
One of the methods is to use SUBSTRING_INDEX -
val data = Seq(("abc|#|pqr|#|xyz"),("aaa|#|sss|#|sdf")).toDF("col1")
data.createOrReplaceTempView("testSplit")
followed by -
%sql
select *,substring_index(col1,'|#|',1) as value1, substring_index(col1,'|#|',2) as value2, substring_index(col1,'|#|',3) as value3 from testSplit
Result -
OR - Split Function Documentation
%sql
select *,SPLIT(col1,'\\|#\\|') as SplitString from testSplit
Result -
Do let me know if this fulfills your requirement or not .

Check below code.
scala> adf.withColumn("split_data",split($"data","\\|#\\|")).show(false)
+---------------+---------------+
|data |split_data |
+---------------+---------------+
|abc|#|pqr|#|xyz|[abc, pqr, xyz]|
|aaa|#|sss|#|sdf|[aaa, sss, sdf]|
+---------------+---------------+
scala> spark.sql("select * from split_data").show(false)
+---------------+
|data |
+---------------+
|abc|#|pqr|#|xyz|
|aaa|#|sss|#|sdf|
+---------------+
scala> spark.sql("""select data,split('abc|#|pqr|#|xyz', '\\|\\#\\|') as split_data from split_data""").show(false)
+---------------+---------------+
|data |split_data |
+---------------+---------------+
|abc|#|pqr|#|xyz|[abc, pqr, xyz]|
|aaa|#|sss|#|sdf|[abc, pqr, xyz]|
+---------------+---------------+
Note: inside spark.sql function pass your select query between """ """ & escape special symbols with \\.

Related

Find out substring from url/value of a key from url

I have a table which has url column
I need to find out all the values correspond to tag
TableA
#+---------------------------------------------------------------------+
#| url |
#+---------------------------------------------------------------------+
#| https://www.amazon.in/primeday?tag=final&value=true |
#| https://www.filipkart.in/status?tag=presubmitted&Id=124&key=2 |
#| https://www.google.com/active/search?tag=inreview&type=addtional |
#| https://www.google.com/filter/search?&type=nonactive |
output
#+------------------+
#| Tag |
#+------------------+
#| final |
#| presubmitted |
#| inreview |
I am able to do it in spark sql via below
spark.sql("""select parse_url(url,'QUERY','tag') as Tag from TableA""")
Any option via dataframe or regular expression.
PySpark:
df \
.withColumn("partialURL", split("url", "tag=")[1]) \
.withColumn("tag", split("partialURL", "&")[0]) \
.drop("partialURL")
You can try the below implementation -
val spark = SparkSession.builder().appName("test").master("local[*]").getOrCreate()
import spark.implicits._
val extract: String => String = StringUtils.substringBetween(_,"tag=","&")
val parse = udf(extract)
val urlDS = Seq("https://www.amazon.in/primeday?tag=final&value=true",
"https://www.filipkart.in/status?tag=presubmitted&Id=124&key=2",
"https://www.google.com/active/search?tag=inreview&type=addtional",
"https://www.google.com/filter/search?&type=nonactive").toDS
urlDS.withColumn("tag",parse($"value")).show()
+----------------------------------------------------------------+------------+
|value |tag |
+----------------------------------------------------------------+------------+
|https://www.amazon.in/primeday?tag=final&value=true |final |
|https://www.filipkart.in/status?tag=presubmitted&Id=124&key=2 |presubmitted|
|https://www.google.com/active/search?tag=inreview&type=addtional|inreview |
|https://www.google.com/filter/search?&type=nonactive |null |
+----------------------------------------------------------------+------------+
The fastest solution is likely substring based, similar to Pardeep's answer. An alternative approach is to use a regex that does some light input checking, similar to:
^(?:(?:(?:https?|ftp):)?\/\/).+?tag=(.*?)(?:&.*?$|$)
This checks that the string starts with a http/https/ftp protocol, the colon and slashes, at least one character (lazily), and either tag=<string of interest> appears somewhere in the middle or at the very end of the string.
Visually (courtesy of regex101), the matches look like:
The tag values you want are in capture group 1, so if you use regexp_extract (PySpark docs), you'll want to use idx of 1 to extract them.
The main difference between this answer and Pardeep's is that this one won't extract values from strings that don't conform to the regex, e.g. the last string in the image above doesn't match. In these edge cases, regexp_extract will return a NULL, which you can process as you wish afterwards.
Since we're invoking a regex engine, this approach is likely a little slower, but the performance difference might be imperceptible in your application.

Specify delimiter in collect_set function in Spark SQL

I want to add a delimiter in the collect_set function which I'm using in Spark SQL.
If it is not available, please let me know how can I achieve it in any alternative way.
Use concat,concat_ws,collect_set to specify delimiter in Spark SQL.
Example:
val df=Seq(("a",1),("a",3),("b",2)).toDF("id","sa")
df.createOrReplaceTempView("tmp")
spark.sql("""select concat('[', //concat with opening bracket
concat_ws(';', //custom delimiter
collect_set(sa)),
concat(']') //concat with closing bracket
) as cnct_deli
from tmp
group by id""").show()
Result:
+---------+
|cnct_deli|
+---------+
| [2]|
| [1;3]|
+---------+

How to ignore double quotes when reading CSV file in Spark?

I have a CSV file like:
col1,col2,col3,col4
"A,B","C", D"
I want to read it as a data frame in spark, where the values of every field are exactly as written in the CSV (I would like to treat the " character as a regular character, and copy it like any other character).
Expected output:
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| "A| B"| "C"| D"|
+----+----+----+----+
The output I get:
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| A,B| C| D"|null|
+----+----+----+----+
In pyspark I am reading like this:
dfr = spark.read.format("csv").option("header", "true").option("inferSchema", "true")
I know that if I add an option like this:
dfr.option("quote", "\u0000")
I get the expected result in the above example, as the function of char '"' is now done by '\u0000', but if my CSV file contains a '\u0000' char, I would also get the wrong result.
Therefore, my question is:
How do I disable the quote option, so that no character acts like a quote?
My CSV file can contain any character, and I want all characters (except comas) to simply be copied into their respective data frame cell. I wonder if there is a way to accomplish this using the escape option.
From the documentation for pyspark.sql.DataFrameReader.csv (emphasis mine):
quote – sets a single character used for escaping quoted values where the separator can be part of the value. If None is set, it uses the default value, ". If you would like to turn off quotations, you need to set an empty string.
dfr = spark.read.csv(
path="path/to/some/file.csv",
header="true",
inferSchema="true",
quote=""
)
dfr.show()
#+----+----+----+----+
#|col1|col2|col3|col4|
#+----+----+----+----+
#| "A| B"| "C"| D"|
#+----+----+----+----+
This is just a work around, if the option suggested by #pault doesn't work -
from pyspark.sql.functions import split
df = spark.createDataFrame([('"A,B","C", D"',),('""A,"B","""C", D"D"',)], schema = ['Column'])
df.show()
+-------------------+
| Column|
+-------------------+
| "A,B","C", D"|
|""A,"B","""C", D"D"|
+-------------------+
for i in list(range(4)):
df = df.withColumn('Col'+str(i),split(df.Column, ',')[i])
df = df.drop('Column')
df.show()
+----+----+-----+-----+
|Col0|Col1| Col2| Col3|
+----+----+-----+-----+
| "A| B"| "C"| D"|
| ""A| "B"|"""C"| D"D"|
+----+----+-----+-----+

Spark Scala - splitting string syntax issue

i'm trying to split String in a DataFrame column using SparkSQL and Scala,
and there seems to be a difference in the way the split condition is working the two
Using Scala,
This works -
val seq = Seq("12.1")
val df = seq.toDF("val")
Scala Code ->
val seq = Seq("12.1")
val df = seq.toDF("val")
val afterSplit = df2.withColumn("FirstPart", split($"val", "\\.")).select($"FirstPart".getItem(0).as("PartOne"))
afterSplit.show(false)
However, in Spark SQL when i use this, firstParkSQL shows a Blank.
df.registerTempTable("temp")
val s1 = sqlContext.sql("select split(val, '\\.')[0] as firstPartSQL from temp")
Instead, when i use this (separate condition represented as [.] instead of \.
expected value shows up.
val s1 = sqlContext.sql("select split(val, '[.]')[0] as firstPartSQL from temp")
Any ideas why this is happening ?
When you use regex patterns in spark-sql with double quotes spark.sql("....."),it is considered as string within another string, so two things happen. Consider this
scala> val df = Seq("12.1").toDF("val")
df: org.apache.spark.sql.DataFrame = [val: string]
scala> df.withColumn("FirstPart", split($"val", "\\.")).select($"FirstPart".getItem(0).as("PartOne")).show
+-------+
|PartOne|
+-------+
| 12|
+-------+
scala> df.createOrReplaceTempView("temp")
With df(), the regex-string for split is directly passed to the split string, so you just need to escape the backslash alone (\).
But when it comes to spark-sql, the pattern is first converted into string and then again passed as string to split() function,
So you need to get \\. before you use that in the spark-sql
The way to get that is to add 2 more \
scala> "\\."
res12: String = \.
scala> "\\\\."
res13: String = \\.
scala>
If you just pass "\\." in the spark-sql, first it gets converted into \. and then to ".", which in regex context becomes (.) "any" character
i.e split on any character, and since each character is adjacent to each other, you get an array of empty string.
The length of the string "12.1" is four and also it matches the final boundary "$" of the string as well.. so upto split(val, '\.')[4], you'll get the
empty string. When you issue split(val, '\.,')[5], you'll get null
To verify this, you can pass the same delimiter string "\\." to regex_replace() function and see what happens
scala> spark.sql("select split(val, '\\.')[0] as firstPartSQL, regexp_replace(val,'\\.','9') as reg_ex from temp").show
+------------+------+
|firstPartSQL|reg_ex|
+------------+------+
| | 9999|
+------------+------+
scala> spark.sql("select split(val, '\\\\.')[0] as firstPartSQL, regexp_replace(val,'\\\\.','9') as reg_ex from temp").show
+------------+------+
|firstPartSQL|reg_ex|
+------------+------+
| 12| 1291|
+------------+------+
scala>
If you still want to use the same pattern between df and sql, then go with raw string i.e triple quotes.
scala> raw"\\."
res23: String = \\.
scala>
scala> spark.sql("""select split(val, '\\.')[0] as firstPartSQL, regexp_replace(val,'\\.','9') as reg_ex from temp""").show
+------------+------+
|firstPartSQL|reg_ex|
+------------+------+
| 12| 1291|
+------------+------+
scala> spark.sql("""select split(val, "\\.")[0] as firstPartSQL, regexp_replace(val,"\\.",'9') as reg_ex from temp""").show
+------------+------+
|firstPartSQL|reg_ex|
+------------+------+
| 12| 1291|
+------------+------+
scala>

How to swap minus sign from last position in a string to first position in hive?

How to swap negative sign from last position to first in a string or integer to first position in hive and/ spark?
example: 22-
required: -22
My code is:
val Validation1 = spark.sql("Select case when substr(YTTLSVAL-,-1,1)='-' then cast(concat('-',substr(YTTLSVAL-,1,length(YTTLSVAL-)-1)) as int) else cast(YTTLSVAL- as int) end as column_name")
scala> Seq("-abcd", "def", "23-", "we").toDF("value").createOrReplaceTempView("values")
scala> val f = (x: String) => if(x.endsWith("-")) s"-${x.dropRight(1)}" else x
scala> spark.udf.register("myudf", f)
scala> spark.sql("select *, myudf(*) as custval from values").show
+-----+-------+
|value|custval|
+-----+-------+
|-abcd| -abcd|
| def| def|
| 23-| -23|
| we| we|
+-----+-------+
EDIT
On second thought, since, UDF's are discouraged unless you absolutely need them (since they create a black box for spark's optimisation engine), please use below way that uses regex_replace instead. Have tested this and it works:
scala> spark.sql("select REGEXP_REPLACE ( value, '^(\\.+)(-)$','-$1') as custval from values").show
You could try REGEXP_REPLACE. This patter searches for a number followed by - at the end and puts it before the number if found.
SELECT REGEXP_REPLACE ( val, '^(\\d+)-$','-$1')
Column functions

Resources