Specify delimiter in collect_set function in Spark SQL - apache-spark

I want to add a delimiter in the collect_set function which I'm using in Spark SQL.
If it is not available, please let me know how can I achieve it in any alternative way.

Use concat,concat_ws,collect_set to specify delimiter in Spark SQL.
Example:
val df=Seq(("a",1),("a",3),("b",2)).toDF("id","sa")
df.createOrReplaceTempView("tmp")
spark.sql("""select concat('[', //concat with opening bracket
concat_ws(';', //custom delimiter
collect_set(sa)),
concat(']') //concat with closing bracket
) as cnct_deli
from tmp
group by id""").show()
Result:
+---------+
|cnct_deli|
+---------+
| [2]|
| [1;3]|
+---------+

Related

pyspark equivalent of postgres regexp_substr fails to extract value

I'm trying to adapt some postgres sql code I have, to pyspark sql. In the postgres sql I'm using the regexp_substr function to parse out ' .5G' if it shows up in a string in the productname column. (I've included example code below). On the pyspark side I'm trying to use the regexp_extract function, but it's only returning null. I've compared the output from the regexp_replace function in postgres to the pyspark, and it's returning the same value. so the issue must be in the regexp_extract function. I've created a sample input dataframe along with the pyspark code I'm currently running below. can someone please tell me what I'm doing wrong and suggest how to fix it, thank you.
postgres:
select
regexp_substr(trim(upper(regexp_replace(a.productname, '[,/#!$%^&*;:{}=_`~()-]'))), ' .5G') as A
from df
output:
' .5G'
code:
# creating dummy data
df = sc.parallelize([('LEMON MERINGUE .5G CAKE SUGAR', )]).toDF(["productname"])
# turning dataframe into view
df.createOrReplaceTempView("df")
# example query trying to extract ' .5G'
testquery=("""select
regexp_extract('('+trim(upper(regexp_replace(a.productname, '[,/#!$%^&*;:{}=_`~()-]','')))+')', ' .5G',1) as A
from df a
""")
# creating dataframe with extracted value in column
test_df=spark.sql(testquery)
test_df.show(truncate=False)
output:
+----+
|A |
+----+
|null|
+----+
You need to wrap '.5G' in parenthesis, not wrapping the column in parenthesis.
testquery = """
select
regexp_extract(trim(upper(regexp_replace(a.productname, '[,/#!$%^&*;:{}=_`~()-]',''))), '( .5G)', 1) as A
from df a
"""
test_df = spark.sql(testquery)
test_df.show(truncate=False)
+----+
|A |
+----+
| .5G|
+----+
Also note that you cannot + strings together; use concat for that purpose.

How to use Split function in spark sql with delemter |#|?

My column is having data as,
col
---
abc|#|pqr|#|xyz
aaa|#|sss|#|sdf
It is delemeted by |#| (pipe ,# , pipe).
How to split this with spark sql.
I am trying spark.sql("select split(col,'|#|')").show() but it is not giving me proper result.
I tried escaping \ but still no luck.
Can anyone knows what is going on here..
Note: I need solution for spark sql only.
I am not sure if I have understood your problem statement properly or not but to split a string by its delimiter is fairly simple and can be done in a variety of ways.
One of the methods is to use SUBSTRING_INDEX -
val data = Seq(("abc|#|pqr|#|xyz"),("aaa|#|sss|#|sdf")).toDF("col1")
data.createOrReplaceTempView("testSplit")
followed by -
%sql
select *,substring_index(col1,'|#|',1) as value1, substring_index(col1,'|#|',2) as value2, substring_index(col1,'|#|',3) as value3 from testSplit
Result -
OR - Split Function Documentation
%sql
select *,SPLIT(col1,'\\|#\\|') as SplitString from testSplit
Result -
Do let me know if this fulfills your requirement or not .
Check below code.
scala> adf.withColumn("split_data",split($"data","\\|#\\|")).show(false)
+---------------+---------------+
|data |split_data |
+---------------+---------------+
|abc|#|pqr|#|xyz|[abc, pqr, xyz]|
|aaa|#|sss|#|sdf|[aaa, sss, sdf]|
+---------------+---------------+
scala> spark.sql("select * from split_data").show(false)
+---------------+
|data |
+---------------+
|abc|#|pqr|#|xyz|
|aaa|#|sss|#|sdf|
+---------------+
scala> spark.sql("""select data,split('abc|#|pqr|#|xyz', '\\|\\#\\|') as split_data from split_data""").show(false)
+---------------+---------------+
|data |split_data |
+---------------+---------------+
|abc|#|pqr|#|xyz|[abc, pqr, xyz]|
|aaa|#|sss|#|sdf|[abc, pqr, xyz]|
+---------------+---------------+
Note: inside spark.sql function pass your select query between """ """ & escape special symbols with \\.

Spark Read csv with missing quotes

spark.read
val data = spark.read
.option("delimiter", "\t")
.quote("quote", "\"")
.csv("file:///opt/spark/test1.tsv")
incorrectly interprets lines with missing quotes, even though tab delimeter exists
for example line:
"aaa" \t "b'bb \t 222
is interpreted as "aaa", "b`bb 222"
instead of
"aaa", "b`bb", "222"
according to the documentation deli-meters inside quotes are ignored.
I can get around the problem by re defining default quote for example:
.option("quote","+")
but it's not a good solution
if quotes are not closing properly, the only option is to keep it when creating dataframe and later on drop it using custom logic.
scala> spark.read.option("delimiter", "\t").option("quote", "").csv("test.csv").show()
+-----+-----+---+
| _c0| _c1|_c2|
+-----+-----+---+
|"aaa"|"b'bb| 22|
+-----+-----+---+
Now if you know which column, might have an issue just apply the following logic.
scala> df.withColumn("col_without_quotes", regexp_replace($"_c0","\"","")).show()
+-----+-----+---+------------------+
| _c0| _c1|_c2|col_without_quotes|
+-----+-----+---+------------------+
|"aaa"|"b'bb| 22| aaa|
+-----+-----+---+------------------+

Is there a way spark won't escape the backslash coming at beginning of each column?

I have a column that has windows address as follows:
\aod140med01MediaExtractorCatalog20190820Hub26727007444841620183_6727007462021489387.nmf
After reading it to a dataset when I am trying to read the column it's escaping the first backslash and printing the value as follows. Is there a way to skip this?
aod140med01MediaExtractorCatalog20190820Hub26727007444841620183_6727007462021489387.nmf
By default Apache spark not removing back slach
val df1 = sc.parallelize(
| Seq(
| (1,"khan /, vaquar","30","/aod140med01MediaExtractorCatalog20190820Hub26727007444841620183_6727007462021489387.nmf"),
| (2,"Zidan /, khan","5","vkhan1MediaExtractorCatalog20190820Hub26727007444841620183_6727007462021489387.nmf"),
| (3,"Zerina khan","1","test")
| ) ).toDF("id","name","age","string").show
Please share your full code to further debug issue.

Spark Scala - splitting string syntax issue

i'm trying to split String in a DataFrame column using SparkSQL and Scala,
and there seems to be a difference in the way the split condition is working the two
Using Scala,
This works -
val seq = Seq("12.1")
val df = seq.toDF("val")
Scala Code ->
val seq = Seq("12.1")
val df = seq.toDF("val")
val afterSplit = df2.withColumn("FirstPart", split($"val", "\\.")).select($"FirstPart".getItem(0).as("PartOne"))
afterSplit.show(false)
However, in Spark SQL when i use this, firstParkSQL shows a Blank.
df.registerTempTable("temp")
val s1 = sqlContext.sql("select split(val, '\\.')[0] as firstPartSQL from temp")
Instead, when i use this (separate condition represented as [.] instead of \.
expected value shows up.
val s1 = sqlContext.sql("select split(val, '[.]')[0] as firstPartSQL from temp")
Any ideas why this is happening ?
When you use regex patterns in spark-sql with double quotes spark.sql("....."),it is considered as string within another string, so two things happen. Consider this
scala> val df = Seq("12.1").toDF("val")
df: org.apache.spark.sql.DataFrame = [val: string]
scala> df.withColumn("FirstPart", split($"val", "\\.")).select($"FirstPart".getItem(0).as("PartOne")).show
+-------+
|PartOne|
+-------+
| 12|
+-------+
scala> df.createOrReplaceTempView("temp")
With df(), the regex-string for split is directly passed to the split string, so you just need to escape the backslash alone (\).
But when it comes to spark-sql, the pattern is first converted into string and then again passed as string to split() function,
So you need to get \\. before you use that in the spark-sql
The way to get that is to add 2 more \
scala> "\\."
res12: String = \.
scala> "\\\\."
res13: String = \\.
scala>
If you just pass "\\." in the spark-sql, first it gets converted into \. and then to ".", which in regex context becomes (.) "any" character
i.e split on any character, and since each character is adjacent to each other, you get an array of empty string.
The length of the string "12.1" is four and also it matches the final boundary "$" of the string as well.. so upto split(val, '\.')[4], you'll get the
empty string. When you issue split(val, '\.,')[5], you'll get null
To verify this, you can pass the same delimiter string "\\." to regex_replace() function and see what happens
scala> spark.sql("select split(val, '\\.')[0] as firstPartSQL, regexp_replace(val,'\\.','9') as reg_ex from temp").show
+------------+------+
|firstPartSQL|reg_ex|
+------------+------+
| | 9999|
+------------+------+
scala> spark.sql("select split(val, '\\\\.')[0] as firstPartSQL, regexp_replace(val,'\\\\.','9') as reg_ex from temp").show
+------------+------+
|firstPartSQL|reg_ex|
+------------+------+
| 12| 1291|
+------------+------+
scala>
If you still want to use the same pattern between df and sql, then go with raw string i.e triple quotes.
scala> raw"\\."
res23: String = \\.
scala>
scala> spark.sql("""select split(val, '\\.')[0] as firstPartSQL, regexp_replace(val,'\\.','9') as reg_ex from temp""").show
+------------+------+
|firstPartSQL|reg_ex|
+------------+------+
| 12| 1291|
+------------+------+
scala> spark.sql("""select split(val, "\\.")[0] as firstPartSQL, regexp_replace(val,"\\.",'9') as reg_ex from temp""").show
+------------+------+
|firstPartSQL|reg_ex|
+------------+------+
| 12| 1291|
+------------+------+
scala>

Resources