Spark Scala - splitting string syntax issue - apache-spark

i'm trying to split String in a DataFrame column using SparkSQL and Scala,
and there seems to be a difference in the way the split condition is working the two
Using Scala,
This works -
val seq = Seq("12.1")
val df = seq.toDF("val")
Scala Code ->
val seq = Seq("12.1")
val df = seq.toDF("val")
val afterSplit = df2.withColumn("FirstPart", split($"val", "\\.")).select($"FirstPart".getItem(0).as("PartOne"))
afterSplit.show(false)
However, in Spark SQL when i use this, firstParkSQL shows a Blank.
df.registerTempTable("temp")
val s1 = sqlContext.sql("select split(val, '\\.')[0] as firstPartSQL from temp")
Instead, when i use this (separate condition represented as [.] instead of \.
expected value shows up.
val s1 = sqlContext.sql("select split(val, '[.]')[0] as firstPartSQL from temp")
Any ideas why this is happening ?

When you use regex patterns in spark-sql with double quotes spark.sql("....."),it is considered as string within another string, so two things happen. Consider this
scala> val df = Seq("12.1").toDF("val")
df: org.apache.spark.sql.DataFrame = [val: string]
scala> df.withColumn("FirstPart", split($"val", "\\.")).select($"FirstPart".getItem(0).as("PartOne")).show
+-------+
|PartOne|
+-------+
| 12|
+-------+
scala> df.createOrReplaceTempView("temp")
With df(), the regex-string for split is directly passed to the split string, so you just need to escape the backslash alone (\).
But when it comes to spark-sql, the pattern is first converted into string and then again passed as string to split() function,
So you need to get \\. before you use that in the spark-sql
The way to get that is to add 2 more \
scala> "\\."
res12: String = \.
scala> "\\\\."
res13: String = \\.
scala>
If you just pass "\\." in the spark-sql, first it gets converted into \. and then to ".", which in regex context becomes (.) "any" character
i.e split on any character, and since each character is adjacent to each other, you get an array of empty string.
The length of the string "12.1" is four and also it matches the final boundary "$" of the string as well.. so upto split(val, '\.')[4], you'll get the
empty string. When you issue split(val, '\.,')[5], you'll get null
To verify this, you can pass the same delimiter string "\\." to regex_replace() function and see what happens
scala> spark.sql("select split(val, '\\.')[0] as firstPartSQL, regexp_replace(val,'\\.','9') as reg_ex from temp").show
+------------+------+
|firstPartSQL|reg_ex|
+------------+------+
| | 9999|
+------------+------+
scala> spark.sql("select split(val, '\\\\.')[0] as firstPartSQL, regexp_replace(val,'\\\\.','9') as reg_ex from temp").show
+------------+------+
|firstPartSQL|reg_ex|
+------------+------+
| 12| 1291|
+------------+------+
scala>
If you still want to use the same pattern between df and sql, then go with raw string i.e triple quotes.
scala> raw"\\."
res23: String = \\.
scala>
scala> spark.sql("""select split(val, '\\.')[0] as firstPartSQL, regexp_replace(val,'\\.','9') as reg_ex from temp""").show
+------------+------+
|firstPartSQL|reg_ex|
+------------+------+
| 12| 1291|
+------------+------+
scala> spark.sql("""select split(val, "\\.")[0] as firstPartSQL, regexp_replace(val,"\\.",'9') as reg_ex from temp""").show
+------------+------+
|firstPartSQL|reg_ex|
+------------+------+
| 12| 1291|
+------------+------+
scala>

Related

How to use Split function in spark sql with delemter |#|?

My column is having data as,
col
---
abc|#|pqr|#|xyz
aaa|#|sss|#|sdf
It is delemeted by |#| (pipe ,# , pipe).
How to split this with spark sql.
I am trying spark.sql("select split(col,'|#|')").show() but it is not giving me proper result.
I tried escaping \ but still no luck.
Can anyone knows what is going on here..
Note: I need solution for spark sql only.
I am not sure if I have understood your problem statement properly or not but to split a string by its delimiter is fairly simple and can be done in a variety of ways.
One of the methods is to use SUBSTRING_INDEX -
val data = Seq(("abc|#|pqr|#|xyz"),("aaa|#|sss|#|sdf")).toDF("col1")
data.createOrReplaceTempView("testSplit")
followed by -
%sql
select *,substring_index(col1,'|#|',1) as value1, substring_index(col1,'|#|',2) as value2, substring_index(col1,'|#|',3) as value3 from testSplit
Result -
OR - Split Function Documentation
%sql
select *,SPLIT(col1,'\\|#\\|') as SplitString from testSplit
Result -
Do let me know if this fulfills your requirement or not .
Check below code.
scala> adf.withColumn("split_data",split($"data","\\|#\\|")).show(false)
+---------------+---------------+
|data |split_data |
+---------------+---------------+
|abc|#|pqr|#|xyz|[abc, pqr, xyz]|
|aaa|#|sss|#|sdf|[aaa, sss, sdf]|
+---------------+---------------+
scala> spark.sql("select * from split_data").show(false)
+---------------+
|data |
+---------------+
|abc|#|pqr|#|xyz|
|aaa|#|sss|#|sdf|
+---------------+
scala> spark.sql("""select data,split('abc|#|pqr|#|xyz', '\\|\\#\\|') as split_data from split_data""").show(false)
+---------------+---------------+
|data |split_data |
+---------------+---------------+
|abc|#|pqr|#|xyz|[abc, pqr, xyz]|
|aaa|#|sss|#|sdf|[abc, pqr, xyz]|
+---------------+---------------+
Note: inside spark.sql function pass your select query between """ """ & escape special symbols with \\.

Convert string type to array type in spark sql

I have table in Spark SQL in Databricks and I have a column as string. I converted as new columns as Array datatype but they still as one string. Datatype is array type in table schema
Column as String
Data1
[2461][2639][2639][7700][7700][3953]
Converted to Array
Data_New
["[2461][2639][2639][7700][7700][3953]"]
String to array conversion
df_new = df.withColumn("Data_New", array(df["Data1"]))
Then write as parquet and use as spark sql table in databricks
When I search for string using array_contains function I get results as false
select *
from table_name
where array_contains(Data_New,"[2461]")
When I search for all string then query turns the results as true
Please suggest if I can separate these string as array and can find any array using array_contains function.
Just remove leading and trailing brackets from the string then split by ][ to get an array of strings:
df = df.withColumn("Data_New", split(expr("rtrim(']', ltrim('[', Data1))"), "\\]\\["))
df.show(truncate=False)
+------------------------------------+------------------------------------+
|Data1 |Data_New |
+------------------------------------+------------------------------------+
|[2461][2639][2639][7700][7700][3953]|[2461, 2639, 2639, 7700, 7700, 3953]|
+------------------------------------+------------------------------------+
Now use array_contains like this:
df.createOrReplaceTempView("table_name")
sql_query = "select * from table_name where array_contains(Data_New,'2461')"
spark.sql(sql_query).show(truncate=False)
Actually this is not an array, this is a full string so you need a regex or similar
expr = "[2461]"
df_new.filter(df_new["Data_New"].rlike(expr))
import
from pyspark.sql import functions as sf, types as st
create table
a = [["[2461][2639][2639][7700][7700][3953]"], [None]]
sdf = sc.parallelize(a).toDF(["col1"])
sdf.show()
+--------------------+
| col1|
+--------------------+
|[2461][2639][2639...|
| null|
+--------------------+
convert type
def spliter(x):
if x is not None:
return x[1:-1].split("][")
else:
return None
udf = sf.udf(spliter, st.ArrayType(st.StringType()))
sdf.withColumn("array_col1", udf("col1")).withColumn("check", sf.array_contains("array_col1", "2461")).show()
+--------------------+--------------------+-----+
| col1| array_col1|check|
+--------------------+--------------------+-----+
|[2461][2639][2639...|[2461, 2639, 2639...| true|
| null| null| null|
+--------------------+--------------------+-----+

How to access nested schema column?

I have a Kafka streaming source with JSONs, e.g. {"type":"abc","1":"23.2"}.
The query gives the following exception:
org.apache.spark.sql.catalyst.parser.ParseException: extraneous
input '.1' expecting {<EOF>, .......}
== SQL ==
person.1
What is the correct syntax to access "person.1"?
I have even changed DoubleType to StringType, but that didn't work either. Example works fine with just by keeping person.type and removing person.1 in selectExpr:
val personJsonDf = inputDf.selectExpr("CAST(value AS STRING)")
val struct = new StructType()
.add("type", DataTypes.StringType)
.add("1", DataTypes.DoubleType)
val personNestedDf = personJsonDf
.select(from_json($"value", struct).as("person"))
val personFlattenedDf = personNestedDf
.selectExpr("person.type", "person.1")
val consoleOutput = personNestedDf.writeStream
.outputMode("update")
.format("console")
.start()
Interesting, since select($"person.1") should work fine (but you used selectExpr which could've confused Spark SQL).
StructField(1,DoubleType,true) won't work however since the type should actually be StringType.
Let's see...
$ cat input.json
{"type":"abc","1":"23.2"}
val input = spark.read.text("input.json")
scala> input.show(false)
+-------------------------+
|value |
+-------------------------+
|{"type":"abc","1":"23.2"}|
+-------------------------+
import org.apache.spark.sql.types._
val struct = new StructType()
.add("type", DataTypes.StringType)
.add("1", DataTypes.StringType)
val q = input.select(from_json($"value", struct).as("person"))
scala> q.show
+-----------+
| person|
+-----------+
|[abc, 23.2]|
+-----------+
val q = input.select(from_json($"value", struct).as("person")).select($"person.1")
scala> q.show
+----+
| 1|
+----+
|23.2|
+----+
I have solved this problem by using person.*
+-----+--------+
|type | 1 |
+-----+--------+
|abc |23.2 |
+-----+--------+

How to swap minus sign from last position in a string to first position in hive?

How to swap negative sign from last position to first in a string or integer to first position in hive and/ spark?
example: 22-
required: -22
My code is:
val Validation1 = spark.sql("Select case when substr(YTTLSVAL-,-1,1)='-' then cast(concat('-',substr(YTTLSVAL-,1,length(YTTLSVAL-)-1)) as int) else cast(YTTLSVAL- as int) end as column_name")
scala> Seq("-abcd", "def", "23-", "we").toDF("value").createOrReplaceTempView("values")
scala> val f = (x: String) => if(x.endsWith("-")) s"-${x.dropRight(1)}" else x
scala> spark.udf.register("myudf", f)
scala> spark.sql("select *, myudf(*) as custval from values").show
+-----+-------+
|value|custval|
+-----+-------+
|-abcd| -abcd|
| def| def|
| 23-| -23|
| we| we|
+-----+-------+
EDIT
On second thought, since, UDF's are discouraged unless you absolutely need them (since they create a black box for spark's optimisation engine), please use below way that uses regex_replace instead. Have tested this and it works:
scala> spark.sql("select REGEXP_REPLACE ( value, '^(\\.+)(-)$','-$1') as custval from values").show
You could try REGEXP_REPLACE. This patter searches for a number followed by - at the end and puts it before the number if found.
SELECT REGEXP_REPLACE ( val, '^(\\d+)-$','-$1')
Column functions

Spark: create DataFrame with specified schema fields

In Spark, create case class to specify schema, then create RDD from a file and convert it to DF. e.g.
case class Example(name: String, age: Long)
val exampleDF = spark.sparkContext
.textFile("example.txt")
.map(_.split(","))
.map(attributes => Example(attributes(0), attributes(1).toInt))
.toDF()
The question is, if the content in the txt file is like "ABCDE12345FGHIGK67890", without any symbols or spaces. How to extract specified length of string for the schema field. e.g. extract 'BCD' for name and '23' for age. Is this possible to use map and split?
Thanks !!!
You can use subString to pull the data from specific index as below
case class Example (name : String, age: Int)
val example = spark.sparkContext
.textFile("test.txt")
.map(line => Example(line.substring(1, 4), line.substring(6,8).toInt)).toDF()
example.show()
Output:
+----+---+
|name|age|
+----+---+
| BCD| 23|
| BCD| 23|
| BCD| 23|
+----+---+
I hope this helps!
In the map function where you are splitting by commas, just put a function that converts the input string to a list of values in the required order.

Resources