What is the substitute of posix operator of Redshift in SparkSQL? - apache-spark

I am executing following Redshift SQL command using POSIX operator (~) for pattern matching (It returns true if there is 9 consecutive digit anywhere in the string, else false)
select '123456789' ~ '\\d{9}' as val; --TRUE
select 'abcd123456789' ~ '\\d{9}' as val; --TRUE
select '123456789ab' ~ '\\d{9}' as val; --TRUE
How do I do some same pattern matching in SparkSQL?

I believe rlike should do the trick:
spark.sql("""SELECT '123456789' rlike '\\\d{9}' as val""").show()
spark.sql("""SELECT 'ab123456789' rlike '\\\d{9}' as val""").show()
spark.sql("""SELECT '123456789abcd' rlike '\\\d{9}' as val""").show()
All result in:
+----+
| val|
+----+
|true|
+----+
And:
spark.sql("""SELECT '12345678abcd' rlike '\\\d{9}' as val""").show()
Results in:
+-----+
| val|
+-----+
|false|
+-----+

Related

How to use Split function in spark sql with delemter |#|?

My column is having data as,
col
---
abc|#|pqr|#|xyz
aaa|#|sss|#|sdf
It is delemeted by |#| (pipe ,# , pipe).
How to split this with spark sql.
I am trying spark.sql("select split(col,'|#|')").show() but it is not giving me proper result.
I tried escaping \ but still no luck.
Can anyone knows what is going on here..
Note: I need solution for spark sql only.
I am not sure if I have understood your problem statement properly or not but to split a string by its delimiter is fairly simple and can be done in a variety of ways.
One of the methods is to use SUBSTRING_INDEX -
val data = Seq(("abc|#|pqr|#|xyz"),("aaa|#|sss|#|sdf")).toDF("col1")
data.createOrReplaceTempView("testSplit")
followed by -
%sql
select *,substring_index(col1,'|#|',1) as value1, substring_index(col1,'|#|',2) as value2, substring_index(col1,'|#|',3) as value3 from testSplit
Result -
OR - Split Function Documentation
%sql
select *,SPLIT(col1,'\\|#\\|') as SplitString from testSplit
Result -
Do let me know if this fulfills your requirement or not .
Check below code.
scala> adf.withColumn("split_data",split($"data","\\|#\\|")).show(false)
+---------------+---------------+
|data |split_data |
+---------------+---------------+
|abc|#|pqr|#|xyz|[abc, pqr, xyz]|
|aaa|#|sss|#|sdf|[aaa, sss, sdf]|
+---------------+---------------+
scala> spark.sql("select * from split_data").show(false)
+---------------+
|data |
+---------------+
|abc|#|pqr|#|xyz|
|aaa|#|sss|#|sdf|
+---------------+
scala> spark.sql("""select data,split('abc|#|pqr|#|xyz', '\\|\\#\\|') as split_data from split_data""").show(false)
+---------------+---------------+
|data |split_data |
+---------------+---------------+
|abc|#|pqr|#|xyz|[abc, pqr, xyz]|
|aaa|#|sss|#|sdf|[abc, pqr, xyz]|
+---------------+---------------+
Note: inside spark.sql function pass your select query between """ """ & escape special symbols with \\.

rlike in scala dataframe is giving the error

I am trying to convert the below Hive SQL statement into Spark dataframe but getting an error.
case when (lower(message_txt) rlike '.*sampletext(\\s?is\\s?)newtext.*' ) then 'P' else 'Y'
Sample data: message_txt = "This is new sampletext, followed by newtext"
Please help me to provide equivalent spark dataframe statement.
Use when(lower($"value").rlike(""".sampletext(\sis\s?)newtext."""),lit('P')).otherwise("Y")
scala> df.withColumn("condition",when(lower($"value").rlike(""".sampletext(\s?is\s?)newtext."""),lit('P')).otherwise("Y")).show(false)
+-------------------------------------------+---------+
|value |condition|
+-------------------------------------------+---------+
|This is new sampletext, followed by newtext|Y |
+-------------------------------------------+---------+
Add end at the end of case statement in sql.
Example:
In spark Sql:
val df=Seq(("This is new sampletext, followed by newtext")).toDF("message_txt")
df.createOrReplaceTempView("tmp")
spark.sql("select case when (lower(message_txt) rlike '.sampletext(\\s?is\\s?)newtext.' ) then 'P' else 'Y' end from tmp").show()
//Result
//+--------------------------------------------------------------------------------+
//|CASE WHEN lower(message_txt) RLIKE .sampletext(s?iss?)newtext. THEN P ELSE Y END|
//+--------------------------------------------------------------------------------+
//| Y|
//+--------------------------------------------------------------------------------+
In dataframe API:
df.withColumn("status", when(lower(col("message_txt")).rlike(".sampletext(\\s?is\\s?)newtext."),"P").otherwise("Y")).show()
//Result
//+--------------------+------+
//| message_txt|status|
//+--------------------+------+
//|This is new sampl...| Y|
//+--------------------+------+
UPDATE:
Checking for strings sampletext and newtext in message_txt column.
//using rlike
df.withColumn("status", when(lower(col("message_txt")).rlike("sampletext.*newtext"),"P").otherwise("Y")).show()
//using like
df.withColumn("status", when(lower(col("message_txt")).like("%sampletext%newtext%"),"P").otherwise("Y")).show()
//+--------------------+------+
//| message_txt|status|
//+--------------------+------+
//|This is new sampl...| P|
//+--------------------+------+

how to handle this in spark

I am using spark-sql 2.4.x version , datastax-spark-cassandra-connector for Cassandra-3.x version. Along with kafka.
I have a scenario for some finance data coming from kafka topic. data (base dataset) contains companyId, year , prev_year fields information.
If columns year === prev_year then I need to join with different table i.e. exchange_rates.
If columns year =!= prev_year then I need to return the base dataset itself
How to do this in spark-sql ?
You can refer below approach for your case.
scala> Input_df.show
+---------+----+---------+----+
|companyId|year|prev_year|rate|
+---------+----+---------+----+
| 1|2016| 2017| 12|
| 1|2017| 2017|21.4|
| 2|2018| 2017|11.7|
| 2|2018| 2018|44.6|
| 3|2016| 2017|34.5|
| 4|2017| 2017| 56|
+---------+----+---------+----+
scala> exch_rates.show
+---------+----+
|companyId|rate|
+---------+----+
| 1|12.3|
| 2|12.5|
| 3|22.3|
| 4|34.6|
| 5|45.2|
+---------+----+
scala> val equaldf = Input_df.filter(col("year") === col("prev_year"))
scala> val notequaldf = Input_df.filter(col("year") =!= col("prev_year"))
scala> val joindf = notequaldf.alias("n").drop("rate").join(exch_rates.alias("e"), List("companyId"), "left")
scala> val finalDF = equaldf.union(joindf)
scala> finalDF.show()
+---------+----+---------+----+
|companyId|year|prev_year|rate|
+---------+----+---------+----+
| 1|2017| 2017|21.4|
| 2|2018| 2018|44.6|
| 4|2017| 2017| 56|
| 1|2016| 2017|12.3|
| 2|2018| 2017|12.5|
| 3|2016| 2017|22.3|
+---------+----+---------+----+

Spark sql query giving data type miss match error

I have small sql query which working perfectly fine in sql, but the same query working in hive as expected.
Table has user information and below is the query
spark.sql("select * from users where (id,id_proof) not in ((1232,345))").show;
I am getting below exception in spark
org.apache.spark.sql.AnalysisException: cannot resolve '(named_struct('age', deleted_inventory.`age`, 'id_proof', deleted_inventory.`id_proof`) IN (named_struct('col1',1232, 'col2', 345)))' due to data type mismatch: Arguments must be same type but were: StructType(StructField(id,IntegerType,true), StructField(id_proof,IntegerType,true)) != StructType(StructField(col1,IntegerType,false), StructField(col2,IntegerType,false));
I id and id_proof are of integer types.
Try using the with() table, it works.
scala> val df = Seq((101,121), (1232,345),(222,2242)).toDF("id","id_proof")
df: org.apache.spark.sql.DataFrame = [id: int, id_proof: int]
scala> df.show(false)
+----+--------+
|id |id_proof|
+----+--------+
|101 |121 |
|1232|345 |
|222 |2242 |
+----+--------+
scala> df.createOrReplaceTempView("girish")
scala> spark.sql("with t1( select 1232 id,345 id_proof ) select id, id_proof from girish where (id,id_proof) not in (select id,id_proof from t1) ").show(false)
+---+--------+
|id |id_proof|
+---+--------+
|101|121 |
|222|2242 |
+---+--------+
scala>

How to access nested schema column?

I have a Kafka streaming source with JSONs, e.g. {"type":"abc","1":"23.2"}.
The query gives the following exception:
org.apache.spark.sql.catalyst.parser.ParseException: extraneous
input '.1' expecting {<EOF>, .......}
== SQL ==
person.1
What is the correct syntax to access "person.1"?
I have even changed DoubleType to StringType, but that didn't work either. Example works fine with just by keeping person.type and removing person.1 in selectExpr:
val personJsonDf = inputDf.selectExpr("CAST(value AS STRING)")
val struct = new StructType()
.add("type", DataTypes.StringType)
.add("1", DataTypes.DoubleType)
val personNestedDf = personJsonDf
.select(from_json($"value", struct).as("person"))
val personFlattenedDf = personNestedDf
.selectExpr("person.type", "person.1")
val consoleOutput = personNestedDf.writeStream
.outputMode("update")
.format("console")
.start()
Interesting, since select($"person.1") should work fine (but you used selectExpr which could've confused Spark SQL).
StructField(1,DoubleType,true) won't work however since the type should actually be StringType.
Let's see...
$ cat input.json
{"type":"abc","1":"23.2"}
val input = spark.read.text("input.json")
scala> input.show(false)
+-------------------------+
|value |
+-------------------------+
|{"type":"abc","1":"23.2"}|
+-------------------------+
import org.apache.spark.sql.types._
val struct = new StructType()
.add("type", DataTypes.StringType)
.add("1", DataTypes.StringType)
val q = input.select(from_json($"value", struct).as("person"))
scala> q.show
+-----------+
| person|
+-----------+
|[abc, 23.2]|
+-----------+
val q = input.select(from_json($"value", struct).as("person")).select($"person.1")
scala> q.show
+----+
| 1|
+----+
|23.2|
+----+
I have solved this problem by using person.*
+-----+--------+
|type | 1 |
+-----+--------+
|abc |23.2 |
+-----+--------+

Resources