rlike in scala dataframe is giving the error - apache-spark

I am trying to convert the below Hive SQL statement into Spark dataframe but getting an error.
case when (lower(message_txt) rlike '.*sampletext(\\s?is\\s?)newtext.*' ) then 'P' else 'Y'
Sample data: message_txt = "This is new sampletext, followed by newtext"
Please help me to provide equivalent spark dataframe statement.

Use when(lower($"value").rlike(""".sampletext(\sis\s?)newtext."""),lit('P')).otherwise("Y")
scala> df.withColumn("condition",when(lower($"value").rlike(""".sampletext(\s?is\s?)newtext."""),lit('P')).otherwise("Y")).show(false)
+-------------------------------------------+---------+
|value |condition|
+-------------------------------------------+---------+
|This is new sampletext, followed by newtext|Y |
+-------------------------------------------+---------+

Add end at the end of case statement in sql.
Example:
In spark Sql:
val df=Seq(("This is new sampletext, followed by newtext")).toDF("message_txt")
df.createOrReplaceTempView("tmp")
spark.sql("select case when (lower(message_txt) rlike '.sampletext(\\s?is\\s?)newtext.' ) then 'P' else 'Y' end from tmp").show()
//Result
//+--------------------------------------------------------------------------------+
//|CASE WHEN lower(message_txt) RLIKE .sampletext(s?iss?)newtext. THEN P ELSE Y END|
//+--------------------------------------------------------------------------------+
//| Y|
//+--------------------------------------------------------------------------------+
In dataframe API:
df.withColumn("status", when(lower(col("message_txt")).rlike(".sampletext(\\s?is\\s?)newtext."),"P").otherwise("Y")).show()
//Result
//+--------------------+------+
//| message_txt|status|
//+--------------------+------+
//|This is new sampl...| Y|
//+--------------------+------+
UPDATE:
Checking for strings sampletext and newtext in message_txt column.
//using rlike
df.withColumn("status", when(lower(col("message_txt")).rlike("sampletext.*newtext"),"P").otherwise("Y")).show()
//using like
df.withColumn("status", when(lower(col("message_txt")).like("%sampletext%newtext%"),"P").otherwise("Y")).show()
//+--------------------+------+
//| message_txt|status|
//+--------------------+------+
//|This is new sampl...| P|
//+--------------------+------+

Related

Escape newline character in DataFrame

I have a Parquet table in Hive which I read via Spark and write to a delimited file. The code I use is this
var x = spark.table("myschema.my_table")
x.write.mode("overwrite").format("csv").save("/tmp/abc")
So far, so good. But the Hive table can contain data that has \n in it. Now when I write the data, that character breaks the line into a new one, creating an extra broken record. The character can be there in any column. How can I set it to replace it with a space while writing? I tried the following but it didn't work
x.write.mode("overwrite").format("csv").option("multiline", "true").save("/tmp/abc")
There are no such options provided by spark to replace \n with space while writing dataframe to csv. Check Aavailable options here.
You can use regex_replace to replace \n with space and then write the dataframe to CSV.
val spark = SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("ERROR")
val inDF = List(("a\nb", "c\nd", "d\ne", "ef")).toDF("col1", "col2", "col3", "col4")
inDF.show()
/*
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| a
b| c
d| d
e| ef|
+----+----+----+----+ */
val outDF = inDF.columns.foldLeft(inDF)((df, c) => df.withColumn(c, regexp_replace(col(c), "\n", "")))
outDF.show()
/*
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| ab| cd| de| ef|
+----+----+----+----+ */
outDF.write.option("header", true).csv("outputPath")

Convert string type to array type in spark sql

I have table in Spark SQL in Databricks and I have a column as string. I converted as new columns as Array datatype but they still as one string. Datatype is array type in table schema
Column as String
Data1
[2461][2639][2639][7700][7700][3953]
Converted to Array
Data_New
["[2461][2639][2639][7700][7700][3953]"]
String to array conversion
df_new = df.withColumn("Data_New", array(df["Data1"]))
Then write as parquet and use as spark sql table in databricks
When I search for string using array_contains function I get results as false
select *
from table_name
where array_contains(Data_New,"[2461]")
When I search for all string then query turns the results as true
Please suggest if I can separate these string as array and can find any array using array_contains function.
Just remove leading and trailing brackets from the string then split by ][ to get an array of strings:
df = df.withColumn("Data_New", split(expr("rtrim(']', ltrim('[', Data1))"), "\\]\\["))
df.show(truncate=False)
+------------------------------------+------------------------------------+
|Data1 |Data_New |
+------------------------------------+------------------------------------+
|[2461][2639][2639][7700][7700][3953]|[2461, 2639, 2639, 7700, 7700, 3953]|
+------------------------------------+------------------------------------+
Now use array_contains like this:
df.createOrReplaceTempView("table_name")
sql_query = "select * from table_name where array_contains(Data_New,'2461')"
spark.sql(sql_query).show(truncate=False)
Actually this is not an array, this is a full string so you need a regex or similar
expr = "[2461]"
df_new.filter(df_new["Data_New"].rlike(expr))
import
from pyspark.sql import functions as sf, types as st
create table
a = [["[2461][2639][2639][7700][7700][3953]"], [None]]
sdf = sc.parallelize(a).toDF(["col1"])
sdf.show()
+--------------------+
| col1|
+--------------------+
|[2461][2639][2639...|
| null|
+--------------------+
convert type
def spliter(x):
if x is not None:
return x[1:-1].split("][")
else:
return None
udf = sf.udf(spliter, st.ArrayType(st.StringType()))
sdf.withColumn("array_col1", udf("col1")).withColumn("check", sf.array_contains("array_col1", "2461")).show()
+--------------------+--------------------+-----+
| col1| array_col1|check|
+--------------------+--------------------+-----+
|[2461][2639][2639...|[2461, 2639, 2639...| true|
| null| null| null|
+--------------------+--------------------+-----+

What is the substitute of posix operator of Redshift in SparkSQL?

I am executing following Redshift SQL command using POSIX operator (~) for pattern matching (It returns true if there is 9 consecutive digit anywhere in the string, else false)
select '123456789' ~ '\\d{9}' as val; --TRUE
select 'abcd123456789' ~ '\\d{9}' as val; --TRUE
select '123456789ab' ~ '\\d{9}' as val; --TRUE
How do I do some same pattern matching in SparkSQL?
I believe rlike should do the trick:
spark.sql("""SELECT '123456789' rlike '\\\d{9}' as val""").show()
spark.sql("""SELECT 'ab123456789' rlike '\\\d{9}' as val""").show()
spark.sql("""SELECT '123456789abcd' rlike '\\\d{9}' as val""").show()
All result in:
+----+
| val|
+----+
|true|
+----+
And:
spark.sql("""SELECT '12345678abcd' rlike '\\\d{9}' as val""").show()
Results in:
+-----+
| val|
+-----+
|false|
+-----+

Spark sql query giving data type miss match error

I have small sql query which working perfectly fine in sql, but the same query working in hive as expected.
Table has user information and below is the query
spark.sql("select * from users where (id,id_proof) not in ((1232,345))").show;
I am getting below exception in spark
org.apache.spark.sql.AnalysisException: cannot resolve '(named_struct('age', deleted_inventory.`age`, 'id_proof', deleted_inventory.`id_proof`) IN (named_struct('col1',1232, 'col2', 345)))' due to data type mismatch: Arguments must be same type but were: StructType(StructField(id,IntegerType,true), StructField(id_proof,IntegerType,true)) != StructType(StructField(col1,IntegerType,false), StructField(col2,IntegerType,false));
I id and id_proof are of integer types.
Try using the with() table, it works.
scala> val df = Seq((101,121), (1232,345),(222,2242)).toDF("id","id_proof")
df: org.apache.spark.sql.DataFrame = [id: int, id_proof: int]
scala> df.show(false)
+----+--------+
|id |id_proof|
+----+--------+
|101 |121 |
|1232|345 |
|222 |2242 |
+----+--------+
scala> df.createOrReplaceTempView("girish")
scala> spark.sql("with t1( select 1232 id,345 id_proof ) select id, id_proof from girish where (id,id_proof) not in (select id,id_proof from t1) ").show(false)
+---+--------+
|id |id_proof|
+---+--------+
|101|121 |
|222|2242 |
+---+--------+
scala>

Udf not working

can you help me to optimize this code and make it work?
this is original data:
+--------------------+-------------+
| original_name|medicine_name|
+--------------------+-------------+
| Venlafaxine| Venlafaxine|
| Lacrifilm 5mg/ml| Lacrifilm|
| Lacrifilm 5mg/ml| null|
| Venlafaxine| null|
|Vitamin D10,000IU...| null|
| paracetamol| null|
| mucolite| null|
I'm expect to get data like this
+--------------------+-------------+
| original_name|medicine_name|
+--------------------+-------------+
| Venlafaxine| Venlafaxine|
| Lacrifilm 5mg/ml| Lacrifilm|
| Lacrifilm 5mg/ml| Lacrifilm|
| Venlafaxine| Venlafaxine|
|Vitamin D10,000IU...| null|
| paracetamol| null|
| mucolite| null|
This is the code:
distinct_df = spark.sql("select distinct medicine_name as medicine_name from medicine where medicine_name is not null")
distinct_df.createOrReplaceTempView("distinctDF")
def getMax(num1, num2):
pmax = (num1>=num2)*num1+(num2>num1)*num2
return pmax
def editDistance(s1, s2):
ed = (getMax(length(s1), length(s2)) - levenshtein(s1,s2))/
getMax(length(s1), length(s2))
return ed
editDistanceUdf = udf(lambda x,y: editDistance(x,y), FloatType())
def getSimilarity(str):
res = spark.sql("select medicine_name, editDistanceUdf('str', medicine_name) from distinctDf where editDistanceUdf('str', medicine_name)>=0.85 order by 2")
res['medicine_name'].take(1)
return res
getSimilarityUdf = udf(lambda x: getSimilarity(x), StringType())
res_df = df.withColumn('m_name', when((df.medicine_name.isNull)|(df.medicine_name.=="null")),getSimilarityUdf(df.original_name)
.otherwise(df.medicine_name)).show()
now i'm getting error:
command_part = REFERENCE_TYPE + parameter._get_object_id()
AttributeError: 'function' object has no attribute '_get_object_id'
There is a bunch of problems with your code:
You cannot use SparkSession or distributed objects in the udf. So getSimilarity just cannot work. If you want to compare objects like this you have to join.
If length and levenshtein come from pyspark.sql.functions there cannot be used inside UserDefinedFunctions. There are designed to generate SQL expressions, mapping from *Column to Column.
Column isNull is a method not property so should be called:
df.medicine_name.isNull()
Following
df.medicine_name.=="null"
is not a syntactically valid Python (looks like Scala calque) and would throw compiler exceptions.
If SparkSession access was allowed in an UserDefinedFunction this wouldn't be a valid substitution
spark.sql("select medicine_name, editDistanceUdf('str', medicine_name) from distinctDf where editDistanceUdf('str', medicine_name)>=0.85 order by 2")
You should use string formatting methods
spark.sql("select medicine_name, editDistanceUdf({str}, medicine_name) from distinctDf where editDistanceUdf({str}, medicine_name)>=0.85 order by 2".format(str=str))
Maybe some other problems, but since you didn't provide a MCVE, anything else would be pure guessing.
When you fix smaller mistakes you have two choices:
Use crossJoin:
combined = df.alias("left").crossJoin(spark.table("distinctDf").alias("right"))
Then apply udf, filter, and one of the methods listed in Find maximum row per group in Spark DataFrame to closest match in group.
Use built-in approximate matching tools as explained in Efficient string matching in Apache Spark

Resources