Spark Data frame search column starting with a string - apache-spark

I have a requirement to filter a data frame based on a condition that a column value should starts with a predefined string.
I am trying following:
val domainConfigJSON = sqlContext.read
.jdbc(url, "CONFIG", prop)
.select("DID", "CONF", "KEY").filter("key like 'config.*'")
And getting exception:
Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException:
You have an error in your SQL syntax; check the manual that
corresponds to your MariaDB server version for the right syntax to use
near 'KEY = 'config.*'' at line 1
Using spark: 1.6.1

You can use the startsWith function present in Column class.
myDataFrame.filter(col("columnName").startswith("PREFIX"))

I used the same function but I was getting errors then I checked what is the error?
actually, we need to use startsWith(literals: String) but the above function having lowercase startswith().
Ex : df.filter(col("ACCOUNT_NUMBER").startsWith("9"))

Related

Cannot use custom SQL function with arguments inside transform scope [Spark SQL] (Error in SQL statement: AnalysisException: Resolved attribute(s)...)

I am using a Spark SQL context in Azure Databricks.
My query uses the transform function for handling an array like so:
SELECT
colA,
colB,
transform(colC,
x -> named_struct(
"innerColA", functionA(x.innerColA), -- does not work
"innerColB", [...x.innerColB...], -- works (same logic as functionA)
"test1", test1(), -- works
"test2", test2(x.innerColA) -- does not work
)
)
FROM
tableA
I get the following error regarding the use of functionA:
Error in SQL statement: AnalysisException: Resolved attribute(s) x#2723416 missing from in operator !Project [cast(lambda x#2723416 as string) AS arg1#2723417].
functionA is simple enough that, if I rewrite it directly into the query, it works (as shown using "innerColC" of my code example.
I have tested with simple functions that don't take any arguments and they can be used without any issues:
CREATE OR REPLACE FUNCTION test1() RETURNS STRING RETURN "test"
But if you have any arguments, it throws that error:
CREATE OR REPLACE FUNCTION test2(arg1 STRING) RETURNS STRING RETURN "test"
Is that a limitation of SparkSQL? Are there any workarounds?

Problem with selecting json object in Pyspark, which may sometime have Null values

I have this big nested json object from which I need to make a Dataframe. One of the inner json elements sometimes come as empty and sometimes it comes with some values in it.
I am giving a simple example here:
When it is filled:
{"student_address": {"Door Number":"1234",
"Place":"xxxx",
"Zip code":"12345"}}
When it is empty:
{"student_address":""}
So, in the final DataFrame I have all the three columns Door Number, Place and Zip code. When the address is empty, I should put Null values in the respective columns and should fill them when there is data.
The code I tried:
test = test.withColumn("place",when(col("student_address") == "", lit(None)).otherwise(col("student_address.place")))\
.withColumn("door_num",when(col("student_address") == "",lit(None)).otherwise(col("student_address.door_num")))\
.withColumn("zip_code",when(col("student_address") == "", lit(None)).otherwise(col("student_address.zip_code")))
So, I am trying to check wether the value is empty or not.
This is the error I am getting:
AnalysisException: Can't extract value from student_address#34: need struct type but got string
I am not able to understand why PySpark is checking the statement in otherwise, when condition is met in when itself. (I tried giving simple values in otherwise instead of json path and it worked).
I am struggling to understand what is happening here and would like to know if there is any simple way to do this.
val addressSchema = StructType(StructField("Place", StringType, false) :: Nil) # Add more fields
val schema = StructType(StructField("address", addressSchema, true) :: Nil) # the point is address is nullable
val df = spark.read.schema(schema).json("example.json")
use Get JSON object with the first object with is Student_address and the column should be the name of the column.
df.withColumn("place",when(df.student_address== "", lit(None)).otherwise(get_json_object(col("student_address"),"$.student_address.place")))

UDF in Spark 1.6 Reassignment to val error

I am using Spark 1.6
The below udf is used to clean address data.
sqlContext.udf.register("cleanaddress", (AD1:String,AD2: String, AD3:String)=>Boolean = _.matches("^[a-zA-Z0-9]*$"))
UDF Name : cleanaddress
Three input parameter is coming from DataFrame column,(AD1,AD2 and AD3).
May someone please help me to fix the below error.
i am trying to write udf which accept three parameter (3 address column of dataframe), compute and give only the filter records.
Error:
Error:(38, 91) reassignment to val
sqlContext.udf.register("cleanaddress", (AD1:String, AD2: String, AD3:String)=>Boolean = _.matches("^[a-zA-Z0-9]*$"))
Your logic is not quite clear fro your given code. What you can do is to return an array of valid addresses like this:
sqlContext.udf.register("cleanaddress", (AD1:String, AD2: String, AD3:String)=> Seq(AD1,AD2,AD3).filter(_.matches("^[a-zA-Z0-9]*$")))
Note that this will return a complex column (i.e. an array)

Spark SQL query with IN operator in CASE WHEN cannot be cast to SparkPlan

I'm trying to execute the test query like this:
SELECT COUNT(CASE WHEN name IN (SELECT name FROM requiredProducts) THEN name END)
FROM myProducts
which throws the following exception:
java.lang.ClassCastException:
org.apache.spark.sql.execution.datasources.LogicalRelation cannot be cast to
org.apache.spark.sql.execution.SparkPlan
I have a suggestion that IN operator can not be used in CASE WHEN. Is it really so? Spark documentation is silent about this.
The IN operator using a subquery does not work in a projection regardless of whether it is contained in a CASE WHEN, it will only work in filters. It works fine if you specify values in the IN clause directly rather than using a subquery.
I am not sure how to generate the exact exception you got above, but when I attempt to run a similar query in Spark Scala, it returns a more descriptive error:
org.apache.spark.sql.AnalysisException: IN/EXISTS predicate sub-queries can only be used in a Filter: Project [CASE WHEN agi_label#5 IN (list#96 []) THEN 1 ELSE 0 END AS CASE WHEN (agi_label IN (listquery())) THEN 1 ELSE 0 END#97]
I have run into this issue in the past. Your best bet is probably to restructure it to use a left join to requiredProducts and then check for a null in the case statement. For example, something like this might work:
SELECT COUNT(CASE WHEN rp.name is not null THEN mp.name END)
FROM myProducts mp
LEFT JOIN requiredProducts rp ON mp.name = rp.name

non-ordinal access to rows returned by Spark SQL query

In the Spark documentation, it is stated that the result of a Spark SQL query is a SchemaRDD. Each row of this SchemaRDD can in turn be accessed by ordinal. I am wondering if there is any way to access the columns using the field names of the case class on top of which the SQL query was built. I appreciate the fact that the case class is not associated with the result, especially if I have selected individual columns and/or aliased them: however, some way to access fields by name rather than ordinal would be convenient.
A simple way is to use the "language-integrated" select method on the resulting SchemaRDD to select the column(s) you want -- this still gives you a SchemaRDD, and if you select more than one column then you will still need to use ordinals, but you can always select one column at a time. Example:
// setup and some data
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
case class Score(name: String, value: Int)
val scores =
sc.textFile("data.txt").map(_.split(",")).map(s => Score(s(0),s(1).trim.toInt))
scores.registerAsTable("scores")
// initial query
val original =
sqlContext.sql("Select value AS myVal, name FROM scores WHERE name = 'foo'")
// now a simple "language-integrated" query -- no registration required
val secondary = original.select('myVal)
secondary.collect().foreach(println)
Now secondary is a SchemaRDD with just one column, and it works despite the alias in the original query.
Edit: but note that you can register the resulting SchemaRDD and query it with straight SQL syntax without needing another case class.
original.registerAsTable("original")
val secondary = sqlContext.sql("select myVal from original")
secondary.collect().foreach(println)
Second edit: When processing an RDD one row at a time, it's possible to access the columns by name by using the matching syntax:
val secondary = original.map {case Row(myVal: Int, _) => myVal}
although this could get cumbersome if the right hand side of the '=>' requires access to a lot of the columns, as they would each need to be matched on the left. (This from a very useful comment in the source code for the Row companion object)

Resources