Spark SQL "select column AS ..." not finding column - apache-spark

I am trying to run an 'SQL' query on a Spark DataFrame. I have registered the name of the df as table and now I am trying to run a select on a column where I apply a udf and then pickup the rows that pass a certain condition.
The problem is that on my WHERE clause is referencing the modified column but it is not able to see the names declared with AS.
DataFrame df = sqlContext.read()
.format("com.databricks.spark.csv")
.option("header", "true")
.option("delimiter", delimiter)
.load(path);
df.registerTempTable("df");
String sqlDfQuery = "SELECT parseDateTime(start) as start1 FROM df WHERE start1 > 1";
if (sqlContext.sql(sqlDfQuery).take(1) != null) return true;
when I am running that I am getting back
org.apache.spark.sql.AnalysisException: cannot resolve 'start1' given input columns: [scores, start, ...
parseDateTime is a UDF defined like that
sqlContext.udf().register("parseDateTime", (String dt) -> new DateTime(dt).getMillis(), DataTypes.LongType);
Should I not be trying to do that?

This happens because it applies the filters before aliases.
You could do a nested select statement to solve this issue.
Something like the following:
String sqlDfQuery = "SELECT start1 FROM (
SELECT parseDateTime(start) AS start1 FROM df) TMP
WHERE start1 > 1 ";

Related

When reading data from ElasticSearch in spark using spark-ES connector string type columns coming with leading and trailing spaces

I am reading the data from Elastic Search in spark using the spark-es connector few of the string type columns values coming with head and tail spaces where as other columns not coming why this inconsistency? Is there any thing I am missing ?
val esTableData = spark.read
.format("org.elasticsearch.spark.sql")
.option("pushdown", "true").option("es.ignoreNulls","true")
.option("es.field.read.empty.as.null", "no")
.load("<index_path>")
esTableData.registerTempTable("tmp_table")
val res1 = spark.sql("select * from tmp_table where clm_a='ABC'")
res1.count
This is giving the count as zero, If I trim the column getting the count.
res10: Long = 0
val res2 = spark.sql("select * from tmp_table where trim(clm)='ABC'")
res2.count
res7: Long = 80

Dynamically loop a dataset for all column names

I am working on project where I have around 500 column names, but I need to apply coalesce function on every table name .
df1 schema
-id
-col1
...
-col500
df2 schema
-id
-col1
...
-col500
Dataset<Row> newDS= df1.join(df2, "id")
.select(
df1.col("id"),
functions.coalesce(df1.col("col1"),df2.col("col1")).as("col1"),
functions.coalesce(df1.col("col2"),df2.col("col2")).as("col2"),
...
functions.coalesce(df1.col("col500"),df2.col("col500")).as("col500"),
)
.show();
What I have tried
Dataset<Row> j1 = df1.join(df2, "id");
Dataset<Row> gh1 = spark.emptyDataFrame();
String[] f = df1.columns();
for(String h : f)
{
if(h == "id")
gh1 = j1.select(df1.col("id"));
else{
gh1 = j1.select(functions.coalesce(df1.col(h),df2.col(h)).as(h));
}
}
gh1.show();
df1.columns will returns the String Array, so cannot invoke streams on it, refer.
Column[] coalescedColumns =
Stream.of(df1.columns())
.map(name -> functions.coalesce(df1.col(name),df2.col(name)).as(name))
.toArray(Column[]::new);
Dataset<Row> newDS = df1.as("a").join(df2.as("b")).where("a.id == b.id").select(coalescedColumns);
If I understand correctly, you have two dataframes with the same schema and you want to coalesce their 500 columns 2 by 2 without having to write everything.
This can be achieved easily by providing a sequence of columns to select. Also since select does not accept sequences of columns but rather a variable number of column arguments, you need to add : _* to let scala know that it needs to treat all the elements of the sequence as separate arguments.
val cols = df1.columns.filter(_ != "id")
df1
.join(df2, "id")
.select(col("id") +: cols.map(n => coalesce(df1.col(n), df2.col(n)) as n) : _* )
In Java, you can pass an array of values to methods expecting variable number of arguments, so you can rewrite your code like this :
Column[] coalescedColumns = Stream.of(df1.columns())
.map(name -> functions.coalesce(df1.col(name),df2.col(name)).as(name))
.toArray(Column[]::new);
Dataset<Row> newDS = df1.join(df2, "id").select(coalescedColumns)
I didn't exclude the id column since coalesce will work as expected on this column as well

Check for empty row within spark dataframe?

Running over several csv files and i am trying to run and do some checks and for some reason for one file i am getting a NullPointerException and i am suspecting that there are some empty row.
So i am running the following and for some reason it gives me an OK output:
check_empty = lambda row : not any([False if k is None else True for k in row])
check_empty_udf = sf.udf(check_empty, BooleanType())
df.filter(check_empty_udf(sf.struct([col for col in df.columns]))).show()
I am missing something within the filter function or we can't extract empty rows from dataframes.
You could use df.dropna() to drop empty rows and then compare the counts.
Something like
df_clean = df.dropna()
num_empty_rows = df.count() - df_clean.count()
You could use an inbuilt option for dealing with such scenarios.
val df = spark.read
.format("csv")
.option("header", "true")
.option("mode", "DROPMALFORMED") // Drop empty/malformed rows
.load("hdfs:///path/file.csv")
Check this reference - https://docs.databricks.com/spark/latest/data-sources/read-csv.html#reading-files

Manipulating a dataframe within a Spark UDF

I have a UDF that filters and selects values from a dataframe, but it runs into "object not serializable" error. Details below.
Suppose I have a dataframe df1 that has columns with names ("ID", "Y1", "Y2", "Y3", "Y4", "Y5", "Y6", "Y7", "Y8", "Y9", "Y10"). I want sum a subset of the "Y" columns based on the matching "ID" and "Value" from another dataframe df2. I tried the following:
val y_list = ("Y1", "Y2", "Y3", "Y4", "Y5", "Y6", "Y7", "Y8", "Y9", "Y10").map(c => col(c))
def udf_test(ID: String, value: Int): Double = {
df1.filter($"ID" === ID).select(y_list:_*).first.toSeq.toList.take(value).foldLeft(0.0)(_+_)
}
sqlContext.udf.register("udf_test", udf_test _)
val df_result = df2.withColumn("Result", callUDF("udf_test", $"ID", $"Value"))
This gives me errors of the form:
java.io.NotSerializableException: org.apache.spark.sql.Column
Serialization stack:
- object not serializable (class: org.apache.spark.sql.Column, value: Y1)
I looked this up and realized that Spark Column is not serializable. I am wondering:
1) There is any way to manipulate a dataframe within an UDF?
2) If not, what's the best way to achieve the type of operation above? My real case is more complicated than this. It requires me to select values from multiple small dataframes based on some columns in a big dataframe, and compute back a value to the big dataframe.
I am using Spark 1.6.3. Thanks!
You can't use Dataset operations inside UDFs. UDF can only manupulate on existing columns and produce one result column. It can't filter Dataset or make aggregations, but it can be used inside filter. UDAF also can aggregate values.
Instead, you can use .as[SomeCaseClass] to make Dataset from DataFrame and use normal, strongly typed functions inside filter, map, reduce.
Edit: If you want to join your bigDF with every small DF in smallDFs List, you can do:
import org.apache.spark.sql.functions._
val bigDF = // some processing
val smallDFs = Seq(someSmallDF1, someSmallDF2)
val joined = smallDFs.foldLeft(bigDF)((acc, df) => acc.join(broadcast(df), "join_column"))
broadcast is a function to add Broadcast Hint to small DF, so that small DF will use more efficient Broadcast Join instead of Sort Merge Join
1) No, you can only use plain scala code within UDFs
2) If you interpreted your code correctly, you can achieve your goal with:
df2
.join(
df1.select($"ID",y_list.foldLeft(lit(0))(_ + _).as("Result")),Seq("ID")
)
import org.apache.spark.sql.functions._
val events = Seq (
(1,1,2,3,4),
(2,1,2,3,4),
(3,1,2,3,4),
(4,1,2,3,4),
(5,1,2,3,4)).toDF("ID","amt1","amt2","amt3","amt4")
var prev_amt5=0
var i=1
def getamt5value(ID:Int,amt1:Int,amt2:Int,amt3:Int,amt4:Int) : Int = {
if(i==1){
i=i+1
prev_amt5=0
}else{
i=i+1
}
if (ID == 0)
{
if(amt1==0)
{
val cur_amt5= 1
prev_amt5=cur_amt5
cur_amt5
}else{
val cur_amt5=1*(amt2+amt3)
prev_amt5=cur_amt5
cur_amt5
}
}else if (amt4==0 || (prev_amt5==0 & amt1==0)){
val cur_amt5=0
prev_amt5=cur_amt5
cur_amt5
}else{
val cur_amt5=prev_amt5 + amt2 + amt3 + amt4
prev_amt5=cur_amt5
cur_amt5
}
}
val getamt5 = udf {(ID:Int,amt1:Int,amt2:Int,amt3:Int,amt4:Int) =>
getamt5value(ID,amt1,amt2,amt3,amt4)
}
myDF.withColumn("amnt5", getamt5(myDF.col("ID"),myDF.col("amt1"),myDF.col("amt2"),myDF.col("amt3"),myDF.col("amt4"))).show()

Taking value from one dataframe and passing that value into loop of SqlContext

Looking to try do something like this:
I have a dataframe that is one column of ID's called ID_LIST. With that column of id's I would like to pass it into a Spark SQL call looping through ID_LIST using foreach returning the result to another dataframe.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val id_list = sqlContext.sql("select distinct id from item_orc")
id_list.registerTempTable("ID_LIST")
id_list.foreach(i => println(i)
id_list println output:
[123]
[234]
[345]
[456]
Trying to now loop through ID_LIST and run a Spark SQL call for each:
id_list.foreach(i => {
val items = sqlContext.sql("select * from another_items_orc where id = " + i
items.foreach(println)
}
First.. not sure how to pull the individual value out, getting this error:
org.apache.spark.sql.AnalysisException: cannot recognize input near '[' '123' ']' in expression specification; line 1 pos 61
Second: how can I alter my code to output the result to a dataframe I can use later ?
Thanks, any help is appreciated!
Answer To First Question
When you perform the "foreach" Spark converts the dataframe into an RDD of type Row. Then when you println on the RDD it prints the Row, the first row being "[123]". It is boxing [] the elements in the row. The elements in the row are accessed by position. If you wanted to print just 123, 234, etc... try
id_list.foreach(i => println(i(0)))
Or you can use native primitive access
id_list.foreach(i => println(i.getString(0))) //For Strings
Seriously... Read the documentation I have linked about Row in Spark. This will transform your code to:
id_list.foreach(i => {
val items = sqlContext.sql("select * from another_items_orc where id = " + i.getString(0))
items.foreach(i => println(i.getString(0)))
})
Answer to Second Question
I have a sneaking suspicion about what you actually are trying to do but I'll answer your question as I have interpreted it.
Let's create an empty dataframe which we will union everything to it in a loop of the distinct items from the first dataframe.
import org.apache.spark.sql.types.{StructType, StringType}
import org.apache.spark.sql.Row
// Create the empty dataframe. The schema should reflect the columns
// of the dataframe that you will be adding to it.
val schema = new StructType()
.add("col1", StringType, true)
var df = ss.createDataFrame(ss.sparkContext.emptyRDD[Row], schema)
// Loop over, select, and union to the empty df
id_list.foreach{ i =>
val items = sqlContext.sql("select * from another_items_orc where id = " + i.getString(0))
df = df.union(items)
}
df.show()
You now have the dataframe df that you can use later.
NOTE: An easier thing to do would probably be to join the two dataframes on the matching columns.
import sqlContext.implicits.StringToColumn
val bar = id_list.join(another_items_orc, $"distinct_id" === $"id", "inner").select("id")
bar.show()

Resources