Calling UDF on Dataframe with Serialization Issue - apache-spark

I was looking at some examples on blogs of UDFs that appear to work, but in fact when I run them they give the infamous task not serializable error.
I find it strange that this is published and no such mention made. Running Spark 2.4.
Code, pretty straight forward something must have changed in Spark?:
def lowerRemoveAllWhitespace(s: String): String = {
s.toLowerCase().replaceAll("\\s", "")
}
val lowerRemoveAllWhitespaceUDF = udf[String, String](lowerRemoveAllWhitespace)
import org.apache.spark.sql.functions.col
val df = sc.parallelize(Seq(
("r1 ", 1, 1, 3, -2),
("r 2", 6, 4, -2, -2),
("r 3", 4, 1, 1, 0),
("r4", 1, 2, 4, 5)
)).toDF("ID", "a", "b", "c", "d")
df.select(lowerRemoveAllWhitespaceUDF(col("ID"))).show(false)
returns:
org.apache.spark.SparkException: Task not serializable
From this blog that I find good: https://medium.com/#mrpowers/spark-user-defined-functions-udfs-6c849e39443b
Something must have changed???
I looked at the top voted item here with an Object and extends Serializable but no joy either. Puzzled.
EDIT
Things seems to have changed, this format needed:
val squared = udf((s: Long) => s * s)
The Object approach still interest me why it failed.

I couldn't reproduce the error (tried on spark 1.6, 2.3, and 2.4), but I do remember facing this kind of error (long time ago). I'll put in my best guess.
The problem happens due to difference between Method and Function in scala. As described in detail here.
Short version of that is when you write def it is equivalent to methods in java, i.e part of a a class and can be invoked using the instance of the class.
When you write udf((s: Long) => s * s) it creates an instance of trait Function1. For this to happen an anonymous class implementing Function1 is generated whose apply method is is something like def apply(s: Long):Long= {s * s}, and the instance of this class is passed as parameter to udf.
However when you write udf[String, String](lowerRemoveAllWhitespace) the method lowerRemoveAllWhitespace needs to be converted to Function1 instance and passed to udf. This is where the serialization fails, since the apply method on this instance will try to invoke lowerRemoveAllWhitespace on instance of another object (which could not be serialized and sent to the worker jvm process) causing the exception.

The example that was posted was from a reputable source, but I cannot get to run without a Serialization error in Spark 2.4, trying of Objects etc. did not help either.
I solved the issue as follows using the udf(( .. approach which looks like a single statement only possible and indeed I could that and voila no serialization. A slightly different example though using primitives.
val sumContributionsPlus = udf((n1: Int, n2: Int, n3: Int, n4: Int) => Seq(n1,n2,n3,n4).foldLeft(0)( (acc, a) => if (a > 0) acc + a else acc))
On a final note, the whole discussion on UDF, Spark native, columns UDFs is confusing when things no longer appear to work.

Related

In C# (.Net for Spark), how to use When() method as a condition to add new column to a DataFrame?

I have got some experience in pyspark. When our team is migrating the Spark project from python to C# (.Net for Spark). I'm encountering problems:
Suppose we have got a Spark dataframe df with an existing column as col1.
In pyspark, I could do something like:
df = df.withColumn('new_col_name', when((df.col1 <= 5), lit('Group A')) \
.when((df.col1 > 5) & (df.col1 <= 8), lit('Group B')) \
.when((df.col1 > 8), lit('Group C')))
The question is how to do the equivalent in C#?
I've tried many things but still getting Exceptions when using the When() method.
For example, the following code would generate the exception:
df = df.WithColumn("new_col_name", df.Col("col1").When(df.Col("col1").EqualTo(3), Functions.Lit("Group A")));
Exception:
[MD2V4P4C] [Error] [JvmBridge] java.lang.IllegalArgumentException: when() can only be applied on a Column previously generated by when() function
Searched around and didn't find many examples on .Net for Spark. Any help would be much appreciated.
I think the problem is that you need to call the function When that isn't a member function on the Col object as the first call and then the version of When that is a member function every call after that (in this column) so:
var spark = SparkSession.Builder().GetOrCreate();
var df = spark.Range(100);
df.WithColumn("new_col_name",
When(Col("Id").Lt(5), Lit("Group A"))
.When(Col("Id").Between(5, 8), Lit("Group B"))
.When(Col("Id").Gt(8), Lit("Group C"))
).Show();
You. can also use >, =, <, etc like this but personally I prefer the more explicit one above:
df.WithColumn("new_col_name",
When(Col("Id") < 5, Lit("Group A"))
.When(Col("Id") >= 5 & Col("Id") <=8 , Lit("Group B"))
.When(Col("Id") > 8, Lit("Group C"))
).Show();

Scala interning: how does different initialisation affect comparison?

I am new to Scala but I know Java. Thus, as far as I understand, the difference is that == in Scala acts as .equals in Java, which means we are looking for the value; and eq in Scala acts as == in Java, which means we are looking for the reference address and not value.
However, after running the code below:
val greet_one_v1 = "Hello"
val greet_two_v1 = "Hello"
println(
(greet_one_v1 == greet_two_v1),
(greet_one_v1 eq greet_two_v1)
)
val greet_one_v2 = new String("Hello")
val greet_two_v2 = new String("Hello")
println(
(greet_one_v2 == greet_two_v2),
(greet_one_v2 eq greet_two_v2)
)
I get the following output:
(true,true)
(true,false)
My theory is that the initialisation of these strings differs. Hence, how is val greet_one_v1 = "Hello" different from val greet_one_v2 = new String("Hello")? Or, if my theory is incorrect, why do I have different outputs?
As correctly answered by Luis Miguel Mejía Suárez, the answer lies in String Interning which is part of what JVM (Java Virtual Machine) does automatically. To initiate a new String it needs to be initiated explicitly like in my example above; otherwise, Java will allocate the same memory for same values for optimisation reasons.

Check Spark Dataframe row has ANY column meeting a condition and stop when first such column found

The following code can be used to filter rows that contain a value of 1. Image there are a lot of columns.
import org.apache.spark.sql.types.StructType
val df = sc.parallelize(Seq(
("r1", 1, 1),
("r2", 6, 4),
("r3", 4, 1),
("r4", 1, 2)
)).toDF("ID", "a", "b")
val ones = df.schema.map(c => c.name).drop(1).map(x => when(col(x) === 1, 1).otherwise(0)).reduce(_ + _)
df.withColumn("ones", ones).where($"ones" === 0).show
The downside here is that it should ideally stop when the first such condition is met. I.e. the first column found. OK, we all know that.
But I cannot find an elegant method to achieve this without presumably using a UDF or very specific logic. The map will process all cols.
Can therefore a fold(Left) be used that can terminate when first occurrence found possibly? Or some other approach? May be an oversight.
My first idea was to use logical expressions and hope for short-circuiting, but it seems spark is not doing this :
df
.withColumn("ones", df.columns.tail.map(x => when(col(x) === 1, true)
.otherwise(false)).reduceLeft(_ or _))
.where(!$"ones")
.show()
But I'm no sure whether spark does support short-circuiting, I think not (https://issues.apache.org/jira/browse/SPARK-18712)
So alternatively you can apply a custom function on your rows using lazy exist on scala's Seq:
df
.map{r => (r.getString(0),r.toSeq.tail.exists(c => c.asInstanceOf[Int]==1))}
.toDF("ID","ones")
.show()
This approach is similar to an UDF, so not sure if thats what you accept.

Not able to convert from RDD to Dataset successfully at runtime [duplicate]

This question already has answers here:
How to convert a simple DataFrame to a DataSet Spark Scala with case class?
(2 answers)
Closed 4 years ago.
I am trying to run this small Spark program. Spark Version 2.1.1
val rdd = sc.parallelize(List((2012, "Tesla", "S"), (1997, "Ford", "E350"), (2015, "Chevy", "Volt")))
import spark.implicits._
val carDetails: Dataset[CarDetails] = spark.createDataset(rdd).as[CarDetails] // Error Line
carDetails.map(car => {
val name = if (car.name == "Tesla") "S" else car.name
CarDetails(car.year, name, car.model)
}).collect().foreach(print)
It is throwing error on this line:
val carDetails: Dataset[CarDetails] = spark.createDataset(rdd).as[CarDetails]
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`year`' given input columns: [_1, _2, _3];
There is no compilation error!
I tried by doing many changes like to use List instead of RDD. Also, tried to first convert to DS and then to as[CarDetails], but didn't work. Now I am clueless.
Why is it taking the columns as _1, _2 and _3 when I have already given the case class
case class CarDetails(year: Int, name: String, model: String)
I tried to change from Int to Long for year in case class. It still did not work.
Edit:
I changed this line after referring the probable duplicate question and it worked.
val carDetails: Dataset[CarDetails] = spark.createDataset(rdd)
.withColumnRenamed("_1","year")
.withColumnRenamed("_2","name")
.withColumnRenamed("_3","model")
.as[CarDetails]
But, I am still not clear as to why I need to rename the columns even after explicitly mapping to a case class.
The rules of as conversion are explained in detail in the API docs:
The method used to map columns depend on the type of U:
When U is a class, fields for the class will be mapped to columns of the same name (case sensitivity is determined by spark.sql.caseSensitive).
When U is a tuple, the columns will be mapped by ordinal (i.e. the first column will be assigned to _1).
When U is a primitive type (i.e. String, Int, etc), then the first column of the DataFrame will be used.
If the schema of the Dataset does not match the desired U type, you can use select along with alias or as to rearrange or rename as required.
To explain this with code. Conversion from case class to Tuple* is valid (fields are matched structurally):
Seq(CarDetails(2012, "Tesla", "S")).toDF.as[(Int, String, String)]
but conversion from Tuple* to arbitrary case class is not (fields are matched by name). You have to rename fields first (ditto):
Seq((2012, "Tesla", "S")).toDF("year", "name", "model").as[CarDetails]
It has quite interesting practical implications:
Tuple typed object cannot contain extraneous fields:
case class CarDetailsWithColor(
year: Int, name: String, model: String, color: String)
Seq(
CarDetailsWithColor(2012, "Tesla", "S", "red")
).toDF.as[(Int, String, String)]
// org.apache.spark.sql.AnalysisException: Try to map struct<year:int,name:string,model:string,color:string> to Tuple3, but failed as the number of fields does not line up.;
While case class typed Dataset can:
Seq(
(2012, "Tesla", "S", "red")
).toDF("year", "name", "model", "color").as[CarDetails]
Of course, starting with case class typed variant would save you all the trouble:
sc.parallelize(Seq(CarDetails(2012, "Tesla", "S"))).toDS

Can a DataFrame be converted to Dataset of a case class if a column name contains a space?

I have a Spark DataFrame where a column name contains a space. Is it possible to convert these rows into case classes?
For example, if I do this:
val data = Seq(1, 2, 3).toDF("a number")
case class Record(`a number`: Int)
data.as[Record]
I get this exception:
org.apache.spark.sql.AnalysisException: cannot resolve '`a$u0020number`' given input columns: [a number];
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
...
Is there any way to do this?
(Of course I can work around this by renaming the column before converting to a case class. I was hoping to have the case class match the input schema exactly.)
Can you try this solution , this worked for me without changing the column name.
import sqlContext.implicits._
case class Record(`a number`: Int)
val data = Seq(1, 2, 3)
val recDF = data.map(x => Record(x)).toDF()
recDF.collect().foreach(println)
[1]
[2]
[3]
I'm using Spark 1.6.0. The only part of your code that doesn't work for me is the part where you're setting up your test data. I have to use a sequence of tuples instead of a sequence of integers:
case class Record(`a number`:Int)
val data = Seq(Tuple1(1),Tuple1(2),Tuple1(3)).toDF("a number")
data.as[Record]
// returns org.apache.spark.sql.Data[Record] = [a$u0020number: int]
If you need a Dataframe instead of a Dataset you can always use another toDF:
data.as[Record].toDF

Resources