Spark UDAF: java.lang.InternalError: Malformed class name - apache-spark

I am using Spark 1.5.0 from CDH 5.5.2 distro. I switched to Scala 2.10.5 from 2.10.4. I am using the following code for UDAF. Is this somehow String vs UTF8String issue? If yes, any help will be greatly appreciated.
object GroupConcat extends UserDefinedAggregateFunction {
def inputSchema = new StructType().add("x", StringType)
def bufferSchema = new StructType().add("buff", ArrayType(StringType))
def dataType = StringType
def deterministic = true
def initialize(buffer: MutableAggregationBuffer) = {
buffer.update(0, ArrayBuffer.empty[String])
}
def update(buffer: MutableAggregationBuffer, input: Row) = {
if (!input.isNullAt(0))
buffer.update(0, buffer.getSeq[String](0) :+ input.getString(0))
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
buffer1.update(0, buffer1.getSeq[String](0) ++ buffer2.getSeq[String](0))
}
def evaluate(buffer: Row) = UTF8String.fromString(
buffer.getSeq[String](0).mkString(","))
}
However, I get this error message at runtime:
Exception in thread "main" java.lang.InternalError: Malformed class name
at java.lang.Class.getSimpleName(Class.java:1190)
at org.apache.spark.sql.execution.aggregate.ScalaUDAF.toString(udaf.scala:464)
at java.lang.String.valueOf(String.java:2847)
at java.lang.StringBuilder.append(StringBuilder.java:128)
at scala.StringContext.standardInterpolator(StringContext.scala:122)
at scala.StringContext.s(StringContext.scala:90)
at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression2.toString(interfaces.scala:96)
at org.apache.spark.sql.catalyst.expressions.Expression.prettyString(Expression.scala:174)
at org.apache.spark.sql.GroupedData$$anonfun$1.apply(GroupedData.scala:86)
at org.apache.spark.sql.GroupedData$$anonfun$1.apply(GroupedData.scala:80)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.sql.GroupedData.toDF(GroupedData.scala:80)
at org.apache.spark.sql.GroupedData.agg(GroupedData.scala:227)

I received the same exception because my object that extends UserDefinedAggregateFunction was inside of another function.
Change this:
object Driver {
def main(args: Array[String]) {
object GroupConcat extends UserDefinedAggregateFunction {
...
}
}
}
To this:
object Driver {
def main(args: Array[String]) {
...
}
object GroupConcat extends UserDefinedAggregateFunction {
...
}
}

I ran into this as a conflict with packages that I was importing. If you are importing anything then try testing on a spark shell with nothing imported.
When you define your UDAF check what the name being returned looks like. It should be something like
FractionOfDayCoverage: org.apache.spark.sql.expressions.UserDefinedAggregateFunction{def dataType: org.apache.spark.sql.types.DoubleType.type; def evaluate(buffer: org.apache.spark.sql.Row): Double} = $anon$1#27506b6d
that $anon$1#27506b6d at the end is a reasonable name that will work. When I imported the conflicting package the name returned was 3 times as long. Here is an example:
$$$$bec6d1991b88c272b3efac29d720f546$$$$anon$1#6886601d

Related

Error java.io.NotSerializableException: org.apache.kafka.clients.producer.KafkaProducer

Connecting to spark streaming with external source like MS SQL server and publishing table data to Kafka.
Getting
java.io.NotSerializableException:org.apache.kafka.clients.producer.KafkaProducer
error.
Please find below deails.
**CustomReceiver.sacla**
package com.sparkdemo.app
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.receiver.Receiver
import java.util.List
import java.util.Map
import com.sparkdemo.entity.Inventory
import org.apache.kafka.clients.consumer.{ConsumerRecords, KafkaConsumer}
import java.net.ConnectException
import scala.util.{Try, Success, Failure}
import collection.JavaConversions._
class CustomReceiver(topic: String, kafkaParams: Map[String, Object]) extends Receiver[Inventory](StorageLevel.MEMORY_AND_DISK_2) {
override def onStart(): Unit = {
val dataService = new DataService()
var records: Inventory = dataService.selectAll()
new Thread("Socket Receiver") {
override def run() {
Try {
val consumer = new KafkaConsumer(kafkaParams)
consumer.subscribe(topic)
while (!isStopped && records!=null) {
// store(tokenData)
// tokenData = new DataService().selectAll();
val records = new DataService().selectAll();
store(records)
}
} match {
case e: ConnectException =>
restart("Error connecting to...", e)
case t: Throwable =>
restart("Error receiving data", t)
}
}
}.start()
}
override def onStop(): Unit = {
println("Nothing")
}
}
**DataService.scala**
package com.sparkdemo.app;
import java.sql.Connection
import java.sql.DriverManager
import java.sql.ResultSet
import java.sql.Statement
import java.util._
import scala.collection.JavaConversions._
import com.sparkdemo.entity.Inventory
class DataService {
var connect: Connection = DriverManager.getConnection(
"jdbc:sqlserver://localhost;databaseName=TestDB;user=SA;password=Sqlserver#007")
var statement: Statement = connect.createStatement()
var resultSet: ResultSet = null
var inv: Inventory = new Inventory()
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver")
def selectAll(): Inventory = {
resultSet = statement.executeQuery("select * from Inventory")
while (resultSet.next()) {
val inv: Inventory = new Inventory()
inv.setId(resultSet.getInt("id"))
inv.setName(resultSet.getString("name"))
inv.setQuantity(resultSet.getInt("quantity"))
}
inv
}
}
Scala main class **Stream.scala**
package com.sparkdemo.app
import org.apache.spark.streaming.dstream.DStream
import com.sparkdemo.config.Config
import org.apache.kafka.common.serialization.{ StringDeserializer, StringSerializer }
import org.apache.kafka.clients.producer.{ ProducerRecord, KafkaProducer }
import java.util.Properties
import collection.JavaConversions._
import com.sparkdemo.entity.Inventory
object Stream extends Serializable{
def main(args: Array[String]) {
import org.apache.spark.streaming._
def getKafkaParams: Map[String, Object] = {
Map[String, Object](
"auto.offset.reset" -> "earliest",
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "group3")
}
val properties = new Properties()
properties.put("bootstrap.servers", "localhost:9092")
properties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
val topic1 = "topic1"
val topic2 = "topic2"
val producer :KafkaProducer[String, Object] = new KafkaProducer(properties)
val ssc = Config.configReceiver()
val stream = ssc.receiverStream(new CustomReceiver(topic1, getKafkaParams))
stream.map(Inventory=>producer.send(new ProducerRecord[String,Object](topic2,Inventory)))
stream.print()
ssc.start()
ssc.awaitTermination()
}
}
Entity class: **Inventory.scala**
package com.sparkdemo.entity
import scala.beans.{BeanProperty}
class Inventory extends Serializable{
#BeanProperty
var id: Int = _
#BeanProperty
var name: String = _
#BeanProperty
var quantity: Int = _
}
Error:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2287)
at org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:547)
at org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:547)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:701)
at org.apache.spark.streaming.StreamingContext.withScope(StreamingContext.scala:265)
at org.apache.spark.streaming.dstream.DStream.map(DStream.scala:546)
at com.sparkdemo.app.Stream$.main(Stream.scala:36)
at com.sparkdemo.app.Stream.main(Stream.scala)
Caused by: java.io.NotSerializableException: org.apache.kafka.clients.producer.KafkaProducer
Serialization stack:
- object not serializable (class: org.apache.kafka.clients.producer.KafkaProducer, value: org.apache.kafka.clients.producer.KafkaProducer#557286ad)
- field (class: com.sparkdemo.app.Stream$$anonfun$main$1, name: producer$1, type: class org.apache.kafka.clients.producer.KafkaProducer)
- object (class com.sparkdemo.app.Stream$$anonfun$main$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
... 12 more
You have ran issue where kafkaproducer is sent unintentionally to executor because of below code
stream.map(Inventory=>producer.send(new ProducerRecordString,Object))
You can mappartitions and create producer in mappartitions so that it is not shipped to executors.
The problem is the type of Serializer you are using for Object type value.
properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
Please write a Serializer to read the Object type values.You can refer below link
Send Custom Java Objects to Kafka Topic

Spark Scala custom UnaryTransformer in a pipeline fails on read of persisted pipeline model

I have developed this simple LogTransformer by extending the UnaryTransformer to apply log transfromation on the "age" column in the DataFrame. I am able to apply this transformer and include that as a pipeline stage and persist the pipeline model after training.
class LogTransformer(override val uid: String) extends UnaryTransformer[Int,
Double, LogTransformer] with DefaultParamsWritable {
def this() = this(Identifiable.randomUID("logTransformer"))
override protected def createTransformFunc: Int => Double = (feature: Int) => {Math.log10(feature)}
override protected def validateInputType(inputType: DataType): Unit = {
require(inputType == DataTypes.IntegerType, s"Input type must be integer type but got $inputType.")
}
override protected def outputDataType: DataType = DataTypes.DoubleType
override def copy(extra: ParamMap): LogTransformer = defaultCopy(extra)
}
object LogTransformer extends DefaultParamsReadable[LogTransformer]
But when I read the persisted model I get the following exception.
val MODEL_PATH = "model/census_pipeline_model"
cvModel.bestModel.asInstanceOf[PipelineModel].write.overwrite.save(MODEL_PATH)
val same_pipeline_model = PipelineModel.load(MODEL_PATH)
exception in thread "main" java.lang.NoSuchMethodException: dsml.Census$LogTransformer$2.read()
at java.lang.Class.getMethod(Class.java:1786)
at org.apache.spark.ml.util.DefaultParamsReader$.loadParamsInstance(ReadWrite.scala:652)
at org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$4.apply(Pipeline.scala:274)
at org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$4.apply(Pipeline.scala:272)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:272)
at org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:348)
at org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:342)
at org.apache.spark.ml.util.MLReadable$class.load(ReadWrite.scala:380)
at org.apache.spark.ml.PipelineModel$.load(Pipeline.scala:332)
at dsml.Census$.main(Census.scala:572)
at dsml.Census.main(Census.scala)
Any pointers on how to fix that would be helpful. Thank You.

Get "java.lang.NoClassDefFoundError" when running spark project with spark-submit on multiple machines

I'm a beginner of scala/spark and I got stuck when shipping my code to the official environment.
To be short, I can't put my SparkSession object in class method and I don't know why? If I do so, it will be fine when I run it on a local single machine but throw java.lang.NoClassDefFoundError, Could not initialize class XXX when I package my code to a single jar file and run it on multiple machines using spark-submit.
For example
When I put my code in structure like this
object Main{
def main(...){
Task.start
}
}
object Task{
case class Data(name:String, ...)
val spark = SparkSession.builder().appName("Task").getOrCreate()
import spark.implicits._
def start(){
var ds = loadFile(path)
ds.map(someMethod) // it dies here!
}
def loadFile(path:String){
spark.read.schema(...).json(path).as[Data]
}
def someMethod(d:Data):String{
d.name
}
}
It will give me "java.lang.NoClassDefFoundError" on each places where I put a self-defined method in those dataset transformation functions (like map, filter... etc).
However, if I rewrite it as
object Task{
case class Data(name:String, ...)
def start(){
val spark = SparkSession.builder().appName("Task").getOrCreate()
import spark.implicits._
var ds = loadFile(spark, path)
ds.map(someMethod) // it works!
}
def loadFile(spark:SparkSession, path:String){
import spark.implicits._
spark.read.schema(...).json(path).as[Data]
}
def someMethod(d:Data):String{
d.name
}
}
It will be fine, but it means that I need to pass the "spark" variable through each of methods that I will need it and I need to write import spark.implicits._ all the time when a method need it.
I think something goes wrong when the spark try to shuffle my object between nodes, but I don't know how exactly the reason is and what is the correct way to write my code.
Thanks
No you don't need to pass sparkSession object and import implicits in all the methods you need. You can make the sparkSession variable as a object variable outside a function and use in all the functions.
Below is the modified example of your code which
object Main{
def main(args: Array[String]): Unit = {
Task.start()
}
}
object Task{
case class Data(fname:String, lname : String)
val spark = SparkSession.builder().master("local").appName("Task").getOrCreate()
import spark.implicits._
def start(){
var ds = loadFile("person.json")
ds.map(someMethod).show()
}
def loadFile(path:String): Dataset[Data] = {
spark.read.json(path).as[Data]
}
def someMethod(d:Data):String = {
d.fname
}
}
Hope this helps!

Spark Custom AccumulatorV2

I have defined a Custom Accumulator as:
import org.apache.spark.util.LongAccumulator
class CustomAccumulator extends LongAccumulator with java.io.Serializable {
override def add(v: Long): Unit = {
super.add(v)
if (v % 100 == 0) println(v)
}
}
And registered it as:
val cusAcc = new CustomAccumulator
sc.register(cusAcc, "customAccumulator")
My issue is that when I try to use it as:
val count = sc.customAccumulator
I get the following error:
<console>:51: error: value customAccumulator is not a member of org.apache.spark.SparkContext
val count = sc.customAccumulator
I am new to Spark and scala, and maybe missing something very trivial. Any guidance will be greatly appreciated.
According to the Spark API,
AccumulatorV2 is no longer under package org.apache.spark.SparkContext; instead, it has been moved to org.apache.spark.util.
Since Spark 2.0.0 you should use method register in abstract class AccumulatorV2:
org.apache.spark.util.AccumulatorV2#register.
Something like this:
cusAcc.register(sc, scala.Option("customAccumulator"), false);

How to refer to the outer class this?

I've been searching in the official Groovy documentation how to replace a call like
MyOuterClass.this
inside a nested class MyInnerClass, but they don't seem to talk about this difficulty. And I did not find by Googling neither.
So, let's say I have this code :
class MyOuterClass {
class MyInnerClass {
}
}
How can I call the this pointer of MyOuterClass inside a method of MyInnerClass ?
Here is an attempt :
public class Outer {
def sayHello() {println "Hello !"}
public class Inner {
def tellHello(){
Outer.this.sayHello()
}
}
}
def objOuter = new Outer()
def objInner = new Outer.Inner()
objInner.tellHello()
and here the error stacktrace :
java.lang.NullPointerException: Cannot invoke method sayHello() on null object
at Outer$Inner.tellHello(inner_outer.groovy:5)
at Outer$Inner$tellHello.call(Unknown Source)
at inner_outer.run(inner_outer.groovy:12)
(I am using the Groovy 2.4.5 version).
The only problem is that you're not passing the outer object to your new Inner class statement, use this:
def objOuter = new Outer()
def objInner = new Outer.Inner(objOuter)
Instead of:
def objOuter = new Outer()
def objInner = new Outer.Inner()
And your code will works,
Hope this helps,

Resources