Spark Custom AccumulatorV2

Spark Custom AccumulatorV2 - apache-spark

I have defined a Custom Accumulator as:
import org.apache.spark.util.LongAccumulator
class CustomAccumulator extends LongAccumulator with java.io.Serializable {
override def add(v: Long): Unit = {
super.add(v)
if (v % 100 == 0) println(v)
}
}
And registered it as:
val cusAcc = new CustomAccumulator
sc.register(cusAcc, "customAccumulator")
My issue is that when I try to use it as:
val count = sc.customAccumulator
I get the following error:
<console>:51: error: value customAccumulator is not a member of org.apache.spark.SparkContext
val count = sc.customAccumulator
I am new to Spark and scala, and maybe missing something very trivial. Any guidance will be greatly appreciated.

According to the Spark API,
AccumulatorV2 is no longer under package org.apache.spark.SparkContext; instead, it has been moved to org.apache.spark.util.

Since Spark 2.0.0 you should use method register in abstract class AccumulatorV2:
org.apache.spark.util.AccumulatorV2#register.
Something like this:
cusAcc.register(sc, scala.Option("customAccumulator"), false);

Related

Why does Spark Dataset.map require all parts of the query to be serializable?

I would like to use the Dataset.map function to transform the rows of my dataset. The sample looks like this:
val result = testRepository.readTable(db, tableName)
.map(testInstance.doSomeOperation)
.count()
where testInstance is a class that extends java.io.Serializable, but testRepository does extend this. The code throws the following error:
Job aborted due to stage failure.
Caused by: NotSerializableException: TestRepository
Question
I understand why testInstance.doSomeOperation needs to be serializable, since it's inside the map and will be distributed to the Spark workers. But why does testRepository needs to be serialized? I don't see why that is necessary for the map. Changing the definition to class TestRepository extends java.io.Serializable solves the issue, but that is not desirable in the larger context of the project.
Is there a way to make this work without making TestRepository serializable, or why is it required to be serializable?
Minimal working example
Here's a full example with the code from both classes that reproduces the NotSerializableException:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
case class MyTableSchema(id: String, key: String, value: Double)
val db = "temp_autodelete"
val tableName = "serialization_test"
class TestRepository extends java.io.Serializable {
def readTable(database: String, tableName: String): Dataset[MyTableSchema] = {
spark.table(f"$database.$tableName")
.as[MyTableSchema]
}
}
val testRepository = new TestRepository()
class TestClass() extends java.io.Serializable {
def doSomeOperation(row: MyTableSchema): MyTableSchema = {
row
}
}
val testInstance = new TestClass()
val result = testRepository.readTable(db, tableName)
.map(testInstance.doSomeOperation)
.count()

The reason why is because your map operation is reading from something that already takes place on the executors.
If you look at your pipeline:
val result = testRepository.readTable(db, tableName)
.map(testInstance.doSomeOperation)
.count()
The first thing you do is testRepository.readTable(db, tableName). If we look inside of the readTable method, we see that you are doing a spark.table operation in there. If we look at the function signature of this method from the API docs, we see the following function signature:
def table(tableName: String): DataFrame
This is not an operation that solely takes place on the driver (imagine reading in a file of >1TB while only taking place on the driver), and it creates a Dataframe (which is by itself a distributed dataset). That means that the testRepository.readTable(db, tableName) function needs to be distributed, and so your testRepository object needs to be distributed.
Hope this helps you!

Why the lines are crossed out and the error

Why the lines are crossed out and the error where the dot is I don't understand
package com.ggenius.whattowearkotlin.data.network
import android.content.Context
import android.net.ConnectivityManager
import com.ggenius.whattowearkotlin.internal.NoConnectivityException
import okhttp3.Interceptor
import okhttp3.Response
class ConnectivityInterceptorImpl(
context: Context?
) : ConnectivityInterceptor {
private val appContext = context.applicationContext
override fun intercept(chain: Interceptor.Chain): Response {
if (!isOnline())
throw NoConnectivityException()
return chain.proceed(chain.request())
}
private fun isOnline() : Boolean {
val connectivityManager = appContext.getSystemService(Context.CONNECTIVITY_SERVICE)
as ConnectivityManager
val networkInfo = connectivityManager.activeNetworkInfo
return networkInfo != null && networkInfo.isConnected
}
}
Check screenshot
enter image description here

This line is crossed out
val networkInfo = connectivityManager.activeNetworkInfo
Because the getActiveNetworkInfo() method is deprecated according to their API
https://developer.android.com/reference/android/net/ConnectivityManager#getActiveNetworkInfo()

Why the lines are crossed out?
The lines are crossed out because they are Deprecated. In Android Studio by default the deprecated use of functions, classes, variables etc. are styled as strikeout. You want to know why? It's because by default settings it's set as so. Here is Screenshot
Why the error where the dot is I don't understand?
The context.applicationContext is showing error because your context is nullable. Whenever you define your variable with Type followed by ? is nullable in Kotlin.
As you passing context: Context?, so context is nullable. To access nullable objects properties or methods you need to use objectName followed by ?. This prevents the NullPointerException that makes Kotlin null safe.
In you example you need to do;
private val appContext = context?.applicationContext
// Note that appContext will also be nullable.
Further Reading

How to convert a Kotlin data class object to map?

Is there any easy way or any standard library method to convert a Kotlin data class object to a map/dictionary of its properties by property names? Can reflection be avoided?

I was using the jackson method, but turns out the performance of this is terrible on Android for first serialization (github issue here). And its dramatically worse for older android versions, (see benchmarks here)
But you can do this much faster with Gson. Conversion in both directions shown here:
import com.google.gson.Gson
import com.google.gson.reflect.TypeToken
val gson = Gson()
//convert a data class to a map
fun <T> T.serializeToMap(): Map<String, Any> {
return convert()
}
//convert a map to a data class
inline fun <reified T> Map<String, Any>.toDataClass(): T {
return convert()
}
//convert an object of type I to type O
inline fun <I, reified O> I.convert(): O {
val json = gson.toJson(this)
return gson.fromJson(json, object : TypeToken<O>() {}.type)
}
//example usage
data class Person(val name: String, val age: Int)
fun main() {
val person = Person("Tom Hanley", 99)
val map = mapOf(
"name" to "Tom Hanley",
"age" to 99
)
val personAsMap: Map<String, Any> = person.serializeToMap()
val mapAsPerson: Person = map.toDataClass()
}

This extension function uses reflection, but maybe it'll help someone like me coming across this in the future:
inline fun <reified T : Any> T.asMap() : Map<String, Any?> {
val props = T::class.memberProperties.associateBy { it.name }
return props.keys.associateWith { props[it]?.get(this) }
}

I have the same use case today for testing and ended up i have used Jackson object mapper to convert Kotlin data class into Map. The runtime performance is not a big concern in my case. I haven't checked in details but I believe it's using reflection under the hood but it's out of concern as happened behind the scene.
For Example,
val dataclass = DataClass(p1 = 1, p2 = 2)
val dataclassAsMap = objectMapper.convertValue(dataclass, object:
TypeReference<Map<String, Any>>() {})
//expect dataclassAsMap == mapOf("p1" to 1, "p2" to 2)

kotlinx.serialization has an experimental Properties format that makes it very simple to convert Kotlin classes into maps and vice versa:
#ExperimentalSerializationApi
#kotlinx.serialization.Serializable
data class Category constructor(
val id: Int,
val name: String,
val icon: String,
val numItems: Long
) {
// the map representation of this class
val asMap: Map<String, Any> by lazy { Properties.encodeToMap(this) }
companion object {
// factory to create Category from a map
fun from(map: Map<String, Any>): Category =
Properties.decodeFromMap(map)
}
}

The closest you can get is with delegated properties stored in a map.
Example (from link):
class User(val map: Map<String, Any?>) {
val name: String by map
val age: Int by map
}
Using this with data classes may not work very well, however.

Kpropmap is a reflection-based library that attempts to make working with Kotlin data classes and Maps easier. It has the following capabilities that are relevant:
Can transform maps to and from data classes, though note if all you need is converting from a data class to a Map, just use reflection directly as per #KenFehling's answer.
data class Foo(val a: Int, val b: Int)
// Data class to Map
val propMap = propMapOf(foo)
// Map to data class
val foo1 = propMap.deserialize<Foo>()
Can read and write Map data in a type-safe way by using the data class KProperty's for type information.
Given a data class and a Map, can do other neat things like detect changed values and extraneous Map keys that don't have corresponding data class properties.
Represent "partial" data classes (kind of like lenses). For example, say your backend model contains a Foo with 3 required immutable properties represented as vals. However, you want to provide an API to patch Foo instances. As it is a patch, the API consumer will only send the updated properties. The REST API layer for this obviously cannot deserialize directly to the Foo data class, but it can accept the patch as a Map. Use kpropmap to validate that the Map has the correct types, and apply the changes from the Map to a copy of the model instance:
data class Foo(val a: Int, val b: Int, val c: Int)
val f = Foo(1, 2, 3)
val p = propMapOf("b" to 5)
val f1 = p.applyProps(f) // f1 = Foo(1, 5, 3)
Disclaimer: I am the author.

Cassandra phantom-dsl derived column is missing in create database generated queries

I have following table definition
import com.outworkers.phantom.builder.primitives.Primitive
import com.outworkers.phantom.dsl._
abstract class DST[V, P <: TSP[V], T <: DST[V, P, T]] extends Table[T, P] {
object entityKey extends StringColumn with PartitionKey {
override lazy val name = "entity_key"
}
abstract class entityValue(implicit ev: Primitive[V]) extends PrimitiveColumn[V] {
override lazy val name = "entity_value"
}
In concrete table sub class
abstract class SDST[P <: TSP[String]] extends DST[String, P, SDST[P]] {
override def tableName: String = "\"SDS\""
object entityValue extends entityValue
}
Database class
class TestDatabase(override val connector: CassandraConnection) extends Database[TestDatabase](connector) {
object SDST extends SDST[SDSR] with connector.Connector {
override def fromRow(r: Row): SDSR=
SDSR(entityKey(r), entityValue(r))
}
}
The create table query generated by phantom-dsl looks like below
database.create()
c.o.phantom Executing query: CREATE TABLE IF NOT EXISTS test."SDS" (entity_key text,PRIMARY KEY (entity_key))
As you can see derived column is missing from the create table DDL.
Please let me know if I am missing something in the implementation.
Omitted class definitions like SDSR and TSP are simple case classes.
Thanks

Phantom doesn't currently support table to table inheritance. The reasons behind that decision are complexities inherent in the Macro API that we rely on to power the DSL.
This feature is planned for a future release, but until that stage we do not expect this to work, as the table helper macro does not read columns that are inherited basically.

Get "java.lang.NoClassDefFoundError" when running spark project with spark-submit on multiple machines

I'm a beginner of scala/spark and I got stuck when shipping my code to the official environment.
To be short, I can't put my SparkSession object in class method and I don't know why? If I do so, it will be fine when I run it on a local single machine but throw java.lang.NoClassDefFoundError, Could not initialize class XXX when I package my code to a single jar file and run it on multiple machines using spark-submit.
For example
When I put my code in structure like this
object Main{
def main(...){
Task.start
}
}
object Task{
case class Data(name:String, ...)
val spark = SparkSession.builder().appName("Task").getOrCreate()
import spark.implicits._
def start(){
var ds = loadFile(path)
ds.map(someMethod) // it dies here!
}
def loadFile(path:String){
spark.read.schema(...).json(path).as[Data]
}
def someMethod(d:Data):String{
d.name
}
}
It will give me "java.lang.NoClassDefFoundError" on each places where I put a self-defined method in those dataset transformation functions (like map, filter... etc).
However, if I rewrite it as
object Task{
case class Data(name:String, ...)
def start(){
val spark = SparkSession.builder().appName("Task").getOrCreate()
import spark.implicits._
var ds = loadFile(spark, path)
ds.map(someMethod) // it works!
}
def loadFile(spark:SparkSession, path:String){
import spark.implicits._
spark.read.schema(...).json(path).as[Data]
}
def someMethod(d:Data):String{
d.name
}
}
It will be fine, but it means that I need to pass the "spark" variable through each of methods that I will need it and I need to write import spark.implicits._ all the time when a method need it.
I think something goes wrong when the spark try to shuffle my object between nodes, but I don't know how exactly the reason is and what is the correct way to write my code.
Thanks

No you don't need to pass sparkSession object and import implicits in all the methods you need. You can make the sparkSession variable as a object variable outside a function and use in all the functions.
Below is the modified example of your code which
object Main{
def main(args: Array[String]): Unit = {
Task.start()
}
}
object Task{
case class Data(fname:String, lname : String)
val spark = SparkSession.builder().master("local").appName("Task").getOrCreate()
import spark.implicits._
def start(){
var ds = loadFile("person.json")
ds.map(someMethod).show()
}
def loadFile(path:String): Dataset[Data] = {
spark.read.json(path).as[Data]
}
def someMethod(d:Data):String = {
d.fname
}
}
Hope this helps!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark Custom AccumulatorV2 - apache-spark

According to the Spark API, AccumulatorV2 is no longer under package org.apache.spark.SparkContext; instead, it has been moved to org.apache.spark.util.

Since Spark 2.0.0 you should use method register in abstract class AccumulatorV2: org.apache.spark.util.AccumulatorV2#register. Something like this: cusAcc.register(sc, scala.Option("customAccumulator"), false);

Related

Why does Spark Dataset.map require all parts of the query to be serializable?

Why the lines are crossed out and the error

How to convert a Kotlin data class object to map?

Cassandra phantom-dsl derived column is missing in create database generated queries

Get "java.lang.NoClassDefFoundError" when running spark project with spark-submit on multiple machines

Categories

Resources