I have a basic question on spark closures. I am not able to distinguish code behavior between scenario 2 & 3, both produces same output but based on my understanding scenario 3 should not work as expected.
The Below code is common for all scenarios:
class A implements Serializable{
String t;
A(String t){
this.t=t;
}
}
//Initiaze spark context
JavaSparkContext context=....
//create rdd
JavaRDD<String> rdd = context.parallelize(Arrays.asList("a","b","c","d","e"),3);
Scenerio 1: don't do this because A is initialize in driver and not visible on executor.
A a=new A("pqr");
rdd.map(i->i+a.t).collect();
Scenerio 2: Recommended way of sharing object
Broadcast<A> broadCast = context.broadcast(new A("pqr"));
rdd.map(i->broadCast.getValue().t+i).collect();
//output: [pqra, pqrb, pqrc, pqrd, pqre]
Scenerio 3: why this code work as expected even when I initiate A in driver?
class TestFunction implements Function<String, String>, Serializable {
private A val;
public TestFunction(){ }
public TestFunction(A a){
this.val = a;
}
#Override
public String call(String integer) throws Exception {
return val.t+integer;
}
}
TestFunction mapFunction = new TestFunction(new A("pqr"));
System.out.println(rdd.map(mapFunction).collect());
//output: [pqra, pqrb, pqrc, pqrd, pqre]
Note: I am running program in cluster mode.
The generated Java bytecodes for Scenerio 1 & 3 are almost the same. The benefit of using Broadcast (Scenerio 2) is the broadcast object will only be sent to an executor once and reuse it in other tasks on this executor. Scenerio 1 & 3 will always send the object A to executors for each task.
Related
I have a following class that reads csv data into Spark's Dataset. Everything works fine if I just simply read and return the data.
However, if I apply a MapFunction to the data before returning from function, I get
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
Caused by: java.io.NotSerializableException: com.Workflow.
I know Spark's working and its need to serialize objects for distributed processing, however, I'm NOT using any reference to Workflow class in my mapping logic. I'm not calling any Workflow class function in my mapping logic. So why is Spark trying to serialize Workflow class? Any help will be appreciated.
public class Workflow {
private final SparkSession spark;
public Dataset<Row> readData(){
final StructType schema = new StructType()
.add("text", "string", false)
.add("category", "string", false);
Dataset<Row> data = spark.read()
.schema(schema)
.csv(dataPath);
/*
* works fine till here if I call
* return data;
*/
Dataset<Row> cleanedData = data.map(new MapFunction<Row, Row>() {
public Row call(Row row){
/* some mapping logic */
return row;
}
}, RowEncoder.apply(schema));
cleanedData.printSchema();
/* .... ERROR .... */
cleanedData.show();
return cleanedData;
}
}
anonymous inner classes have a hidden/implicit reference to enclosing class. use Lambda expression or go with Roma Anankin's solution
you could make Workflow implement Serializeble and SparkSession as #transient
I am working on a Spark application that expands edges by adding the adjacent vertices to that edges. I am using Map/reduce paradigm for the process where I want to partition the total number of edges and expand them in different worker nodes.
To accomplish that I need to read the partitioned adjacent list in the worker nodes based on the key value. But I am getting an error while trying to load files inside the reduceByKey() method. It says that the task is not serializable. My code:
public class MyClass implements Serializable{
public static void main(String args[]) throws IOException {
SparkConf conf = new SparkConf().setAppName("startingSpark").setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> file = sc.textFile("hdfs://localhost:9000/mainFile.txt");
... ... ... //Mapping done successfully
JavaPairRDD<String, String> rdd1 = pairs.reduceByKey(new Function2<String, String, String>() {
#Override
public String call(String v1, String v2) throws Exception {
... ... ...
JavaRDD <String> adj = sc.textFile("hdfs://localhost:9000/adjacencyList_"+key+"txt");
//Here I to expand the edges after reading the adjacency list.
}
}
But I am getting an error Task not serializable. Caused by: java.io.NotSerializableException: org.apache.spark.api.java.JavaSparkContext
Serialization stack:
- object not serializable. I think it is due to the fact that I am using the same spark context in the worker node as in the driver program. If I try to create a new Spark Context inside the reduceByKey() method, it also gives me an error saying that Only one SparkContext should be running in this JVM.
Can anyone tell me how can I read a file inside the reduceByKey() method? Is there any other way to accomplish my task? I want the expansion of the edges in the worker nodes so that they can be run in a distributed way.
Thanks in advance.
I am trying to figure out if I can work with Kotlin and Spark,
and use the former's data classes instead of Scala's case classes.
I have the following data class:
data class Transaction(var context: String = "", var epoch: Long = -1L, var items: HashSet<String> = HashSet()) :
Serializable {
companion object {
#JvmStatic
private val serialVersionUID = 1L
}
}
And the relevant part of the main routine looks like this:
val transactionEncoder = Encoders.bean(Transaction::class.java)
val transactions = inputDataset
.groupByKey(KeyExtractor(), KeyExtractor.getKeyEncoder())
.mapGroups(TransactionCreator(), transactionEncoder)
.collectAsList()
transactions.forEach { println("collected Transaction=$it") }
With TransactionCreator defined as:
class TransactionCreator : MapGroupsFunction<Tuple2<String, Timestamp>, Row, Transaction> {
companion object {
#JvmStatic
private val serialVersionUID = 1L
}
override fun call(key: Tuple2<String, Timestamp>, values: MutableIterator<Row>): Transaction {
val seq = generateSequence { if (values.hasNext()) values.next().getString(2) else null }
val items = seq.toCollection(HashSet())
return Transaction(key._1, key._2.time, items).also { println("inside call Transaction=$it") }
}
}
However, I think I'm running into some sort of serialization problem,
because the set ends up empty after collection.
I see the following output:
inside call Transaction=Transaction(context=context1, epoch=1000, items=[c])
inside call Transaction=Transaction(context=context1, epoch=0, items=[a, b])
collected Transaction=Transaction(context=context1, epoch=0, items=[])
collected Transaction=Transaction(context=context1, epoch=1000, items=[])
I've tried a custom KryoRegistrator to see if it was a problem with Kotlin's HashSet:
class MyRegistrator : KryoRegistrator {
override fun registerClasses(kryo: Kryo) {
kryo.register(HashSet::class.java, JavaSerializer()) // kotlin's HashSet
}
}
But it doesn't seem to help.
Any other ideas?
Full code here.
It does seem to be a serialization issue.
The documentation of Encoders.bean states (Spark v2.4.0):
collection types: only array and java.util.List currently, map support is in progress
Porting the Transaction data class to Java and changing items to a java.util.List seems to help.
I am using spark 2.0.0.
Is there a way to pass parameters from spark driver to executors? I tried the following.
class SparkDriver {
public static void main(String argv[]){
SparkConf conf = new SparkConf().setAppName("test").setMaster("yarn");
SparkSession sparkSession = SparkSession.builder().config(conf).getOrCreate();
Dataset<Row> input = sparkSession.read().load("inputfilepath");
Dataset<Row> modifiedinput = input.mapPartitions(new customMapPartition(5),Encoders.bean(Row.class));
}
class customMapPartition implements MapPartitionsFunction{
private static final long serialVersionUID = -6513655566985939627L;
private static Integer variableThatHastobePassed = null;
public customMapPartition(Integer passedInteger){
customMapPartition.variableThatHastobePassed= passedInteger;
}
#Override
public Iterator<Row> call(Iterator<Row> input) throws Exception {
System.out.println("number that is passed " + variableThatHastobePassed);
}
}
As mentioned above I wrote a custom mappartitionfunction to pass the parameters. and am accessing the static variable in call method of partitionfunction. This worked when i ran in my local with "setmaster("local"). But did not work when ran on a cluster with .setmaster("yarn"). (printed null in the system.out.println statements)
Is there a way to pass parameters from driver to executors.
my bad i was using
private static Integer variableThatHastobePassed = null;
the variable should not be declared as static.
The spark docs says that
By default, when Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task.
If I create a Java SimpleDateFormat and use it in RDD operations, I got a exception NumberFormatException: multiple points.
I know SimpleDateFormat is not thread-safe. But as said by spark docs, this SimpleDateFormat object is copied to each task, so there should not be multiple threads accessing this object.
I speculate that all task in one executor shares the same SimpleDateFormate object, am I right?
This program prints the same object java.text.SimpleDateFormat#f82ede60
object NormalVariable {
// create dateFormat here doesn't change
// val dateFormat = new SimpleDateFormat("yyyy.MM.dd")
def main(args: Array[String]) {
val dateFormat = new SimpleDateFormat("yyyy.MM.dd")
val conf = new SparkConf().setAppName("Spark Test").setMaster("local[*]")
val spark = new SparkContext(conf)
val dates = Array[String]("1999.09.09", "2000.09.09", "2001.09.09", "2002.09.09", "2003.09.09")
println(dateFormat)
val resultes = spark.parallelize(dates).map { i =>
println(dateFormat)
dateFormat.parse(i)
}.collect()
println(resultes.mkString(" "))
spark.stop()
}
}
As you know, SimpleDateFormat is not thread safe.
If Spark is using a single core per executor (--executor-cores 1) then everything should work fine. But as soon as you configure more than one core per executor, your code is now running multi-threaded, the SimpleDateFormat is shared by multiple Spark tasks concurrently, and is likely to corrupt the data and throw various exceptions.
To fix this, you can use one of the same approaches as for non-Spark code, namely ThreadLocal, which ensures you get one copy of the SimpleDateFormat per thread.
In Java, this looks like:
public class DateFormatTest {
private static final ThreadLocal<DateFormat> df = new ThreadLocal<DateFormat>(){
#Override
protected DateFormat initialValue() {
return new SimpleDateFormat("yyyyMMdd");
}
};
public Date convert(String source) throws ParseException{
Date d = df.get().parse(source);
return d;
}
}
and the equivalent code in Scala works just the same - shown here as a spark-shell session:
import java.text.SimpleDateFormat
object SafeFormat extends ThreadLocal[SimpleDateFormat] {
override def initialValue = {
new SimpleDateFormat("yyyyMMdd HHmmss")
}
}
sc.parallelize(Seq("20180319 162058")).map(SafeFormat.get.parse(_)).collect
res6: Array[java.util.Date] = Array(Mon Mar 19 16:20:58 GMT 2018)
So you would define the ThreadLocal at the top level of your job class or object, then call df.get to obtain the SimpleDateFormat within your RDD operations.
See:
http://fahdshariff.blogspot.co.uk/2010/08/dateformat-with-multiple-threads.html
"Java DateFormat is not threadsafe" what does this leads to?