I have a following class that reads csv data into Spark's Dataset. Everything works fine if I just simply read and return the data.
However, if I apply a MapFunction to the data before returning from function, I get
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
Caused by: java.io.NotSerializableException: com.Workflow.
I know Spark's working and its need to serialize objects for distributed processing, however, I'm NOT using any reference to Workflow class in my mapping logic. I'm not calling any Workflow class function in my mapping logic. So why is Spark trying to serialize Workflow class? Any help will be appreciated.
public class Workflow {
private final SparkSession spark;
public Dataset<Row> readData(){
final StructType schema = new StructType()
.add("text", "string", false)
.add("category", "string", false);
Dataset<Row> data = spark.read()
.schema(schema)
.csv(dataPath);
/*
* works fine till here if I call
* return data;
*/
Dataset<Row> cleanedData = data.map(new MapFunction<Row, Row>() {
public Row call(Row row){
/* some mapping logic */
return row;
}
}, RowEncoder.apply(schema));
cleanedData.printSchema();
/* .... ERROR .... */
cleanedData.show();
return cleanedData;
}
}
anonymous inner classes have a hidden/implicit reference to enclosing class. use Lambda expression or go with Roma Anankin's solution
you could make Workflow implement Serializeble and SparkSession as #transient
Related
I would like to use the Dataset.map function to transform the rows of my dataset. The sample looks like this:
val result = testRepository.readTable(db, tableName)
.map(testInstance.doSomeOperation)
.count()
where testInstance is a class that extends java.io.Serializable, but testRepository does extend this. The code throws the following error:
Job aborted due to stage failure.
Caused by: NotSerializableException: TestRepository
Question
I understand why testInstance.doSomeOperation needs to be serializable, since it's inside the map and will be distributed to the Spark workers. But why does testRepository needs to be serialized? I don't see why that is necessary for the map. Changing the definition to class TestRepository extends java.io.Serializable solves the issue, but that is not desirable in the larger context of the project.
Is there a way to make this work without making TestRepository serializable, or why is it required to be serializable?
Minimal working example
Here's a full example with the code from both classes that reproduces the NotSerializableException:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
case class MyTableSchema(id: String, key: String, value: Double)
val db = "temp_autodelete"
val tableName = "serialization_test"
class TestRepository extends java.io.Serializable {
def readTable(database: String, tableName: String): Dataset[MyTableSchema] = {
spark.table(f"$database.$tableName")
.as[MyTableSchema]
}
}
val testRepository = new TestRepository()
class TestClass() extends java.io.Serializable {
def doSomeOperation(row: MyTableSchema): MyTableSchema = {
row
}
}
val testInstance = new TestClass()
val result = testRepository.readTable(db, tableName)
.map(testInstance.doSomeOperation)
.count()
The reason why is because your map operation is reading from something that already takes place on the executors.
If you look at your pipeline:
val result = testRepository.readTable(db, tableName)
.map(testInstance.doSomeOperation)
.count()
The first thing you do is testRepository.readTable(db, tableName). If we look inside of the readTable method, we see that you are doing a spark.table operation in there. If we look at the function signature of this method from the API docs, we see the following function signature:
def table(tableName: String): DataFrame
This is not an operation that solely takes place on the driver (imagine reading in a file of >1TB while only taking place on the driver), and it creates a Dataframe (which is by itself a distributed dataset). That means that the testRepository.readTable(db, tableName) function needs to be distributed, and so your testRepository object needs to be distributed.
Hope this helps you!
I am working on a Spark application that expands edges by adding the adjacent vertices to that edges. I am using Map/reduce paradigm for the process where I want to partition the total number of edges and expand them in different worker nodes.
To accomplish that I need to read the partitioned adjacent list in the worker nodes based on the key value. But I am getting an error while trying to load files inside the reduceByKey() method. It says that the task is not serializable. My code:
public class MyClass implements Serializable{
public static void main(String args[]) throws IOException {
SparkConf conf = new SparkConf().setAppName("startingSpark").setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> file = sc.textFile("hdfs://localhost:9000/mainFile.txt");
... ... ... //Mapping done successfully
JavaPairRDD<String, String> rdd1 = pairs.reduceByKey(new Function2<String, String, String>() {
#Override
public String call(String v1, String v2) throws Exception {
... ... ...
JavaRDD <String> adj = sc.textFile("hdfs://localhost:9000/adjacencyList_"+key+"txt");
//Here I to expand the edges after reading the adjacency list.
}
}
But I am getting an error Task not serializable. Caused by: java.io.NotSerializableException: org.apache.spark.api.java.JavaSparkContext
Serialization stack:
- object not serializable. I think it is due to the fact that I am using the same spark context in the worker node as in the driver program. If I try to create a new Spark Context inside the reduceByKey() method, it also gives me an error saying that Only one SparkContext should be running in this JVM.
Can anyone tell me how can I read a file inside the reduceByKey() method? Is there any other way to accomplish my task? I want the expansion of the edges in the worker nodes so that they can be run in a distributed way.
Thanks in advance.
I am using spark-sql-2.4.1 version.
creating a broadcast variable as below
Broadcast<Map<String,Dataset>> bcVariable = javaSparkContext.broadcast(//read dataset);
Me passing the bcVariable to a function
Service.calculateFunction(sparkSession, bcVariable.getValue());
public static class Service {
public static calculateFunction(
SparkSession sparkSession,
Map<String, Dataset> dataSet ) {
System.out.println("---> size : " + dataSet.size()); //printing size 1
for( Entry<String, Dataset> aEntry : dataSet.entrySet() ) {
System.out.println( aEntry.getKey()); // printing key
aEntry.getValue().show() // throw null pointer exception
}
}
What is wrong here ? how to pass a dataset/dataframe in the function?
Try 2 :
Broadcast<Dataset> bcVariable = javaSparkContext.broadcast(//read dataset);
Me passing the bcVariable to a function
Service.calculateFunction(sparkSession, bcVariable.getValue());
public static class Service {
public static calculateFunction(
SparkSession sparkSession,
Dataset dataSet ) {
System.out.println("---> size : " + dataSet.size()); // throwing null pointer exception.
}
What is wrong here ? how to pass a dataset/dataframe in the function?
Try 3 :
Dataset metaData = //read dataset from oracle table i.e. meta-data.
Me passing the metaData to a function
Service.calculateFunction(sparkSession, metaData );
public static class Service {
public static calculateFunction(
SparkSession sparkSession,
Dataset metaData ) {
System.out.println("---> size : " + metaData.size()); // throwing null pointer exception.
}
What is wrong here ? how to pass a dataset/dataframe in the function?
The value to be broadcast has to be any Scala object but not a DataFrame.
Service.calculateFunction(sparkSession, metaData) is executed on executors and hence metaData is null (as it was not serialized and sent over the wire from the driver to executors).
broadcast[T](value: T): Broadcast[T]
Broadcast a read-only variable to the cluster, returning a org.apache.spark.broadcast.Broadcast object for reading it in distributed functions. The variable will be sent to each cluster only once.
Think of DataFrame data abstraction to represent a distributed computation that is described in a SQL-like language (Dataset API or SQL). It simply does not make any sense to have it anywhere but on the driver where computations can be submitted for execution (as tasks on executors).
You simply have to "convert" the data this computation represents (in DataFrame terms) using DataFrame.collect.
Once you collected the data you can broadcast it and reference using .value method.
The code could look as follows:
val dataset = // reading data
Broadcast<Map<String,Dataset>> bcVariable =
javaSparkContext.broadcast(dataset.collect);
Service.calculateFunction(sparkSession, bcVariable.getValue());
The only change compared to your code is collect.
We are currently exploring Apache Spark (with Hadoop) for performing large scale
data transformation (in Java).
We are using the new looking (and experimental) DataSourceV2 interfaces to build our custom
output data files. A component of this is an implementation of the org.apache.spark.sql.sources.v2.writer.DataWriter
interface. It all works beautifully, except for one problem:
The org.apache.spark.sql.sources.v2.writer.DataWriter.write(record) method is often (but not always)
called twice for the same input record.
Here is what I hope is enough code for you to get the gist of what we're doing:
Basically we have many large sets of input data that we land via a Spark application
into Hadoop tables using code that looks something like:
final Dataset<Row> jdbcTableDataset = sparkSession.read()
.format("jdbc")
.option("url", sqlServerUrl)
.option("dbtable", tableName)
.option("user", jdbcUser)
.option("password", jdbcPassword)
.load();
final DataFrameWriter<Row> dataFrameWriter = jdbcTableDataset.write();
dataFrameWriter.save(hdfsDestination + "/" + tableName);
There's roughly fifty of these tables, for what it is worth. I know that there are no duplicates
in the data because dataFrameWriter.count() and dataFrameWriter.distinct().count()
returns the same value.
The transformation process involves performing join operations on these tables and writing
the result to files in the (shared) file system in a custom format. The resulting rows contain a unique key,
a dataGroup column, a dataSubGroup column and about 40 other columns. The selected records are
ordered by dataGroup, dataSubGroup and key.
Each output file is distinguished by the dataGroup column, which is used to partition the write operation:
final Dataset<Row> selectedData = dataSelector.selectData();
selectedData
.write()
.partitionBy("dataGroup")
.format("au.com.mycompany.myformat.DefaultSource")
.save("/path/to/shared/directory/");
To give you an idea of the scale, the resulting selected data consists of fifty-sixty million
records, unevenly split between roughly 3000 dataGroup files. Large, but not enormous.
The partitionBy("dataGroup") neatly ensures that each dataGroup file is processed by a
single executor. So far so good.
My datasource implements the new looking (and experimental) DataSourceV2 interface:
package au.com.mycompany.myformat;
import java.io.Serializable;
import java.util.Optional;
import org.apache.spark.sql.SaveMode;
import org.apache.spark.sql.sources.DataSourceRegister;
import org.apache.spark.sql.sources.v2.DataSourceOptions;
import org.apache.spark.sql.sources.v2.WriteSupport;
import org.apache.spark.sql.sources.v2.writer.DataSourceWriter;
import org.apache.spark.sql.types.StructType;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class DefaultSource implements DataSourceRegister, WriteSupport , Serializable {
private static final Logger logger = LoggerFactory.getLogger(DefaultSource.class);
public DefaultSource() {
logger.info("created");
}
#Override
public String shortName() {
logger.info("shortName");
return "myformat";
}
#Override
public Optional<DataSourceWriter> createWriter(String writeUUID, StructType schema, SaveMode mode, DataSourceOptions options) {
return Optional.of(new MyFormatSourceWriter(writeUUID, schema, mode, options));
}
}
There's a DataSourceWriter implementation:
public class MyFormatSourceWriter implements DataSourceWriter, Serializable {
...
}
and a DataSourceWriterFactory implementation:
public class MyDataWriterFactory implements DataWriterFactory<InternalRow> {
...
}
and finally a DataWriter implementation. It seems that a DataWriter is created and sent to
each executor. Therefore each DataWriter will process many of the dataGroups.
Each record has a unique key column.
public class MyDataWriter implements DataWriter<InternalRow>, Serializable {
private static final Logger logger = LoggerFactory.getLogger(XcdDataWriter.class);
...
MyDataWriter(File buildDirectory, StructType schema, int partitionId) {
this.buildDirectory = buildDirectory;
this.schema = schema;
this.partitionId = partitionId;
logger.debug("Created MyDataWriter for partition {}", partitionId);
}
private String getFieldByName(InternalRow row, String fieldName) {
return Optional.ofNullable(row.getUTF8String(schema.fieldIndex(fieldName)))
.orElse(UTF8String.EMPTY_UTF8)
.toString();
}
/**
* Rows are written here. Each row has a unique key column as well as a dataGroup
* column. Right now we are frequently getting called with the same record twice.
*/
#Override
public void write(InternalRow record) throws IOException {
String nextDataFileName = getFieldByName(record, "dataGroup") + ".myExt";
// some non-trivial logic for determining the right output file
...
// write the output record
outputWriter.append(getFieldByName(row, "key")).append(',')
.append(getFieldByName(row, "prodDate")).append(',')
.append(getFieldByName(row, "nation")).append(',')
.append(getFieldByName(row, "plant")).append(',')
...
}
#Override
public WriterCommitMessage commit() throws IOException {
...
outputWriter.close();
...
logger.debug("Committed partition {} with {} data files for zip file {} for a total of {} zip files",
partitionId, dataFileCount, dataFileName, dataFileCount);
return new MyWriterCommitMessage(partitionId, dataFileCount);
}
#Override
public void abort() throws IOException {
logger.error("Failed to collect data for schema: {}", schema);
...
}
}
Right now I'm working around this by keeping track of the last key that was processed and ignoring
duplicates.
The spark docs says that
By default, when Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task.
If I create a Java SimpleDateFormat and use it in RDD operations, I got a exception NumberFormatException: multiple points.
I know SimpleDateFormat is not thread-safe. But as said by spark docs, this SimpleDateFormat object is copied to each task, so there should not be multiple threads accessing this object.
I speculate that all task in one executor shares the same SimpleDateFormate object, am I right?
This program prints the same object java.text.SimpleDateFormat#f82ede60
object NormalVariable {
// create dateFormat here doesn't change
// val dateFormat = new SimpleDateFormat("yyyy.MM.dd")
def main(args: Array[String]) {
val dateFormat = new SimpleDateFormat("yyyy.MM.dd")
val conf = new SparkConf().setAppName("Spark Test").setMaster("local[*]")
val spark = new SparkContext(conf)
val dates = Array[String]("1999.09.09", "2000.09.09", "2001.09.09", "2002.09.09", "2003.09.09")
println(dateFormat)
val resultes = spark.parallelize(dates).map { i =>
println(dateFormat)
dateFormat.parse(i)
}.collect()
println(resultes.mkString(" "))
spark.stop()
}
}
As you know, SimpleDateFormat is not thread safe.
If Spark is using a single core per executor (--executor-cores 1) then everything should work fine. But as soon as you configure more than one core per executor, your code is now running multi-threaded, the SimpleDateFormat is shared by multiple Spark tasks concurrently, and is likely to corrupt the data and throw various exceptions.
To fix this, you can use one of the same approaches as for non-Spark code, namely ThreadLocal, which ensures you get one copy of the SimpleDateFormat per thread.
In Java, this looks like:
public class DateFormatTest {
private static final ThreadLocal<DateFormat> df = new ThreadLocal<DateFormat>(){
#Override
protected DateFormat initialValue() {
return new SimpleDateFormat("yyyyMMdd");
}
};
public Date convert(String source) throws ParseException{
Date d = df.get().parse(source);
return d;
}
}
and the equivalent code in Scala works just the same - shown here as a spark-shell session:
import java.text.SimpleDateFormat
object SafeFormat extends ThreadLocal[SimpleDateFormat] {
override def initialValue = {
new SimpleDateFormat("yyyyMMdd HHmmss")
}
}
sc.parallelize(Seq("20180319 162058")).map(SafeFormat.get.parse(_)).collect
res6: Array[java.util.Date] = Array(Mon Mar 19 16:20:58 GMT 2018)
So you would define the ThreadLocal at the top level of your job class or object, then call df.get to obtain the SimpleDateFormat within your RDD operations.
See:
http://fahdshariff.blogspot.co.uk/2010/08/dateformat-with-multiple-threads.html
"Java DateFormat is not threadsafe" what does this leads to?