Load CSV File in BigQuery with Dataproc (Spark) - apache-spark

I'm trying to read data from CSV file in GCS and save it in a BigQuery table.
This my csv file :
1,Marc,B12,2017-03-24
2,Marc,B12,2018-01-31
3,Marc,B21,2017-03-17
4,Jeam,B12,2017-12-30
5,Jeam,B12,2017-09-02
6,Jeam,B11,2018-06-30
7,Jeam,B21,2018-03-02
8,Olivier,B20,2017-12-30
And this is my code :
val spark = SparkSession
.builder()
.appName("Hyp-session-bq")
.config("spark.master","local")
.getOrCreate()
val sc : SparkContext = spark.sparkContext
val conf=sc.hadoopConfiguration
//Input Parameters
val projectId = conf.get("fs.gs.project.id")
val bucket = conf.get("fs.gs.system.bucket")
val inputTable = s"$projectId:rpc.testBig"
//Input Configuration
conf.set(BigQueryConfiguration.PROJECT_ID_KEY,projectId)
conf.set(BigQueryConfiguration.GCS_BUCKET_KEY,bucket)
BigQueryConfiguration.configureBigQueryInput(conf,inputTable)
//Output Parameters
val outPutTable = s"$projectId:rpc.outTestBig"
// Temp output bucket that is deleted upon completion of job
val outPutGcsPath = ("gs://"+bucket+"/hadoop/tmp/outTestBig")
BigQueryOutputConfiguration.configure(conf,
outPutTable,
null,
outPutGcsPath,
BigQueryFileFormat.NEWLINE_DELIMITED_JSON,
classOf[TextOutputFormat[_,_]])
conf.set("mapreduce.job.outputformat.class", classOf[IndirectBigQueryOutputFormat[_,_]].getName)
// Truncate the table before writing output to allow multiple runs.
conf.set(BigQueryConfiguration.OUTPUT_TABLE_WRITE_DISPOSITION_KEY,"WRITE_TRUNCATE")
val text_file = sc.textFile("gs://test_files/csvfiles/test.csv")
val lignes = text_file.flatMap(x=>x.split(" "))
case class schemaFile(id: Int, name: String, symbole: String, date: String)
def parseStringWithCaseClass(str: String): schemaFile = schemaFile(
val id = str.split(",")(0).toInt,
val name = str.split(",")(1),
val symbole = str.split(",")(2),
val date = str.split(",")(3)
)
val result1 = lignes.map(x=>parseStringWithCaseClass(x))
val x =result1.map(elem =>(null,new Gson().toJsonTree(elem)))
val y = x.saveAsNewAPIHadoopDataset(conf)
When I run the Code I get this Error :
ERROR org.apache.spark.internal.io.SparkHadoopMapReduceWriter: Aborting job job_20180226083501_0008.
com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request
{
"code" : 400,
"errors" : [ {
"domain" : "global",
"message" : "Load configuration must specify at least one source URI",
"reason" : "invalid"
} ],
"message" : "Load configuration must specify at least one source URI"
}
at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:145)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:321)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1056)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
at com.google.cloud.hadoop.io.bigquery.BigQueryHelper.insertJobOrFetchDuplicate(BigQueryHelper.java:306)
at com.google.cloud.hadoop.io.bigquery.BigQueryHelper.importFromGcs(BigQueryHelper.java:160)
at com.google.cloud.hadoop.io.bigquery.output.IndirectBigQueryOutputCommitter.commitJob(IndirectBigQueryOutputCommitter.java:57)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:128)
at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:101)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084)
at jeam.BigQueryIO$.main(BigQueryIO.scala:115)
at jeam.BigQueryIO.main(BigQueryIO.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I think the probleme is about the case class and parseStringWithCaseClass but I don't Know How to resolve this.
I don't have a problème in the configuration because i get the perfect result when i'm trying with wordcount example : https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example

Try to work with Tuple4 :
def parseStringWithTuple(str: String): Tuple4[Int, String, String, String] = {
val id = str.split(",")(0).toInt
val name = str.split(",")(1)
val symbole = str.split(",")(2)
val date = str.split(",")(3)
(id, name, symbole, date)
}
val result1 = lignes.map(x=>parseStringWithTuple(x))
But I tested your Code and it works fine.

I have been performing some tests running your code with my own BigQuery tables and CSV files, and it has worked for me without needing any additional modification.
I see that when you changed CaseClass to Tuple4, as suggested by #jean-marc, your code started working, so it is a strange behavior, even more considering that for both him and me, your code is actually working without needing further modifications. The error Load configuration must specify at least one source URI usually appears when the load job in BigQuery is not properly configured, and it is not receiving a correct Cloud Storage object URL. However, if the same exact code works when only changing to Tuple4 and the CSV file you are using is the same and has not changed (i.e. the URL is valid), it may have been a transient issue, possibly related to Cloud Storage or BigQuery and not to the Dataproc job itself.
Finally, given the case that this issue is specific to you (it has worked for at least two more users with the same code), once you have checked that there is no issue related to the Cloud Storage object (permissions, wrong location, etc.), you may be interested in creating a bug in the Public Issue Tracker.

Related

Kotlin with spark create dataframe from POJO which has pojo classes within

I have a kotlin data class as shown below
data class Persona_Items(
val key1:Int = 0,
val key2:String = "Hello")
data class Persona(
val persona_type: String,
val created_using_algo: String,
val version_algo: String,
val createdAt:Long,
val listPersonaItems:List<Persona_Items>)
data class PersonaMetaData
(val user_id: Int,
val persona_created: Boolean,
val persona_createdAt: Long,
val listPersona:List<Persona>)
fun main() {
val personalItemList1 = listOf(Persona_Items(1), Persona_Items(key2="abc"), Persona_Items(10,"rrr"))
val personalItemList2 = listOf(Persona_Items(10), Persona_Items(key2="abcffffff"),Persona_Items(20,"rrr"))
val persona1 = Persona("HelloWorld","tttAlgo","1.0",10L,personalItemList1)
val persona2 = Persona("HelloWorld","qqqqAlgo","1.0",10L,personalItemList2)
val personMetaData = PersonaMetaData(884,true,1L, listOf(persona1,persona2))
val spark = SparkSession
.builder()
.master("local[2]")
.config("spark.driver.host","127.0.0.1")
.appName("Simple Application").orCreate
val rdd1: RDD<PersonaMetaData> = spark.toDS(listOf(personMetaData)).rdd()
val df = spark.createDataFrame(rdd1, PersonaMetaData::class.java)
df.show(false)
}
When I try to create a dataframe I get the below error.
Exception in thread main java.lang.UnsupportedOperationException: Schema for type src.Persona is not supported.
Does this mean that for list of data classes, creating dataframe is not supported? Please help me understand what is missing this the above code.
It could be much easier for you to use the Kotlin API for Apache Spark (Full disclosure: I'm the author of the API). With it your code could look like this:
withSpark {
val ds = dsOf(Persona_Items(1), Persona_Items(key2="abc"), Persona_Items(10,"rrr")))
// rest of logics here
}
Thing is Spark does not support data classes out of the box and we had to make an there are nothing like import spark.implicits._ in Kotlin, so we had to make extra step to make it work automatically.
In Scala import spark.implicits._ is required to encode your serialize and deserialize your entities automatically, in the Kotlin API we do this almost at compile time.
Error means that Spark doesn't know how to serialize the Person class.
Well, it works for me out of the box. I've created a simple app for you to demonstrate it check it out here, https://github.com/szymonprz/kotlin-spark-simple-app/blob/master/src/main/kotlin/CreateDataframeFromRDD.kt
you can just run this main and you will see that correct content is displayed.
Maybe you need to fix your build tool configuration if you see something scala specific in kotlin project, then you can check my build.gradle inside this project or you can read more about it here https://github.com/JetBrains/kotlin-spark-api/blob/main/docs/quick-start-guide.md

Invalid status code '400' from .. error payload: "requirement failed: Session isn't active

I am running Pyspark scripts to write a dataframe to a csv in jupyter Notebook as below:
df.coalesce(1).write.csv('Data1.csv',header = 'true')
After an hour of runtime I am getting the below error.
Error: Invalid status code from http://.....session isn't active.
My config is like:
spark.conf.set("spark.dynamicAllocation.enabled","true")
spark.conf.set("shuffle.service.enabled","true")
spark.conf.set("spark.dynamicAllocation.minExecutors",6)
spark.conf.set("spark.executor.heartbeatInterval","3600s")
spark.conf.set("spark.cores.max", "4")
spark.conf.set("spark.sql.tungsten.enabled", "true")
spark.conf.set("spark.eventLog.enabled", "true")
spark.conf.set("spark.app.id", "Logs")
spark.conf.set("spark.io.compression.codec", "snappy")
spark.conf.set("spark.rdd.compress", "true")
spark.conf.set("spark.executor.instances", "6")
spark.conf.set("spark.executor.memory", '20g')
spark.conf.set("hive.exec.dynamic.partition", "true")
spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
spark.conf.set("spark.driver.allowMultipleContexts", "true")
spark.conf.set("spark.master", "yarn")
spark.conf.set("spark.driver.memory", "20G")
spark.conf.set("spark.executor.instances", "32")
spark.conf.set("spark.executor.memory", "32G")
spark.conf.set("spark.driver.maxResultSize", "40G")
spark.conf.set("spark.executor.cores", "5")
I have checked the container nodes and the error there is:
ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Container marked as failed:container_e836_1556653519610_3661867_01_000005 on host: ylpd1205.kmdc.att.com. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Not able to figure out the issue.
Judging by the output, if your application is not finishing with a FAILED status, that sounds like a Livy timeout error: your application is likely taking longer than the defined timeout for a Livy session (which defaults to 1h), so even despite the Spark app succeeds your notebook will receive this error if the app takes longer than the Livy session's timeout.
If that's the case, here's how to address it:
edit the /etc/livy/conf/livy.conf file (in the cluster's master
node)
set the livy.server.session.timeout to a higher value, like 8h (or larger, depending on your app)
restart Livy to update the setting: sudo restart livy-server in the cluster's master
test your code again
I am not well versed in pyspark but in scala the solution would involve something like this
First we need to create a method for creating a header file:
def createHeaderFile(headerFilePath: String, colNames: Array[String]) {
//format header file path
val fileName = "dfheader.csv"
val headerFileFullName = "%s/%s".format(headerFilePath, fileName)
//write file to hdfs one line after another
val hadoopConfig = new Configuration()
val fileSystem = FileSystem.get(hadoopConfig)
val output = fileSystem.create(new Path(headerFileFullName))
val writer = new PrintWriter(output)
for (h <- colNames) {
writer.write(h + ",")
}
writer.write("\n")
writer.close()
}
You will also need a method for calling hadoop to merge your part files which would be written by df.write method:
def mergeOutputFiles(sourcePaths: String, destLocation: String): Unit = {
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
// in case of array[String] use for loop to iterate over the muliple source paths if not use the code below
// for (sourcePath <- sourcePaths) {
//Get the path under destination where the partitioned files are temporarily stored
val pathText = sourcePaths.split("/")
val destPath = "%s/%s".format(destLocation, pathText.last)
//Merge files into 1
FileUtil.copyMerge(hdfs, new Path(sourcePath), hdfs, new Path(destPath), true, hadoopConfig, null)
// }
//delete the temp partitioned files post merge complete
val tempfilesPath = "%s%s".format(destLocation, tempOutputFolder)
hdfs.delete(new Path(tempfilesPath), true)
}
Here is a method for generating output files or your df.write method where you are passing your huge DF to be written out to hadoop HDFS:
def generateOutputFiles( processedDf: DataFrame, opPath: String, tempOutputFolder: String,
spark: SparkSession): String = {
import spark.implicits._
val fileName = "%s%sNameofyourCsvFile.csv".format(opPath, tempOutputFolder)
//write as csv to output directory and add file path to array to be sent for merging and create header file
processedDf.write.mode("overwrite").csv(fileName)
createHeaderFile(fileName, processedDf.columns)
//create an array of the partitioned file paths
outputFilePathList = fileName
// you can use array of string or string only depending on if the output needs to get divided in multiple file based on some parameter in that case chagne the signature ot Array[String] as output
// add below code
// outputFilePathList(counter) = fileName
// just use a loop in the above and increment it
//counter += 1
return outputFilePathList
}
With all the methods defined here is how you can implement them:
def processyourlogic( your parameters if any):Dataframe=
{
// your logic to do whatever needs to be done to your data
}
Assuming the above method returns a dataframe, here is how you can put everything together:
val yourbigD f = processyourlogic(your parameters) // returns DF
yourbigDf.cache // caching just in case you need it
val outputPathFinal = " location where you want your file to be saved"
val tempOutputFolderLocation = "temp/"
val partFiles = generateOutputFiles(yourbigDf, outputPathFinal, tempOutputFolderLocation, spark)
mergeOutputFiles(partFiles, outputPathFinal)
Let me know if you have any other question relating to that. If the answer you seek is different then the original question should be asked.

what kind of RDDs in Spark can be saved to BigQuery table using saveAsNewAPIHadoopDataset

In the example of Using the BigQuery Connector with Spark
// Perform word count.
val wordCounts = (tableData
.map(entry => convertToTuple(entry._2))
.reduceByKey(_ + _))
// Write data back into a new BigQuery table.
// IndirectBigQueryOutputFormat discards keys, so set key to null.
(wordCounts
.map(pair => (null, convertToJson(pair)))
.saveAsNewAPIHadoopDataset(conf))
if I remove the .reduceByKey(_ + _) part, then I will have the following error
org.apache.spark.SparkException: Job aborted.
at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:107)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084)
... 46 elided
Caused by: java.io.IOException: Schema has no fields. Table: test_output_40b400dc_1bfe_454a_9aa8_bf9562d54c3f_source
at com.google.cloud.hadoop.io.bigquery.BigQueryUtils.waitForJobCompletion(BigQueryUtils.java:95)
at com.google.cloud.hadoop.io.bigquery.BigQueryHelper.importFromGcs(BigQueryHelper.java:164)
at com.google.cloud.hadoop.io.bigquery.output.IndirectBigQueryOutputCommitter.commitJob(IndirectBigQueryOutputCommitter.java:57)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:128)
at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:101)
... 53 more
In some cases, I don't use reduceByKey and want to save my RDD in BigQuery.
try to work with schema :
object Schema {
def apply(record: JsonObject): Schema = Schema (
word = record.get ("word").getAsString,
Count = record.get ("Count").getAsInt
)
}
case class Schema(word String,
Count :Int
)
and pass this schema like this :
wordCounts.map(x=>Schema(x))
hope it helps you
java.io.IOException: Schema has no fields is the error, which means BigQuery is unable to detect schema automatically. If you specify a schema like
List<TableFieldSchema> fields = new ArrayList<>();
fields.add(new TableFieldSchema().setName("word").setType("STRING"));
fields.add(new TableFieldSchema().setName("word_count").setType("INTEGER"));
BigQueryOutputConfiguration.configure(conf, ..., new TableSchema().setFields(fields), ...);
You should no longer hit this issue.
I think the reason .reduceByKey(_ + _) hides this issue is because:
Schema auto-detection will select one random file and scan up to 100 rows of data.
tableData RDD is initially partitioned into many small shards, each of which is insufficient to let BigQuery infer schema automatically.
.reduceByKey(_ + _) repartitions the RDD into bigger shards.
My hunch is that if you replace .reduceByKey(_ + _) by .repartition(2), the job should also work without explicitly providing a schema.

Doing flatmap on a function returning RDD

I am trying to process multiple avro files in the code below. the idea is to first get a series of avro files in a list. then open each avro file and generate a steam of tuples (string, int). then finally group the stream of tuples by key and sum the ints.
object AvroCopyUtil {
def main(args: Array[String]) : Unit = {
val conf = new SparkConf().setAppName("Leads Data Analysis").setMaster("local[*]")
val sc = new SparkContext(conf)
val fs = FileSystem.get(new Configuration())
val avroList = GetAvroList(fs, args(0))
avroList.flatMap(av =>
sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](av)
.map(r => (r._1.datum.get("field").toString, 1)))
.reduceByKey(_ + _)
.foreach(println)
}
def GetAvroList(fs: FileSystem, input: String) : List[String] = {
// get all children
val masterList : List[FileStatus] = fs.listStatus(new Path(input)).toList
val (allFiles, allDirs) = masterList.partition(x => x.isDirectory == false)
allFiles.map(_.getPath.toString) ::: allDirs.map(_.getPath.toString).flatMap(x => GetAvroList(fs, x))
}
}
The compile error i get is
[error] found : org.apache.spark.rdd.RDD[(org.apache.avro.mapred.AvroKey[org.apache.avro.generic.GenericRecord], org.apache.hadoop.io.NullWritable)]
[error] required: TraversableOnce[?]
[error] avroRdd.flatMap(av => sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](av))
[error] ^
[error] one error found
Edit: based on the suggestion below I tried
val rdd = sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable,
AvroKeyInputFormat[GenericRecord]](avroList.mkString(","))
but I got the error
Exception in thread "main" java.lang.IllegalArgumentException: java.net.URISyntaxException: Illegal character in scheme name at index 0: 2015-10-
15-00-1576041136-flumetracker.foo.com-FooAvroEvent.1444867200044.avro,hdfs:
Your function is unnecessary. You are also attempting to create an RDD within a transformation which doesn't really make sense. The transformation (in this case, flatMap) runs on top of an RDD and the records within an RDD will be what is transformed. In the case of a flatMap, the expected output of the anonymous function is a TraversableOnce object which will then be flattened into multiple records by the transformation. Looking at your code though, you don't really need to do a flatMap as a simply map will suffice. Keep in mind also that due to the immutability of RDD's, you must always reassign your transformations into new values.
Try something like:
val avroRDD = sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](filePath)
val countsRDD = avroRDD.map(av => (av._1.datum.get("field1").toString, 1)).reduceByKey(_ + _)
It seems as though you may need to take some time to grasp some of Spark's basic framework nuances. I would recommend fully reading the Spark Programming Guide. Lastly, if you want to use Avro, please also check out spark-avro as much of the boiler plate around working with Avro is taken care of there (and DataFrames may perhaps be more intuitive and easier to use for your use case).
(EDIT:)
It seems like you may have misunderstood how to load data to be processed in Spark. The parallelize() method is used to distribute collections across an RDD and not data within files. To do the latter, you actually only need to provide a comma-separated list of input files to the newAPIHadoopFile() loader. So assuming your GetAvroList() function works, you can do:
val avroList = GetAvroList(fs, args(0))
val avroRDD = sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](avroList.mkString(","))
val countsRDD = avroRDD.map(av => (av._1.datum.get("field1").toString, 1)).reduceByKey(_ + _)
flatMappedRDD.foreach(println)

aparch spark, NotSerializableException: org.apache.hadoop.io.Text

here is my code:
val bg = imageBundleRDD.first() //bg:[Text, BundleWritable]
val res= imageBundleRDD.map(data => {
val desBundle = colorToGray(bg._2) //lineA:NotSerializableException: org.apache.hadoop.io.Text
//val desBundle = colorToGray(data._2) //lineB:everything is ok
(data._1, desBundle)
})
println(res.count)
lineB goes well but lineA shows that:org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: org.apache.hadoop.io.Text
I try to use use Kryo to solve my problem but it seems nothing has been changed:
import com.esotericsoftware.kryo.Kryo
import org.apache.spark.serializer.KryoRegistrator
class MyRegistrator extends KryoRegistrator {
override def registerClasses(kryo: Kryo) {
kryo.register(classOf[Text])
kryo.register(classOf[BundleWritable])
}
}
System.setProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
System.setProperty("spark.kryo.registrator", "hequn.spark.reconstruction.MyRegistrator")
val sc = new SparkContext(...
Thanks!!!
I had a similar problem when my Java code was reading sequence files containing Text keys.
I found this post helpful:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-solve-java-io-NotSerializableException-org-apache-hadoop-io-Text-td2650.html
In my case, I converted Text to a String using map:
JavaPairRDD<String, VideoRecording> mapped = videos.map(new PairFunction<Tuple2<Text,VideoRecording>,String,VideoRecording>() {
#Override
public Tuple2<String, VideoRecording> call(
Tuple2<Text, VideoRecording> kv) throws Exception {
// Necessary to copy value as Hadoop chooses to reuse objects
VideoRecording vr = new VideoRecording(kv._2);
return new Tuple2(kv._1.toString(), vr);
}
});
Be aware of this note in the API for sequenceFile method in JavaSparkContext:
Note: Because Hadoop's RecordReader class re-uses the same Writable object for each record, directly caching the returned RDD will create many references to the same object. If you plan to directly cache Hadoop writable objects, you should first copy them using a map function.
In Apache Spark while dealing with Sequence files, we have to follow these techniques:
-- Use Java equivalent Data Types in place of Hadoop data types.
-- Spark Automatically converts the Writables into Java equivalent Types.
Ex:- We have a sequence file "xyz", here key type is say Text and value
is LongWritable. When we use this file to create an RDD, we need use their
java equivalent data types i.e., String and Long respectively.
val mydata = = sc.sequenceFile[String, Long]("path/to/xyz")
mydata.collect
The reason your code has the serialization problem is that your Kryo setup, while close, isn't quite right:
change:
System.setProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
System.setProperty("spark.kryo.registrator", "hequn.spark.reconstruction.MyRegistrator")
val sc = new SparkContext(...
to:
val sparkConf = new SparkConf()
// ... set master, appname, etc, then:
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryo.registrator", "hequn.spark.reconstruction.MyRegistrator")
val sc = new SparkContext(sparkConf)

Resources