Read from Spark RDD a Kryo File - apache-spark

I'm Spark & Scala newbie.
I need to read and analyze a file in Spark that it has written in my scala code with Kryo serialized:
import com.esotericsoftware.kryo.Kryo
import com.esotericsoftware.kryo.io.Output
val kryo:Kryo = new Kryo()
val output:Output = new Output(new FileOutputStream("filename.ext",true))
//kryo.writeObject(output, feed) (tested both line)
kryo.writeClassAndObject(output, myScalaObject)
This is a pseudo-code for create a file with my object (myScalaObject) serialized, that is a complex object.
The file seems that write well, but i have problem when I read it in Spark RDD
pseudo-code in Spark:
val conf = new SparkConf()
.setMaster("local")
.setAppName("My application")
.set("spark.executor.memory", "1g")
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
conf.set("spark.kryo.registrator", "myScalaObject")
val sc = new SparkContext(conf)
val file=sc.objectFile[myScalaObject]("filename.ext")
val counts = file.count()
When I try to execute it i receive this error:
org.apache.spark.SparkException:
Job aborted: Task 0.0:0 failed 1 times (most recent failure:
Exception failure: java.io.IOException: file: filename.ext not a SequenceFile)
Is possible read this type of file in Spark?
If this solution is not possible, what is a good solution for create a complex file structure to read in Spark?
thank you

If you want to read with objectFile, write out the data with saveAsObjectFile.
val myObjects: Seq[MyObject] = ...
val rddToSave = sc.parallelize(myObjects) // Or better yet: construct as RDD from the start.
rddToSave.saveAsObjectFile("/tmp/x")
val rddLoaded = sc.objectFile[MyObject]("/tmp/x")
Alternatively, as zsxwing says, you can create an RDD of the filenames, and use map to read the contents of each. If want each file to be read into a separate partition, parallelize the filenames into separate partitions:
def loadFiles(filenames: Seq[String]): RDD[Object] = {
def load(filename: String): Object = {
val input = new Input(new FileInputStream(filename))
return kryo.readClassAndObject(input)
}
val partitions = filenames.length
return sc.parallelize(filenames, partitions).map(load)
}

Related

az synapse spark job submit

According to the documentation, using az synapse spark job submit, I can pass arguments using --arguments. So far so good.
However, I cannot figure out to actually access those arguments in my code. Here's my current effort:
val conf = new SparkConf().setAppName("foo")
val sc = new SparkContext(conf)
val spark = SparkSession.builder.appName("foo").getOrCreate()
val start_time = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm").format(LocalDateTime.now)
val appID = sc.getConf.getAppId
//let's get some arguments
val inputArgs = spark.sqlContext.getConf("spark.driver.args").split("\\s+")
//val inputArgs = sc.getConf.get("spark.driver.args").split("\\s+")
Either of those lines throw the following exception:
22/03/25 19:07:45 ERROR ApplicationMaster: User class threw exception: java.util.NoSuchElementException: spark.driver.args
java.util.NoSuchElementException: spark.driver.args
So, how do I read the arguments in the Scala code?
Ok, I was overcomplicating this.
def main(args: Array[String]) {
...
val foo = args(0)

How to use foreachRDD in legacy Spark Streaming

I am getting exception while using foreachRDD for my CSV data processing. Here is my code
case class Person(name: String, age: Long)
val conf = new SparkConf()
conf.setMaster("local[*]")
conf.setAppName("CassandraExample").set("spark.driver.allowMultipleContexts", "true")
val ssc = new StreamingContext(conf, Seconds(10))
val smDstream=ssc.textFileStream("file:///home/sa/testFiles")
smDstream.foreachRDD((rdd,time) => {
val peopleDF = rdd.map(_.split(",")).map(attributes =>
Person(attributes(0), attributes(1).trim.toInt)).toDF()
peopleDF.createOrReplaceTempView("people")
val teenagersDF = spark.sql("insert into table devDB.stam SELECT name, age
FROM people WHERE age BETWEEN 13 AND 29")
//teenagersDF.show
})
ssc.checkpoint("hdfs://go/hive/warehouse/devDB.db")
ssc.start()
i am getting following error
java.io.NotSerializableException: DStream checkpointing has been enabled but the DStreams with their functions are not serializable
org.apache.spark.streaming.StreamingContext
Serialization stack:
- object not serializable (class: org.apache.spark.streaming.StreamingContext, value: org.apache.spark.streaming.StreamingContext#1263422a)
- field (class: $iw, name: ssc, type: class org.apache.spark.streaming.StreamingContext)
please help
The question does not really make sense anymore in that dStreams are being deprecated / abandoned.
There a few things to consider in the code, what the exact question is therefore hard to glean. That said, I had to ponder as well as I am not a Serialization expert.
You can find a few posts of some trying to write to Hive table directly as opposed to a path, in my answer I use an approach but you can use your approach of Spark SQL to write for a TempView, that is all possible.
I simulated input from a QueueStream, so I need no split to be applied. You can adapt this to your own situation if you follow the same "global" approach. I elected to write to a parquet file that gets created if needed. You can create your tempView and then use spark.sql as per your initial approach.
The Output Operations on DStreams are:
print()
saveAsTextFiles(prefix, [suffix])
saveAsObjectFiles(prefix, [suffix])
saveAsHadoopFiles(prefix, [suffix])
foreachRDD(func)
foreachRDD
The most generic output operator that applies a function, func, to
each RDD generated from the stream. This function should push the data
in each RDD to an external system, such as saving the RDD to files, or
writing it over the network to a database. Note that the function func
is executed in the driver process running the streaming application,
and will usually have RDD actions in it that will force the
computation of the streaming RDDs.
It states saving to files, but it can do what you want via foreachRDD, albeit I
assumed the idea was to external systems. Saving to files is quicker
in my view as opposed to going through steps to write a table
directly. You want to offload data asap with Streaming as volumes are typically high.
Two steps:
In a separate class to the Streaming Class - run under Spark 2.4:
case class Person(name: String, age: Int)
Then the Streaming logic you need to apply - you may need some imports
that I have in my notebook otherwise as I ran this under DataBricks:
import org.apache.spark.sql.SparkSession
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.{Seconds, StreamingContext}
import scala.collection.mutable
import org.apache.spark.sql.SaveMode
val spark = SparkSession
.builder
.master("local[4]")
.config("spark.driver.cores", 2)
.appName("forEachRDD")
.getOrCreate()
val sc = spark.sparkContext
val ssc = new StreamingContext(spark.sparkContext, Seconds(1))
val rddQueue = new mutable.Queue[RDD[List[(String, Int)]]]()
val QS = ssc.queueStream(rddQueue)
QS.foreachRDD(q => {
if(!q.isEmpty) {
val q_flatMap = q.flatMap{x=>x}
val q_withPerson = q_flatMap.map(field => Person(field._1, field._2))
val df = q_withPerson.toDF()
df.write
.format("parquet")
.mode(SaveMode.Append)
.saveAsTable("SO_Quest_BigD")
}
}
)
ssc.start()
for (c <- List(List(("Fred",53), ("John",22), ("Mary",76)), List(("Bob",54), ("Johnny",92), ("Margaret",15)), List(("Alfred",21), ("Patsy",34), ("Sylvester",7)) )) {
rddQueue += ssc.sparkContext.parallelize(List(c))
}
ssc.awaitTermination()

Saving multiple hadoop datasets concurrently in Spark

I have a Spark app that looks like this:
val conf = new SparkConf().setAppName("MyApp")
val sc = new SparkContext(conf)
val rdd1 = ...
rdd1.saveAsNewAPIHadoopDataset(output1)
val rdd2 = ...
rdd2.saveAsNewAPIHadoopDataset(output2)
val rdd3 = ...
rdd3.saveAsNewAPIHadoopDataset(output3)
```
The call to saveAsNewAPIHadoopDataset and while some of my workers are doing IO, it would be nice if the job continued to run the next stages.
I tried to wrap each computation in a Future {} and await on all of them at the end but ran into this issue https://issues.apache.org/jira/browse/SPARK-13631
Is there a way in Spark to save to Hadoop dataset in a way that will queue other stages? FWIW, Hadoop's output configuration is BigQuery connector (https://cloud.google.com/hadoop/bigquery-connector)

Spark: Perform SQL function on RDD subsets

I need to perform SQL function on RDD subsets, increasing in size. For this I have to take subsets from input RDD with take function:
def main(args: Array[String]) {
// set up environment
val conf = new SparkConf()
.setMaster("local[5]")
.setAppName("Test")
.set("spark.executor.memory", "4g")
val sc = new SparkContext(conf)
val cntPairsRdd = cntsRdd.map(n => {
val sample = data0.take(n)
val dataRDD = sc.parallelize(sample)
val df = dataRDD.toDF()
val result = df.select( ...)
val xCnt = result.count
(n, xCnt)
})
}
cntsRdd is a set of increasing integers. Function take returns a list not RDD. So to make my SQL work I need first to convert my list to RDD and then to dataframe. Unfortunately inside map function Spark does not allow to create another RDD. In other words in Spark one can not create RDD inside another RDD. Because of the same reason Spark does not support SparkContext serilization. I get serilization exception when trying to sc.parallelize(sample).
Please advise some workaround to perform SQL function on RDD subsets, as defined in this scenario.

How to join a DStream with a non-stream file?

I'd like to join every RDD in a DStream with a non-streaming, unchanging reference file. Here is my code:
val sparkConf = new SparkConf().setAppName("LogCounter")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val sc = new SparkContext()
val geoData = sc.textFile("data/geoRegion.csv")
.map(_.split(','))
.map(line => (line(0), (line(1),line(2),line(3),line(4))))
val topicMap = topics.split(",").map((_,numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
val goodIPsFltrBI = lines.filter(...).map(...).filter(...) // details removed for brevity
val vdpJoinedGeo = goodIPsFltrBI.transform(rdd =>rdd.join(geoData))
I'm getting many, many errors, the most common being:
14/11/19 19:58:23 WARN TaskSetManager: Loss was due to java.io.FileNotFoundException
java.io.FileNotFoundException: http://10.102.71.92:40764/broadcast_1
I think I should be broadcasting geoData instead of reading it in with each task (it's a 100MB file), but I'm not sure where to put the code that initializes geoData the first time.
Also I'm not sure if geoData is even defined correctly (maybe it should use ssc instead of sc?). The documentation I've seen just lists the transform and join but doesn't show how the static file was created.
Any ideas on how to broadcast geoData and then join it to each streaming RDD?
FileNotFound Exception:
The geoData textFile is loaded on all workers from the provided location ("data/geroRegion.csv"). It's most probably that this file in only available in the driver and therefore the workers cannot load it, throwing a file not found exception.
Broadcast variable:
Broadcast variables are defined on the driver and used on the workers by unwrapping the broadcast container to get the content.
This means that the data contained by the broadcast variable should be loaded by the driver before at the time the job is defined.
This might solve two problems in this case: Assuming that the geoData.csv file is located in the driver node, it will allow proper loading of this data on the driver and an efficient spread over the cluster.
In the code above, replace the geoData loading with a local file reading version:
val geoData = Source.fromFile("data/geoRegion.csv").getLines
.map(_.split(','))
.map(line => (line(0), (line(1),line(2),line(3),line(4)))).toMap
val geoDataBC = sc.broadcast(geoData)
To use it, you access the broadcast contents within a closure. Note that you will get access to the map previously wrapped in the broadcast variable: it's a simple object, not an RDD, so in this case you cannot use join to merge the two datasets. You could use flatMap instead:
val vdpJoinedGeo = goodIPsFltrBI.flatMap{ip => geoDataBC.value.get(ip).map(data=> (ip,data)}

Resources