I've been experimenting with Spark's mapPartitionsWithIndex and I ran into problems when
trying to return an Iterator of a tuple that itself contained an empty iterator.
I tried several different ways of constructing the inner iterator [ via Iterator(), and List(...).iterator ], and
all roads let to my getting this error:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 2.0 in stage 0.0 (TID 2) had a not serializable result: scala.collection.LinearSeqLike$$anon$1
Serialization stack:
- object not serializable (class: scala.collection.LinearSeqLike$$anon$1, value: empty iterator)
- field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
- object (class scala.Tuple2, (1,empty iterator))
- element of array (index: 0)
- array (class [Lscala.Tuple2;, size 1)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
My code example is given below. Note that as given it runs OK (an empty iterator is returned as the
mapPartitionsWithIndex value.) But when you run with the now commented-out version of
the mapPartitionsWithIndex invocations you will get the error above.
If anyone has a suggestion on how to this can be made to work, I'd be much obliged.
import org.apache.spark.{Partition, SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object ANonWorkingExample extends App {
val sparkConf = new SparkConf().setAppName("continuous").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val parallel: RDD[Int] = sc.parallelize(1 to 9)
val parts: Array[Partition] = parallel.partitions
val partRDD: RDD[(Int, Iterator[Int])] =
parallel.coalesce(3).
mapPartitionsWithIndex {
(partitionIndex: Int, inputiterator: Iterator[Int]) =>
val mappedInput: Iterator[Int] = inputiterator.map(_ + 1)
// Iterator((partitionIndex, mappedInput)) // FAILS
Iterator() // no exception.. but not really what i want.
}
val data = partRDD.collect
println("data:" + data.toList);
}
I am not sure what you are trying to achieve and I am a sort of novice compared to some of the expert folks here.
I present something that may give you an idea of how to do things I think correctly and make some comments:
You seem to get the partitions explicitly and call mapPartitions - a 1st for me.
RDD inside mapPartitions and the various SPARK SCALA thing will not fly; it is about iterables and I think you need to drop to SCALA only level.
The serializeable error come from doing List[Int].
Here is an example showing index partition along with those corresponding index values.
import org.apache.spark.{Partition, SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.{Minutes, Seconds, StreamingContext}
// from your stuff, left in
val parallel: RDD[Int] = sc.parallelize(1 to 9, 4)
val mapped = parallel.mapPartitionsWithIndex{
(index, iterator) => {
println("Called in Partition -> " + index)
val myList = iterator.toList
myList.map(x => (index, x)).groupBy( _._1 ).mapValues( _.map( _._2 ) ).toList.iterator
}
}
mapped.collect()
This returns the following that resembles a little of what I think you seemed to want:
res38: Array[(Int, List[Int])] = Array((0,List(1, 2)), (1,List(3, 4)), (2,List(5, 6)), (3,List(7, 8, 9)))
Final note: the documentation and such is not so easy to follow, you don't get it all from word count example!
So, hope this helps.
I think it might get you on the right path to where you want to go, I could not quite see it, but may be you can now see the forest for the trees.
So, the dumb thing I was doing was trying to return an unserializable data structure: an Iterator, as clearly indicated by the stack trace I got.
And the solution is to not use an iterator. Rather, use a collection like a Seq, or List. The sample program below illustrates the correct way to do what I was trying to do.
import org.apache.spark.{Partition, SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object AWorkingExample extends App {
val sparkConf = new SparkConf().setAppName("batman").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val parallel: RDD[Int] = sc.parallelize(1 to 9)
val parts: Array[Partition] = parallel.partitions
val partRDD: RDD[(Int, List[Int])] =
parallel.coalesce(3).
mapPartitionsWithIndex {
(partitionIndex: Int, inputiterator: Iterator[Int]) =>
val mappedInput: Iterator[Int] = inputiterator.map(_ + 1)
Iterator((partitionIndex, mappedInput.toList)) // Note the .toList() call -- that makes it work
}
val data = partRDD.collect
println("data:" + data.toList);
}
By the way, what I was trying to do originally was to see concretely which chunks of data from my parallelized-to-RDD structure were assigned to which partition. Here is the output you get if you run the program:
data:List((0,List(2, 3)), (1,List(4, 5, 6)), (2,List(7, 8, 9, 10)))
Interesting that the data distribution could have been more optimally balanced, but wasn't. That's not the point of the question, but I thought it was interesting.
Related
We have code running in Apache Spark. After a detailed examination of the code, I've determined that one of our mappers is modifying an object that is in an RDD, rather than making a copy of the object for the output. That is, we have an RDD of dicts, and the map function is adding things to the dictionary, rather than returning new dictionaries.
RDDs are supposed to be immutable. Ours are being mutated.
We are also having memory errors.
Question: Will Spark be confused if the size of an RDD suddenly increases?
While it probably does not crash, it can cause some unspecified behaviour. For example this snippet
val rdd = sc.parallelize({
val m = new mutable.HashMap[Int, Int]
m.put(1, 2)
m
} :: Nil)
rdd.cache() // comment out to change behaviour!
rdd.map(m => {
m.put(2, 3)
m
}).collect().foreach(println) // "Map(2 -> 3, 1 -> 2)"
rdd.collect().foreach(println) // Either "Map(1 -> 2)" or "Map(2 -> 3, 1 -> 2)" depending if caching is used
the behaviour changes depending if the RDD gets cached or not. In the Spark API there is a bunch of functions that are allowed to mutate the data and that is clearly pointed out in the documentation, see this for example https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/rdd/PairRDDFunctions.html#aggregateByKey-U-scala.Function2-scala.Function2-scala.reflect.ClassTag-
Consider having a RDD[(K, V)] of map entries instead of maps i.e. RDD[Map[K, V]]. This would enable adding new entries in a standard way using flatMap or mapPartitions. If needed, the map representation can be eventually generating by grouping etc.
Okay, I developed some code to test out what happens if an object referred to in an RDD is mutated by the mapper, and I am happy to report that it is not possible if you are programming from Python.
Here is my test program:
from pyspark.sql import SparkSession
import time
COUNT = 5
def funnydir(i):
"""Return a directory for i"""
return {"i":i,
"gen":0 }
def funnymap(d):
"""Take a directory and perform a funnymap"""
d['gen'] = d.get('gen',0) + 1
d['id' ] = id(d)
return d
if __name__=="__main__":
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
dfroot = sc.parallelize(range(COUNT)).map(funnydir)
dfroot.persist()
df1 = dfroot.map(funnymap)
df2 = df1.map(funnymap)
df3 = df2.map(funnymap)
df4 = df3.map(funnymap)
print("===========================================")
print("*** df1:",df1.collect())
print("*** df2:",df2.collect())
print("*** df3:",df3.collect())
print("*** df4:",df4.collect())
print("===========================================")
ef1 = dfroot.map(funnymap)
ef2 = ef1.map(funnymap)
ef3 = ef2.map(funnymap)
ef4 = ef3.map(funnymap)
print("*** ef1:",ef1.collect())
print("*** ef2:",ef2.collect())
print("*** ef3:",ef3.collect())
print("*** ef4:",ef4.collect())
If you run this, you'll see that the id for the dictionary d is different in each of the dataframes. Apparently Spark is serializing deserializing the objects as they are passed from mapper to mapper. So each gets its own version.
If this were not true, then the first call to funnymap to make df1 would also change the generation in the dfroot data frame, and as a result ef4 would have different generation numbers that df4.
I can can create a DF inside foreachRDD if I do not try and use a Case Class and simply let default names for columns be made with toDF() or if I assign them via toDF("c1, "c2").
As soon as I try and use a Case Class, and having looked at the examples, I get:
Task not serializable
If I shift the Case Class statement around I then get:
toDF() not part of RDD[CaseClass]
It's legacy, but I am curious as to the nth Serialization error that Spark can produce and if it carries over into Structured Streaming.
I have an RDD that need not be split, may be that is the issue? NO. Running in DataBricks?
Coding is as follows:
import org.apache.spark.sql.SparkSession
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.{Seconds, StreamingContext}
import scala.collection.mutable
case class Person(name: String, age: Int) //extends Serializable // Some say inherently serializable so not required
val spark = SparkSession.builder
.master("local[4]")
.config("spark.driver.cores", 2)
.appName("forEachRDD")
.getOrCreate()
val sc = spark.sparkContext
val ssc = new StreamingContext(spark.sparkContext, Seconds(1))
val rddQueue = new mutable.Queue[RDD[List[(String, Int)]]]()
val QS = ssc.queueStream(rddQueue)
QS.foreachRDD(q => {
if(!q.isEmpty) {
import spark.implicits._
val q_flatMap = q.flatMap{x=>x}
val q_withPerson = q_flatMap.map(field => Person(field._1, field._2))
val df = q_withPerson.toDF()
df.show(false)
}
}
)
ssc.start()
for (c <- List(List(("Fred",53), ("John",22), ("Mary",76)), List(("Bob",54), ("Johnny",92), ("Margaret",15)), List(("Alfred",21), ("Patsy",34), ("Sylvester",7)) )) {
rddQueue += ssc.sparkContext.parallelize(List(c))
}
ssc.awaitTermination()
Having not grown up with Java, but having looked around I found out what to do, but am not expert enough to explain.
I was running in a DataBricks notebook where I prototype.
The clue is that the
case class Person(name: String, age: Int)
was inside the same DB Notebook. One needs to define the case class external to the current notebook - in a separate notebook - and thus separate to the class running the Streaming.
I have some custom logic that looks at elements in an RDD and would like to conditionally write to a TempView via the UNION approach using foreach, as per below:
rddX.foreach{ x => {
// Do something, some custom logic
...
val y = create new RDD from this RDD element x
...
or something else
// UNION to TempView
...
}}
Something really basic that I do not get:
How can convert the nth entry (x) of the RDD to an RDD itself of length 1?
Or, convert the nth entry (x) directly to a DF?
I get all the set based cases, but here I want to append when I meet a condition immediately for the sake of simplicity. I.e. at the level of the item entry in the RDD.
Now, before getting a -1 as SO 41356419, I am only suggesting this as I have a specific use case and to mutate a TempView in SPARK SQL, I do need such an approach - at least that is my thinking. Not a typical SPARK USE CASE, but that is what we are / I am facing.
Thanks in advance
First of all - you can't create RDD or DF inside foreach() of another RDD or DF/DS function. But you can get nth element from RDD and create new RDD with that single element.
EDIT:
The solution, however is much simplier:
import org.apache.spark.{SparkConf, SparkContext}
object Main {
val conf = new SparkConf().setAppName("myapp").setMaster("local[*]")
val sc = new SparkContext(conf)
def main(args: Array[String]): Unit = {
val n = 534 // This is input value (index of the element we'ŗe interested in)
sc.setLogLevel("ERROR")
// Creating dummy rdd
val rdd = sc.parallelize(0 to 999).cache()
val singletonRdd = rdd.zipWithIndex().filter(pair => pair._1 == n)
}
}
Hope that helps!
I am using the following code to write an RDD as a sequence file
#Test
def testSparkWordCount(): Unit = {
val words = Array("Hello", "Hello", "World", "Hello", "Welcome", "World")
val conf = new SparkConf().setMaster("local").setAppName("testSparkWordCount")
val sc = new SparkContext(conf)
val dir = "file:///" + System.currentTimeMillis()
sc.parallelize(words).map(x => (x, 1)).saveAsHadoopFile(
dir,
classOf[Text],
classOf[IntWritable],
classOf[org.apache.hadoop.mapred.SequenceFileOutputFormat[Text, IntWritable]]
)
sc.stop()
}
When I run it, it complains that
Caused by: java.io.IOException: wrong key class: java.lang.String is not class org.apache.hadoop.io.Text
at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1373)
at org.apache.hadoop.mapred.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:76)
at org.apache.spark.internal.io.SparkHadoopWriter.write(SparkHadoopWriter.scala:94)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1139)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1137)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1137)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1360)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1145)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1125)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
Should I have to use sc.parallelize(words).map(x => (new Text(x), new IntWritable(1)) instead of sc.parallelize(words).map(x => (x, 1))? I don't think i have to wrap it explicitly since SparkContext has already provides the implicits that wrap the premitive types to their corresponding Writables.
So, what should I do to make this piece of code work
Yes, SparkContext provides implicits for conversion. But this conversion do not applied during saving, must be used in usual Scala way:
import org.apache.spark.SparkContext._
val mapperFunction: String=> (Text,IntWritable) = x => (x, 1)
... parallelize(words).map(mapperFunction).saveAsHadoopFile ...
I am trying to process multiple avro files in the code below. the idea is to first get a series of avro files in a list. then open each avro file and generate a steam of tuples (string, int). then finally group the stream of tuples by key and sum the ints.
object AvroCopyUtil {
def main(args: Array[String]) : Unit = {
val conf = new SparkConf().setAppName("Leads Data Analysis").setMaster("local[*]")
val sc = new SparkContext(conf)
val fs = FileSystem.get(new Configuration())
val avroList = GetAvroList(fs, args(0))
avroList.flatMap(av =>
sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](av)
.map(r => (r._1.datum.get("field").toString, 1)))
.reduceByKey(_ + _)
.foreach(println)
}
def GetAvroList(fs: FileSystem, input: String) : List[String] = {
// get all children
val masterList : List[FileStatus] = fs.listStatus(new Path(input)).toList
val (allFiles, allDirs) = masterList.partition(x => x.isDirectory == false)
allFiles.map(_.getPath.toString) ::: allDirs.map(_.getPath.toString).flatMap(x => GetAvroList(fs, x))
}
}
The compile error i get is
[error] found : org.apache.spark.rdd.RDD[(org.apache.avro.mapred.AvroKey[org.apache.avro.generic.GenericRecord], org.apache.hadoop.io.NullWritable)]
[error] required: TraversableOnce[?]
[error] avroRdd.flatMap(av => sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](av))
[error] ^
[error] one error found
Edit: based on the suggestion below I tried
val rdd = sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable,
AvroKeyInputFormat[GenericRecord]](avroList.mkString(","))
but I got the error
Exception in thread "main" java.lang.IllegalArgumentException: java.net.URISyntaxException: Illegal character in scheme name at index 0: 2015-10-
15-00-1576041136-flumetracker.foo.com-FooAvroEvent.1444867200044.avro,hdfs:
Your function is unnecessary. You are also attempting to create an RDD within a transformation which doesn't really make sense. The transformation (in this case, flatMap) runs on top of an RDD and the records within an RDD will be what is transformed. In the case of a flatMap, the expected output of the anonymous function is a TraversableOnce object which will then be flattened into multiple records by the transformation. Looking at your code though, you don't really need to do a flatMap as a simply map will suffice. Keep in mind also that due to the immutability of RDD's, you must always reassign your transformations into new values.
Try something like:
val avroRDD = sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](filePath)
val countsRDD = avroRDD.map(av => (av._1.datum.get("field1").toString, 1)).reduceByKey(_ + _)
It seems as though you may need to take some time to grasp some of Spark's basic framework nuances. I would recommend fully reading the Spark Programming Guide. Lastly, if you want to use Avro, please also check out spark-avro as much of the boiler plate around working with Avro is taken care of there (and DataFrames may perhaps be more intuitive and easier to use for your use case).
(EDIT:)
It seems like you may have misunderstood how to load data to be processed in Spark. The parallelize() method is used to distribute collections across an RDD and not data within files. To do the latter, you actually only need to provide a comma-separated list of input files to the newAPIHadoopFile() loader. So assuming your GetAvroList() function works, you can do:
val avroList = GetAvroList(fs, args(0))
val avroRDD = sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](avroList.mkString(","))
val countsRDD = avroRDD.map(av => (av._1.datum.get("field1").toString, 1)).reduceByKey(_ + _)
flatMappedRDD.foreach(println)