How will Spark react if an RDD gets bigger? - apache-spark

We have code running in Apache Spark. After a detailed examination of the code, I've determined that one of our mappers is modifying an object that is in an RDD, rather than making a copy of the object for the output. That is, we have an RDD of dicts, and the map function is adding things to the dictionary, rather than returning new dictionaries.
RDDs are supposed to be immutable. Ours are being mutated.
We are also having memory errors.
Question: Will Spark be confused if the size of an RDD suddenly increases?

While it probably does not crash, it can cause some unspecified behaviour. For example this snippet
val rdd = sc.parallelize({
val m = new mutable.HashMap[Int, Int]
m.put(1, 2)
m
} :: Nil)
rdd.cache() // comment out to change behaviour!
rdd.map(m => {
m.put(2, 3)
m
}).collect().foreach(println) // "Map(2 -> 3, 1 -> 2)"
rdd.collect().foreach(println) // Either "Map(1 -> 2)" or "Map(2 -> 3, 1 -> 2)" depending if caching is used
the behaviour changes depending if the RDD gets cached or not. In the Spark API there is a bunch of functions that are allowed to mutate the data and that is clearly pointed out in the documentation, see this for example https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/rdd/PairRDDFunctions.html#aggregateByKey-U-scala.Function2-scala.Function2-scala.reflect.ClassTag-
Consider having a RDD[(K, V)] of map entries instead of maps i.e. RDD[Map[K, V]]. This would enable adding new entries in a standard way using flatMap or mapPartitions. If needed, the map representation can be eventually generating by grouping etc.

Okay, I developed some code to test out what happens if an object referred to in an RDD is mutated by the mapper, and I am happy to report that it is not possible if you are programming from Python.
Here is my test program:
from pyspark.sql import SparkSession
import time
COUNT = 5
def funnydir(i):
"""Return a directory for i"""
return {"i":i,
"gen":0 }
def funnymap(d):
"""Take a directory and perform a funnymap"""
d['gen'] = d.get('gen',0) + 1
d['id' ] = id(d)
return d
if __name__=="__main__":
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
dfroot = sc.parallelize(range(COUNT)).map(funnydir)
dfroot.persist()
df1 = dfroot.map(funnymap)
df2 = df1.map(funnymap)
df3 = df2.map(funnymap)
df4 = df3.map(funnymap)
print("===========================================")
print("*** df1:",df1.collect())
print("*** df2:",df2.collect())
print("*** df3:",df3.collect())
print("*** df4:",df4.collect())
print("===========================================")
ef1 = dfroot.map(funnymap)
ef2 = ef1.map(funnymap)
ef3 = ef2.map(funnymap)
ef4 = ef3.map(funnymap)
print("*** ef1:",ef1.collect())
print("*** ef2:",ef2.collect())
print("*** ef3:",ef3.collect())
print("*** ef4:",ef4.collect())
If you run this, you'll see that the id for the dictionary d is different in each of the dataframes. Apparently Spark is serializing deserializing the objects as they are passed from mapper to mapper. So each gets its own version.
If this were not true, then the first call to funnymap to make df1 would also change the generation in the dfroot data frame, and as a result ef4 would have different generation numbers that df4.

Related

Grouping by then applying custom function in Spark, using SparkSession in a Java stream?

Let's assume that I have a use case where I want to groupBy then apply a custom function to the grouped values. In Python, I could accomplish this through:
df.groupby("id").apply(custom_function)
and
#pandas_udf("id string, prediction double", PandasUDFType.GROUPED_MAP)
def custom_function(id, dataframe):
rf = RandomForestRegressor(n_estimators=25, random_state=42)
rf.fit(train_features, dataframe.quantity_sold)
prediction = rf.predict(test_features)
return pd.DataFrame({'id': id, 'prediction': prediction}, index=[0])
I could accomplish the same thing in Scala through:
input.rdd.groupBy(row => row.get(0)).collect().map(data => {
val df = sparkSession.createDataFrame(sparkContext.parallelize(data._2.toSeq), input.schema)
(data._1.toString, df)
}).foldLeft(sparkSession.createDataFrame(sparkContext.emptyRDD[Row], outputSchema))((acc, next) => {
val assembler = new VectorAssembler()
.setInputCols(modelColumns)
.setOutputCol(features)
.transform(next._2)
val forest = oldForest
.fit(assembler)
.transform(testAssembler)
acc.union(forest)
})
If we compare these two workarounds, the upper one works much faster than the below one. I tried to do this without collect, but I get the error RDD transformations and actions can only be invoked by the driver, not inside of other transformations.
I am aware that collect returns the results to the driver as a list, that is why I am forced to use Scala collection API (map and flatMap) to further continue my processing.
My questions regarding this are, is the job not supposed to be spread to executors again once collected to the driver (since I am continuing to use Spark ML API)? Or is everything simply calculated (once collected) in the driver as the code goes back to where main method is executed? Basically, why is the run very slow and is there any approach to make this process better without using Python?
Thank you!
EDIT: Managed to solve this (an example as below); say we have this dataset:
+---+-----+------+-----+
|id |first|second|third|
+---+-----+------+-----+
|1 |1.0 |1.0 |1.0 |
|1 |1.0 |2.0 |2.0 |
|1 |1.0 |3.0 |3.0 |
|1 |1.0 |4.0 |4.0 |
|1 |1.0 |5.0 |5.0 |
+---+-----+------+-----+
Our goal is to group by id, then for the grouped columns (first, second and third), we want to train the model then predict something (with column third being our label).
To group and apply the UDAF (as suggested from werner):
val myAggFct = udaf(MyAgg).apply(array("first", "second", "third"))
df.groupBy("id").agg(myAggFct)
myAggFct UDF is implemented as below:
object MyAgg extends Aggregator[Seq[Double], Seq[Seq[Double]], String] {
override def zero: Seq[Seq[Double]] = scala.collection.mutable.Seq[Seq[Double]]()
override def reduce(b: Seq[Seq[Double]], a: Seq[Double]): Seq[Seq[Double]] = b :+ a
override def merge(b1: Seq[Seq[Double]], b2: Seq[Seq[Double]]): Seq[Seq[Double]] = b1 ++ b2
override def finish(allInts: Seq[Seq[Double]]): String = {
// Defining the attributes (first, second and third, as our dataset)
val array: List[Attribute] = List() :+
new Attribute("first") :+
new Attribute("second") :+
new Attribute("third")
// Creation of the instance and defining what we want to predict, in our case, the last attribute
// aka. `third`
val dataRaw = new Instances("train", new util.ArrayList[Attribute](array.asJava), 0)
dataRaw.setClassIndex(dataRaw.numAttributes() - 1)
// Converting our Seq[Seq[Double]] to Dense Instances, so we can add it to `dataRaw`
// aka. our trained model
dataRaw.addAll(allInts.map(v => new DenseInstance(1.0, v.toArray)).asJava)
// We create a Random Forest object and we use `dataRaw` as classifier
val mlp = new RandomForest()
mlp.buildClassifier(dataRaw)
// Give it a test case, in this case, we want to see where first = 1.0 and second = 2.0 fall into
val testInstance = new DenseInstance(1.0, Seq(1.0, 2.0).toArray)
testInstance.setDataset(dataRaw)
// We classify the instance and add some content for clearer output
mlp.classifyInstance(testInstance).toString + ": " + allInts.mkString(", ")
}
override def bufferEncoder: Encoder[Seq[Seq[Double]]] = newSequenceEncoder[Seq[Seq[Double]]]
override def outputEncoder: Encoder[String] = Encoders.STRING
}
Final result:
+---+------------------------------------------------------------------------------------------------------------+
|id |myagg$(array(first, second, third)) |
+---+------------------------------------------------------------------------------------------------------------+
|1 |2.2: List(1.0, 1.0, 1.0), List(1.0, 2.0, 2.0), List(1.0, 3.0, 3.0), List(1.0, 4.0, 4.0), List(1.0, 5.0, 5.0)|
+---+------------------------------------------------------------------------------------------------------------+
In this case, there might be some overhead while converting from/to Java and Scala.
This is a good use case for an User-Defined Aggregate Function.
What is an User-Defined Aggregate Function?
After grouping a dataframe with groupBy usually one or more aggregation functions like min, max or sum are used to aggregate all values that belong to one group of rows into a single value. If none of Spark's built-in functions suits your needs you can write your own function that takes the data from one of the groups and aggregates it into a new value.
Like you can use
df.groupBy('myCol1).agg(sum('myCol2))
you can use
df.groupBy('myCol1).agg(customFunction('myCol2))
where customFunction does whatever you need it to do, for example applying a RandomForestRegressor to all elements of one group of data.
How to create an User-Defined Aggregate Function?
Here is an (arguably simplistic) example for an User-Defined Aggregate Function. This function collects all values of one group in a sequence and then concatenates all these values into a string.
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import spark.implicits._
//some test data: 1,2,3,...,10
val df = (1 to 10).toDF()
//create the user defined aggregation function
object MyAgg extends Aggregator[Int, Seq[Int], String]{
override def zero: Seq[Int] = scala.collection.mutable.Seq[Int]()
override def reduce(b: Seq[Int], a: Int): Seq[Int] = b :+ a
override def merge(b1: Seq[Int], b2: Seq[Int]): Seq[Int] = b1 ++ b2
override def finish(allInts: Seq[Int]): String = allInts.foldLeft("START")((s,b) => s + "_" + b)
override def bufferEncoder: Encoder[Seq[Int]] = newSequenceEncoder[Seq[Int]]
override def outputEncoder: Encoder[String] = Encoders.STRING
}
val myAggFct = udaf(MyAgg).withName("myAgg")
//group the dataframe and apply myAggFct to each group separately
df.groupBy(expr("value % 3")).agg(myAggFct('value)).show
Output:
+-----------+--------------+
|(value % 3)| myagg(value)|
+-----------+--------------+
| 1|START_1_4_7_10|
| 2| START_2_5_8|
| 0| START_3_6_9|
+-----------+--------------+
How does the User-Defined Aggregate Function work?
The two functions reduce and merge combine all values of one group into a sequence created by the zero function.
The central function is the function finish. Here the sequence of all collected values (allInts) is transformed into the result of the aggregation operation. This would be the place to apply for example the RandomForestRegressor. As the finish function runs distributed on the executor nodes, all required additional data should be broadcasted.
Note: the example above could also (better) be implemented using Dataset.reduce because we do not need the values as sequence. We simply could add the values to the string as soon as we see them. But for a regressor we need the complete list of values and so the User-Defined Aggreate Function is reasonable here.
The python version actually run in the executors and hence distributes the load. Collect requires all executors to send their data to the driver for processing. This means your only use the threads provided by the driver. You are also likely suffering from lots of garbage collection as well as you are creating a Vector assembler over and over again.(and immediately throwing it away)
If you want you can do collect like things in-side of an executor. you can use mapPartitions.
val df4 = df2.mapPartitions(iterator => { // Start executer code
// Do the heavy initialization here
// Like database connections e.t.c
val util = new Util()
val res = iterator.map(row=>{
val fullName = util.combine(row.getString(0),row.getString(1),row.getString(2))
(fullName, row.getString(3),row.getInt(5))
})
res // End executor code
})
val df4part = df4.toDF("fullName","id","salary")
df4part.printSchema()
df4part.show(false)
The catch is that you cannot use any feature that uses sparkContext as that only lives inside the driver. Said another way: You can only use pure scala features inside the executor code. But if you can find a Scala library for Random forest that would be answer. The iterator used inside is very memory efficient and will run much faster than your collect that you are doing.
Likely you really want to use spark's RandomForestRegressor?
It look like you have a global oldForest so I can't tell what you are using but [a global variable] won't work with mapParitions so initialize it once and use it many times(inside the executor code)
Collect Code

Create RDD from RDD entry inside foreach loop

I have some custom logic that looks at elements in an RDD and would like to conditionally write to a TempView via the UNION approach using foreach, as per below:
rddX.foreach{ x => {
// Do something, some custom logic
...
val y = create new RDD from this RDD element x
...
or something else
// UNION to TempView
...
}}
Something really basic that I do not get:
How can convert the nth entry (x) of the RDD to an RDD itself of length 1?
Or, convert the nth entry (x) directly to a DF?
I get all the set based cases, but here I want to append when I meet a condition immediately for the sake of simplicity. I.e. at the level of the item entry in the RDD.
Now, before getting a -1 as SO 41356419, I am only suggesting this as I have a specific use case and to mutate a TempView in SPARK SQL, I do need such an approach - at least that is my thinking. Not a typical SPARK USE CASE, but that is what we are / I am facing.
Thanks in advance
First of all - you can't create RDD or DF inside foreach() of another RDD or DF/DS function. But you can get nth element from RDD and create new RDD with that single element.
EDIT:
The solution, however is much simplier:
import org.apache.spark.{SparkConf, SparkContext}
object Main {
val conf = new SparkConf().setAppName("myapp").setMaster("local[*]")
val sc = new SparkContext(conf)
def main(args: Array[String]): Unit = {
val n = 534 // This is input value (index of the element we'ŗe interested in)
sc.setLogLevel("ERROR")
// Creating dummy rdd
val rdd = sc.parallelize(0 to 999).cache()
val singletonRdd = rdd.zipWithIndex().filter(pair => pair._1 == n)
}
}
Hope that helps!

Splitting a pipeline in spark?

Assume that I have a Spark pipeline like this (formatted to emphasize the important steps):
val foos1 = spark_session.read(foo_file).flatMap(toFooRecord)
.map(someComplicatedProcessing)
.map(transform1)
.distinct().collect().toSet
I'm adding a similar pipeline:
val foos2 = spark_session.read(foo_file).flatMap(toFooRecord)
.map(someComplicatedProcessing)
.map(transform2)
.distinct().collect().toSet
Then I do something with both results.
I'd like to avoid doing someComplicatedProcessing twice (not parsing the file twice is nice, too).
Is there a way to take the stream after the .map(someComplicatedProcessing) step and create two parallel streams feeding off it?
I know that I can store the intermediate result on disk and thus save the CPU time at the cost of more I/O. Is there a better way? What words do I web-search for?
First option - cache intermediate results:
val cached = spark_session.read(foo_file).flatMap(toFooRecord)
.map(someComplicatedProcessing)
.cache
val foos1 = cached.map(transform1)
.distinct().collect().toSet
val foos2 = cached.map(transform2)
.distinct().collect().toSet
Second option - use RDD and make single pass:
val foos = spark_session.read(foo_file)
.flatMap(toFooRecord)
.map(someComplicatedProcessing)
.rdd
.flatMap(x => Seq(("t1", transform1(x)), ("t2", transform2(x))))
.distinct
.collect
.groupBy(_._1)
.mapValues(_.map(_._2))
val foos1 = foos("t1")
val foos2 = foos("t2")
The second option may require some type wrangling if transform1 and transform2 have incompatible return types.

Spark 1.6.2's RDD caching seems do to weird things with filters in some cases

I have an RDD:
avroRecord: org.apache.spark.rdd.RDD[com.rr.eventdata.ViewRecord] = MapPartitionsRDD[75]
I then filter the RDD for a single matching value:
val siteFiltered = avroRecord.filter(_.getSiteId == 1200)
I now count how many distinct values I get for SiteId. Given the filter it should be "1". Here's two ways I do it without cache and with cache:
val basic = siteFiltered.map(_.getSiteId).distinct.count
val cached = siteFiltered.cache.map(_.getSiteId).distinct.count
The result indicates that the cached version isn't filtered at all:
basic: Long = 1
cached: Long = 93
"93" isn't even the expected value if the filter was ignored completely (that answer is "522"). It also isn't a problem with "distinct" as the values are real ones.
It seems like the cached RDD has some odd partial version of the filter.
Anyone know what's going on here?
I supposed the problem is that you have to cache the result of your RDD before doing any action on it.
Spark build a DAG that represents the execution of your program. Each node is a transformation or an action on your RDD. Without cacheing the RDD, each action forces Spark to execute the whole DAG from the begining (or from the last cache invocation).
So, your code should work if you do the following changes:
val siteFiltered =
avroRecord.filter(_.getSiteId == 1200)
.map(_.getSiteId).cache
val basic = siteFiltered.distinct.count
// Yes, I know, in this way the second count has no sense at all
val cached = siteFiltered.distinct.count
There is no issue with your code. It should work fine.
I tried out the same at my local it is working fine without any discrepancies with multiple runs.
I have following data with me:
Event1,11.4
Event2,82.0
Event3,53.8
Event4,31.0
Event5,22.6
Event6,43.1
Event7,11.0
Event8,22.1
Event8,22.1
Event8,22.1
Event8,22.1
Event9,3.2
Event10,13.1
Event9,3.2
Event10,13.1
Event9,3.2
Event10,13.1
Event11,3.22
Event12,13.11
And I tried the same thing as you did, following is my code that is working fine:
scala> var textrdd = sc.textFile("file:///data/pocs/blogs/eventrecords");
textrdd: org.apache.spark.rdd.RDD[String] = file:///data/pocs/blogs/eventrecords MapPartitionsRDD[123] at textFile at <console>:27
scala> var filteredRdd = textrdd.filter(_.split(",")(1).toDouble > 1)
filteredRdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[124] at filter at <console>:29
scala> filteredRdd.map(x => x.split(",")(1)).distinct.count
res36: Long = 12
scala> filteredRdd.cache.map(x => x.split(",")(1)).distinct.count
res37: Long = 12

Doing flatmap on a function returning RDD

I am trying to process multiple avro files in the code below. the idea is to first get a series of avro files in a list. then open each avro file and generate a steam of tuples (string, int). then finally group the stream of tuples by key and sum the ints.
object AvroCopyUtil {
def main(args: Array[String]) : Unit = {
val conf = new SparkConf().setAppName("Leads Data Analysis").setMaster("local[*]")
val sc = new SparkContext(conf)
val fs = FileSystem.get(new Configuration())
val avroList = GetAvroList(fs, args(0))
avroList.flatMap(av =>
sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](av)
.map(r => (r._1.datum.get("field").toString, 1)))
.reduceByKey(_ + _)
.foreach(println)
}
def GetAvroList(fs: FileSystem, input: String) : List[String] = {
// get all children
val masterList : List[FileStatus] = fs.listStatus(new Path(input)).toList
val (allFiles, allDirs) = masterList.partition(x => x.isDirectory == false)
allFiles.map(_.getPath.toString) ::: allDirs.map(_.getPath.toString).flatMap(x => GetAvroList(fs, x))
}
}
The compile error i get is
[error] found : org.apache.spark.rdd.RDD[(org.apache.avro.mapred.AvroKey[org.apache.avro.generic.GenericRecord], org.apache.hadoop.io.NullWritable)]
[error] required: TraversableOnce[?]
[error] avroRdd.flatMap(av => sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](av))
[error] ^
[error] one error found
Edit: based on the suggestion below I tried
val rdd = sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable,
AvroKeyInputFormat[GenericRecord]](avroList.mkString(","))
but I got the error
Exception in thread "main" java.lang.IllegalArgumentException: java.net.URISyntaxException: Illegal character in scheme name at index 0: 2015-10-
15-00-1576041136-flumetracker.foo.com-FooAvroEvent.1444867200044.avro,hdfs:
Your function is unnecessary. You are also attempting to create an RDD within a transformation which doesn't really make sense. The transformation (in this case, flatMap) runs on top of an RDD and the records within an RDD will be what is transformed. In the case of a flatMap, the expected output of the anonymous function is a TraversableOnce object which will then be flattened into multiple records by the transformation. Looking at your code though, you don't really need to do a flatMap as a simply map will suffice. Keep in mind also that due to the immutability of RDD's, you must always reassign your transformations into new values.
Try something like:
val avroRDD = sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](filePath)
val countsRDD = avroRDD.map(av => (av._1.datum.get("field1").toString, 1)).reduceByKey(_ + _)
It seems as though you may need to take some time to grasp some of Spark's basic framework nuances. I would recommend fully reading the Spark Programming Guide. Lastly, if you want to use Avro, please also check out spark-avro as much of the boiler plate around working with Avro is taken care of there (and DataFrames may perhaps be more intuitive and easier to use for your use case).
(EDIT:)
It seems like you may have misunderstood how to load data to be processed in Spark. The parallelize() method is used to distribute collections across an RDD and not data within files. To do the latter, you actually only need to provide a comma-separated list of input files to the newAPIHadoopFile() loader. So assuming your GetAvroList() function works, you can do:
val avroList = GetAvroList(fs, args(0))
val avroRDD = sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](avroList.mkString(","))
val countsRDD = avroRDD.map(av => (av._1.datum.get("field1").toString, 1)).reduceByKey(_ + _)
flatMappedRDD.foreach(println)

Resources