spark: How to merge shuffled rdd efficiently? - apache-spark

I have 5 shuffled key-value rdds, one big one(1,000,000 records), and 4 relative small ones(100,000 records).All rdds were shullfed with the same number of partitions, I have two strategies to merge the 5 one,
Merge the 5 rdds together
merge the 4 small rdds together and then join the bigone
I think the strategy 2 would be more efficiently, as it would not re-shuffle the big one. But the experiment result shows the strategy 1 more efficient. The code and output are following:
Code
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkContext, SparkConf}
object MergeStrategy extends App {
Logger.getLogger("org").setLevel(Level.ERROR)
Logger.getLogger("akka").setLevel(Level.ERROR)
val conf = new SparkConf().setMaster("local[4]").setAppName("test")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val bigRddSize = 1e6.toInt
val smallRddSize = 1e5.toInt
println(bigRddSize)
val bigRdd = sc.parallelize((0 until bigRddSize)
.map(x => (scala.util.Random.nextInt, 0))).repartition(100).cache
bigRdd.take(10).foreach(println)
val smallRddList = (0 until 4).map(i => {
val rst = sc.parallelize((0 until smallRddSize)
.map(x => (scala.util.Random.nextInt, 0))).repartition(100).cache
println(rst.count)
rst
}).toArray
// strategy 1
{
val begin = System.currentTimeMillis
val s1Rst = sc.union(Array(bigRdd) ++ smallRddList).distinct(100)
println(s1Rst.count)
val end = System.currentTimeMillis
val timeCost = (end - begin) / 1000d
println("S1 time count: %.1f s".format(timeCost))
}
// strategy 2
{
val begin = System.currentTimeMillis
val smallMerged = sc.union(smallRddList).distinct(100).cache
println(smallMerged.count)
val s2Rst = bigRdd.fullOuterJoin(smallMerged).flatMap({ case (key, (left, right)) => {
if (left.isDefined && right.isDefined) Array((key, left.get), (key, right.get)).distinct
else if (left.isDefined) Array((key, left.get))
else if (right.isDefined) Array((key, right.get))
else throw new Exception("Cannot happen")
}
})
println(s2Rst.count)
val end = System.currentTimeMillis
val timeCost = (end - begin) / 1000d
println("S2 time count: %.1f s".format(timeCost))
}
}
Output
1000000
(688282474,0)
(-255073127,0)
(872746474,0)
(-792516900,0)
(417252803,0)
(-1514224305,0)
(1586932811,0)
(1400718248,0)
(939155130,0)
(1475156418,0)
100000
100000
100000
100000
1399777
S1 time count: 39.7 s
399984
1399894
S2 time count: 49.8 s
My understanding for shuffled rdd was wrong? Can anybody give some advices?
Thanks!

I found a method to merge rdd more efficiently, see the following 2 merging strategies:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SQLContext
import org.apache.spark.{HashPartitioner, SparkContext, SparkConf}
import scala.collection.mutable.ArrayBuffer
object MergeStrategy extends App {
Logger.getLogger("org").setLevel(Level.ERROR)
Logger.getLogger("akka").setLevel(Level.ERROR)
val conf = new SparkConf().setMaster("local[4]").setAppName("test")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val rddCount = 20
val mergeCount = 5
val dataSize = 20000
val parts = 50
// generate data
scala.util.Random.setSeed(943343)
val testData = for (i <- 0 until rddCount)
yield sc.parallelize(scala.util.Random.shuffle((0 until dataSize).toList).map(x => (x, 0)))
.partitionBy(new HashPartitioner(parts))
.cache
testData.foreach(x => println(x.count))
// strategy 1: merge directly
{
val buff = ArrayBuffer[RDD[(Int, Int)]]()
val begin = System.currentTimeMillis
for (i <- 0 until rddCount) {
buff += testData(i)
if ((buff.size >= mergeCount || i == rddCount - 1) && buff.size > 1) {
val merged = sc.union(buff).distinct
.partitionBy(new HashPartitioner(parts)).cache
println(merged.count)
buff.foreach(_.unpersist(false))
buff.clear
buff += merged
}
}
val end = System.currentTimeMillis
val timeCost = (end - begin) / 1000d
println("Strategy 1 Time Cost: %.1f".format(timeCost))
assert(buff.size == 1)
println("Strategy 1 Complete, with merged Count %s".format(buff(0).count))
}
// strategy 2: merge directly without repartition
{
val buff = ArrayBuffer[RDD[(Int, Int)]]()
val begin = System.currentTimeMillis
for (i <- 0 until rddCount) {
buff += testData(i)
if ((buff.size >= mergeCount || i == rddCount - 1) && buff.size > 1) {
val merged = sc.union(buff).distinct(parts).cache
println(merged.count)
buff.foreach(_.unpersist(false))
buff.clear
buff += merged
}
}
val end = System.currentTimeMillis
val timeCost = (end - begin) / 1000d
println("Strategy 2 Time Cost: %.1f".format(timeCost))
assert(buff.size == 1)
println("Strategy 2 Complete, with merged Count %s".format(buff(0).count))
}
}
The result shows that strategy 1 (time cost 20.8 seconds) is more efficient than strategy 2 (time cost 34.3 seconds). my pc is window 8, CPU 4 cores 2.0GHz, 8GB memory.
The only difference is that strategy partitioned by HashPartitioner, but strategy 2 not. As a result, the strategy 1 produce ShuffledRDD, but strategy 1 MapPartitionsRDD. I think RDD.distinct function processes ShuflledRDD more efficiently than MapPartitionsRDD.

Related

How to generate large word count file in Spark?

I want to generate 10 million lines’ wordcount file for performance test(each line has the same sentence). But I have no idea about how to code it.
You can give me an example code, and save file in HDFS directly.
You can try something like this.
Generate 1 column with values from 1 to 100k and one with values from 1 to 100 explode both of them with explode(column).
You can't generate one column with 10 Mil values because kryo buffer is gonna throw an error.
I don't know if this is the best performance way to do it, but it is the fastest way I can think right now.
val generateList = udf((s: Int) => {
val buf = scala.collection.mutable.ArrayBuffer.empty[Int]
for(i <- 1 to s) {
buf += i
}
buf
})
val someDF = Seq(
("Lorem ipsum dolor sit amet, consectetur adipiscing elit.")
).toDF("sentence")
val someDfWithMilColumn = someDF.withColumn("genColumn1", generateList(lit(100000)))
.withColumn("genColumn2", generateList(lit(100)))
val someDfWithMilColumn100k = someDfWithMilColumn
.withColumn("expl_val", explode($"mil")).drop("expl_val", "genColumn1")
val someDfWithMilColumn10mil = someDfWithMilColumn100k
.withColumn("expl_val2", explode($"10")).drop("genColumn2", "expl_val2")
someDfWithMilColumn10mil.write.parquet(path)
You can do it by joining the 2 DFs as below,
Also find the code explanation inline.
import org.apache.spark.sql.SaveMode
object GenerateTenMils {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
spark.conf.set("spark.sql.crossJoin.enabled","true") // Enable cross join
import spark.implicits._
//Create a DF with your sentence
val df = List("each line has the same sentence").toDF
//Create another Dataset with 10000000 records
spark.range(10000000)
.join(df) // Cross Join the dataframes
.coalesce(1) // Output to a single file
.drop("id") // Drop the extra column
.write
.mode(SaveMode.Overwrite)
.text("src/main/resources/tenMils") // Write as text file
}
}
You could follow this approach.
Tail recursive to generate the objects list and Dataframes, and Union to generate the big Dataframe
val spark = SparkSession
.builder()
.appName("TenMillionsRows")
.master("local[*]")
.config("spark.sql.shuffle.partitions","4") //Change to a more reasonable default number of partitions for our data
.config("spark.app.id","TenMillionsRows") // To silence Metrics warning
.getOrCreate()
val sc = spark.sparkContext
import spark.implicits._
/**
* Returns a List of nums sentences
* #param sentence
* #param num
* #return
*/
def getList(sentence: String, num: Int) : List[String] = {
#tailrec
def loop(st: String,n: Int, acc: List[String]): List[String] = {
n match {
case num if num == 0 => acc
case _ => loop(st, n - 1, st :: acc)
}
}
loop(sentence,num,List())
}
/**
* Returns a Dataframe that is the union of nums dataframes
* #param lst
* #param num
* #return
*/
def getDataFrame(lst: List[String], num: Int): DataFrame = {
#tailrec
def loop (ls: List[String],n: Int, acc: DataFrame): DataFrame = {
n match {
case n if n == 0 => acc
case _ => loop(lst,n - 1, acc.union(sc.parallelize(ls).toDF("sentence")))
}
}
loop(lst, num, sc.parallelize(List(sentence)).toDF("sentence"))
}
val sentence = "hope for the best but prepare for the worst"
val lSentence = getList(sentence, 100000)
val dfs = getDataFrame(lSentence,100)
println(dfs.count())
// output: 10000001
dfs.write.orc("path_to_hdfs") // write dataframe to a orc file
// you can save the file as parquet, txt, json .......
// with dataframe.write
Hope this helps.

Spark can not serialize the BufferedImage class

I have a Not Serializable Class exception in Spark 2.2.0.
The following procedure is what I am trying to do in Scala:
To read from HDFS a set of JPEG images.
To build an array of java.awt.image.BufferedImageS.
To extract the java.awt.image.BufferedImage buffer and store it in a 2D array for each image, by building an array of two-dimensional arrays containing the image buffer information Array[Array[Int]].
Transform the Array[Array[Int]] into an org.apache.spark.rdd.RDD[Array[Array[Int]]] by using sc.parallelize method.
Perform image processing operations distributelly by transforming the initial org.apache.spark.rdd.RDD[Array[Array[Int]]].
This is the code:
import org.apache.spark.sql.SparkSession
import javax.imageio.ImageIO
import java.io.ByteArrayInputStream
def binarize(image: Array[Array[Int]], threshold: Int) : Array[Array[Int]] = {
val height = image.size
val width = image(0).size
val result = Array.ofDim[Int](height, width)
for (i <- 0 until height) {
for (j <- 0 until width){
result(i)(j) = if (image(i)(j) <= threshold) 0 else 255
}
}
result
}
object imageTestObj {
def main(args: Array[String]) {
val spark = SparkSession.builder().appName("imageTest2").getOrCreate()
val sc = spark.sparkContext
val saveToHDFS = false
val threshold: Int = 128
val partitions = 32
val inPathStr = "hdfs://192.168.239.218:9000/vitrion/input"
val outPathStr = if (saveToHDFS) "hdfs://192.168.239.54:9000/vitrion/output/" else "/home/vitrion/IdeaProjects/imageTest2/output/"
val files = sc.binaryFiles(inPathStr).collect
val AWTImageArray = files.map { binFile =>
val input = binFile._2.open()
val name = binFile._1
var buffer: Array[Byte] = Array.fill(input.available)(0)
input.readFully(buffer)
ImageIO.read(new ByteArrayInputStream(buffer))
}
val ImgBuffers = AWTImageArray.map { image =>
val height = image.getHeight
val width = image.getWidth
val buffer = Array.ofDim[Int](height, width)
for (i <- 0 until height) {
for (j <- 0 until width){
buffer(i)(j) = image.getRaster.getDataBuffer.getElem(0, i * width + j)
}
}
buffer
}
val inputImages = sc.parallelize(ImgBuffers, partitions).cache()
val op1 = inputImages.map(image => binarize(image, threshold))
}
}
This algorithm gets a very well-known exception:
org.apache.spark.SparkException: Task not serializable
...
Caused by: java.io.NotSerializableException: java.awt.image.BufferedImage
Serialization stack:
- object not serializable (class: java.awt.image.BufferedImage, ...
I do not understand why Spark attempts to serialize the BufferedImage class when it is used before creating the first RDD in the application. Isn't it supposed that the BufferedImage class should be serialized if I try to create an RDD[BufferedImage]?
Can somebody explain me what is going on?
Thank you in advance...
Actually you are serializing a function in Spark. This function cannot contain references to non serializable classes. You can instantiate in the function non-serializable classes (OK), but NOT refer to instances of non serializable classes in the function.
Most probably you are referencing in one of the functions you use to an instance of a BufferedImage.
Check your code and see if you are not referencing from a function a BufferedImage object.
By inlining some code and not serializing BufferedImage objects, I guess you can overcome the exception. Can you try out this code (did not execute it myself)?:
object imageTestObj {
def main(args: Array[String]) {
val spark = SparkSession.builder().appName("imageTest2").getOrCreate()
val sc = spark.sparkContext
val saveToHDFS = false
val threshold: Int = 128
val partitions = 32
val inPathStr = "hdfs://192.168.239.218:9000/vitrion/input"
val outPathStr = if (saveToHDFS) "hdfs://192.168.239.54:9000/vitrion/output/" else "/home/vitrion/IdeaProjects/imageTest2/output/"
val ImgBuffers = sc.binaryFiles(inPathStr).collect.map { binFile =>
val input = binFile._2.open()
val name = binFile._1
var buffer: Array[Byte] = Array.fill(input.available)(0)
input.readFully(buffer)
val image = ImageIO.read(new ByteArrayInputStream(buffer))
// Inlining must be here, so that BufferedImage is not serialized.
val height = image.getHeight
val width = image.getWidth
val buffer = Array.ofDim[Int](height, width)
for (i <- 0 until height) {
for (j <- 0 until width){
buffer(i)(j) = image.getRaster.getDataBuffer.getElem(0, i * width + j)
}
}
buffer
}
val inputImages = sc.parallelize(ImgBuffers, partitions).cache()
val op1 = inputImages.map(image => binarize(image, threshold))
}
}

spark dataframe null value count

I am new to spark and i want to calculate the null rate of each columns,(i have 200 columns), my function is as follows:
def nullCount(dataFrame: DataFrame): Unit = {
val args = dataFrame.columns.length
val cols = dataFrame.columns
val d=dataFrame.count()
println("Follows are the null value rate of each columns")
for (i <- Range(0,args)) {
var nullrate = dataFrame.rdd.filter(r => r(i) == (-900)).count.toDouble / d
println(cols(i), nullrate)
}
}
But I find it's too slow , is there any more effective way to do this ?
Adapted from this answer by zero323:
import org.apache.spark.sql.functions.{col, count, when}
df.select(df.columns.map(c => (count(c) / count("*")).alias(c)): _*)
with -900:
df.select(df.columns.map(
c => (count(when(col(c) === -900, col(c))) / count("*")).alias(c)): _*)

value toDF is not a member of org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.Vector)]

Am getting a compilation error converting the pre-LDA transformation to a data frame using SCALA in SPARK 2.0. The specific code that is throwing an error is as per below:
val documents = PreLDAmodel.transform(mp_listing_lda_df)
.select("docId","features")
.rdd
.map{ case Row(row_num: Long, features: MLVector) => (row_num, features) }
.toDF()
The complete compilation error is:
Error:(132, 8) value toDF is not a member of org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.Vector)]
possible cause: maybe a semicolon is missing before `value toDF'?
.toDF()
Here is the complete code:
import java.io.FileInputStream
import java.sql.{DriverManager, ResultSet}
import java.util.Properties
import org.apache.spark.SparkConf
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.clustering.LDA
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel, RegexTokenizer, StopWordsRemover}
import org.apache.spark.ml.linalg.{Vector => MLVector}
import org.apache.spark.mllib.clustering.{LDA => oldLDA}
import org.apache.spark.rdd.JdbcRDD
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.sql.{Row, SparkSession}
object MPClassificationLDA {
/*Start: Configuration variable initialization*/
val props = new Properties
val fileStream = new FileInputStream("U:\\JIRA\\MP_Classification\\target\\classes\\mpclassification.properties")
props.load(fileStream)
val mpExtract = props.getProperty("mpExtract").toString
val shard6_db_server_name = props.getProperty("shard6_db_server_name").toString
val shard6_db_user_id = props.getProperty("shard6_db_user_id").toString
val shard6_db_user_pwd = props.getProperty("shard6_db_user_pwd").toString
val mp_output_file = props.getProperty("mp_output_file").toString
val spark_warehouse_path = props.getProperty("spark_warehouse_path").toString
val rf_model_file_path = props.getProperty("rf_model_file_path").toString
val windows_hadoop_home = props.getProperty("windows_hadoop_home").toString
val lda_vocabulary_size = props.getProperty("lda_vocabulary_size").toInt
val pre_lda_model_file_path = props.getProperty("pre_lda_model_file_path").toString
val lda_model_file_path = props.getProperty("lda_model_file_path").toString
fileStream.close()
/*End: Configuration variable initialization*/
val conf = new SparkConf().set("spark.sql.warehouse.dir", spark_warehouse_path)
def main(arg: Array[String]): Unit = {
//SQL Query definition and parameter values as parameter upon executing the Object
val cont_id = "14211599"
val top = "100000"
val start_date = "2016-05-01"
val end_date = "2016-06-01"
val mp_spark = SparkSession
.builder()
.master("local[*]")
.appName("MPClassificationLoadLDA")
.config(conf)
.getOrCreate()
MPClassificationLDACalculation(mp_spark, cont_id, top, start_date, end_date)
mp_spark.stop()
}
private def MPClassificationLDACalculation
(mp_spark: SparkSession
,cont_id: String
,top: String
,start_date: String
,end_date: String
): Unit = {
//DB connection definition
def createConnection() = {
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver").newInstance();
DriverManager.getConnection("jdbc:sqlserver://" + shard6_db_server_name + ";user=" + shard6_db_user_id + ";password=" + shard6_db_user_pwd);
}
//DB Field Names definition
def extractvalues(r: ResultSet) = {
Row(r.getString(1),r.getString(2))
}
//Prepare SQL Statement with parameter value replacement
val query = """SELECT docId = audt_id, text = auction_title FROM brands6.dbo.uf_ds_marketplace_classification_listing(#cont_id, #top, '#start_date', '#end_date') WHERE ? < ? OPTION(RECOMPILE);"""
.replaceAll("#cont_id", cont_id)
.replaceAll("#top", top)
.replaceAll("#start_date", start_date)
.replaceAll("#end_date", end_date)
.stripMargin
//Connect to Source DB and execute the Prepared SQL Steatement
val mpDataRDD = new JdbcRDD(mp_spark.sparkContext
,createConnection
,query
,lowerBound = 0
,upperBound = 10000000
,numPartitions = 1
,mapRow = extractvalues)
val schema_string = "docId,text"
val fields = StructType(schema_string.split(",")
.map(fieldname => StructField(fieldname, StringType, true)))
//Create Data Frame using format identified through schema_string
val mpDF = mp_spark.createDataFrame(mpDataRDD, fields)
mpDF.collect()
val mp_listing_tmp = mpDF.selectExpr("cast(docId as long) docId", "text")
mp_listing_tmp.printSchema()
println(mp_listing_tmp.first)
val mp_listing_lda_df = mp_listing_tmp.withColumn("docId", mp_listing_tmp("docId"))
mp_listing_lda_df.printSchema()
val tokenizer = new RegexTokenizer()
.setInputCol("text")
.setOutputCol("rawTokens")
.setMinTokenLength(2)
val stopWordsRemover = new StopWordsRemover()
.setInputCol("rawTokens")
.setOutputCol("tokens")
val vocabSize = 4000
val countVectorizer = new CountVectorizer()
.setVocabSize(vocabSize)
.setInputCol("tokens")
.setOutputCol("features")
val PreLDApipeline = new Pipeline()
.setStages(Array(tokenizer, stopWordsRemover, countVectorizer))
val PreLDAmodel = PreLDApipeline.fit(mp_listing_lda_df)
//comment out after saving it the first time
PreLDAmodel.write.overwrite().save(pre_lda_model_file_path)
val documents = PreLDAmodel.transform(mp_listing_lda_df)
.select("docId","features")
.rdd
.map{ case Row(row_num: Long, features: MLVector) => (row_num, features) }
.toDF()
//documents.printSchema()
val numTopics: Int = 20
val maxIterations: Int = 100
//note the FeaturesCol need to be set
val lda = new LDA()
.setOptimizer("em")
.setK(numTopics)
.setMaxIter(maxIterations)
.setFeaturesCol(("_2"))
val vocabArray = PreLDAmodel.stages(2).asInstanceOf[CountVectorizerModel].vocabulary
}
}
Am thinking that it is related to conflicts in the imports section of the code. Appreciate any help.
2 things needed to be done:
Import implicits: Note that this should be done only after an instance of org.apache.spark.sql.SQLContext is created. It should be written as:
val sqlContext= new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
Move case class outside of the method: case class, by use of which you define the schema of the DataFrame, should be defined outside of the method needing it. You can read more about it here: https://issues.scala-lang.org/browse/SI-6649

How to process a subset of input records in a batch, i.e. the first second in 3-sec batch time?

If I set Seconds(1) for the batch time in StreamingContext, like this:
val ssc = new StreamingContext(sc, Seconds(1))
3 seconds will receive the 3 seconds of data, but I only need the first seconds of data, I can discard the next 2 seconds of data. So can I spend 3 seconds to process only first second of data?
You can do this via updateStateByKey if you keep track of counter, for example like below:
import org.apache.spark.SparkContext
import org.apache.spark.streaming.dstream.ConstantInputDStream
import org.apache.spark.streaming.{Seconds, StreamingContext}
object StreamEveryThirdApp {
def main(args: Array[String]) {
val sc = new SparkContext("local[*]", "Streaming Test")
implicit val ssc = new StreamingContext(sc, Seconds(1))
ssc.checkpoint("./checkpoint")
// generate stream
val inputDStream = createConstantStream
// increase seconds counter
val accStream = inputDStream.updateStateByKey(updateState)
// keep only 1st second records
val firstOfThree = accStream.filter { case (key, (value, counter)) => counter == 1}
firstOfThree.print()
ssc.start()
ssc.awaitTermination()
}
def updateState: (Seq[Int], Option[(Option[Int], Int)]) => Option[(Option[Int], Int)] = {
case(values, state) =>
state match {
// If no previous state, i.e. set first Second
case None => Some(Some(values.sum), 1)
// If this is 3rd second - remove state
case Some((prevValue, 3)) => None
// If this is not the first second - increase seconds counter, but don't calculate values
case Some((prevValue, counter)) => Some((None, counter + 1))
}
}
def createConstantStream(implicit ssc: StreamingContext): ConstantInputDStream[(String, Int)] = {
val seq = Seq(
("key1", 1),
("key2", 3),
("key1", 2),
("key1", 2)
)
val rdd = ssc.sparkContext.parallelize(seq)
val inputDStream = new ConstantInputDStream(ssc, rdd)
inputDStream
}
}
In case if you have time information within your data, you could also use 3 seconds window stream.window(Seconds(3), Seconds(3)) and filter records by the time information from data, and quite often this is preferred approach

Resources