parallelize() method in SparkContext - apache-spark

I am trying to understand the effect of giving different numSlices to the parallelize() method in SparkContext. Given below is the Syntax of the method
def parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism)
(implicit arg0: ClassTag[T]): RDD[T]
I ran spark-shell in local mode
spark-shell --master local
My understanding is, numSlices decides the no of partitions of the resultant RDD(after calling sc.parallelize()). Consider few examples below
Case 1
scala> sc.parallelize(1 to 9, 1);
res0: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:22
scala> res0.partitions.size
res2: Int = 1
Case 2
scala> sc.parallelize(1 to 9, 2);
res3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:22
scala> res3.partitions.size
res4: Int = 2
Case 3
scala> sc.parallelize(1 to 9, 3);
res5: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:22
scala> res3.partitions.size
res6: Int = 2
Case 4
scala> sc.parallelize(1 to 9, 4);
res7: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:22
scala> res3.partitions.size
res8: Int = 2
Question 1 : In case 3 & case 4, I was expecting the partition size to be 3 & 4 respectively, but both cases have partition size of only 2. What is the reason for this?
Question 2 : In each case there is a number associated with ParallelCollectionRDD[no]. ie In Case 1 it is ParallelCollectionRDD[0], In case 2 it is ParallelCollectionRDD[1] & so on. What exactly those numbers signify?

Question 1: That's a typo on your part. You're calling res3.partitions.size, instead of res5 and res7 respectively. When I do it with the correct number, it works as expected.
Question 2: That's the id of the RDD in the Spark Context, used for keeping the graph straight. See what happens when I run the same command three times:
scala> sc.parallelize(1 to 9,1)
res0: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:22
scala> sc.parallelize(1 to 9,1)
res1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:22
scala> sc.parallelize(1 to 9,1)
res2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:22
There are now three different RDDs with three different ids. We can run the following to check:
scala> (res0.id, res1.id, res2.id)
res3: (Int, Int, Int) = (0,1,2)

Related

how to combine several RDD with different lengths into a single RDD with a specific order pattern?

I have several RDD with different lengths:
RDD1 : [a, b, c, d, e, f, g]
RDD2 : [1, 3 ,2, 44, 5]
RDD3 : [D, F, G]
I want to combine them into one RDD, with the order pattern:
every 5 rows: takes 2 rows from RDD1 , takes 2 from RDD2 ,then takes 1 from
RDD3
This pattern should loop until all RDD exhausted.
the output above should be:
RDDCombine : [a,b,1,3,D, c,d,2,44,F, e,f,5,G, g]
How to achieve this? Thanks a lot!
Background: I'am designing a recommender system. Now I have several RDD outputs from different algorithms, I want to combine them in some order pattern to make a hybrid recommend.
I would not say it's an optimal solution but may help you to get started.. again this is not at all production ready code. Also, the number of partitions I have used is 1 based on fewer data, but you can edit it.
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
conf.setMaster("local[*]")
conf.setAppName("some")
val sc = new SparkContext(conf)
val rdd2 = sc.parallelize(Seq(1,3,2,44,5),1)
val rdd1 = sc.parallelize(Seq('a','b','c','d','e','f','g'),1)
val rdd3 = sc.parallelize(Seq('D','F','G'),1)
val groupingCount = 2
val rdd = rdd1.zipPartitions(rdd2,rdd3)((a,b,c) => {
val ag = a.grouped(groupingCount)
val bg = b.grouped(groupingCount)
val cg = c.grouped(1)
ag.zip(bg).zip(cg).map(x=> x._1._1 ++ x._1._2 ++x._2
)
})
rdd.foreach(println)
sc.stop()
}

How to solve this about Spark TopK?

I got a problem about TopK to be solved using Spark.
The source file is about this:
baoshi,13
xinxi,80
baoshi,99
xinxi,32
baoshi,50
xinxi,43
baoshi,210
xinxi,100
Here is my code:
import org.apache.spark.{SparkConf, SparkContext}
object TopKTest {
def main(args: Array[String]): Unit = {
val file = "file:///home/hadoop/rdd-test/TopK3.txt"
val conf = new SparkConf().setAppName("TopKTest").setMaster("local")
val sc = new SparkContext(conf)
val txt = sc.textFile(file)
val rdd2 =txt.map(line=>(line.split(",")(0)
,line.split(",")(1).trim))
val rdd=rdd2.groupByKey()
val rdd1 = rdd.map(line=> {
val f = line._1
val s = line._2
val t = s.toList.sortWith(_ > _).take(2)
(f, t)
})
rdd1.foreach(println)
}
}
The expected result is :
(xinxi,List(100, 80))
(baoshi,List(210, 99))
That's because you compare Strings not numerics.
Change
val rdd2 =txt.map(line=>(line.split(",")(0)
,line.split(",")(1).trim))
to
val rdd2 =txt.map(line=>(line.split(",")(0)
,line.split(",")(1).trim.toLong))
Here is the way:
scala> import org.apache.spark.mllib.rdd.MLPairRDDFunctions._
import org.apache.spark.mllib.rdd.MLPairRDDFunctions._
scala> val rdd = spark.sparkContext.textFile("D:\\test\\input.txt")
rdd: org.apache.spark.rdd.RDD[String] = D:\test\input.txt MapPartitionsRDD[1] at textFile at <console>:26
scala> rdd.foreach(println)
xinxi,43
baoshi,13
baoshi,210
xinxi,80
xinxi,100
baoshi,99
xinxi,32
baoshi,50
scala> val rdd1 = rdd.map(row => (row.split(",")(0), row.split(",")(1).toInt))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[2] at map at <console>:28
scala> val rdd2 = rdd1.topByKey(2)
rdd2: org.apache.spark.rdd.RDD[(String, Array[Int])] = MapPartitionsRDD[4] at mapValues at MLPairRDDFunctions.scala:50
scala> val rdd3 = rdd2.map(m => (m._1, m._2.toList))
rdd3: org.apache.spark.rdd.RDD[(String, List[Int])] = MapPartitionsRDD[5] at map at <console>:32
scala> rdd3.foreach(println)
(xinxi,List(100, 80))
(baoshi,List(210, 99))

Very slow writing of a dataframe to file on Spark cluster

I have a test program that writes a dataframe to file. The dataframe is generated by adding sequential numbers for each row, like
1,2,3,4,5,6,7.....11
2,3,4,5,6,7,8.....12
......
There is 100000 rows in the dataframe, but I don't think it is too big.
When I submit the Spark task, it takes almost 20 minutes to write the dataframe to file on HDFS. I am wondering why it is so slow, and how to improve the performance.
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val numCol = 11
val arraydataInt = 1 to 100000 toArray
val arraydata = arraydataInt.map(x => x.toDouble)
val slideddata = arraydata.sliding(numCol).toSeq
val rows = arraydata.sliding(numCol).map { x => Row(x: _*) }
val datasetsize = arraydataInt.size
val myrdd = sc.makeRDD(rows.toSeq, arraydata.size - numCol).persist()
val schemaString = "value1 value2 value3 value4 value5 " +
"value6 value7 value8 value9 value10 label"
val schema =
StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, DoubleType, true)))
val df = sqlContext.createDataFrame(myrdd, schema).cache()
val splitsH = df.randomSplit(Array(0.8, 0.1))
val trainsetH = splitsH(0).cache()
val testsetH = splitsH(1).cache()
println("now saving training and test samples into files")
trainsetH.write.save("TrainingSample.parquet")
testsetH.write.save("TestSample.parquet")
Turn
val myrdd = sc.makeRDD(rows.toSeq, arraydata.size - numCol).persist()
To
val myrdd = sc.makeRDD(rows.toSeq, 100).persist()
You've made a rdd with arraydata.size - numCol partitions and each partition would lead to a task which takes extra run time. Generally speaking, the number of partitions is a trade-off between the level of parallelism and that extra cost. Try 100 partitions and it should works much better.
BTW, the official Guide suggest to set this number 2 or 3 times the number of CPUs in your cluster.

RDD lineage cache

Iam having trouble understanding the lineage if an RDD. For instance
lets say we have this lineage:
hadoopRDD(location) <-depends- filteredRDD(f:A->Boolean) <-depends- mappedRDD(f:A->B)
If we persist the first RDD and after some actions we unpersist it. Will this affect others depended RDD? If yes, how can er avoid that?
My point is if we unpersist a parent RDD will this action remove partitions from the children RDDs?
Lets walk through an example. This will create an RDD with a Seq of Ints in one partition. The reason for one partition is simply to keep ordering for the rest of the example.
scala> val seq = Seq(1,2,3,4,5)
seq: Seq[Int] = List(1, 2, 3, 4, 5)
scala> val rdd = sc.parallelize(seq, 1)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[11] at parallelize at <console>:23
Now lets create two new RDDs which are mapped versions of the original:
scala> val firstMappedRDD = rdd.map { case i => println(s"firstMappedRDD calc for $i"); i * 2 }
firstMappedRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[12] at map at <console>:25
scala> firstMappedRDD.toDebugString
res25: String =
(1) MapPartitionsRDD[12] at map at <console>:25 []
| ParallelCollectionRDD[11] at parallelize at <console>:23 []
scala> val secondMappedRDD = firstMappedRDD.map { case i => println(s"secondMappedRDD calc for $i"); i * 2 }
secondMappedRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[13] at map at <console>:27
scala> secondMappedRDD.toDebugString
res26: String =
(1) MapPartitionsRDD[13] at map at <console>:27 []
| MapPartitionsRDD[12] at map at <console>:25 []
| ParallelCollectionRDD[11] at parallelize at <console>:23 []
We can see the lineages using toDebugString. I added printlns to each map step to make it clear when the map is called. Let's collect each RDD to see what happens:
scala> firstMappedRDD.collect()
firstMappedRDD calc for 1
firstMappedRDD calc for 2
firstMappedRDD calc for 3
firstMappedRDD calc for 4
firstMappedRDD calc for 5
res27: Array[Int] = Array(2, 4, 6, 8, 10)
scala> secondMappedRDD.collect()
firstMappedRDD calc for 1
secondMappedRDD calc for 2
firstMappedRDD calc for 2
secondMappedRDD calc for 4
firstMappedRDD calc for 3
secondMappedRDD calc for 6
firstMappedRDD calc for 4
secondMappedRDD calc for 8
firstMappedRDD calc for 5
secondMappedRDD calc for 10
res28: Array[Int] = Array(4, 8, 12, 16, 20)
As you would expect, the map for the first step is called once again when we call secondMappedRDD.collect(). So now let's cache the first mapped RDD.
scala> firstMappedRDD.cache()
res29: firstMappedRDD.type = MapPartitionsRDD[12] at map at <console>:25
scala> secondMappedRDD.toDebugString
res31: String =
(1) MapPartitionsRDD[13] at map at <console>:27 []
| MapPartitionsRDD[12] at map at <console>:25 []
| ParallelCollectionRDD[11] at parallelize at <console>:23 []
scala> firstMappedRDD.count()
firstMappedRDD calc for 1
firstMappedRDD calc for 2
firstMappedRDD calc for 3
firstMappedRDD calc for 4
firstMappedRDD calc for 5
res32: Long = 5
scala> secondMappedRDD.toDebugString
res33: String =
(1) MapPartitionsRDD[13] at map at <console>:27 []
| MapPartitionsRDD[12] at map at <console>:25 []
| CachedPartitions: 1; MemorySize: 120.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
| ParallelCollectionRDD[11] at parallelize at <console>:23 []
The lineage of the second mapped RDD has the cached result of the first in it's lineage, after the result of the first map is in the cache. Now let's call collect:
scala> secondMappedRDD.collect
secondMappedRDD calc for 2
secondMappedRDD calc for 4
secondMappedRDD calc for 6
secondMappedRDD calc for 8
secondMappedRDD calc for 10
res34: Array[Int] = Array(4, 8, 12, 16, 20)
And now let's unpersist and call collect again.
scala> firstMappedRDD.unpersist()
res36: firstMappedRDD.type = MapPartitionsRDD[12] at map at <console>:25
scala> secondMappedRDD.toDebugString
res37: String =
(1) MapPartitionsRDD[13] at map at <console>:27 []
| MapPartitionsRDD[12] at map at <console>:25 []
| ParallelCollectionRDD[11] at parallelize at <console>:23 []
scala> secondMappedRDD.collect
firstMappedRDD calc for 1
secondMappedRDD calc for 2
firstMappedRDD calc for 2
secondMappedRDD calc for 4
firstMappedRDD calc for 3
secondMappedRDD calc for 6
firstMappedRDD calc for 4
secondMappedRDD calc for 8
firstMappedRDD calc for 5
secondMappedRDD calc for 10
res38: Array[Int] = Array(4, 8, 12, 16, 20)
So when we collect the result of the second mapped RDD after the first has been unpersisted, the map of the first gets called again.
If the source had been HDFS, or any other storage, the data would have been retrieved from the source again.

big integer number in Spark

In Spark-shell, I run the following code:
scala> val input = sc.parallelize(List(1, 2, 4, 1881824400))
input: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:21
scala> val result = input.map(x => 2*x)
result: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:23
scala> println(result.collect().mkString(","))
2,4,8,-531318496
Why the result of 2*1881824400 = -531318496 ? not 3763648800 ?
Is that a bug in Spark?
Thanks for your help.
Thanks ccheneson and hveiga. The answer is that the mapping makes the result bigger than 2^31, run out the range of Interger. Therefore, the number jumps into the negatives region.

Resources