RDD lineage cache - apache-spark

Iam having trouble understanding the lineage if an RDD. For instance
lets say we have this lineage:
hadoopRDD(location) <-depends- filteredRDD(f:A->Boolean) <-depends- mappedRDD(f:A->B)
If we persist the first RDD and after some actions we unpersist it. Will this affect others depended RDD? If yes, how can er avoid that?
My point is if we unpersist a parent RDD will this action remove partitions from the children RDDs?

Lets walk through an example. This will create an RDD with a Seq of Ints in one partition. The reason for one partition is simply to keep ordering for the rest of the example.
scala> val seq = Seq(1,2,3,4,5)
seq: Seq[Int] = List(1, 2, 3, 4, 5)
scala> val rdd = sc.parallelize(seq, 1)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[11] at parallelize at <console>:23
Now lets create two new RDDs which are mapped versions of the original:
scala> val firstMappedRDD = rdd.map { case i => println(s"firstMappedRDD calc for $i"); i * 2 }
firstMappedRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[12] at map at <console>:25
scala> firstMappedRDD.toDebugString
res25: String =
(1) MapPartitionsRDD[12] at map at <console>:25 []
| ParallelCollectionRDD[11] at parallelize at <console>:23 []
scala> val secondMappedRDD = firstMappedRDD.map { case i => println(s"secondMappedRDD calc for $i"); i * 2 }
secondMappedRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[13] at map at <console>:27
scala> secondMappedRDD.toDebugString
res26: String =
(1) MapPartitionsRDD[13] at map at <console>:27 []
| MapPartitionsRDD[12] at map at <console>:25 []
| ParallelCollectionRDD[11] at parallelize at <console>:23 []
We can see the lineages using toDebugString. I added printlns to each map step to make it clear when the map is called. Let's collect each RDD to see what happens:
scala> firstMappedRDD.collect()
firstMappedRDD calc for 1
firstMappedRDD calc for 2
firstMappedRDD calc for 3
firstMappedRDD calc for 4
firstMappedRDD calc for 5
res27: Array[Int] = Array(2, 4, 6, 8, 10)
scala> secondMappedRDD.collect()
firstMappedRDD calc for 1
secondMappedRDD calc for 2
firstMappedRDD calc for 2
secondMappedRDD calc for 4
firstMappedRDD calc for 3
secondMappedRDD calc for 6
firstMappedRDD calc for 4
secondMappedRDD calc for 8
firstMappedRDD calc for 5
secondMappedRDD calc for 10
res28: Array[Int] = Array(4, 8, 12, 16, 20)
As you would expect, the map for the first step is called once again when we call secondMappedRDD.collect(). So now let's cache the first mapped RDD.
scala> firstMappedRDD.cache()
res29: firstMappedRDD.type = MapPartitionsRDD[12] at map at <console>:25
scala> secondMappedRDD.toDebugString
res31: String =
(1) MapPartitionsRDD[13] at map at <console>:27 []
| MapPartitionsRDD[12] at map at <console>:25 []
| ParallelCollectionRDD[11] at parallelize at <console>:23 []
scala> firstMappedRDD.count()
firstMappedRDD calc for 1
firstMappedRDD calc for 2
firstMappedRDD calc for 3
firstMappedRDD calc for 4
firstMappedRDD calc for 5
res32: Long = 5
scala> secondMappedRDD.toDebugString
res33: String =
(1) MapPartitionsRDD[13] at map at <console>:27 []
| MapPartitionsRDD[12] at map at <console>:25 []
| CachedPartitions: 1; MemorySize: 120.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
| ParallelCollectionRDD[11] at parallelize at <console>:23 []
The lineage of the second mapped RDD has the cached result of the first in it's lineage, after the result of the first map is in the cache. Now let's call collect:
scala> secondMappedRDD.collect
secondMappedRDD calc for 2
secondMappedRDD calc for 4
secondMappedRDD calc for 6
secondMappedRDD calc for 8
secondMappedRDD calc for 10
res34: Array[Int] = Array(4, 8, 12, 16, 20)
And now let's unpersist and call collect again.
scala> firstMappedRDD.unpersist()
res36: firstMappedRDD.type = MapPartitionsRDD[12] at map at <console>:25
scala> secondMappedRDD.toDebugString
res37: String =
(1) MapPartitionsRDD[13] at map at <console>:27 []
| MapPartitionsRDD[12] at map at <console>:25 []
| ParallelCollectionRDD[11] at parallelize at <console>:23 []
scala> secondMappedRDD.collect
firstMappedRDD calc for 1
secondMappedRDD calc for 2
firstMappedRDD calc for 2
secondMappedRDD calc for 4
firstMappedRDD calc for 3
secondMappedRDD calc for 6
firstMappedRDD calc for 4
secondMappedRDD calc for 8
firstMappedRDD calc for 5
secondMappedRDD calc for 10
res38: Array[Int] = Array(4, 8, 12, 16, 20)
So when we collect the result of the second mapped RDD after the first has been unpersisted, the map of the first gets called again.
If the source had been HDFS, or any other storage, the data would have been retrieved from the source again.

Related

Loading Data into Spark Dataframe without delimiters in source

I have a dataset with no delimiters:
111222333444
555666777888
Desired output:
|_c1_|_c2_|_c3_|_c4_|
|111 |222 |333 |444 |
|555 |666 |777 |888 |
i have tried this to attain the output
val myDF = spark.sparkContext.textFile("myFile").toDF()
val myNewDF = myDF.withColumn("c1", substring(col("value"), 0, 3))
.withColumn("c2", substring(col("value"), 3, 6))
.withColumn("c3", substring(col("value"), 6, 9)
.withColumn("c4", substring(col("value"), 9, 12))
.drop("value")
.show()
but i need to manipulate c4 (multiply 100) but the datatype is string not double.
Update: I encountered a scenarios
when i execute this,
val myNewDF = myDF.withColumn("c1", expr("substring(value, 0, 3)"))
.withColumn("c2", expr("substring(value, 3, 6"))
.withColumn("c3", expr("substring(value, 6, 9)"))
.withColumn("c4", (expr("substring(value, 9, 12)").cast("double") * 100))
.drop("value")
myNewDF.show(5,false) // it only shows "value" column (which i dropped) and "c1" column
myNewDF.printSchema // only showing 2 rows. why is it not showing all the newly created 4 columns?
Create test dataframe:
scala> var df = Seq(("111222333444"),("555666777888")).toDF("s")
df: org.apache.spark.sql.DataFrame = [s: string]
Split column s into an array of 3-character chunks:
scala> var res = df.withColumn("temp",split(col("s"),"(?<=\\G...)"))
res: org.apache.spark.sql.DataFrame = [s: string, temp: array<string>]
Map array elements to new columns:
scala> res = res.select((1 until 5).map(i => col("temp").getItem(i-1).as("c"+i)):_*)
res: org.apache.spark.sql.DataFrame = [c1: string, c2: string ... 2 more fields]
scala> res.show(false)
+---+---+---+---+
|c1 |c2 |c3 |c4 |
+---+---+---+---+
|111|222|333|444|
|555|666|777|888|
+---+---+---+---+
Leaving a little to puzzle for yourself, like 1) reading the file and naming your dataset / dataframe columns explicitly, this simulated approach with RDD should help you on your way:
val rdd = sc.parallelize(Seq(("111222333444"),
("555666777888")
)
)
val df = rdd.map(x => (x.slice(0,3), x.slice(3,6), x.slice(6,9), x.slice(9,12))).toDF()
df.show(false)
returns:
+---+---+---+---+
|_1 |_2 |_3 |_4 |
+---+---+---+---+
|111|222|333|444|
|555|666|777|888|
+---+---+---+---+
OR
using DF's:
import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq(("111222333444"),
("555666777888"))
).toDF()
val df2 = df.withColumn("c1", expr("substring(value, 1, 3)")).withColumn("c2", expr("substring(value, 4, 3)")).withColumn("c3", expr("substring(value, 7, 3)")).withColumn("c4", expr("substring(value, 10, 3)"))
df2.show(false)
returns:
+------------+---+---+---+---+
|value |c1 |c2 |c3 |c4 |
+------------+---+---+---+---+
|111222333444|111|222|333|444|
|555666777888|555|666|777|888|
+------------+---+---+---+---+
you can drop the value, leave that up to you.
Like the answer above but gets complicated if not all 3 size chunks.
Your updated question for double times 100:
val df2 = df.withColumn("c1", expr("substring(value, 1, 3)")).withColumn("c2", expr("substring(value, 4, 3)")).withColumn("c3", expr("substring(value, 7, 3)"))
.withColumn("c4", (expr("substring(value, 10, 3)").cast("double") * 100))

Specify default value for rowsBetween and rangeBetween in Spark

I have a question concerning a window operation in Sparks Dataframe 1.6.
Let's say I have the following table:
id|MONTH |number
1 201703 2
1 201704 3
1 201705 7
1 201706 6
At moment I'm using the rowsBetween function:
val window = Window.partitionBy("id")
.orderBy(asc("MONTH"))
.rowsBetween(-2, 0)
randomDF.withColumn("counter", sum(col("number")).over(window))
This gives me following results:
id|MONTH |number |counter
1 201703 2 2
1 201704 3 5
1 201705 7 12
1 201706 6 16
What I wan't to achieve is setting a default value (like in lag() and lead()) when there are no prescending rows. For example: '0' so that I get results like:
id|MONTH |number |counter
1 201703 2 0
1 201704 3 0
1 201705 7 12
1 201706 6 16
I've already looked in the documentation but Spark 1.6 does not allow this, and I was wondering if there was some kind of workaround.
Many thanks !
How about something like this where:
add additional lag step
substitute values with case
Code
val rowsRdd: RDD[Row] = spark.sparkContext.parallelize(
Seq(
Row(1, 1, 201703, 2),
Row(2, 1, 201704, 3),
Row(3, 1, 201705, 7),
Row(4, 1, 201706, 6)))
val schema: StructType = new StructType()
.add(StructField("sortColumn", IntegerType, false))
.add(StructField("id", IntegerType, false))
.add(StructField("month", IntegerType, false))
.add(StructField("number", IntegerType, false))
val df0: DataFrame = spark.createDataFrame(rowsRdd, schema)
val prevRows = 2
val window = Window.partitionBy("id")
.orderBy(col("month"))
.rowsBetween(-prevRows, 0)
val window2 = Window.partitionBy("id")
.orderBy(col("month"))
val df2 = df0.withColumn("counter", sum(col("number")).over(window))
val df3 = df2.withColumn("myLagTmp", lag(lit(1), prevRows).over(window2))
val df4 = df3.withColumn("counter", expr("case when myLagTmp is null then 0 else counter end")).drop(col("myLagTmp"))
df4.sort("sortColumn").show()
Thanks to the answer of #astro_asz i've came up with the following solution:
val numberRowsBetween = 2
val window1 = Window.partitionBy("id").orderBy("MONTH")
val window2 = Window.partitionBy("id")
.orderBy(asc("MONTH"))
.rowsBetween(-(numberRowsBetween - 1), 0)
randomDF.withColumn("counter", when(lag(col("number"), numberRowsBetween , 0).over(window1) === 0, 0)
.otherwise(sum(col("number")).over(window2)))
This solution will put a '0' as default value.

parallelize() method in SparkContext

I am trying to understand the effect of giving different numSlices to the parallelize() method in SparkContext. Given below is the Syntax of the method
def parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism)
(implicit arg0: ClassTag[T]): RDD[T]
I ran spark-shell in local mode
spark-shell --master local
My understanding is, numSlices decides the no of partitions of the resultant RDD(after calling sc.parallelize()). Consider few examples below
Case 1
scala> sc.parallelize(1 to 9, 1);
res0: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:22
scala> res0.partitions.size
res2: Int = 1
Case 2
scala> sc.parallelize(1 to 9, 2);
res3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:22
scala> res3.partitions.size
res4: Int = 2
Case 3
scala> sc.parallelize(1 to 9, 3);
res5: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:22
scala> res3.partitions.size
res6: Int = 2
Case 4
scala> sc.parallelize(1 to 9, 4);
res7: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:22
scala> res3.partitions.size
res8: Int = 2
Question 1 : In case 3 & case 4, I was expecting the partition size to be 3 & 4 respectively, but both cases have partition size of only 2. What is the reason for this?
Question 2 : In each case there is a number associated with ParallelCollectionRDD[no]. ie In Case 1 it is ParallelCollectionRDD[0], In case 2 it is ParallelCollectionRDD[1] & so on. What exactly those numbers signify?
Question 1: That's a typo on your part. You're calling res3.partitions.size, instead of res5 and res7 respectively. When I do it with the correct number, it works as expected.
Question 2: That's the id of the RDD in the Spark Context, used for keeping the graph straight. See what happens when I run the same command three times:
scala> sc.parallelize(1 to 9,1)
res0: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:22
scala> sc.parallelize(1 to 9,1)
res1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:22
scala> sc.parallelize(1 to 9,1)
res2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:22
There are now three different RDDs with three different ids. We can run the following to check:
scala> (res0.id, res1.id, res2.id)
res3: (Int, Int, Int) = (0,1,2)

big integer number in Spark

In Spark-shell, I run the following code:
scala> val input = sc.parallelize(List(1, 2, 4, 1881824400))
input: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:21
scala> val result = input.map(x => 2*x)
result: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:23
scala> println(result.collect().mkString(","))
2,4,8,-531318496
Why the result of 2*1881824400 = -531318496 ? not 3763648800 ?
Is that a bug in Spark?
Thanks for your help.
Thanks ccheneson and hveiga. The answer is that the mapping makes the result bigger than 2^31, run out the range of Interger. Therefore, the number jumps into the negatives region.

Spark zipPartitions on the same RDD

I'm quite a newbie with Spark and I have some problem in doing something like a cartesian but only within the same partition. Maybe an example can swoh clearly what I want to do: let's suppose we have a RDD made with sc.parallelize(1,2,3,4,5,6) and this RDD is partitioned in three partitions which contains respectively: (1,2) ; (3,4) ; (5,6). Than I would like to obtain the following result: ((1,1),(1,2),(2,1),(2,2)) ; ((3,3),(3,4),(4,3),(4,4)) ; ((5,5),(5,6),(6,5),(6,6)).
What I have tried so far is doing:
partitionedData.zipPartitions(partitionedData)((aiter, biter) => {
var res = new ListBuffer[(Double,Double)]()
while(aiter.hasNext){
val a = aiter.next()
while(biter.hasNext){
val b = biter.next()
res+=(a,b)
}
}
res.iterator
})
but it doesn't work as aiter and biter are the same iterator...so I get only the first line of the result.
Can someone help me?
Thanks.
Use RDD.mapPartitions:
val rdd = sc.parallelize(1 to 6, 3)
val res = rdd.mapPartitions { iter =>
val seq = iter.toSeq
val res = for (a <- seq; b <- seq) yield (a, b)
res.iterator
}
res.collect
Prints:
res0: Array[(Int, Int)] = Array((1,1), (1,2), (2,1), (2,2), (3,3), (3,4), (4,3), (4,4), (5,5), (5,6), (6,5), (6,6))

Resources