compute new RDD from 2 original RDD - apache-spark

I have 2 RDD in Key-Value type. RDD1 is [K,V], RDD2 is [K,U].
The set of K of both RDD1 and RDD2 are the same.
I need to map to a new RDD with [K, (U-V)/(U+v)].
My way is firstly to join RDD1 to
val newRDD = RDD1. RDD2.join(RDD2)
Then map new RDD.
newRDD.map(line=> (line._1, (line._2._1-line._2._2)/(line._2._1+line._2._2)))
The problem is that set RDD1( RDD2) has over 100 million, so the join between 2 sets take a very expensive cost as well as a long time(3 mins) to execute.
Are there any better ways to reduce the time of this task?

Try converting them to DataFrame first:
val df1 = RDD1.toDF("v_key", "v")
val df2 = RDD2.toDF("u_key", "u")
val newDf = df1.join(df2, $"v_key" === $"u_key")
newDF.select($"v_key", ($"u" - $"v") / ($"u" + $"v")).rdd
Aside from being a lot faster (because Spark will do the optimizing for you) I think it reads better.
I should also note that if it were me, I wouldn't do the .rdd at the end -- I would leave it a DataFrame. But that's me.

Related

Caching in spark before diverging the flow

I have a basic question regarding working with Spark DataFrame.
Consider the following piece of pseudo code:
val df1 = // Lazy Read from csv and create dataframe
val df2 = // Filter df1 on some condition
val df3 = // Group by on df2 on certain columns
val df4 = // Join df3 with some other df
val subdf1 = // All records from df4 where id < 0
val subdf2 = // All records from df4 where id > 0
* Then some more operations on subdf1 and subdf2 which won't trigger spark evaluation yet*
// Write out subdf1
// Write out subdf2
Suppose I start of with main dataframe df1(which I lazy read from the CSV), do some operations on this dataframe (filter, groupby, join) and then comes a point where I split this datframe based on a condition (for eg, id > 0 and id < 0). Then I further proceed to operate on these sub dataframes(let us name these subdf1, subdf2) and ultimately write out both the sub dataframes.
Notice that the write function is the only command that triggers the spark evaluation and rest of the functions(filter, groupby, join) result in lazy evaluations.
Now when I write out subdf1, I am clear that lazy evaluation kicks in and all the statements are evaluated starting from reading of CSV to create df1.
My question comes when we start writing out subdf2. Does spark understand the divergence in code at df4 and store this dataframe when command for writing out subdf1 was encountered? Or will it again start from the first line of creating df1 and re-evaluate all the intermediary dataframes?
If so, is it a good idea to cache the dataframe df4(Assuming I have sufficient memory)?
I'm using scala spark if that matters.
Any help would be appreciated.
No, Spark cannot infer that from your code. It will start all over again. To confirm this, you can do subdf1.explain() and subdf2.explain() and you should see that both dataframes have query plans that start right from the beginning where df1 was read.
So you're right that you should cache df4 to avoid redoing all the computations starting from df1, if you have enough memory. And of course, remember to unpersist by doing df4.unpersist() at the end if you no longer need df4 for any further computations.

Spark RDD Windowing using pyspark

There is a Spark RDD, called rdd1. It has(key, value) pair and I have a list, whose elements are a tuple(key1,key2).
I want to get a rdd2, with rows `((key1,key2), (value of key1 in rdd1, value of key2 in rdd1)).
Can somebody help me?
rdd1:
key1, value1,
key2, value2,
key3, value3
array: [(key1,key2),(key2,key3)]
Result:
(key1,key2),value1,value2
(key2,key3),value2,value3
I have tried
spark.parallize(array).map(lambda x:)
sliding with SCALA vs mllib sliding - two implementations, a bit fiddly but here it is:
import org.apache.spark.mllib.rdd.RDDFunctions._
val rdd1 = sc.parallelize(Seq(
( "key1", "value1"),
( "key2", "value2"),
( "key3", "value3"),
( "key4", "value4"),
( "key5", "value5")
))
val rdd2 = rdd1.sliding(2)
val rdd3 = rdd2.map(x => (x(0), x(1)))
val rdd4 = rdd3.map(x => ((x._1._1, x._2._1),x._1._2, x._2._2))
rdd4.collect
also, the following and this is actually better of course... :
val rdd5 = rdd2.map{case Array(x,y) => ((x._1, y._1), x._2, y._2)}
rdd5.collect
returns in both cases:
res70: Array[((String, String), String, String)] = Array(((key1,key2),value1,value2), ((key2,key3),value2,value3), ((key3,key4),value3,value4), ((key4,key5),value4,value5))
which I believe meets your needs, but not in pyspark.
On Stack Overflow you can find statements that pyspark does not have an equivalent for RDDs unless you "roll your own". You could look at this How to transform data with sliding window over time series data in Pyspark. However, I would advise Data Frames with the use of pyspark.sql.functions.lead() and pyspark.sql.functions.lag(). Somewhat easier.

In Spark, caching a DataFrame influences execution time of previous stages?

I am running a Spark (2.0.1) job with multiple stages. I noticed that when I insert a cache() in one of later stages it changes the execution time of earlier stages. Why? I've never encountered such a case in literature when reading about caching().
Here is my DAG with cache():
And here is my DAG without cache(). All remaining code is the same.
I have a cache() after a sort merge join in Stage10. If the cache() is used in Stage10 then Stage8 is nearly twice longer (20 min vs 11 min) then if there were no cache() in Stage10. Why?
My Stage8 contains two broadcast joins with small DataFrames and a shuffle on a large DataFrame in preparation for the merge join. Stages8 and 9 are independent and operate on two different DataFrames.
Let me know if you need more details to answer this question.
UPDATE 8/2/1018
Here are the details of my Spark script:
I am running my job on a cluster via spark-submit. Here is my spark session.
val spark = SparkSession.builder
.appName("myJob")
.config("spark.executor.cores", 5)
.config("spark.driver.memory", "300g")
.config("spark.executor.memory", "15g")
.getOrCreate()
This creates a job with 21 executors with 5 cpu each.
Load 4 DataFrames from parquet files:
val dfT = spark.read.format("parquet").load(filePath1) // 3 Tb in 3185 partitions
val dfO = spark.read.format("parquet").load(filePath2) // ~ 700 Mb
val dfF = spark.read.format("parquet").load(filePath3) // ~ 800 Mb
val dfP = spark.read.format("parquet").load(filePath4) // 38 Gb
Preprocessing on each of the DataFrames is composed of column selection and dropDuplicates and possible filter like this:
val dfT1 = dfT.filter(...)
val dfO1 = dfO.select(columnsToSelect2).dropDuplicates(Array("someColumn2"))
val dfF1 = dfF.select(columnsToSelect3).dropDuplicates(Array("someColumn3"))
val dfP1 = dfP.select(columnsToSelect4).dropDuplicates(Array("someColumn4"))
Then I left-broadcast-join together first three DataFrames:
val dfTO = dfT1.join(broadcast(dfO1), Seq("someColumn5"), "left_outer")
val dfTOF = dfTO.join(broadcast(dfF1), Seq("someColumn6"), "left_outer")
Since the dfP1 is large I need to do a merge join, I can't afford it to do it now. I need to limit the size of dfTOF first. To do that I add a new timestamp column which is a withColumn with a UDF which transforms a string into a timestamp
val dfTOF1 = dfTOF.withColumn("TransactionTimestamp", myStringToTimestampUDF)
Next I filter on a new timestamp column:
val dfTrain = dfTOF1.filter(dfTOF1("TransactionTimestamp").between("2016-01-01 00:00:00+000", "2016-05-30 00:00:00+000"))
Now I am joining the last DataFrame:
val dfTrain2 = dfTrain.join(dfP1, Seq("someColumn7"), "left_outer")
And lastly the column selection with a cache() that is puzzling me.
val dfTrain3 = dfTrain.select("columnsToSelect5").cache()
dfTrain3.agg(sum(col("someColumn7"))).show()
It looks like the cache() is useless here but there will be some further processing and modelling of the DataFrame and the cache() will be necessary.
Should I give more details? Would you like to see execution plan for dfTrain3?

How to distribute dataset evenly to avoid a skewed join (and long-running tasks)?

I am writing application using Spark dataset API on databricks notebook.
I have 2 tables. One is 1.5billion rows and second 2.5 million. Both tables contain telecommunication data and join is done using country code and first 5 digits of a number. Output has 55 billion rows. Problem is I have skewed data(long running tasks). No matter how i repartition dataset I get long running tasks because of uneven distribution of hashed keys.
I tried using broadcast joins, tried persisting big table partitions in memory etc.....
What are my options here?
spark will repartition the data based on the join key, so repartitioning before the join won't change the skew (only add an unnecessary shuffle)
if you know the key that is causing the skew (usually it will be some thing like null or 0 or ""), split your data into to 2 parts - 1 dataset with the skew key, and another with the rest
and do the join on the sub datasets, and union the results
for example:
val df1 = ...
val df2 = ...
val skewKey = null
val df1Skew = df1.where($"key" === skewKey)
val df2Skew = df2.where($"key" === skewKey)
val df1NonSkew = df1.where($"key" =!= skewKey)
val df2NonSkew = df2.where($"key" =!= skewKey)
val dfSkew = df1Skew.join(df2Skew) //this is a cross join
val dfNonSkew = df1NonSkew.join(df2NonSkew, "key")
val res = dfSkew.union(dfNonSkew)

Understanding Spark's caching

I'm trying to understand how Spark's cache work.
Here is my naive understanding, please let me know if I'm missing something:
val rdd1 = sc.textFile("some data")
rdd1.cache() //marks rdd1 as cached
val rdd2 = rdd1.filter(...)
val rdd3 = rdd1.map(...)
rdd2.saveAsTextFile("...")
rdd3.saveAsTextFile("...")
In the above, rdd1 will be loaded from disk (e.g. HDFS) only once. (when rdd2 is saved I assume) and then from cache (assuming there is enough RAM) when rdd3 is saved)
Now here is my question. Let's say I want to cache rdd2 and rdd3 as they will both be used later on, but I don't need rdd1 after creating them.
Basically there is duplication, isn't it? Since once rdd2 and rdd3 are calculated, I don't need rdd1 anymore, I should probably unpersist it, right? the question is when?
Will this work? (Option A)
val rdd1 = sc.textFile("some data")
rdd1.cache() // marks rdd as cached
val rdd2 = rdd1.filter(...)
val rdd3 = rdd1.map(...)
rdd2.cache()
rdd3.cache()
rdd1.unpersist()
Does spark add the unpersist call to the DAG? or is it done immediately? if it's done immediately, then basically rdd1 will be non cached when I read from rdd2 and rdd3, right?
Should I do it this way instead (Option B)?
val rdd1 = sc.textFile("some data")
rdd1.cache() // marks rdd as cached
val rdd2 = rdd1.filter(...)
val rdd3 = rdd1.map(...)
rdd2.cache()
rdd3.cache()
rdd2.saveAsTextFile("...")
rdd3.saveAsTextFile("...")
rdd1.unpersist()
So the question is this:
Is Option A good enough? i.e. will rdd1 still load the file only once?
Or do I need to go with Option B?
It would seem that Option B is required. The reason is related to how persist/cache and unpersist are executed by Spark. Since RDD transformations merely build DAG descriptions without execution, in Option A by the time you call unpersist, you still only have job descriptions and not a running execution.
This is relevant because a cache or persist call just adds the RDD to a Map of RDDs that marked themselves to be persisted during job execution. However, unpersist directly tells the blockManager to evict the RDD from storage and removes the reference in the Map of persistent RDDs.
persist function
unpersist function
So you would need to call unpersist after Spark actually executed and stored the RDD with the block manager.
The comments for the RDD.persist method hint towards this:
rdd.persist
In option A, you have not shown when you are calling the action (call to save)
val rdd1 = sc.textFile("some data")
rdd.cache() //marks rdd as cached
val rdd2 = rdd1.filter(...)
val rdd3 = rdd1.map(...)
rdd2.cache()
rdd3.cache()
rdd1.unpersist()
rdd2.saveAsTextFile("...")
rdd3.saveAsTextFile("...")
If the sequence is as above, Option A should use cached version of rdd1 for computing both rdd2 and rdd 3
Option B is an optimal approach with small tweak-in. Make use of less expensive action methods. In the approach mentioned by your code, saveAsTextFile is an expensive operation, replace it by count method.
Idea here is to remove the big rdd1 from DAG, if it's not relevant for further computation (after rdd2 and rdd3 are created)
Updated approach from code
val rdd1 = sc.textFile("some data").cache()
val rdd2 = rdd1.filter(...).cache()
val rdd3 = rdd1.map(...).cache()
rdd2.count
rdd3.count
rdd1.unpersist()

Resources