How to Map one rdd to another with Pyspark? - apache-spark

I have rdd1 which have labels(0,1,4) and another rdd2 in which i have text. I want to map rdd1 with rdd2 such that row1 of rdd1 is mapped with row1 of rdd2 and so on.
I have tried:
rdd2.join(rdd1.map(lambda x: (x[0], x[0:])))
It gives me error:
RDD is empty.
Can someone please guide me here?
Sample Output: rdd1- labels & rdd2-Text
rdd1 rdd2
0 i hate painting i have white paint all over my hands.
0 Bawww I need a haircut No1 could fit me in before work tonight. Sigh.
4 I had a great day
1 what is life.
4 He sings so good
1 i need to go to sleep ....goodnight
Sample output

If you have rdd1 as
val rdd1 = sc.parallelize(List(0,0,4,1,4,1))
and rdd2 as
val rdd2 = sc.parallelize(List("i hate painting i have white paint all over my hands.",
"Bawww I need a haircut No1 could fit me in before work tonight. Sigh.",
"I had a great day",
"what is life.",
"He sings so good",
"i need to go to sleep ....goodnight"))
I want to map rdd1 with rdd2 such that row1 of rdd1 is mapped with row1 of rdd2 and so on.
using zip function
A simple zip function should meet your rquirement
rdd1.zip(rdd2)
which would give you output as
(0,i hate painting i have white paint all over my hands.)
(0,Bawww I need a haircut No1 could fit me in before work tonight. Sigh.)
(4,I had a great day)
(1,what is life.)
(4,He sings so good)
(1,i need to go to sleep ....goodnight)
zipWithIndex and join
This approach would give you the same output as explained above using zip (and this method is expensive as well)
rdd1.zipWithIndex().map(_.swap).join(rdd2.zipWithIndex().map(_.swap)).map(_._2)
I hope the answer is helpful

Related

Caching in spark before diverging the flow

I have a basic question regarding working with Spark DataFrame.
Consider the following piece of pseudo code:
val df1 = // Lazy Read from csv and create dataframe
val df2 = // Filter df1 on some condition
val df3 = // Group by on df2 on certain columns
val df4 = // Join df3 with some other df
val subdf1 = // All records from df4 where id < 0
val subdf2 = // All records from df4 where id > 0
* Then some more operations on subdf1 and subdf2 which won't trigger spark evaluation yet*
// Write out subdf1
// Write out subdf2
Suppose I start of with main dataframe df1(which I lazy read from the CSV), do some operations on this dataframe (filter, groupby, join) and then comes a point where I split this datframe based on a condition (for eg, id > 0 and id < 0). Then I further proceed to operate on these sub dataframes(let us name these subdf1, subdf2) and ultimately write out both the sub dataframes.
Notice that the write function is the only command that triggers the spark evaluation and rest of the functions(filter, groupby, join) result in lazy evaluations.
Now when I write out subdf1, I am clear that lazy evaluation kicks in and all the statements are evaluated starting from reading of CSV to create df1.
My question comes when we start writing out subdf2. Does spark understand the divergence in code at df4 and store this dataframe when command for writing out subdf1 was encountered? Or will it again start from the first line of creating df1 and re-evaluate all the intermediary dataframes?
If so, is it a good idea to cache the dataframe df4(Assuming I have sufficient memory)?
I'm using scala spark if that matters.
Any help would be appreciated.
No, Spark cannot infer that from your code. It will start all over again. To confirm this, you can do subdf1.explain() and subdf2.explain() and you should see that both dataframes have query plans that start right from the beginning where df1 was read.
So you're right that you should cache df4 to avoid redoing all the computations starting from df1, if you have enough memory. And of course, remember to unpersist by doing df4.unpersist() at the end if you no longer need df4 for any further computations.

Spark scala partition dataframe for large cross joins

I have two dataframes that need to be cross joined on a 20-node cluster. However because of their size, a simple crossjoin is failing. I am looking to partition the data and perform the crossjoin and am looking for an efficient way to do it.
Simple Algorithm
Manually split file f1 into three and read into dataframes: df1A, df1B, df1C. Manually split file f2 into four and ready into dataframes: df2A, df2B, df2C, df2D. Cross join df1A X df2A, df1A X df2B,..,df1A X df2D,...,df1C X df2D. Save each cross join in a file and manually put together all files. This way Spark can perform each cross join parallely and things should complete fairly quickly.
Question
Is there is more efficient way of accomplishing this by reading both files into two dataframes, then partitioning each dataframe into 3 and 4 "pieces" and for each partition of one dataframe cross join with every partition of the other dataframe?
Data frame can be partitioned ether range or hash .
val df1 = spark.read.csv("file1.txt")
val df2 = spark.read.csv("file2.txt")
val partitionedByRange1 = df1.repartitionByRange(3, $"k")
val partitionedByRange2 = df2.repartitionByRange(4, $"k")
val result =partitionedByRange1.crossJoin(partitionedByRange2);
NOTE : set property spark.sql.crossJoin.enabled=true
You can convert this in to a rdd and then use cartesian operation on that RDD. You should then be able to save that RDD to a file. Hope that helps

My spark app is too slow, how can I increase the speed significantly?

This is part of my spark code which is very slow. By slow I mean for 70 Million data rows it takes almost 7 minutes to run the code but I need it to run in under 5 seconds if possible. I have a cluster with 5 spark nodes with 80 cores and 177 GB memory of which 33Gb are currently used.
range_expr = col("created_at").between(
datetime.now()-timedelta(hours=timespan),
datetime.now()-timedelta(hours=time_delta(timespan))
)
article_ids = sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="table", keyspace=source).load().where(range_expr).select('article','created_at').repartition(64*2)
axes = sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="table", keyspace=source).load()
#article_ids.join(axes,article_ids.article==axes.article)
speed_df = article_ids.join(axes,article_ids.article==axes.article).select(axes.article,axes.at,axes.comments,axes.likes,axes.reads,axes.shares) \
.map(lambda x:(x.article,[x])).reduceByKey(lambda x,y:x+y) \
.map(lambda x:(x[0],sorted(x[1],key=lambda y:y.at,reverse = False))) \
.filter(lambda x:len(x[1])>=2) \
.map(lambda x:x[1][-1]) \
.map(lambda x:(x.article,(x,(x.comments if x.comments else 0)+(x.likes if x.likes else 0)+(x.reads if x.reads else 0)+(x.shares if x.shares else 0))))
I believe especially this part of the code is particularly slow:
sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="table", keyspace=source).load()
When put in spark it transforms into this which I think causes it to be slow :
javaToPython at NativeMethodAccessorImpl.java:-2
Any help would really be appreciated. Thanks
EDIT
The biggest speed problem seems to be JavatoPython. The attached picture is only for part of my data and is already very slow.
EDIT (2)
About len(x1)>=2:
Sorry for the long elaboration but I really hope I can solve this problem, so making people understand a quite complex problem in detail is crucial:
this is my rdd example:
rdd1 = [(1,3),(1,5),(1,6),(1,9),(2,10),(2,76),(3,8),(4,87),(4,96),(4,109),(5,10),(6,19),(6,18),(6,65),(6,43),(6,81),(7,12),(7,96),(7,452),(8,59)]
After the spark transformation rdd1 has this form:
rdd_result = [(1,9),(2,76),(4,109),(6,81),(7,452)]
the result does not contain (3,8),(5,10) because the key 3 or 5 only occur once, I don't want the 3 or 5 to appear.
below is my program:
first:rdd1 reduceByKey then the result is:
rdd_reduceByKey=[(1,[3,5,6,9]),(2,[10,76]),(3,[8]),(4,[87,96,109]),(5,[10]),(6,[19,18,65,43,81]),(7,[12,96,452,59]))]
second:rdd_reduceByKey filter by len(x1)>=2 then result is:
rdd_filter=[(1,[3,5,6,9]),(2,[10,76]),(4,[87,96,109]),(6,[19,18,65,43,81]),(7,[12,96,452,59]))]
so the len(x1)>=2 is necessary but slow.
Any recommendation improvements would be hugely appreciated.
Few things I would to do if I meet performance issue.
check spark web UI. Find the slowest part.
The lambda function is really suspicious
Check executor configuration
Store some of the data in intermediate table.
Compare the result if store data in parquet helps.
Compare the if using Scala helps
EDIT:
Using Scala instead of Python could do the trick if the JavatoPython is the slowest.
Here is the code for finding the latest/largest. It should be NlogN, most likely close to N, since the sorting is on small data set.
import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray
val data = Seq((1,3),(1,5),(1,6),(1,9),(2,10),
(2,76),(3,8),(4,87),(4,96),(4,109),
(5,10),(6,19),(6,18),(6,65),(6,43),
(6,81),(7,12),(7,96),(7,452),(8,59))
val df = sqlContext.createDataFrame(data)
val dfAgg = df.groupBy("_1").agg(collect_set("_2").alias("_2"))
val udfFirst= udf[Int, WrappedArray[Int]](_.head)
val dfLatest = dfAgg.filter(size($"_2") > 1).
select($"_1", udfFirst(sort_array($"_2", asc=false)).alias("latest"))
dfLatest.show()

compute new RDD from 2 original RDD

I have 2 RDD in Key-Value type. RDD1 is [K,V], RDD2 is [K,U].
The set of K of both RDD1 and RDD2 are the same.
I need to map to a new RDD with [K, (U-V)/(U+v)].
My way is firstly to join RDD1 to
val newRDD = RDD1. RDD2.join(RDD2)
Then map new RDD.
newRDD.map(line=> (line._1, (line._2._1-line._2._2)/(line._2._1+line._2._2)))
The problem is that set RDD1( RDD2) has over 100 million, so the join between 2 sets take a very expensive cost as well as a long time(3 mins) to execute.
Are there any better ways to reduce the time of this task?
Try converting them to DataFrame first:
val df1 = RDD1.toDF("v_key", "v")
val df2 = RDD2.toDF("u_key", "u")
val newDf = df1.join(df2, $"v_key" === $"u_key")
newDF.select($"v_key", ($"u" - $"v") / ($"u" + $"v")).rdd
Aside from being a lot faster (because Spark will do the optimizing for you) I think it reads better.
I should also note that if it were me, I wouldn't do the .rdd at the end -- I would leave it a DataFrame. But that's me.

Understanding Spark's caching

I'm trying to understand how Spark's cache work.
Here is my naive understanding, please let me know if I'm missing something:
val rdd1 = sc.textFile("some data")
rdd1.cache() //marks rdd1 as cached
val rdd2 = rdd1.filter(...)
val rdd3 = rdd1.map(...)
rdd2.saveAsTextFile("...")
rdd3.saveAsTextFile("...")
In the above, rdd1 will be loaded from disk (e.g. HDFS) only once. (when rdd2 is saved I assume) and then from cache (assuming there is enough RAM) when rdd3 is saved)
Now here is my question. Let's say I want to cache rdd2 and rdd3 as they will both be used later on, but I don't need rdd1 after creating them.
Basically there is duplication, isn't it? Since once rdd2 and rdd3 are calculated, I don't need rdd1 anymore, I should probably unpersist it, right? the question is when?
Will this work? (Option A)
val rdd1 = sc.textFile("some data")
rdd1.cache() // marks rdd as cached
val rdd2 = rdd1.filter(...)
val rdd3 = rdd1.map(...)
rdd2.cache()
rdd3.cache()
rdd1.unpersist()
Does spark add the unpersist call to the DAG? or is it done immediately? if it's done immediately, then basically rdd1 will be non cached when I read from rdd2 and rdd3, right?
Should I do it this way instead (Option B)?
val rdd1 = sc.textFile("some data")
rdd1.cache() // marks rdd as cached
val rdd2 = rdd1.filter(...)
val rdd3 = rdd1.map(...)
rdd2.cache()
rdd3.cache()
rdd2.saveAsTextFile("...")
rdd3.saveAsTextFile("...")
rdd1.unpersist()
So the question is this:
Is Option A good enough? i.e. will rdd1 still load the file only once?
Or do I need to go with Option B?
It would seem that Option B is required. The reason is related to how persist/cache and unpersist are executed by Spark. Since RDD transformations merely build DAG descriptions without execution, in Option A by the time you call unpersist, you still only have job descriptions and not a running execution.
This is relevant because a cache or persist call just adds the RDD to a Map of RDDs that marked themselves to be persisted during job execution. However, unpersist directly tells the blockManager to evict the RDD from storage and removes the reference in the Map of persistent RDDs.
persist function
unpersist function
So you would need to call unpersist after Spark actually executed and stored the RDD with the block manager.
The comments for the RDD.persist method hint towards this:
rdd.persist
In option A, you have not shown when you are calling the action (call to save)
val rdd1 = sc.textFile("some data")
rdd.cache() //marks rdd as cached
val rdd2 = rdd1.filter(...)
val rdd3 = rdd1.map(...)
rdd2.cache()
rdd3.cache()
rdd1.unpersist()
rdd2.saveAsTextFile("...")
rdd3.saveAsTextFile("...")
If the sequence is as above, Option A should use cached version of rdd1 for computing both rdd2 and rdd 3
Option B is an optimal approach with small tweak-in. Make use of less expensive action methods. In the approach mentioned by your code, saveAsTextFile is an expensive operation, replace it by count method.
Idea here is to remove the big rdd1 from DAG, if it's not relevant for further computation (after rdd2 and rdd3 are created)
Updated approach from code
val rdd1 = sc.textFile("some data").cache()
val rdd2 = rdd1.filter(...).cache()
val rdd3 = rdd1.map(...).cache()
rdd2.count
rdd3.count
rdd1.unpersist()

Resources