Performance benefit to separate actions in a series of transformation - apache-spark

Consider a series of heavy transformation:
val df1 = spark.sql("select * from big table where...")
val df2 = df1.groupBy(...).agg(..)
val df3 = df2.join(...)
val df4 = df3.columns.foldleft(df3)((inpudDF, column) => (...))
val df5 = df4.withColumn("newColName", row_number.over(Window.OrderBy(..,..)))
val df5.count
is there any performance benefit to add action to each transformation, instead of take action in the last transformation?
val df1 = spark.sql("select * from big table where...")
val df2 = df1.groupBy(...).agg(..)
val df2 .count // does it help to ease the performance impact by df5.count
val df3 = df2.join(...)
val df3.count // does it help to ease the performance impact by df5.count
val df4 = df3.columns.foldleft(df3)((inpudDF, column) => (...))
val df4.count // does it help to ease the performance impact by df5.count
val df5 = df4.withColumn("newColName", row_number.over(Window.OrderBy(..,..)))
val df5.count

Related

Does spark caching required on the last common part of 2 actions?

My code:
df1 = sql_context.sql("select * from table1") #should I cache here?
df2 = sql_context.sql("select * from table2") #should I cache here?
df1 = df1.where(df1.id == '5')
df1 = df1.where(df1.city == 'NY')
joined_df = df1.join(df2, on = "key") # should I cache here?
output_df = joined_df.where(joined_df.x == 5)
joined_df.write.format("csv").save(path1)
output_df.write.format("csv").save(path2)
So, I have 2 actions in the code, both of them make filters on df1 and join the data with df2.
Where is the right place to use cache() in this code?
Should I cache df1 and df2 because they will be used in both of the actions.
Or should I cache only the joined_df that is the last common part between this 2 actions?

how to combine several RDD with different lengths into a single RDD with a specific order pattern?

I have several RDD with different lengths:
RDD1 : [a, b, c, d, e, f, g]
RDD2 : [1, 3 ,2, 44, 5]
RDD3 : [D, F, G]
I want to combine them into one RDD, with the order pattern:
every 5 rows: takes 2 rows from RDD1 , takes 2 from RDD2 ,then takes 1 from
RDD3
This pattern should loop until all RDD exhausted.
the output above should be:
RDDCombine : [a,b,1,3,D, c,d,2,44,F, e,f,5,G, g]
How to achieve this? Thanks a lot!
Background: I'am designing a recommender system. Now I have several RDD outputs from different algorithms, I want to combine them in some order pattern to make a hybrid recommend.
I would not say it's an optimal solution but may help you to get started.. again this is not at all production ready code. Also, the number of partitions I have used is 1 based on fewer data, but you can edit it.
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
conf.setMaster("local[*]")
conf.setAppName("some")
val sc = new SparkContext(conf)
val rdd2 = sc.parallelize(Seq(1,3,2,44,5),1)
val rdd1 = sc.parallelize(Seq('a','b','c','d','e','f','g'),1)
val rdd3 = sc.parallelize(Seq('D','F','G'),1)
val groupingCount = 2
val rdd = rdd1.zipPartitions(rdd2,rdd3)((a,b,c) => {
val ag = a.grouped(groupingCount)
val bg = b.grouped(groupingCount)
val cg = c.grouped(1)
ag.zip(bg).zip(cg).map(x=> x._1._1 ++ x._1._2 ++x._2
)
})
rdd.foreach(println)
sc.stop()
}

Generating multiple columns dynamically using loop in pyspark dataframe

I have a requirement where I have to generate multiple columns dynamically in pyspark. I have written a similar code as below to accomplish the same.
sc = SparkContext()
sqlContext = SQLContext(sc)
cols = ['a','b','c']
df = sqlContext.read.option("header","true").option("delimiter", "|").csv("C:\\Users\\elkxsnk\\Desktop\\sample.csv")
for i in cols:
df1 = df.withColumn(i,lit('hi'))
df1.show()
However I am missing out columns a and b in the final result. Please help.
Changed the code like below. its working now, but wanted to know if there is a better way of handling it.
cols = ['a','b','c']
cols_add = []
flg_first = 'Y'
df = sqlContext.read.option("header","true").option("delimiter", "|").csv("C:\\Users\\elkxsnk\\Desktop\\sample.csv")
for i in cols:
print('start'+str(df.columns))
if flg_first == 'Y':
df1 = df.withColumn(i,lit('hi'))
cols_add.append(i)
flg_first = 'N'
else:enter code here
df1 = df1.select(df.columns+cols_add).withColumn(i,lit('hi'))
cols_add.append(i)
print('end' + str(df1.columns))
df1.show()

How to subtract two DataFrames keeping duplicates in Spark 2.3.0

Spark 2.4.0 introduces new handy function exceptAll which allows to subtract two dataframes, keeping duplicates.
Example
val df1 = Seq(
("a", 1L),
("a", 1L),
("a", 1L),
("b", 2L)
).toDF("id", "value")
val df2 = Seq(
("a", 1L),
("b", 2L)
).toDF("id", "value")
df1.exceptAll(df2).collect()
// will return
Seq(("a", 1L),("a", 1L))
However I can only use Spark 2.3.0.
What is the best way to implement this using only functions from Spark 2.3.0?
One option is to use row_number to generate a sequential number column and use it on a left join to get the missing rows.
PySpark solution shown here.
from pyspark.sql.functions import row_number
from pyspark.sql import Window
w1 = Window.partitionBy(df1.id).orderBy(df1.value)
w2 = Window.partitionBy(df2.id).orderBy(df2.value)
df1 = df1.withColumn("rnum", row_number().over(w1))
df2 = df2.withColumn("rnum", row_number().over(w2))
res_like_exceptAll = df1.join(df2, (df1.id==df2.id) & (df1.val == df2.val) & (df1.rnum == df2.rnum), 'left') \
.filter(df2.id.isNull()) \ #Identifies missing rows
.select(df1.id,df1.value)
res_like_exceptAll.show()

Very slow writing of a dataframe to file on Spark cluster

I have a test program that writes a dataframe to file. The dataframe is generated by adding sequential numbers for each row, like
1,2,3,4,5,6,7.....11
2,3,4,5,6,7,8.....12
......
There is 100000 rows in the dataframe, but I don't think it is too big.
When I submit the Spark task, it takes almost 20 minutes to write the dataframe to file on HDFS. I am wondering why it is so slow, and how to improve the performance.
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val numCol = 11
val arraydataInt = 1 to 100000 toArray
val arraydata = arraydataInt.map(x => x.toDouble)
val slideddata = arraydata.sliding(numCol).toSeq
val rows = arraydata.sliding(numCol).map { x => Row(x: _*) }
val datasetsize = arraydataInt.size
val myrdd = sc.makeRDD(rows.toSeq, arraydata.size - numCol).persist()
val schemaString = "value1 value2 value3 value4 value5 " +
"value6 value7 value8 value9 value10 label"
val schema =
StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, DoubleType, true)))
val df = sqlContext.createDataFrame(myrdd, schema).cache()
val splitsH = df.randomSplit(Array(0.8, 0.1))
val trainsetH = splitsH(0).cache()
val testsetH = splitsH(1).cache()
println("now saving training and test samples into files")
trainsetH.write.save("TrainingSample.parquet")
testsetH.write.save("TestSample.parquet")
Turn
val myrdd = sc.makeRDD(rows.toSeq, arraydata.size - numCol).persist()
To
val myrdd = sc.makeRDD(rows.toSeq, 100).persist()
You've made a rdd with arraydata.size - numCol partitions and each partition would lead to a task which takes extra run time. Generally speaking, the number of partitions is a trade-off between the level of parallelism and that extra cost. Try 100 partitions and it should works much better.
BTW, the official Guide suggest to set this number 2 or 3 times the number of CPUs in your cluster.

Resources