I have a test program that writes a dataframe to file. The dataframe is generated by adding sequential numbers for each row, like
1,2,3,4,5,6,7.....11
2,3,4,5,6,7,8.....12
......
There is 100000 rows in the dataframe, but I don't think it is too big.
When I submit the Spark task, it takes almost 20 minutes to write the dataframe to file on HDFS. I am wondering why it is so slow, and how to improve the performance.
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val numCol = 11
val arraydataInt = 1 to 100000 toArray
val arraydata = arraydataInt.map(x => x.toDouble)
val slideddata = arraydata.sliding(numCol).toSeq
val rows = arraydata.sliding(numCol).map { x => Row(x: _*) }
val datasetsize = arraydataInt.size
val myrdd = sc.makeRDD(rows.toSeq, arraydata.size - numCol).persist()
val schemaString = "value1 value2 value3 value4 value5 " +
"value6 value7 value8 value9 value10 label"
val schema =
StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, DoubleType, true)))
val df = sqlContext.createDataFrame(myrdd, schema).cache()
val splitsH = df.randomSplit(Array(0.8, 0.1))
val trainsetH = splitsH(0).cache()
val testsetH = splitsH(1).cache()
println("now saving training and test samples into files")
trainsetH.write.save("TrainingSample.parquet")
testsetH.write.save("TestSample.parquet")
Turn
val myrdd = sc.makeRDD(rows.toSeq, arraydata.size - numCol).persist()
To
val myrdd = sc.makeRDD(rows.toSeq, 100).persist()
You've made a rdd with arraydata.size - numCol partitions and each partition would lead to a task which takes extra run time. Generally speaking, the number of partitions is a trade-off between the level of parallelism and that extra cost. Try 100 partitions and it should works much better.
BTW, the official Guide suggest to set this number 2 or 3 times the number of CPUs in your cluster.
Related
Consider a series of heavy transformation:
val df1 = spark.sql("select * from big table where...")
val df2 = df1.groupBy(...).agg(..)
val df3 = df2.join(...)
val df4 = df3.columns.foldleft(df3)((inpudDF, column) => (...))
val df5 = df4.withColumn("newColName", row_number.over(Window.OrderBy(..,..)))
val df5.count
is there any performance benefit to add action to each transformation, instead of take action in the last transformation?
val df1 = spark.sql("select * from big table where...")
val df2 = df1.groupBy(...).agg(..)
val df2 .count // does it help to ease the performance impact by df5.count
val df3 = df2.join(...)
val df3.count // does it help to ease the performance impact by df5.count
val df4 = df3.columns.foldleft(df3)((inpudDF, column) => (...))
val df4.count // does it help to ease the performance impact by df5.count
val df5 = df4.withColumn("newColName", row_number.over(Window.OrderBy(..,..)))
val df5.count
I have several RDD with different lengths:
RDD1 : [a, b, c, d, e, f, g]
RDD2 : [1, 3 ,2, 44, 5]
RDD3 : [D, F, G]
I want to combine them into one RDD, with the order pattern:
every 5 rows: takes 2 rows from RDD1 , takes 2 from RDD2 ,then takes 1 from
RDD3
This pattern should loop until all RDD exhausted.
the output above should be:
RDDCombine : [a,b,1,3,D, c,d,2,44,F, e,f,5,G, g]
How to achieve this? Thanks a lot!
Background: I'am designing a recommender system. Now I have several RDD outputs from different algorithms, I want to combine them in some order pattern to make a hybrid recommend.
I would not say it's an optimal solution but may help you to get started.. again this is not at all production ready code. Also, the number of partitions I have used is 1 based on fewer data, but you can edit it.
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
conf.setMaster("local[*]")
conf.setAppName("some")
val sc = new SparkContext(conf)
val rdd2 = sc.parallelize(Seq(1,3,2,44,5),1)
val rdd1 = sc.parallelize(Seq('a','b','c','d','e','f','g'),1)
val rdd3 = sc.parallelize(Seq('D','F','G'),1)
val groupingCount = 2
val rdd = rdd1.zipPartitions(rdd2,rdd3)((a,b,c) => {
val ag = a.grouped(groupingCount)
val bg = b.grouped(groupingCount)
val cg = c.grouped(1)
ag.zip(bg).zip(cg).map(x=> x._1._1 ++ x._1._2 ++x._2
)
})
rdd.foreach(println)
sc.stop()
}
When I attempt to generate the connected components using graphframes it is taking substantially longer than I expected. I am running on spark 2.1, graphframes 0.5 and AWS EMR with 3 r4.xlarge instances. When the generating the connected components for a graph of about 12 million edges it is taking around 3 hours.
The code is below. I am fairly new to spark so any suggestions would be awesome.
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf()
.setMaster("yarn-cluster")
.setAppName("Connected Component")
val sc = new SparkContext(sparkConf)
sc.setCheckpointDir("s3a://......")
AWSUtils.setS3Credentials(sc.hadoopConfiguration)
implicit val sqlContext = SQLContext.getOrCreate(sc)
import sqlContext.implicits._
val historical = sqlContext
.read
.option("mergeSchema", "false")
.parquet("s3a://.....")
.map(x => (x(0).toString, x(2).toString, x(1).toString, x(3).toString, x(4).toString.toLong, x(5).toString.toLong))
// Complete graph
val g = GraphFrame(
historical.flatMap(e => List((e._1, e._3, e._5), (e._2, e._4, e._5))).toDF("id", "type", "timestamp"),
historical.toDF("src", "dst", "srcType", "dstType", "timestamp", "companyId")
)
val connectedComponents: DataFrame = g.connectedComponents.run()
connectedComponents.toDF().show(100, false)
sc.stop()
}
I'm using Spark v1.5.2. I wrote a program in Python and I don't understand why it reads the input files twice. The same program written in Scala only reads the input files once.
I use an accumulator to count the number of times that map() is called. From the accumulator value, I infer the number of times the input file is read.
The input file contains 3 lines of text.
Python:
from pyspark import SparkContext, SQLContext
from pyspark.sql.types import *
def createTuple(record): # used with map()
global map_acc
map_acc += 1
return (record[0], record[1].strip())
sc = SparkContext(appName='Spark test app') # appName is shown in the YARN UI
sqlContext = SQLContext(sc)
map_acc = sc.accumulator(0)
lines = sc.textFile("examples/src/main/resources/people.txt")
people_rdd = lines.map(lambda l: l.split(",")).map(createTuple) #.cache()
fieldNames = 'name age'
fields = [StructField(field_name, StringType(), True) for field_name in fieldNames.split()]
schema = StructType(fields)
df = sqlContext.createDataFrame(people_rdd, schema)
print 'record count DF:', df.count()
print 'map_acc:', map_acc.value
#people_rdd.unpersist()
$ spark-submit --master local[1] test.py 2> err
record count DF: 3
map_acc: 6 ##### why 6 instead of 3??
Scala:
import org.apache.spark._
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType,StructField,StringType};
object SimpleApp {
def main(args: Array[String]) {
def createTuple(record:Array[String], map_acc: Accumulator[Int]) = { // used with map()
map_acc += 1
Row(record(0), record(1).trim)
}
val conf = new SparkConf().setAppName("Scala Test App")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val map_acc = sc.accumulator(0)
val lines = sc.textFile("examples/src/main/resources/people.txt")
val people_rdd = lines.map(_.split(",")).map(createTuple(_, map_acc))
val fieldNames = "name age"
val schema = StructType(
fieldNames.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
val df = sqlContext.createDataFrame(people_rdd, schema)
println("record count DF: " + df.count)
println("map_acc: " + map_acc.value)
}
}
$ spark-submit ---class SimpleApp --master local[1] test.jar 2> err
record count DF: 3
map_acc: 3
If I remove the comments from the Python program and cache the RDD, then the input files are not read twice. However, I don't think I should have to cache the RDD, right? In the Scala version I don't need to cache the RDD.
people_rdd = lines.map(lambda l: l.split(",")).map(createTuple).cache()
...
people_rdd.unpersist()
$ spark-submit --master local[1] test.py 2> err
record count DF: 3
map_acc: 3
$ hdfs dfs -cat examples/src/main/resources/people.txt
Michael, 29
Andy, 30
Justin, 19
It happens because in 1.5 createDataFrame eagerly validates provided schema on a few elements:
elif isinstance(schema, StructType):
# take the first few rows to verify schema
rows = rdd.take(10)
for row in rows:
_verify_type(row, schema)
In contrast current versions validate schema for all elements but it is done lazily and you wouldn't see the same behavior. For example this would fail instantaneously in 1.5:
from pyspark.sql.types import *
rdd = sc.parallelize([("foo", )])
schema = StructType([StructField("foo", IntegerType(), False)])
sqlContext.createDataFrame(rdd, schema)
but 2.0 equivalent would fail when you try to evaluate DataFrame.
In general you shouldn't expect that Python and Scala code will behave the same way unless you strictly limit yourself to interactions with SQL API. PySpark:
Implements almost all RDD methods natively so the same chain of transformations can result in a different DAG.
Interactions with Java API may require an eager evaluation to provide type information for Java classes.
In Spark-shell, I run the following code:
scala> val input = sc.parallelize(List(1, 2, 4, 1881824400))
input: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:21
scala> val result = input.map(x => 2*x)
result: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:23
scala> println(result.collect().mkString(","))
2,4,8,-531318496
Why the result of 2*1881824400 = -531318496 ? not 3763648800 ?
Is that a bug in Spark?
Thanks for your help.
Thanks ccheneson and hveiga. The answer is that the mapping makes the result bigger than 2^31, run out the range of Interger. Therefore, the number jumps into the negatives region.