I am working with Spark GraphX. I am building a graph from a file (around 620 mb, 50K vertices and almost 50 millions of edges). I am using a spark cluster with: 4 workers, each one with 8 cores and 13.4g of ram, 1 driver with the same specs. When I submit my .jar to the cluster, randomly one of the workers loads all the data on it. All the task needed for the computing are requested to that worker. While the computing the remaining three are without doing nothing. I have try everything and i do not found nothing that can force to compute in all of the workers.
When Spark build the graph and I look for the number of partitions of the RDD of vertices say 5, but if I repartition that RDD for example with 32 (number of cores in total) Spark load the data in every worker but gets slow the computation.
Im launching the spark submit by this way:
spark-submit --master spark://172.30.200.20:7077 --driver-memory 12g --executor-memory 12g --class interscore.InterScore /root/interscore/interscore.jar hdfs://172.30.200.20:9000/user/hadoop/interscore/network.dat hdfs://172.30.200.20:9000/user/hadoop/interscore/community.dat 111
The code is here:
object InterScore extends App{
val sparkConf = new SparkConf().setAppName("Big-InterScore")
val sc = new SparkContext(sparkConf)
val t0 = System.currentTimeMillis
runInterScore(args(0), args(1), args(2))
println("Running time " + (System.currentTimeMillis - t0).toDouble / 1000)
sc.stop()
def runInterScore(netPath:String, communitiesPath:String, outputPath:String) = {
val communities = sc.textFile(communitiesPath).map(x => {
val a = x.split('\t')
(a(0).toLong, a(1).toInt)
}).cache
val graph = GraphLoader.edgeListFile(sc, netPath, true)
.partitionBy(PartitionStrategy.RandomVertexCut)
.groupEdges(_ + _)
.joinVertices(communities)((_, _, c) => c)
.cache
val lvalues = graph.aggregateMessages[Double](
m => {
m.sendToDst(if (m.srcAttr != m.dstAttr) 1 else 0)
m.sendToSrc(if (m.srcAttr != m.dstAttr) 1 else 0)
}, _ + _)
val communitiesIndices = communities.map(x => x._2).distinct.collect
val verticesWithLValue = graph.vertices.repartition(32).join(lvalues).cache
println("K = " + communitiesIndices.size)
graph.unpersist()
graph.vertices.unpersist()
communitiesIndices.foreach(c => {
//COMPUTE c
}
})
}
}
Related
Why Spark creates multiple stages for wide transformation even if data is present in one partition?
val a = sc.parallelize(Array("This","is","a","This","is","file"),1)
val b = a.map(x => (x,1))
val c = b.reduceByKey(_+_)
c.collect
I trying to append some rows (5 million rows/ 2800 columns) in a Hive table through Spark/Scala, but the process seems to stuck after long hours. The logs don't show any errors.
How can I be sure the process is really running?
Is there something to do to optimize the job?
My submit configs:
--driver-memory 15 G
--executor-memory 30g
--num-executors 35
--executor-cores 5
Thanks!
def exprToAppend(myCols: Set[String], allCols: Set[String]) = {
import org.apache.spark.sql.functions._
allCols.toList.map(x => x match {
case x if myCols.contains(x) => col(x)
case _ => lit(0d).as(x)
})
}
val insert : DataFrame = tableFinal.select(exprToAppend(tableFinal.columns.toSet, historico.columns.toSet):_ *).select(historico.columns.map(x => col(x)) :_*);
insert.write.mode("append")
.format("parquet")
.insertInto(s"${Configuration.SIGLA}${Configuration.TABLE_HIST}")
I have spark job code as below. Which works fine with below configuration on cluster.
String path = "/tmp/one.txt";
JavaRDD<SomeClass> jRDD = spark.read()
.textFile(path)
.javaRDD()
.map(line -> {
return new SomeClass(line);
});
Dataset responseSet = sparkSession.createDataFrame(jRDD, SomeClass.class);
responseSet.write()
.format("text")
.save(path + "processed");
Whereas, If I want to read binary file(same size as text) it takes much more time.
String path = "/tmp/one.txt";
JavaRDD<SomeClass> jRDD = sparkContext
.binaryRecords(path, 10000, new Configuration())
.toJavaRDD()
.map(line -> {
return new SomeClass(line);
});
Dataset responseSet = spark.createDataFrame(jRDD, SomeClass.class);
responseSet.write()
.format("text")
.save(path + "processed");
Below is my configuration.
driver-memory 8g
executor-memory 6g
num-executors 16
Time taken by first code with 150 MB file is 1.30 mins.
Time taken by second code with 150 MB file is 4 mins.
Also, first code was able to run on all 16 executors whereas second uses only one.
ny suggestions why it is slow?
I found the issue. The textFile()method was creating 16 partitions(you can checknumOfPartitions using getNumPartitions() method on RDD) whereas binaryRecords() created only 1(Java binaryRecords doesn't provide overloaded method which specifies num of partitions to be created).
I increased numOfPartitions on RDD created by binaryRecords() by using repartition(NUM_OF_PARTITIONS) method on RDD.
I use spark streaming to receive Kafka's data like this:
val conf = new SparkConf()
conf.setMaster("local[*]").setAppName("KafkaStreamExample")
.setSparkHome("/home/kufu/spark/spark-1.5.2-bin-hadoop2.6")
.setExecutorEnv("spark.executor.extraClassPath","target/scala-2.11/sparkstreamexamples_2.11-1.0.jar")
val threadNum = 3
val ssc = new StreamingContext(conf, Seconds(2))
val topicMap = Map(consumeTopic -> 1)
val dataRDDs:IndexedSeq[InputDStream[(String, String)]] = approachType match {
case KafkaStreamJob.ReceiverBasedApproach =>
(1 to threadNum).map(_=>
KafkaUtils.createStream(ssc, zkOrBrokers, "testKafkaGroupId", topicMap))
case KafkaStreamJob.DirectApproach =>
(1 to threadNum).map(_=>
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, Map[String, String]("metadata.broker.list" -> zkOrBrokers),
Set[String](consumeTopic)))
}
//dataRDDs.foreach(_.foreachRDD(genProcessing(approachType)))
val dataRDD = ssc.union(dataRDDs)
dataRDD.foreachRDD(genProcessing(approachType))
ssc.start()
ssc.awaitTermination()
the genProcessing generates a process to deal with the data, which will takes 5s(sleep 5s). Code are like this:
def eachRDDProcessing(rdd:RDD[(String, String)]):Unit = {
if(count>max) throw new Exception("Stop here")
println("--------- num: "+count+" ---------")
val batchNum = count
val curTime = System.currentTimeMillis()
Thread.sleep(5000)
val family = approachType match{
case KafkaStreamJob.DirectApproach => KafkaStreamJob.DirectFamily
case KafkaStreamJob.ReceiverBasedApproach => KafkaStreamJob.NormalFamily
}
val families = KafkaStreamJob.DirectFamily :: KafkaStreamJob.NormalFamily :: Nil
val time = System.currentTimeMillis().toString
val messageCount = rdd.count()
rdd.foreach(tuple => {
val hBaseConn = new HBaseConnection(KafkaStreamJob.rawDataTable,
KafkaStreamJob.zookeeper, families)
hBaseConn.openOrCreateTable()
val puts = new java.util.ArrayList[Put]()
val strs = tuple._2.split(":")
val row = strs(1) + ":" + strs(0) + ":" + time
val put = new Put(Bytes.toBytes(row))
put.add(Bytes.toBytes(family), Bytes.toBytes(KafkaStreamJob.tableQualifier),
Bytes.toBytes("batch " + batchNum.toString + ":" + strs(1)))
puts.add(put)
hBaseConn.puts(puts)
hBaseConn.close()
})
count+=1
println("--------- add "+messageCount+" messages ---------")
}
eachRDDProcessing
but the spark streaming doesn't start multi-thread.Tasks were processed one by one, and each task took around 5s. My machine has 8 cores, and the spark run on one node.
I don't spark streaming will start threads especially on driver. The point is if you have multiple nodes, your genProcessing will run on different nodes.
Further, if you call rdd.foreachPartition(...), suppose it should get better parallelism
Scenario: I am doing some testing with spark streaming. The files with around 100 records comes in every 25 seconds.
Problem: The processing is taking on average 23 seconds for 4 core pc using local[*] in the program. When i deploy the same app to server with 16 cores i was expecting an improvement in processing time. However, i see it is still taking same time in 16 cores as well (also checked cpu usages in ubuntu and cpu is being fully utilized). All the configurations are default provided by spark.
Question:
Should not processing time decrease with increase in number of cores available for the streaming job?
Code:
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName(this.getClass.getCanonicalName)
.set("spark.hadoop.validateOutputSpecs", "false")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(25))
val sqc = new SQLContext(sc)
val gpsLookUpTable = MapInput.cacheMappingTables(sc, sqc).persist(StorageLevel.MEMORY_AND_DISK_SER_2)
val broadcastTable = sc.broadcast(gpsLookUpTable)
jsonBuilder.append("[")
ssc.textFileStream("hdfs://localhost:9000/inputDirectory/")
.foreachRDD { rdd =>
if (!rdd.partitions.isEmpty) {
val header = rdd.first().split(",")
val rowsWithoutHeader = Utils.dropHeader(rdd)
rowsWithoutHeader.foreach { row =>
jsonBuilder.append("{")
val singleRowArray = row.split(",")
(header, singleRowArray).zipped
.foreach { (x, y) =>
jsonBuilder.append(convertToStringBasedOnDataType(x, y))
// GEO Hash logic here
if (x.equals("GPSLat") || x.equals("Lat")) {
lattitude = y.toDouble
}
else if (x.equals("GPSLon") || x.equals("Lon")) {
longitude = y.toDouble
if (x.equals("Lon")) {
// This section is used to convert GPS Look Up to GPS LookUP with Hash
jsonBuilder.append(convertToStringBasedOnDataType("geoCode", GeoHash.encode(lattitude, longitude)))
}
else {
val selectedRow = broadcastTable.value
.filter("geoCode LIKE '" + GeoHash.subString(lattitude, longitude) + "%'")
.withColumn("Distance", calculateDistance(col("Lat"), col("Lon")))
.orderBy("Distance")
.select("TrackKM", "TrackName").take(1)
if (selectedRow.length != 0) {
jsonBuilder.append(convertToStringBasedOnDataType("TrackKm", selectedRow(0).get(0)))
jsonBuilder.append(convertToStringBasedOnDataType("TrackName", selectedRow(0).get(1)))
}
else {
jsonBuilder.append(convertToStringBasedOnDataType("TrackKm", "NULL"))
jsonBuilder.append(convertToStringBasedOnDataType("TrackName", "NULL"))
}
}
}
}
jsonBuilder.setLength(jsonBuilder.length - 1)
jsonBuilder.append("},")
}
sc.parallelize(Seq(jsonBuilder.toString)).repartition(1).saveAsTextFile("hdfs://localhost:9000/outputDirectory")
It sounds like you are using only one thread, whether the application runs on a machine with 4 or 16 cores won't matter if that is the case.
It sounds like 1 file comes in, that 1 file is 1 RDD partition with 100 rows. You iterate over the rows in that RDD and append the jsonBuilder. At the end you call repartition(1) which will make the writing of the file single threaded.
You could reparation your data-set to 12 RDD partitions after you pick up the file, to ensure that other threads work on the rows. But unless I am missing something you are lucky this isn't happening. What happens if two threads are calling jsonBuilder.append("{") at the same time? Won't they create invalid JSON. I could be missing something here.
You could test to see if I am correct about the single threaded-ness of your application by adding logging like this:
scala> val rdd1 = sc.parallelize(1 to 10).repartition(1)
rdd1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[5] at repartition at <console>:21
scala> rdd1.foreach{ r => {println(s"${Thread.currentThread.getName()} => $r")} }
Executor task launch worker-40 => 1
Executor task launch worker-40 => 2
Executor task launch worker-40 => 3
Executor task launch worker-40 => 4
Executor task launch worker-40 => 5
Executor task launch worker-40 => 6
Executor task launch worker-40 => 7
Executor task launch worker-40 => 8
Executor task launch worker-40 => 9
Executor task launch worker-40 => 10
scala> val rdd3 = sc.parallelize(1 to 10).repartition(3)
rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[40] at repartition at <console>:21
scala> rdd3.foreach{ r => {println(s"${Thread.currentThread.getName()} => $r")} }
Executor task launch worker-109 => 1
Executor task launch worker-108 => 2
Executor task launch worker-95 => 3
Executor task launch worker-95 => 4
Executor task launch worker-109 => 5
Executor task launch worker-108 => 6
Executor task launch worker-108 => 7
Executor task launch worker-95 => 8
Executor task launch worker-109 => 9
Executor task launch worker-108 => 10