Spark GraphFrames High Shuffle read/write - apache-spark

Hi I have created Graph using vertex and edge files. Size of graph is 600GB. I am querying this graph using motif feature of Spark GraphFrames.
I have setup an AWS EMR cluster for querying graph.
cluster details:- 1 master and 8 slaves
Master Node:
m5.xlarge
4 vCore, 16 GiB memory, EBS only storage
EBS Storage:64 GiB
Slave Node:
m5.4xlarge
16 vCore, 64 GiB memory, EBS only storage
EBS Storage:256 GiB (per instance)
I am facing very high shuffle read(3.4TB) and write(2TB), this is affecting performance and it takes around 50 mins to execute only 10 queries.Is there any way to reduce such high shuffle.
Following is my spark code:-
val spark = SparkSession.builder.appName("SparkGraph POC").getOrCreate()
val g:GraphFrame = GraphFrame(vertexDf, edgeDf)
//queries
val q1 = g.find(" (a)-[r1]->(b); (b)-[r2]->(c)")
q1.filter(
" r1.relationship = 'knows' and" +
" r2.relationship = 'knows'").distinct()
.createOrReplaceTempView("q1table")
spark.sql("select a.id as a_id,a.name as a_name," +
"b.id as b_id,b.name as b_name," +
"c.id as c_id,c.name as c_name from q1table")
.write
.option("quote", "\"")
.option("escape", "\"")
.option("header","true")
.csv(resFilePath + "/q1")
spark.catalog.uncacheTable("q1table")
val q2 = g.find(" (a)-[r1]->(b); (b)-[r2]->(c); (c)-[r3]->(d); (d)-[r4]->(e)")
q2.filter(
" a.name = 'user1' and" +
" e.name = 'user4' and" +
" r1.relationship = 'knows' and" +
" r2.relationship = 'knows' and" +
" r3.relationship = 'knows' and" +
" r4.relationship = 'knows'").distinct()
.createOrReplaceTempView("q2table")
spark.sql("select a.id as a_id, a.name as a_name ," +
"e.id as e_id, e.name as e_name from q2table")
.write
.option("quote", "\"")
.option("escape", "\"")
.option("header","true")
.csv(resFilePath + "/q2")
spark.catalog.uncacheTable("q2table")
spark.stop()

The problem with the implementation of Graphframes is that it makes self joins of the internal dataframes as many times as you use on the motifs. That means that you will have more a more shuffle as the length of the chain increases
You can see more details at https://www.waitingforcode.com/apache-spark-graphframes/motifs-finding-graphframes/read
I have also tried a similar approach and have seen that when the length of the chain is greater than 12 Spark starts being not responsive and connections are lost with executors, even if I increased resources.
If you are trying to do that, I would recommend using a graph database instead.
Hope this helps

Related

Why spark partition all the data in one executor?

I am working with Spark GraphX. I am building a graph from a file (around 620 mb, 50K vertices and almost 50 millions of edges). I am using a spark cluster with: 4 workers, each one with 8 cores and 13.4g of ram, 1 driver with the same specs. When I submit my .jar to the cluster, randomly one of the workers loads all the data on it. All the task needed for the computing are requested to that worker. While the computing the remaining three are without doing nothing. I have try everything and i do not found nothing that can force to compute in all of the workers.
When Spark build the graph and I look for the number of partitions of the RDD of vertices say 5, but if I repartition that RDD for example with 32 (number of cores in total) Spark load the data in every worker but gets slow the computation.
Im launching the spark submit by this way:
spark-submit --master spark://172.30.200.20:7077 --driver-memory 12g --executor-memory 12g --class interscore.InterScore /root/interscore/interscore.jar hdfs://172.30.200.20:9000/user/hadoop/interscore/network.dat hdfs://172.30.200.20:9000/user/hadoop/interscore/community.dat 111
The code is here:
object InterScore extends App{
val sparkConf = new SparkConf().setAppName("Big-InterScore")
val sc = new SparkContext(sparkConf)
val t0 = System.currentTimeMillis
runInterScore(args(0), args(1), args(2))
println("Running time " + (System.currentTimeMillis - t0).toDouble / 1000)
sc.stop()
def runInterScore(netPath:String, communitiesPath:String, outputPath:String) = {
val communities = sc.textFile(communitiesPath).map(x => {
val a = x.split('\t')
(a(0).toLong, a(1).toInt)
}).cache
val graph = GraphLoader.edgeListFile(sc, netPath, true)
.partitionBy(PartitionStrategy.RandomVertexCut)
.groupEdges(_ + _)
.joinVertices(communities)((_, _, c) => c)
.cache
val lvalues = graph.aggregateMessages[Double](
m => {
m.sendToDst(if (m.srcAttr != m.dstAttr) 1 else 0)
m.sendToSrc(if (m.srcAttr != m.dstAttr) 1 else 0)
}, _ + _)
val communitiesIndices = communities.map(x => x._2).distinct.collect
val verticesWithLValue = graph.vertices.repartition(32).join(lvalues).cache
println("K = " + communitiesIndices.size)
graph.unpersist()
graph.vertices.unpersist()
communitiesIndices.foreach(c => {
//COMPUTE c
}
})
}
}

Spark performance is not enhancing

I am using Zeppelin to read avro files of size in GBs and have records in billions. I have tried with 2 instances and 7 instances on AWS EMR, but the performance seems equal. With 7 instances it is still taking alot of times. The code is:
val snowball = spark.read.avro(snowBallUrl + folder + prefix + "*.avro")
val prod = spark.read.avro(prodUrl + folder + prefix + "*.avro")
snowball.persist()
prod.persist()
val snowballCount = snowball.count()
val prodCount = prod.count()
val u = snowball.union(prod)
Output:
snowballCount: Long = 13537690
prodCount: Long = 193885314
And the resources can be seen here:
The spark.executor.cores is set to 1. If i try to change this number, the Zeppelin doesn't work and spark context shutdown. It would be great, if someone can hint a bit to improve the performance.
Edit:
I checked how many partitions it created:
snowball.rdd.partitions.size
prod.rdd.partitions.size
u.rdd.partitions.size
res21: Int = 55
res22: Int = 737
res23: Int = 792

SparkR - override default parameters in spark.conf

I am using sparkR (spark 2.0.0, yarn) on cluster with following configuration: 5 machines (24 cores + 200 GB RAM each). Wanted to run sparkR.session() with additional arguments to assign only a percentage of total resources to my job:
if(Sys.getenv("SPARK_HOME") == "") Sys.setenv(SPARK_HOME = "/...")
library(SparkR, lib.loc = file.path(Sys.getenv('SPARK_HOME'), "R", "lib"))
sparkR.session(master = "spark://host:7077",
appname = "SparkR",
sparkHome = Sys.getenv("SPARK_HOME"),
sparkConfig = list(spark.driver.memory = "2g"
,spark.executor.memory = "20g"
,spark.executor.cores = "4"
,spark.executor.instances = "10"),
enableHiveSupport = TRUE)
The weird thing is that parameters seemed to be passed to sparkContext, but at the same time I end with number of x-core executors which make use of 100% resources (in this example 5 * 24 cores = 120 cores available; 120 / 4 = 30 executors).
I tried creating another spark-defaults.conf with no default paramaters assigned (so the only default parameters are those existed in spark documentation - they should be easily overrided) by:
if(Sys.getenv("SPARK_CONF_DIR") == "") Sys.setenv(SPARK_CONF_DIR = "/...")
Again, when I looked at the Spark UI on http://driver-node:4040 the total number of executors isn't correct (tab "Executors"), but at the same time all the config parameters in tab "Environment" are exactly the same as those provided by me in R script.
Anyone knows what might be the reason? Is the problem with R API or some infrastructural cluster-specific issue (like yarn settings)
I found you have to use the spark.driver.extraJavaOptions, e.g.
spark <- sparkR.session(master = "yarn",
sparkConfig = list(
spark.driver.memory = "2g",
spark.driver.extraJavaOptions =
paste("-Dhive.metastore.uris=",
Sys.getenv("HIVE_METASTORE_URIS"),
" -Dspark.executor.instances=",
Sys.getenv("SPARK_EXECUTORS"),
" -Dspark.executor.cores=",
Sys.getenv("SPARK_CORES"),
sep = "")
))
Alternatively you change the spark-submit args, e.g.
Sys.setenv("SPARKR_SUBMIT_ARGS"="--master yarn --driver-memory 10g sparkr-shell")

spark master goes down with out of memory exception

I have 1 spark master and 2 slave nodes setup with 8 gb memory each on AWS. I have setup spark master to run every 1 hour. I have a cassandra database which is read every hour from spark to get records and process it in spark. There are around 5000 records every hour. My spark master crashed in one of the run saying
"15/12/20 11:04:45 ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkMaster-akka.actor.default-dispatcher-4436] shutting down ActorSystem [sparkMaster]
java.lang.OutOfMemoryError: GC overhead limit exceeded
at scala.math.BigInt$.apply(BigInt.scala:82)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:16)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:42)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:35)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:42)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:35)
at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3066)
at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2161)
at org.json4s.jackson.JsonMethods$class.parse(JsonMethods.scala:19)
at org.json4s.jackson.JsonMethods$.parse(JsonMethods.scala:44)
at org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58)
at org.apache.spark.deploy.master.Master.rebuildSparkUI(Master.scala:793)
at org.apache.spark.deploy.master.Master.removeApplication(Master.scala:734)
at org.apache.spark.deploy.master.Master.org$apache$spark$deploy$master$Master$$finishApplication(Master.scala:712)
at org.apache.spark.deploy.master.Master$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$28.apply(Master.scala:445)
at org.apache.spark.deploy.master.Master$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$28.apply(Master.scala:445)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.deploy.master.Master$$anonfun$receiveWithLogging$1.applyOrElse(Master.scala:445)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at org.apache.spark.deploy.master.Master.aroundReceive(Master.scala:52)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
"
Can you please let me know the reason why spark master crashed with out of memory. I have this as setup for spark
_executorMemory=6G
_driverMemory=6G
creating 8 paritions in my code.
Why does master goes down which out of memory
Here is the code
//create spark context
_sparkContext = new SparkContext(_conf)
//load the cassandra table
val tabledf = _sqlContext.read.format("org.apache.spark.sql.cassandra").options(Map( "table" -> "events", "keyspace" -> "sams")).load
val whereQuery = "addedtime >= '" + _from + "' AND addedtime < '" + _to + "'"
helpers.printnextLine("Where query to run on Cassandra : " + whereQuery)
val rdd = tabledf.filter(whereQuery)
rdd.registerTempTable("rdd")
val selectQuery = "lower(brandname) as brandname, lower(appname) as appname, lower(packname) as packname, lower(assetname) as assetname, eventtime, lower(eventname) as eventname, lower(client.OSName) as platform, lower(eventorigin) as eventorigin, meta.price as price"
val modefiedDF = _sqlContext.sql("select " + selectQuery + " from rdd")
//cache the rdd
modefiedDF.cache
// perform groupby operation
grprdd = filterrdd.groupBy("brandname", "appname", "packname", "eventname", "platform", "eventorigin", "price").count()
grprdd.foreachPartition{iter =>
{
iter.foreach(element =>
{
// Write to sql server table
val statement = con.createStatement()
statement.executeUpdate(insertQuery)
finally
{
if(con != null)
con.close
}
// clear the cache
_sqlContext.clearCache()
The problem may be that you are asking spark master to use 6 GB and spark executor to use another 6 GB (total 12 GB to be used). However the system only has a total 8 GB RAM available.
Of this 8 GB you should also allow some memory to be utilized for OS processes (say 1 GB)k. Thus total RAM available to spark (master and worker combined) is only 7 GB.
Set executorMemory and driverMemory accordingly.

Performance issue: Spark-sql spent 0.5s process an empty kafka stream

I'm following the Spark KafkaWordCount.scala example to process a Kafka stream.
To get a easy way to write the calc logic, I'm also using Spark-SQL.
The issue is I found each SQL query spent 300 - 400 ms even for an empty stream!
When I have a 2 seconds time window, this cost is too much.
The same logic wrote by Scala code only spent 10 - 12 ms.
The Spark-SQL version:
def processBySQL(persons: RDD[Person], sqc: SQLContext) = {
val ts = System.currentTimeMillis
import sqc.implicits._
val df = persons.toDF()
df.registerTempTable("tb_person")
sqc.cacheTable("tb_person")
sqc.sql("SELECT count(1), age FROM tb_person GROUP BY age").collect().foreach(println)
sqc.uncacheTable("tb_person")
println("[SQL]: " + (System.currentTimeMillis - ts) + " ms") //300 - 400ms
}
The Scala version:
def processByCode(persons: RDD[Person]) = {
val ts = System.currentTimeMillis
persons.groupBy(_.age)
.map(group => {
val (age, items) = group
val size = items.size
(items.last.name, items.size)
}).collect().foreach(println)
println("[CODE]: " + (System.currentTimeMillis - ts) + " ms") // 10 - 12ms
}
Full test code here: https://gist.github.com/nonlyli/e247a576b275cd7b3d88
Any idea for this issue?
Update 2015.09.21: Upgrade to spark 1.5.0, test results no big difference.
The sql and non-sql are not doing the same thing. For instance, in the SQL version, you're calling cacheTable

Resources