Spark Pipe function utilizes CPU only 50% - apache-spark

I have a legacy code in C++ that gets a file path on HDFS as input, runs and writes its output to local HDD.
Following is how I call it:
val trainingRDD = pathsRdd.pipe(command = commandSeq, env = Map(), printPipeContext = _ => (),
printRDDElement = (kV, printFn) => {
val hdfsPath = kV._2
printFn(hdfsPath)
}, separateWorkingDir = false)
I see CPU utilization around 50% on Ganglia. spark.task.cpus setting is equal to 1. So, each task gets 1 core. But my question is, when I call the binary with pipe, does that binary gets all cores available on the host just like any other executable or is it restricted to how many cores that pipe task has? So far, increasing spark.task.cpus to 2 didn't increase usage.

Spark it could read files parallel and also there have partition machilism in spark ,more partition = more parallel. if you want increase your CPU utilization you could configure your SparkContext to
sc = new SparkConf().setMaster("local[*]")
More details acess https://spark.apache.org/docs/latest/configuration.html

Related

Number of Task in Apache spark while writing into HDFS

I am trying to read csv file and then adding some columns . After that trying to save in orc format.
I could not understand how spark decided number of tasks for different stages.
Why number of task for CSV stage is 1 and for ORC stage it is 39?
val c1c8 = spark.read.option("header",true).csv("/user/DEEPAK_TEST/C1C6_NEW/")
val c1c8new = { c1c8.withColumnRenamed("c1c6_F","c1c8").withColumnRenamed("Network_Out","c1c8_network").withColumnRenamed("Access NE Out","c1c8_access_ne")
.withColumn("c1c8_signalling",when (col("signalling_Out") === "SIP Cl4" , "SIP CL4").when (col("signalling_Out") === "SIP cl4" , "SIP CL4").when (col("signalling_Out") === "Other" , "other").otherwise(col("signalling_Out")))
.withColumnRenamed("access type Out","c1c8_access_type").withColumnRenamed("Type_of_traffic_C","c1c8_typeoftraffic")
.withColumnRenamed("BOS traffic type Out","c1c8_bos_trafc_typ").withColumnRenamed("Scope_Out","c1c8_scope")
.withColumnRenamed("Join with UP-DWN SIP cl5 T1T7 Out","c1c8_join_indicator")
.select("c1c8","c1c8_network", "c1c8_access_ne", "c1c8_signalling", "c1c8_access_type", "c1c8_typeoftraffic",
"c1c8_bos_trafc_typ", "c1c8_scope","c1c8_join_indicator")
}
c1c8new.write.orc("/user/DEEPAK_TEST/C1C8_MAPPING_NEWT/")
Below is my understanding from looking at Spark 2.x source code.
Stage 0 is a file scan that creates FileScanRDD which is an RDD that scans a list of file partitions. This stage can have more than one task when you are reading from multiple partitioned directories, such as a partitioned Hive table.
The number of tasks in Stage 1 will be equals to the number of RDD partitions. In your case c1c8new.rdd.getNumPartitions will be 39. This number is calculated using:
config value spark.files.maxPartitionBytes (128MB by default)
sparkContext.defaultParallelism returned by task scheduler (equal to number of cores when running in local mode)
totalBytes
DataSourceScanExec.scala#L423
val defaultMaxSplitBytes =
fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes
val openCostInBytes = fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes
val defaultParallelism = fsRelation.sparkSession.sparkContext.defaultParallelism
val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + openCostInBytes)).sum
val bytesPerCore = totalBytes / defaultParallelism
val maxSplitBytes = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))
logInfo(s"Planning scan with bin packing, max size: $maxSplitBytes bytes, " +
s"open cost is considered as scanning $openCostInBytes bytes.")
You can see actual calculated values in the above log message if you set the log level to INFO - spark.sparkContext.setLogLevel("INFO")
In your case, I think the split size is 128 and so, number of tasks/partitions is roughly 4.6G/128MB
As a side note, you can change the number of partitions (and hence the number of tasks in the subsequent stage) by using repartition() or coalesce() on the dataframe. More importantly, the number of partitions after a shuffle is determined by spark.sql.shuffle.partitions (200 by default). If you have a shuffle, it is better to use this configuration to control the number of tasks because inserting repartition() or coalesce() between stages adds extra overhead.
For large spark SQL workloads, setting optimum values for spark.sql.shuffle.partitions in each stage was always a pain point. Spark 3.x has better support for this if Adaptive Query Execution is enabled, but I haven't tried it for any production workloads.

In Apache Spark, how to make a task to always execute on the same machine?

In its simplest form, RDD is merely a placeholder of chained computations that can be arbitrarily scheduled to be executed on any machine:
val src = sc.parallelize(0 to 1000)
val rdd = src.mapPartitions { itr =>
Iterator(SparkEnv.get.executorId)
}
for (i <- 1 to 3) {
val vs = rdd.collect()
println(vs.mkString)
}
/* yielding:
1230123012301230
0321032103210321
2130213021302130
*/
This behaviour can obviously be overridden by making any of the upstream RDD persisted, such that Spark scheduler will minimise redundant computation:
val src = sc.parallelize(0 to 1000)
src.persist()
val rdd = src.mapPartitions { itr =>
Iterator(SparkEnv.get.executorId)
}
for (i <- 1 to 3) {
val vs = rdd.collect()
println(vs.mkString)
}
/* yield:
2013201320132013
2013201320132013
2013201320132013
each partition has a fixed executorID
*/
Now my problem is :
I don't like the vanilla caching mechanism (see this post: In Apache Spark, can I incrementally cache an RDD partition?) and have wrote my own caching mechanism (by implementing a new RDD). Since the new caching mechanism is only capable of reading existing values from local disk/memory, if there are multiple executors, my cache for each partition will be frequently missed every time the partition is executed in a task on another machine.
So my question is :
How do I mimic Spark RDD persistent implementation to ask the DAG scheduler to enforce/suggest locality aware task scheduling? Without actually calling the .persist() method, because it is unnecessary.

Running large dataset causes timeout

I am building a Spark application that is relatively simple. Generally, the logic looks like this:
val file1 = sc.textFile("s3://file1/*")
val file2 = sc.textFile("s3://file2/*")
// map over files
val file1Map = file1.map(word => (word, "val1"))
val file2Map = file2.map(differentword => (differentword, "val2"))
val unionRdd = file1Map.union(file2Map)
val groupedUnion = unionRdd.groupByKey()
val output = groupedUnion.map(tuple => {
// do something that requires all the values, return new object
if(oneThingIsTrue) tuple._1 else "null"
}).filter(line => line != "null")
output.saveAsTextFile("s3://newfile/")
The question has to do with this not working when I run it with larger datasets. I can run it without errors when the Dataset is around 700GB. When I double it to 1.6TB, the job will get halfway before timing out. Here is the Err log:
INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 0, fetching them
INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker#172.31.4.36:39743)
ERROR MapOutputTrackerWorker: Error communicating with MapOutputTracker
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [800 seconds]. This timeout is controlled by spark.network.timeout
I have tried increasing the network timeout to both 800 seconds and 1600 seconds but all this does is delay the error for longer. I am running the code on 10r4.2xl which have 8 cores each and 62gb RAM. I have EBS setup to have 3TB storage. I am running this code via Zeppelin in Amazon EMR.
Can anyone help me debug this? The CPU usage of the cluster will be close to 90% the whole time until it gets halfway and it drops back to 0 completely. The other interesting thing is that it looks like it fails in the second stage when it is shuffling. As you can see from the trace, it is doing the fetch and never gets it.
Here is a photo from Ganglia.
I'm still not sure what caused this but I was able to get around it by coalescing the unionRdd and then grouping that result. Changing the above code to:
...
// union rdd is 30k partitions, coalesce into 8k
val unionRdd = file1Map.union(file2Map)
val col = unionRdd.coalesce(8000)
val groupedUnion = col.groupByKey()
...
It might not be efficient, but it works.
replace groupbykey with reduceByKey or aggregateByKey or combineByKey.
groupByKey must bring all like keys onto the same worker and this can cause an out of memory error. Not sure why there isn't a warning on using this function

Method to get number of cores for a executor on a task node?

E.g. I need to get a list of all available executors and their respective multithreading capacity (NOT the total multithreading capacity, sc.defaultParallelism already handle that).
Since this parameter is implementation-dependent (YARN and spark-standalone have different strategy for allocating cores) and situational (it may fluctuate because of dynamic allocation and long-term job running). I cannot use other method to estimate this. Is there a way to retrieve this information using Spark API in a distributed transformation? (E.g. TaskContext, SparkEnv)
UPDATE As for Spark 1.6, I have tried the following methods:
1) run a 1-stage job with many partitions ( >> defaultParallelism ) and count the number of distinctive threadIDs for each executorID:
val n = sc.defaultParallelism * 16
sc.parallelize(n, n).map(v => SparkEnv.get.executorID -> Thread.currentThread().getID)
.groupByKey()
.mapValue(_.distinct)
.collect()
This however leads to an estimation higher than actual multithreading capacity because each Spark executor uses an overprovisioned thread pool.
2) Similar to 1, except that n = defaultParallesim, and in every task I add a delay to prevent resource negotiator from imbalanced sharding (a fast node complete it's task and asks for more before slow nodes can start running):
val n = sc.defaultParallelism
sc.parallelize(n, n).map{
v =>
Thread.sleep(5000)
SparkEnv.get.executorID -> Thread.currentThread().getID
}
.groupByKey()
.mapValue(_.distinct)
.collect()
it works most of the time, but is much slower than necessary and may be broken by very imbalanced cluster or task speculation.
3) I haven't try this: use java reflection to read BlockManager.numUsableCores, this is obviously not a stable solution, the internal implementation may change at any time.
Please tell me if you have found something better.
It is pretty easy with Spark rest API. You have to get application id:
val applicationId = spark.sparkContext.applicationId
ui URL:
val baseUrl = spark.sparkContext.uiWebUrl
and query:
val url = baseUrl.map { url =>
s"${url}/api/v1/applications/${applicationId}/executors"
}
With Apache HTTP library (already in Spark dependencies, adapted from https://alvinalexander.com/scala/scala-rest-client-apache-httpclient-restful-clients):
import org.apache.http.impl.client.DefaultHttpClient
import org.apache.http.client.methods.HttpGet
import scala.util.Try
val client = new DefaultHttpClient()
val response = url
.flatMap(url => Try{client.execute(new HttpGet(url))}.toOption)
.flatMap(response => Try{
val s = response.getEntity().getContent()
val json = scala.io.Source.fromInputStream(s).getLines.mkString
s.close
json
}.toOption)
and json4s:
import org.json4s._
import org.json4s.jackson.JsonMethods._
implicit val formats = DefaultFormats
case class ExecutorInfo(hostPort: String, totalCores: Int)
val executors: Option[List[ExecutorInfo]] = response.flatMap(json => Try {
parse(json).extract[List[ExecutorInfo]]
}.toOption)
As long as you keep application id and ui URL at hand and open ui port to external connections you can do the same thing from any task.
I would try to implement SparkListener in a way similar to web UI does. This code might be helpful as an example.

PySpark Block Matrix multiplication fails with OOM

I'm trying to do a matrix multiplication chain of size 67584*67584 using Pyspark but it constantly runs out of memory or OOM error.Here are the details:
Input is matlab file(.mat file) which has the matrix in a single file. I load the file using scipy loadmat, split the file into multiple files of block size (1024*1024) and store them back in .mat format.
Now mapper loads each file using filelist and create a rdd of blocks.
filelist = sc.textFile(BLOCKS_DIR + 'filelist.txt',minPartitions=200)
blocks_rdd = filelist.map(MapperLoadBlocksFromMatFile).cache()
MapperLoadBlocksFromMatFile is a function as below:
def MapperLoadBlocksFromMatFile(filename):
data = loadmat(filename)
G = data['G']
id = data['block_id'].flatten()
n = G.shape[0]
if(not(isinstance(G,sparse.csc_matrix))):
sub_matrix = Matrices.dense(n, n, G.transpose().flatten())
else:
sub_matrix = Matrices.dense(n,n,np.array(G.todense()).transpose().flatten())
return ((id[0], id[1]), sub_matrix)
Now once i have this rdd, i create a BlockMatrix from it. and Do a matrix multiplication with it.
adjacency_mat = BlockMatrix(blocks_rdd, block_size, block_size, adj_mat.shape[0], adj_mat.shape[1])
I'm using the multiply method from BlockMatrix implementation and it runs out of memory every single time.
Result = adjacency_mat.multiply(adjacency_mat)
Below are the cluster configuration details:
50 nodes of 64gb Memory and 20 cores processors.
worker-> 60gb and 16 cores
executors-> 15gb and 4 cores each
driver.memory -> 60gb and maxResultSize->10gb
i even tried with rdd.compress. Inspite of having enough memory and cores, i run out of memory every time. Every time a different node runs out of memory and i don't have an option of using visualVM in the cluster . What am i doing wrong? Is the way blockmatrix is created wrong? Or am i not accounting for enough memory?
OOM Error Stacktrace

Resources