My Spark job consists of 3 workers, co-located with the data they need to read. I submit a RDD with some metadata and the job tasks turn that metadata in into real data. For instance the metadata could contain a file to read from the local worker filesystem and the first stage of the spark job would be to read that file into a RDD partition.
In my environment the data may not be present on all 3 workers and it is way too expensive to read across workers (i.e. if the data is on worker1 then worker2 can not reach out and fetch it). For this reason I have to force partitions onto the appropriate worker for the data they are reading. I have a mechanism for achieving this where I check the worker against the expected worker in the metadata and fail the task with a descriptive error message if they don't match. Using blacklisting I can ensure that the task is rescheduled on a different node until the right one is found. This works fine but as an optimization I wanted to use preferredLocations to help the tasks get assigned to the right workers initially without having to go through the try/reschedule process.
Is use makeRDD to create my initial RDD (of metadata) with the correct preferredLocations as per the answer here: How to control preferred locations of RDD partitions?, however it's not exhibiting the behaviour I expect. The code to makeRDD is below:
sc.makeRDD(taskAssigments)
where taskAssignments takes the form:
val taskAssignments = mutable.ArrayBuffer[(String, Seq[String])]()
metadataMappings.foreach { case(k , v) => {
taskAssignments += (k + ":" + v.mkString(",") -> Seq(idHostnameMappings(k)))
}}
idHostMappings is just a map of id -> hostName and I've verified that it contains the correct information.
Given that my test Spark cluster is completely clean with no other jobs running on it and there is no skew in the input RDD (it has 3 partitions to match the 3 workers) I would have expected the tasks to be assigned to their preferredLocations. Instead I still the error messages indicating that tasks are going through the fail/reschedule process.
Is my assumption that tasks would be scheduled at their preferredLocations on a clean cluster correct and is there anything further I can do to force this?
Follow up:
I was also able to create a much simpler test case. My 3 spark workers are named worker1,worker2 and worker3 and I run the following:
import scala.collection.mutable
val someData = mutable.ArrayBuffer[(String, Seq[String])]()
someData += ("1" -> Seq("worker1"))
someData += ("2" -> Seq("worker2"))
someData += ("3" -> Seq("worker3"))
val someRdd = sc.makeRDD(someData)
someRdd.map(i=>i + ":" + java.net.InetAddress.getLocalHost().getHostName()).collect().foreach(println)
I'd expect to see 1:worker1 etc but in fact see
1:worker3
2:worker1
3:worker2
can anyone explain this behaviour?
It turned out the issue was with my environment not Spark. Just in case anyone else is experiencing this the problem was that the Spark workers did not use the machine hostname by default. Setting the following environment variable on each worker rectified it: SPARK_LOCAL_HOSTNAME: "worker1"
Related
lets say for the operation
val a = 12 + 4, or something simple.
Will it still be distributed by the driver into cluster ?
lets say I have a map , say Map[String,String] (very large say 1000000 key value pairs)(hypothetical assumption)
Now when I do get*("something"),
Will this be distributed across the cluster to get that value?
If not , then what is the use of spark if it doesn't computes simple task together?
How is the number of tasks determined by spark also number of job determined ?
If there is a stream and some action is perform for each batch. Is it so that for each batch new job is created?
Answers:
No, This is still a driver side compute.
If you create the map in a Driver program then it remains on driver. If you try access a key then it would simply lookup on the map you created on driver memory and return you back the value.
If you create a RDD out of the collection (Reference) , and if you run any transformation then it will be run on Spark cluster.
Number of partitions usually corresponds to the number of tasks. You can explicitly tell how many partitions you want when you parallelize the collection ( like the map in your case)
Yes, there will be a job created for action performed on each batch.
I have a spark cluster setup with one node and one executor, 8 executor cores. I am trying to use the map feature to do "requests.get" function on parallel. Here is the pseudo code:
sc = SparkContext()
url_list = ["a.com", "b.com", "c.com", .....]
def request(url):
page = requests.get(url)
return page.content
result = sc.parallelize(url_list).map(request).collect()
I am expecting the http request to happen on the executor on parallel since i have 8 cores setup in configuration. However, it is requesting on sequence. I get it spark is not really designed for user case like this. But can anyone help me understand why this is not running on parallel based on the core number. Also, how to get what i want which is to run the requests on parallel on spark executor or across different executors.
Try sc.parallelize(url_list, 8).
Without specifying the number of slices, you may only be getting one partition in the RDD, hence the map API may only launch one task to process the partition, hence request() will be called in sequence for each row of that partition.
You can check to see how many partitions you have with:
rdd = sc.parallelize(url_list) # or with the ,8)
print rdd.getNumPartitions()
rdd.map(...
I'm trying to run a mapToPair function on a javaPairRDD of about 1.5 million entries. Outside of the call, I have a Java Map that's locally defined. If I access the Map inside the mapToPair function then my program runs out of memory. If I don't access the Map, then it executes successfully, even if I access the map in the main loop of the code. Any thoughts on why this might be happening? My hypothesis is that accessing the Map inside the anonymous function is causing Spark to duplicate it a lot of times.
I'm running Spark in Local mode with 16 threads. The issue occurs for anything from 16 to 4000 partitions of the data.
Code example:
Working Code:
JavaPairRDD<Integer, CustomObject> pairRDD = createRDD();
while(loop_condition = true) {
Map<Integer, CustomObject> bigLocalMap = createMap();
System.out.println(bigLocalMap.size());
pairRDD = pairRDD.mapToPair(pair -> {
return pair;
}
}
Not Working Code
JavaPairRDD<Integer, CustomObject> pairRDD = createRDD();
while(loop_condition = true) {
Map<Integer, CustomObject> bigLocalMap = createMap();
pairRDD = pairRDD.mapToPair(pair -> {
System.out.println(bigLocalMap.size());
return pair;
}
}
How big is bigLocalMap? The way you are referencing it (via a closure) requires it to be serialized and sent to every executor for every core. Instead you should pass it around as a broadcast variable.
The general idea is you can register data that you want to be accessible on all of the executors and spark will ensure that the data is effeciently transferred and only stored once per executor. With the closure method you will end up with duplicates if you have configured executors to have multiple cores.
Reference:
https://spark.apache.org/docs/latest/rdd-programming-guide.html#broadcast-variables
If you are still running out of memory, I would take a look at your memory settings. Some candidates for addressing it would be to:
Reduce the number of cores per executor (less simultaneous tasks using memory)
Increase the number of partitions, either by setting spark.default.parallelism and spark.sql.shuffle.partitions (which will only take affect after the first shuffle) or by explicitly calling repartition. Smaller tasks will have less memory pressure.
If you have the resources, increase the amount of RAM you are giving to your executors with the spark.executor.memory setting
Normally when creating an RDD from a List you can just use the SparkContext.parallelize method, but you can not use the spark context from within a Task as it's not serializeable. I have a need to create an RDD from a list of Strings from within a task. Is there a way to do this?
I've tried creating a new SparkContext in the task, but it gives me an error about not supporting multiple spark contexts in the same JVM and that I need to set spark.driver.allowMultipleContexts = true. According to the Apache User Group, that setting however does not yet seem to be supported
As far as I am concerned it is not possible and it is hardly a matter of serialization or a support for multiple Spark contexts. A fundamental limitation is a core Spark architecture. Since Spark context is maintained by a driver and tasks are executed on the workers creating a RDD from inside a task would require pushing changes from workers to a driver. I am not saying it is technically impossible but a whole ideas seems to be rather cumbersome.
Creating Spark context from inside tasks looks even worse. First of all it would mean that context is created on the workers, which for all practical purposes don't communicate with each other. Each worker would get its own context which could operate only on a data that is accessible on given worker. Finally preserving worker state is definitely not a part of the contract so any context create inside a task should be simply garbage collected after the task is finished.
If handling the problem using multiple jobs is not an option you can try to use mapPartitions like this:
val rdd = sc.parallelize(1 to 100)
val tmp = rdd.mapPartitions(iter => {
val results = Map(
"odd" -> scala.collection.mutable.ArrayBuffer.empty[Int],
"even" -> scala.collection.mutable.ArrayBuffer.empty[Int]
)
for(i <- iter) {
if (i % 2 != 0) results("odd") += i
else results("even") += i
}
Iterator(results)
})
val odd = tmp.flatMap(_("odd"))
val even = tmp.flatMap(_("even"))
I have two node standalone cluster for spark stream processing. below is my sample code which demonstrate process I am executing.
sparkConf.setMaster("spark://rsplws224:7077")
val ssc=new StreamingContext()
println(ssc.sparkContext.master)
val inDStream = ssc.receiverStream //batch of 500 ms as i would like to have 1 sec latency
val filteredDStream = inDStream.filter // filtering unwanted tuples
val keyDStream = filteredDStream.map // converting to pair dstream
val stateStream = keyDStream .updateStateByKey //updating state for history
stateStream.checkpoint(Milliseconds(2500)) // to remove long lineage and meterilizing state stream
stateStream.count()
val withHistory = keyDStream.join(stateStream) //joining state wit input stream for further processing
val alertStream = withHistory.filter // decision to be taken by comparing history state and current tuple data
alertStream.foreach // notification to other system
My Problem is spark is not distributing this state RDD to multiple nodes or not distributing task to other node and causing high latency in response, my input load is around 100,000 tuples per seconds.
I have tried below things but nothing is working
1) spark.locality.wait to 1 sec
2) reduce memory allocated to executer process to check weather spark distribute RDD or task but even if it goes beyond memory limit of first node (m1) where drive is also running.
3) increased spark.streaming.concurrentJobs from 1 (default) to 3
4) I have checked in streaming ui storage that there are around 20 partitions for state dstream RDD all located on local node m1.
If I run SparkPi 100000 then spark is able to utilize another node after few seconds (30-40) so I am sure that my cluster configuration is fine.
Edit
One thing I have noticed that even for my RDD if I set storage level MEMORY_AND_DISK_SER_2 then also in app ui storage it shows Memory Serialized 1x Replicated
Spark will not distribute stream data across the cluster automatically for it tends to make full use of data locality(to launch a task on where its data lies will be better, this is default configuration). But you can use repartition to distribute stream data and improve the parallelism. You can turn to http://spark.apache.org/docs/1.3.0/streaming-programming-guide.html#performance-tuning for more information.
If your not hitting the cluster and your jobs only run locally it most likely means your Spark Master in your SparkConf is set to the local URI not the master URI.
By default the value of spark.default.parallelism property is "Local mode" so all the tasks will be executed in the node is receiving the data.
Change this property in spark-defaults.conf file in order to increase the parallelism level.