PySpark task not running on parallel - apache-spark

I have a spark cluster setup with one node and one executor, 8 executor cores. I am trying to use the map feature to do "requests.get" function on parallel. Here is the pseudo code:
sc = SparkContext()
url_list = ["a.com", "b.com", "c.com", .....]
def request(url):
page = requests.get(url)
return page.content
result = sc.parallelize(url_list).map(request).collect()
I am expecting the http request to happen on the executor on parallel since i have 8 cores setup in configuration. However, it is requesting on sequence. I get it spark is not really designed for user case like this. But can anyone help me understand why this is not running on parallel based on the core number. Also, how to get what i want which is to run the requests on parallel on spark executor or across different executors.

Try sc.parallelize(url_list, 8).
Without specifying the number of slices, you may only be getting one partition in the RDD, hence the map API may only launch one task to process the partition, hence request() will be called in sequence for each row of that partition.
You can check to see how many partitions you have with:
rdd = sc.parallelize(url_list) # or with the ,8)
print rdd.getNumPartitions()
rdd.map(...

Related

Does every operation you write in a spark job performed in a spark cluster?

lets say for the operation
val a = 12 + 4, or something simple.
Will it still be distributed by the driver into cluster ?
lets say I have a map , say Map[String,String] (very large say 1000000 key value pairs)(hypothetical assumption)
Now when I do get*("something"),
Will this be distributed across the cluster to get that value?
If not , then what is the use of spark if it doesn't computes simple task together?
How is the number of tasks determined by spark also number of job determined ?
If there is a stream and some action is perform for each batch. Is it so that for each batch new job is created?
Answers:
No, This is still a driver side compute.
If you create the map in a Driver program then it remains on driver. If you try access a key then it would simply lookup on the map you created on driver memory and return you back the value.
If you create a RDD out of the collection (Reference) , and if you run any transformation then it will be run on Spark cluster.
Number of partitions usually corresponds to the number of tasks. You can explicitly tell how many partitions you want when you parallelize the collection ( like the map in your case)
Yes, there will be a job created for action performed on each batch.

why does a single core of a Spark worker complete each task faster than the rest of the cores in the other workers?

I have three nodes in a cluster, each with a single active core. I mean, I have 3 cores in the cluster.
Assuming that all partitions have almost the same number of records, why does a single core of a worker complete each task faster than the rest of the cores in the other workers?
Please observe this screenshot. The timeline shows that the latency of the worker core (x.x.x.230) is notably shorter than the other two worker core (x.x.x.210 and x.x.x.220) latencies.
This means that the workers x.x.x.210 and x.x.x.220 are doing the same job in a longer time compared to the worker x.x.x.230. This also happens when all the available cores in the cluster are used, but its delay is not so critial.
I submitted this application again. Look at this new screenshot. Now the fastest worker is the x.x.x.210. Observe that tasks 0, 1 and 2 process partitions with almost the same number of records. This execution time discrepancy is not good, is it?
I don't understand!!!
What I'm really doing is creating a DataFrame and doing a mapping operation to get a new DataFrame, saving the result in a Parquet file.
val input: DataFrame = spark.read.parquet(...)
val result: DataFrame = input.map(row => /* ...operations... */)
result.write.parquet(...)
Any idea why this happens? Is that how Spark operates normally?
Thanks in advance.

ML tasks not running in parallel

EDITED FOR MORE DETAILS
I am testing parameters for classifiers, and have about 3k parameter combinations for SVM and 3k for MLP. Therefore, I want to run tests in parallel to streamline the results.
I have a server with 24 cores/48 threads and 128gb of RAM. In order to place various jobs in parallel, I have already tried using multiple workers and multiple executors. I have even used GNU parallel. But I always get sequential results, i.e., in my results folder I can see that only one classifier is outputting results and the time it takes to produce results matches a sequential profile.
I tried submitting the same jar multiple times (using spark-submit), each testing different parameters; I tried generating all 6k different command combinations to a file and then passing it to gnu parallel, and nothing. Except for adding more executors and changing the resources available per executor, I use the standard spark settings are per download of spark pre-built with hadoop.
As I read in the documentation, each execution should use the same spark context. Is this correct?
Why I don't have parallelism to my tests? What is the ideal combination of workers, executors and resources for each?
EDIT 2 changed scheduler to FAIR, will post results about this change
EDIT 3 Fair scheduler made no difference
PSEUDO-CODE
def main(args: Array[String]): Unit = {
spark = SparkSession.builder().appName("foo").getOrCreate()
sparkContext = spark.sparkContext
//read properties
//initiate classfier
//cross validation procedures
//write results
spark.stop()
}

How to Dynamically Increase Active Tasks in Spark running on Yarn

I am running a spark streaming process where I got a batch of 6000 events. But when I look at executors only one active task is running. I tried dynamic allocation and as well as setting number of executors etc. Even if I have 15 executors only one active task is running at a time. Can any one please guide me what am I doing wrong here.
It looks like you're having only one partition in your DStream. You should try to explicitly repartition your input stream:
val input: DStream[...] = ...
val partitionedInput = input.repartition(numPartitions = 16)
This way you would have 16 partitions in your input DStream, and each of those partitions could be processed in a separate task (and each of those tasks could be executed on a separate executor)

how to make two Spark RDD run parallel

For example I created two RDDs in my code as following:
val rdd1=sc.esRDD("userIndex1/type1")
val rdd2=sc.esRDD("userIndex2/type2")
val rdd3=rdd1.join(rdd2)
rdd3.foreachPartition{....}
I found they were executed serially, why not Spark run them parallel?
The reason of my question is that the network is very slow, for generating rdd1 need 1 hour and generating rdd2 needs 1 hour as well. So I asked why Spark didn't generate the two RDDs at the same time.
Spark provide the asynchronous action to run all jobs in asynchronously so it will may be help in you use case to run all computation in parallel + concurrent. AT a time only one RDD will be computed in spark cluster but u can make them asynchronous. you can check java docs for this api here https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/rdd/AsyncRDDActions.html
And there is also a blog about it check it out here https://blog.knoldus.com/2015/10/21/demystifying-asynchronous-actions-in-spark/
I have found similar behavior. Running RDDs either in Serial or parallel doesn't make any difference due to the number of executors, executor cores you set in your spark-submit.
Let's say we have 2 RDDs as you mentioned above. Let's say each RDD takes 1 hr with 1 executor and 1 core each. We cannot increase the performance with 1 executor and 1 core (Spark config), even if spark runs both RDDs in parallel unless you increase the executors and cores.
So, Running two RDDs in parallel is not going to increase the performance.

Resources