ML tasks not running in parallel - apache-spark

EDITED FOR MORE DETAILS
I am testing parameters for classifiers, and have about 3k parameter combinations for SVM and 3k for MLP. Therefore, I want to run tests in parallel to streamline the results.
I have a server with 24 cores/48 threads and 128gb of RAM. In order to place various jobs in parallel, I have already tried using multiple workers and multiple executors. I have even used GNU parallel. But I always get sequential results, i.e., in my results folder I can see that only one classifier is outputting results and the time it takes to produce results matches a sequential profile.
I tried submitting the same jar multiple times (using spark-submit), each testing different parameters; I tried generating all 6k different command combinations to a file and then passing it to gnu parallel, and nothing. Except for adding more executors and changing the resources available per executor, I use the standard spark settings are per download of spark pre-built with hadoop.
As I read in the documentation, each execution should use the same spark context. Is this correct?
Why I don't have parallelism to my tests? What is the ideal combination of workers, executors and resources for each?
EDIT 2 changed scheduler to FAIR, will post results about this change
EDIT 3 Fair scheduler made no difference
PSEUDO-CODE
def main(args: Array[String]): Unit = {
spark = SparkSession.builder().appName("foo").getOrCreate()
sparkContext = spark.sparkContext
//read properties
//initiate classfier
//cross validation procedures
//write results
spark.stop()
}

Related

PySpark task not running on parallel

I have a spark cluster setup with one node and one executor, 8 executor cores. I am trying to use the map feature to do "requests.get" function on parallel. Here is the pseudo code:
sc = SparkContext()
url_list = ["a.com", "b.com", "c.com", .....]
def request(url):
page = requests.get(url)
return page.content
result = sc.parallelize(url_list).map(request).collect()
I am expecting the http request to happen on the executor on parallel since i have 8 cores setup in configuration. However, it is requesting on sequence. I get it spark is not really designed for user case like this. But can anyone help me understand why this is not running on parallel based on the core number. Also, how to get what i want which is to run the requests on parallel on spark executor or across different executors.
Try sc.parallelize(url_list, 8).
Without specifying the number of slices, you may only be getting one partition in the RDD, hence the map API may only launch one task to process the partition, hence request() will be called in sequence for each row of that partition.
You can check to see how many partitions you have with:
rdd = sc.parallelize(url_list) # or with the ,8)
print rdd.getNumPartitions()
rdd.map(...

How can I parallelize multiple Datasets in Spark?

I have a Spark 2.1 job where I maintain multiple Dataset objects/RDD's that represent different queries over our underlying Hive/HDFS datastore. I've noticed that if I simply iterate over the List of Datasets, they execute one at a time. Each individual query operates in parallel, but I feel that we are not maximizing our resources by not running the different datasets in parallel as well.
There doesn't seem to be a lot out there regarding doing this, as most questions appear to be around parallelizing a single RDD or Dataset, not parallelizing multiple within the same job.
Is this inadvisable for some reason? Can I just use a executor service, thread pool, or futures to do this?
Thanks!
Yes you can use multithreading in the driver code, but normally this does not increase performance, unless your queries operate on very skewed data and/or cannot be parallelized well enough to fully utilize the resources.
You can do something like that:
val datasets : Seq[Dataset[_]] = ???
datasets
.par // transform to parallel Seq
.foreach(ds => ds.write.saveAsTable(...)

Using Spark, how do I read multiple files in parallel from different folders in HDFS?

I have 3 folders containing csv files in 3 different schemas in HDFS.All 3 files are huge ( several GBs). I want to read the files in parallel and process the rows in them in parallel. How do I accomplish this is on a yarn cluster using Spark?
Assuming, you are using Scala, create a parallel collection of your files using the hdfs client and the .par convenience method, then map the result onto spark.read and call an action -- voilĂ , if you have enough resources in the cluster, you'll have all files being read in parallel. At worst, Spark's job scheduler will shuffle the execution of certain tasks around to minimize wait times.
If you don't have enough workers/executors, you won't gain much, but if you do, you can fully exploit those resources, without having to wait for each job to finish, before you send out the next.
Due to lazy evaluation this may happen anyway, depending on how you work with the data -- but you can force parallel execution of several actions/jobs by using parallelism or Futures.
If you want to process all the data separately, you can always write 3 spark jobs to process them separately and execute them in the cluster in parallel. There are several way to run all 3 jobs in parallel. The most straight forward is to have a oozie workflow with 3 parallel sub-workflow.
Now if you want to process 3 datasets in the same job, you need to read them sequentially. After that you can process the datasets. When you process multiple datasets using spark operation, Spark parallelize them for you. The closure of the operation will be shipped to the executors and all will work in parallel.
What do you mean under "read the files in parallel and process the rows in them in parallel"? Spark deals with your data in parallel itself according to your application configuration (num-executors, executor-cores...).
If you mean 'start reading files at the same time and process simultaneously', I'm pretty sure, you can't explicitly get it. It would demand some capabilities to affect the DAG of your application, but as I know, the only way to do it is implicitly, when building your data process as a sequence of transformations/actions.
Spark is also designed in such way, that it can execute several stages simultaneously "out of box", if your resource allocation allows.
I had encountered similar situation recently.
You can pass a list of CSVs with their paths to spark read api like spark.read.json(input_file_paths) (source). This will load all the files in a single dataframe and all the transformations eventually performed will be done in parallel by multiple executors depending on your spark config.

How does mllib code run on spark?

I am new to distributed computing, and I'm trying to run Kmeans on EC2 using Spark's mllib kmeans. As I was reading through the tutorial I found the following code snippet on
http://spark.apache.org/docs/latest/mllib-clustering.html#k-means
I am having trouble understanding how this code runs inside the cluster. Specifically, I'm having trouble understanding the following:
After submitting the code to master node, how does spark know how to parallelize the job? Because there seem to be no part of the code that deals with this.
Is the code copied to all nodes and executed on each node? Does the master node do computation?
How do node communitate the partial result of each iteration? Is this dealt inside the kmeans.train code, or is the spark core takes care of it automatically?
Spark divides data to many partitions. For example, if you read a file from HDFS, then partitions should be equal to partitioning of data in HDFS. You can manually specify number of partitions by doing repartition(numberOfPartitions). Each partition can be processed on separate node, thread, etc. Sometimes data are partitioned by i.e. HashPartitioner, which looks on hash of the data.
Number of partitions and size of partitions generally tells you if data is distributed/parallelized correctly. Creating partitions of data is hidden in RDD.getPartitions methods.
Resource scheduling depends on cluster manager. We can post very long post about them ;) I think that in this question, the partitioning is the most important. If not, please inform me, I will edit answer.
Spark serializes clusures, that are given as arguments to transformations and actions. Spark creates DAG, which is sent to all executors and executors execute this DAG on the data - it launches closures on each partition.
Currently after each iteration, data is returned to the driver and then next job is scheduled. In Drizzle project, AMPLab/RISELab is creating possibility to create multiple jobs on one time, so data won't be sent to the driver. It will create DAG one time and schedules i.e. job with 10 iterations. Shuffle between them will be limited / will not exists at all. Currently DAG is created in each iteration and job in scheduled to executors
There is very helpful presentation about resource scheduling in Spark and Spark Drizzle.

what factors affect how many spark job concurrently

We recently have set up the Spark Job Server to which the spark jobs are submitted.But we found out that our 20 nodes(8 cores/128G Memory per node) spark cluster can only afford 10 spark jobs running concurrently.
Can someone share some detailed info about what factors would actually affect how many spark jobs can be run concurrently? How can we tune the conf so that we can take full advantage of the cluster?
Question is missing some context, but first - it seems like Spark Job Server limits the number of concurrent jobs (unlike Spark itself, which puts a limit on number of tasks, not jobs):
From application.conf
# Number of jobs that can be run simultaneously per context
# If not set, defaults to number of cores on machine where jobserver is running
max-jobs-per-context = 8
If that's not the issue (you set the limit higher, or are using more than one context), then the total number of cores in the cluster (8*20 = 160) is the maximum number of concurrent tasks. If each of your jobs creates 16 tasks, Spark would queue the next incoming job waiting for CPUs to be available.
Spark creates a task per partition of the input data, and the number of partitions is decided according to the partitioning of the input on disk, or by calling repartition or coalesce on the RDD/DataFrame to manually change the partitioning. Some other actions that operate on more than one RDD (e.g. union) may also change the number of partitions.
Some things that could limit the parallelism that you're seeing:
If your job consists of only map operations (or other shuffle-less operations), it will be limited to the number of partitions of data you have. So even if you have 20 executors, if you have 10 partitions of data, it will only spawn 10 task (unless the data is splittable, in something like parquet, LZO indexed text, etc).
If you're performing a take() operation (without a shuffle), it performs an exponential take, using only one task and then growing until it collects enough data to satisfy the take operation. (Another question similar to this)
Can you share more about your workflow? That would help us diagnose it.

Resources