adhoc multithreading and Spark - multithreading

I have a processing data pipeline including 3 methods ( let's say A(), B(), C() sequentially) for an input text file. But I have to repeat this pipeline for 10000 different files. I have used adhoc multithreading: create 10000 threads, and add them to threadPool...Now I switch to Spark to achieve this parallel. My question are:
If Spark can do better job, guide me basic steps please cause I'm new to Spark.
If I use adhoc multithreading, deploy it on cluster. How can i manage resource to allocate threads running equally among nodes.I'm new to HPC system too.
I hope I ask the right questions, thanks !

Related

How to use pipeline for Apache Spark jobs?

I’m learning how to use kubeflow pipeline for Apache Spark jobs and have a question. I’d appreciate if you could share your thoughts!
It is my understanding that data cannot be shared between SparkSessions, and that in each pipeline step/component you need to instantiate a new SparkSession (please correct me if I’m wrong). Does that mean in order to use output from spark jobs from previous pipeline steps, we need to save it somewhere? I suspect this will cause disk read/write burden and slow down the whole process. Can you please share with me how helpful it will be then to use pipeline for spark work?
I’m imaging a potential use case where one would like to ingest data in pyspark, preprocess it, select features for a ML job, then try different ML models and select the best one. In a non-spark situation, I probably would set separate components for each step of “loading data”, “preprocessing data”, and “feature engineering”. Due to the aforementioned issue, however, would it be better to complete all these with in one step in the pipeline, save the output somewhere, and then dedicate separate pipeline components for each model and train them in parallel?
Can you share any other potential use case? Thanks a lot in advance!
Spark in general is a in-memory processing framework, you'd avoid un-necessary writing/reading files. I believe it's better to have one spark job done in one task so you don't need to share spark session and the "middle" result between tasks. Data from loading data/pre-processing/feature engineering better to be serialised/stored with/without kubeflow anyway (think silver/bronze/golden).

Deadlock when many spark jobs are concurrently scheduled

Using spark 2.4.4 running in YARN cluster mode with the spark FIFO scheduler.
I'm submitting multiple spark dataframe operations (i.e. writing data to S3) using a thread pool executor with a variable number of threads. This works fine if I have ~10 threads, but if I use hundreds of threads, there appears to be a deadlock, with no jobs being scheduled according to the Spark UI.
What factors control how many jobs can be scheduled concurrently? Driver resources (e.g. memory/cores)? Some other spark configuration settings?
EDIT:
Here's a brief synopsis of my code
ExecutorService pool = Executors.newFixedThreadPool(nThreads);
ExecutorCompletionService<Void> ecs = new ExecutorCompletionService<>(pool);
Dataset<Row> aHugeDf = spark.read.json(hundredsOfPaths);
List<Future<Void>> futures = listOfSeveralHundredThings
.stream()
.map(aThing -> ecs.submit(() -> {
df
.filter(col("some_column").equalTo(aThing))
.write()
.format("org.apache.hudi")
.options(writeOptions)
.save(outputPathFor(aThing));
return null;
}))
.collect(Collectors.toList());
IntStream.range(0, futures.size()).forEach(i -> ecs.poll(30, TimeUnit.MINUTES));
exec.shutdownNow();
At some point, as nThreads increases, spark no longer seems to be scheduling any jobs as evidenced by:
ecs.poll(...) timing out eventually
The Spark UI jobs tab showing no active jobs
The Spark UI executors tab showing no active tasks for any executor
The Spark UI SQL tab showing nThreads running queries with no running job ID's
My execution environment is
AWS EMR 5.28.1
Spark 2.4.4
Master node = m5.4xlarge
Core nodes = 3x rd5.24xlarge
spark.driver.cores=24
spark.driver.memory=32g
spark.executor.memory=21g
spark.scheduler.mode=FIFO
If possible write the output of the jobs to AWS Elastic MapReduce hdfs (to leverage on the almost instantaneous renames and better file IO of local hdfs) and add a dstcp step to move the files to S3, to save yourself all the troubles of handling the innards of an object store trying to be a filesystem. Also writing to local hdfs will allow you to enable speculation to control runaway tasks without falling into the deadlock traps associated with DirectOutputCommiter.
If you must use S3 as the output directory ensure that the following Spark configurations are set
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
spark.speculation false
Note: DirectParquetOutputCommitter is removed from Spark 2.0 due to the chance of data loss. Unfortunately until we have improved consistency from S3a we have to work with the workarounds. Things are improving with Hadoop 2.8
Avoid keynames in lexicographic order. One could use hashing/random prefixes or reverse date-time to get around.The trick is to name your keys hierarchically, putting the most common things you filter by on the left side of your key. And never have underscores in bucket names due to DNS issues.
Enabling fs.s3a.fast.upload upload parts of a single file to Amazon S3 in parallel
Refer these articles for more detail-
Setting spark.speculation in Spark 2.1.0 while writing to s3
https://medium.com/#subhojit20_27731/apache-spark-and-amazon-s3-gotchas-and-best-practices-a767242f3d98
IMO you're likely approaching this problem wrong. Unless you can guarantee that the number of tasks per job is very low, you're likely not going to get much performance improvement by parallelizing 100s of jobs at once. Your cluster can only support 300 tasks at once, assuming you're using the default parallelism of 200 thats only 1.5 jobs. I'd suggest rewriting your code to cap max concurrent queries at 10. I highly suspect that you have 300 queries with only a single task of several hundred actually running. Most OLTP data processing system intentionally have a fairly low level of concurrent queries compared to more traditional RDS systems for this reason.
also
Apache Hudi has a default parallelism of several hundred FYI.
Why don't you just partition based on your filter column?
I would start by eliminating possible causes. Are you sure its spark that is not able to submit many jobs? Is it spark or is it YARN? If it is the later, you might need to play with the YARN scheduler settings. Could it be something to do with ExecutorService implementation that may have some limitation for the scale you are trying to achieve? Could it be hudi? With the snippet thats hard to determine.
How does the problem manifest itself other than no jobs starting up? Do you see any metrics / monitoring on the cluster or any logs that point to the problem as you say it?
If it is to do with scaling, is is possible for you to autoscale with EMR flex and see if that works for you?
How many executor cores?
Looking into these might help you narrow down or perhaps confirm the issue - unless you have already looked into these things.
(I meant to add this as comment rather than answer but text too long for comment)
Using threads or thread pools are always problematic and error prone.
I had similar problem in processing spark jobs in one of Internet of things application. I resolved using fair scheduling.
Suggestions :
Use fair scheduling (fairscheduler.xml) instead of yarn capacity scheduler
how to ? see this by using dedicated resource pools one per module. when used it will look like below spark ui
See that unit of parllelism (number of partitions ) are correct for data frames you use by seeing spark admin ui. This is spark native way of using parllelism.

how to submit to spark for many jobs in one application

I have a report stats project which use spark 2.1(scala),here is how it works:
object PtStatsDayApp extends App {
Stats A...
Stats B...
Stats C...
.....
}
someone put many stat computation(mostly not related) in one class and submit it using shell. I find it has two problems:
if one stat stuck then the other stats below can not run
if one stat failed then the application will rerun from the beginning
I have two refactor solutions:
put every stat in a single class ,but many more script needed. Does this solution get many overhead for submit so many?
run these stat in parallel .Does this issue resource stress, or spark can hand it appropriately?
Any other idea or best practice? thanks
There are several 3d party free Spark schedulers like Airflow, but I suggest to use Spark Launcher API and write a launching logic programmatically. With this API you can run your jobs in paralel, sequentially or whatever you want.
Link to doc: https://spark.apache.org/docs/2.3.0/api/java/index.html?org/apache/spark/launcher/package-summary.html
Efficiency of running your jobs in parallel mostly depends on your Spark Cluster configuration. In general Spark supports such kind of workloads.
First you can set the scheduler mode to FAIR. Then you can use parallel collections to launch simultaneous Spark jobs on a multithreaded driver.
A parallel collection, lets say... a Parallel Sequence ParSeq of ten of your Stats queries, can use a foreach to fire off each of the Stats queries one by one. It will depend on how many cores the driver has as to how many threads you can use aimultaneously. By default, the global execution context has that many threads.
Check out these posts they are examples of launching concurrent spark jobs with parallel collections.
Cache and Query a Dataset In Parallel Using Spark
Launching Apache Spark SQL jobs from multi-threaded driver

Using Spark, how do I read multiple files in parallel from different folders in HDFS?

I have 3 folders containing csv files in 3 different schemas in HDFS.All 3 files are huge ( several GBs). I want to read the files in parallel and process the rows in them in parallel. How do I accomplish this is on a yarn cluster using Spark?
Assuming, you are using Scala, create a parallel collection of your files using the hdfs client and the .par convenience method, then map the result onto spark.read and call an action -- voilà, if you have enough resources in the cluster, you'll have all files being read in parallel. At worst, Spark's job scheduler will shuffle the execution of certain tasks around to minimize wait times.
If you don't have enough workers/executors, you won't gain much, but if you do, you can fully exploit those resources, without having to wait for each job to finish, before you send out the next.
Due to lazy evaluation this may happen anyway, depending on how you work with the data -- but you can force parallel execution of several actions/jobs by using parallelism or Futures.
If you want to process all the data separately, you can always write 3 spark jobs to process them separately and execute them in the cluster in parallel. There are several way to run all 3 jobs in parallel. The most straight forward is to have a oozie workflow with 3 parallel sub-workflow.
Now if you want to process 3 datasets in the same job, you need to read them sequentially. After that you can process the datasets. When you process multiple datasets using spark operation, Spark parallelize them for you. The closure of the operation will be shipped to the executors and all will work in parallel.
What do you mean under "read the files in parallel and process the rows in them in parallel"? Spark deals with your data in parallel itself according to your application configuration (num-executors, executor-cores...).
If you mean 'start reading files at the same time and process simultaneously', I'm pretty sure, you can't explicitly get it. It would demand some capabilities to affect the DAG of your application, but as I know, the only way to do it is implicitly, when building your data process as a sequence of transformations/actions.
Spark is also designed in such way, that it can execute several stages simultaneously "out of box", if your resource allocation allows.
I had encountered similar situation recently.
You can pass a list of CSVs with their paths to spark read api like spark.read.json(input_file_paths) (source). This will load all the files in a single dataframe and all the transformations eventually performed will be done in parallel by multiple executors depending on your spark config.

Spark-java multithreading vs running individual spark jobs

I am new with Spark and trying to understand performance difference in below approaches (Spark on hadoop)
Scenario : As per batch processing I have 50 hive queries to run.Some can run parallel and some sequential.
- First approach
All of queries can be stored in a hive table and I can write a Spark driver to read all queries at once and run all queries in parallel ( with HiveContext) using java multi-threading
pros: easy to maintain
Cons: all resources may get occupied and
performance tuning can be tough for each query.
- Second approach
using oozie spark actions run each query individual
pros:optimization can be done at query level
cons: tough to maintain.
I couldn't find any document about the first approach that how Spark will process queries internally in first approach.From performance point of view which approach is better ?
The only thing on Spark multithreading I could found is:
"within each Spark application, multiple “jobs” (Spark actions) may be running concurrently if they were submitted by different threads"
Thanks in advance
Since your requirement is to run hive queries in parallel with the condition
Some can run parallel and some sequential
This kinda of workflows are best handled by a DAG processor which Apache Oozie is. This approach will be clearner than you managing your queries by code i.e. you will be building your own DAG processor instead of using the one provided by oozie.

Resources