spark embedded (same JVM) for <400k text file process - apache-spark

Can we do a mini spark job embedded in our app? Any example? Reason : want to process part of a file and give results quicker than submitting a regular job. File is only 500 lines. But do not want to keep 2 code bases - just the one that is used for the large files too. File size is less than an MB.
I want to process the file in the same JVM that my client code is running. Want a single executor kicked off from within same JVM, via a flag in the config. (so a few jobs will have this flag set and others wont. Those that do not will run as usual on the cluster.)

Related

How to make subsection of sbatch jobs run at a time

I have a bash script (essentially what it does is align all files in a directory to all the files of another specified directory). The number of jobs gets quite large if there are 100 files being aligned individually to another 100 files (10,000 jobs), which all get submitted to slurm individually.
I have been doing them in batches manually but I think there must be a way to include it in a script so that, for example, only 50 jobs are running at a time.
I tried
$ sbatch --array [1-1000]%50
but it didn't work

Getting the load for each job

Where can I find the load (used/claimed CPUs) per job? I know to get it per host using sinfo, but that does not directly give information on which job causes a possible 'incorrect' load of anything unequal to 1.
(I want to get this for all jobs, i.e. logging in to the node and running top is not my objective.)
You can use
sacct --format='jobid,ReqCPUS,elapsed,AveCPU'
and compare Elapsed with AveCPU. The latter will only be available for job steps, not for the whole job.

How to prevent eventLog file of Spark stream jobs eating up space?

We have multiple run-forever streaming jobs generating huge eventLogs. These in-progress logs won't be removed until reach the the max age config (spark.history.fs.cleaner.maxAge).
Based on the Spark source code, "Only completed applications older than the specified max age will be deleted." https://github.com/apache/spark/blob/a45647746d1efb90cb8bc142c2ef110a0db9bc9f/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
So, in-progress eventLog will never be removed before completion and they are eating up space. Anyone have idea how to prevent it?
We have option like script periodically removes old files, but it will be our last resort, and we cannot modify the source code, but configuration.

Performance analysis of U-SQL script

When I run a U SQL script from portal/visual studio it follows stages like preparing,queued,running,finalizing. What exactly happens behind the scenes in all these stages?Will there be any execution time difference when the job is run from visual studio/portal in dev and production environment? We need to clock the speeds and record the time the script would take in production.Ultimately, the goal is to run these scripts as Data Factory activities in production.
I assume that there would be differences since I assume your dev environment would probably run at lower resource usage (lower degree of parallelism both between jobs and inside a job) than your production environment. Otherwise there should be no difference.
Note that we are still working on performance so if you are running into particular issues, please let us know.
The phases roughly do the following (I am probably missing some parts):
preparing: includes compilation, optimization, Codegen, preparing the execution graph and required resources and putting the job into the queue.
queueing: The job sits in the queue to get executed once the job is at the top of the queue and resources are available to start the job. This can be impacted by setting the maximal number of jobs that can run in parallel (a setting you can set by "calling" support/us).
running: Actual job execution. This will be affected by resources: Maximal number of parallelism that is specified on the job, network bandwidth, store access (throttling, bandwidth).
finalizing: Cleanup and stitching results into files, "sealing" table files. This can be more expensive depending on where you write the data (ADL is faster than WASB for example).

How does the Apache Spark scheduler split files into tasks?

In spark-summit 2014, Aaron gives the speak A Deeper Understanding of Spark Internals , in his slide, page 17 show a stage has been splited into 4 tasks as bellow:
Here I wanna know three things about how does a stage be splited into tasks?
in this example above, it seems that tasks' number are created based on the file number, am I right?
if I'm right in point 1, so if there was just 3 files under directory names, will it just create 3 tasks?
If I'm right in point 2, what if there is just one but very large file? Does it just split this stage into 1 task? And what if when the data is coming from a streaming data source?
thanks a lot, I feel confused in how does the stage been splited into tasks.
You can configure the # of partitions (splits) for the entire process as the second parameter to a job, e.g. for parallelize if we want 3 partitions:
a = sc.parallelize(myCollection, 3)
Spark will divide the work into relatively even sizes (*) . Large files will be broken down accordingly - you can see the actual size by:
rdd.partitions.size
So no you will not end up with single Worker chugging away for a long time on a single file.
(*) If you have very small files then that may change this processing. But in any case large files will follow this pattern.
The split occurs in two stages:
Firstly HDSF splits the logical file into 64MB or 128MB physical files when the file is loaded.
Secondly SPARK will schedule a MAP task to process each physical file.
There is a fairly complex internal scheduling process as there are three copies of each physical file stored on three different servers, and, for large logical files it may not be possible to run all the tasks at once. The way this is handled is one of the major differences between hadoop distributions.
When all the MAP tasks have run the collectors, shuffle and reduce tasks can then be run.
Stage: New stage will get created when a wide transformation occurs
Task: Will get created based on partitions in a worker
Attaching the link for more explanation: How DAG works under the covers in RDD?
Question 1: in this example above, it seems that tasks' number are created based on the file number, am I right?
Answer : its not based on the filenumber, its based on your hadoop block(0.gz,1.gz is a block of data saved or stored in hdfs. )
Question 2:
if I'm right in point 1, so if there was just 3 files under directory names, will it just create 3 tasks?
Answer : By default block size in hadoop is of 64MB and that block of data will be treated as partition in spark.
Note : no of partitions = no of task, because of these it has created 3tasks.
Question 3 :
what if there is just one but very large file? Does it just split this stage into 1 task? And what if when the data is coming from a streaming data source?
Answer : No, the very large file will be partitioned and as i answered for ur question 2 based on the no of partitions , no of task will be created

Resources