Is it possible to know the resources used by a specific Spark job? - apache-spark

I'm drawing on ideas of using a multi tenant Spark cluster. The cluster execute jobs on demand for a specific tenant.
Is it possible to "know" the specific resources used by a specific job (for payment reasons)? E.g. if a job requires that several nodes in kubernetes is automatically allocated is it then possible to track which Spark jobs (and tenant at the end) that initiated these resource allocations? Or, jobs are always evenly spread out on allocated resources?
Tried to find information at the Apache Spark site and else where on the internet without success.

See https://spark.apache.org/docs/latest/monitoring.html
You can save data from Spark History Server as json and then write your own resource calc stuff.
It is Spark App you mean.

Related

Running a spark job in local mode inside an Openshift pod

I have a pyspark batch job scheduled on YARN. There is now a requirement to put the logic of the spark job into a web service.
I really don't want there to be 2 copies of the same code, and therefore would like to somehow reuse the spark code inside the service, only replacing the IO parts.
The expected size of the workloads per request is small so I don't want to complicate the service by turning it into a distributed application. I would like instead to run the spark code in local mode inside the service. How do I do that? Is that even a good idea? Are there better alternatives?

Spark job as a web service?

A peer of mine has created code that opens a restful api web service within an interactive spark job. The intent of our company is to use his code as a means of extracting data from various datasources. He can get it to work on his machine with a local instance of spark. He insists that this is a good idea and it is my job as DevOps to implement it with Azure Databricks.
As I understand it interactive jobs are for one-time analytics inquiries and for the development of non-interactive jobs to be run solely as ETL/ELT work between data sources. There is of course the added problem of determining the endpoint for the service binding within the spark cluster.
But I'm new to spark and I have scarcely delved into the mountain of documentation that exists for all the implementations of spark. Is what he's trying to do a good idea? Is it even possible?
The web-service would need to act as a Spark Driver. Just like you'd run spark-shell, run some commands , and then use collect() methods to bring all data to be shown in the local environment, that all runs in a singular JVM environment. It would submit executors to a remote Spark cluster, then bring the data back over the network. Apache Livy is one existing implementation for a REST Spark submission server.
It can be done, but depending on the process, it would be very asynchronous, and it is not suggested for large datasets, which Spark is meant for. Depending on the data that you need (e.g. highly using SparkSQL), it'd be better to query a database directly.

Mesos implementation

I have two Django websites that create a Spark Session to a Cluster which is running on Mesos.
The problem is that whatever Django starts first will create a framework and take 100% the resources permanently, it grabs them and doesn't let them go even if idle.
I am lost on how to make the two frameworks use only the neede resources and have them concurrently access the Spark cluster.
Looked into spark schedulres, dynamic resources for spark and mesos but nothing seems to work.
Is it even possible or should I change the approach?
Self solved using dynamic allocation.

how does resource offer works with mesos for a spark streaming application?

As I understand reading through mesos documentation, is a resource offer is done to a application/framework and it is upto the application to accept/reject offer.
I have a "never-ending" spark streaming app where I configured the executors/cores/memory I need parallelism. Aren't these resources acquired only once when my spark-app starts-up. That is, lets say, if my executors are idle are they handed back to mesos?
Does resource offer and acceptance happens only once in case of spark-streaming?
The same question can be extended for other long-running framework such as cassandra or YARN on mesos.
my understanding is that when spark-streaming is run on coarse-grained model resource exchanges happens once and resources are dedicated to executors for lifetime of spark app
The best source for Spark on Mesos would be the spark docs site here.
In the coarse-grained section you can see the following which answers your question:
The benefit of coarse-grained mode is much lower startup overhead, but at the cost of reserving Mesos resources for the complete duration of the application. To configure your job to dynamically adjust to its resource requirements, look into Dynamic Allocation.
If you look into Dynamic resource allocation you can potentially move around executor resources via a Spark Shuffle service. This can be achieved via the provided script by the Spark service, or via Marathon.

How to ensure that DAG is not recomputed after the driver is restarted?

How can I ensure that an entire DAG of spark is highly available i.e. not recomputed from scratch when the driver is restarted (default HA in yarn cluster mode).
Currently, I use spark to orchestrate multiple smaller jobs i.e.
read table1
hash some columns
write to HDFS
this is performed for multiple tables.
Now when the driver is restarted i.e. when working on the second table the first one is reprocessed - though it already would have been stored successfully.
I believe that the default mechanism of checkpointing (the raw input values) would not make sense.
What would be a good solution here?
Is it possible to checkpoint the (small) configuration information and only reprocess what has not already been computed?
TL;DR Spark is not a task orchestration tool. While it has built-in scheduler and some fault tolerance mechanisms built-in, it as suitable for granular task management, as for example server orchestration (hey, we can call pipe on each machine to execute bash scripts, right).
If you want granular recovery choose a minimal unit of computation that makes sense for a given process (read, hash, write looks like a good choice, based on the description), make it an application and use external orchestration to submit the jobs.
You can build poor man's alternative, by checking if expected output exist and skipping part of the job in that case, but really don't - we have variety of battle tested tools which can do way better job than this.
As a side note Spark doesn't provide HA for the driver, only supervision with automatic restarts. Also independent jobs (read -> transform -> write) create independent DAGs - there is no global DAG and proper checkpoint of the application would require full snapshot of its state (like good old BLCR).
when the driver is restarted (default HA in yarn cluster mode).
When the driver of a Spark application is gone, your Spark application is gone and so are all the cached datasets. That's by default.
You have to use some sort of caching solution like https://www.alluxio.org/ or https://ignite.apache.org/. Both work with Spark and both claim to be offering the feature to outlive a Spark application.
There has been times when people used Spark Job Server to share data across Spark applications (which is similar to restarting Spark drivers).

Resources