Can I run multiple Spark History Servers in parallel? - apache-spark

We use Spark History Server to analyze our Spark runs, and we have a lot of them. This causes a high load on the server, since logs are analyzed lazily (i.e. on the first request).
I was wondering if we can scale out this service, or just scale it up? Is it version-dependant?
Since it's a concurrency issue, I rather get a trustworthy answer instead of running it and hoping for the best.
Thanks!

Related

Spark structured streaming best VMs

I was hoping to ask if anyone found the best VM to use for Databricks clusters when running spark streaming.
I was testing out the Fv2 series (F32_v2), however I found out that most of the jobs have an issue with memory spill. With that said would it make sense to use more memory optimized clusters or add more compute VMs?
We are looking to see how we can improve the code, but as a general rule have you found some VM types work better with streaming jobs and some that do not work well (for example the L-series vs E-series vs F series).
Thank you in advance
It might depend on your use case. If you need more parallel processing - lets say you have more partitions on your message queue from you pull the data, you can go for compute optimized node and have more cores running in parallel and pulling data from message queue. If you feel your workload is memory intensive, you can go for memory optimized VMs.
This page has details around the benchmarking tests conducted on databricks and it might help you get some fair idea -
https://www.databricks.com/blog/2017/10/11/benchmarking-structured-streaming-on-databricks-runtime-against-state-of-the-art-streaming-systems.html
Github repo with .dbc files for benchmarking - https://github.com/databricks/benchmarks

Kubernetes Vs Spark Vs Spark on kubernetes

So I have a use case where I will stream about 1000 records per minute from kafka. I just need to dump these records in raw form in a no sql db or something like a data lake for that matter
I ran this through two approaches
Approach 1
——————————
Create kafka consumers in java and run them as three different containers in kubernetes. Since all the containers are in the same kafka consumer group, they would all contribute towards reading from same kafka topic and dump data into data lake. This works pretty quick for the volume of work load I have
Approach 2
——————————-
I then created a spark cluster and the same java logic to read from kafka and dump data in data lake
Observations
———————————-
Performance of kubernetes if not bad was equal to that of a spark job running in clustered mode.
So my question is, what is the real use case for using spark over kubernetes the way I am using it or even spark on kubernetes?
Is spark only going to rise and shine much much heavier work loads let’s say something of the order of 50,000 records per minute or cases where some real time processing needs to be done on the data before dumping it to the sink?
Spark has more cost associated to it so I need to make sure I use it only if it would scale better than kuberbetes solution
If your case is only to archive/snapshot/dump records I would recommend you to look into the Kafka Connect.
If you need to process the records you stream, eg. aggregate or join streams, then Spark comes into the game. Also for this case you may look into the Kafka Streams.
Each of these frameworks have its own tradeoffs and performance overheads, but in any case you save much development efforts using the tools made for that rather than developing your own consumers. Also these frameworks already support most of the failures handling, scaling, and configurable semantics. Also they have enough config options to tune the behaviour to most of the cases you can imagine. Just choose the available integration and you're good to go! And of course beware the open source bugs ;) .
Hope it helps.
Running kafka inside Kubernetes is only recommended when you have a lot of expertise doing it, as Kubernetes doesn't know it's hosting Spark, and Spark doesn't know its running inside Kubernetes you will need to double check for every feature you decide to run.
For your workload, I'd recommend sticking with Kubernetes. The elasticity, performance, monitoring tools and scheduling features plus the huge community support adds well on the long run.
Spark is a open source, scalable, massively parallel, in-memory execution engine for analytics applications so it will really spark when your load become more processing demand. It simply doesn't have much room to rise and shine if you are only dumping data, so keep It simple.

How Do I monitor progess and recover in a long-running Spark map job?

We're using Spark to run an ETL process by which data gets loaded in from a massive (500+GB) MySQL database and converted into aggregated JSON files, then gets written out to Amazon S3.
My question is two-fold:
This job could take a long time to run, and it would be nice to know how that mapping is going. I know Spark has a built in log manager. Is it as simple as just putting a log statement inside of each map? I'd like to know when each record gets mapped.
Suppose this massive job fails in the middle (maybe it chokes on a DB record or the MYSQL connection drops). Is there an easy way to recover from this in Spark? I've heard that caching/checkpointing can potentially solve this, but I'm not sure how?
Thanks!
Seems like 2 questions with lost of answers and detail. Anyway, assuming non-SPARK Streaming answer and referencing other based on my own reading / research, a limited response:
The following on logging progress checking of stages, tasks, jobs:
Global Logging via log4j and tailoring of this by using under the template stored under SPARK_HOME/conf folder, the template log4j.properties.template file which serves as a basis for defining logging requirements for ones own purposes but at SPARK level.
Programmtically by using Logger, using import org.apache.log4j.{Level, Logger}.
REST API to get status of SPARK Jobs. See this enlightening blog: http://arturmkrtchyan.com/apache-spark-hidden-rest-api
There is also a Spark Listener that can be used
:http://:8080 to see progress via Web UI.
Depends on type of failure. Graceful vs. non-graceful, fault tolerance aspects or memory usage issues and things like serious database duplicate key errors depending on API used.
See How does Apache Spark handles system failure when deployed in YARN? SPARK handles its own failures by looking at DAG and attempting to reconstruct a partition by re-execution of what is needed. This all encompasses aspects under fault tolerance for which nothing needs to be done.
Things outside of SPARK's domain and control mean it's over. E.g. memory issues that may result from exceeding various parameters on at large scale computations, DF JDBC write against a store with a duplicate error, JDBC connection outages. This means re-execution.
As an aside, some aspects are not logged as failures even though they are, e.g. duplicate key inserts on some Hadoop Storage Managers.

Drawbacks of using embedded Spark in Application

I have a use case where in I launch local spark (embedded) inside an application server rather than going for spark rest job server or kernel. Because former(embedded spark) has very low latency compared to other. I am interested in
Drawbacks of this approach if there are any.
Can same be used in production?
P.S. Low latency is priority here.
EDIT: Size of the data being processed for most of the cases will be less than 100mb.
I don't think it is a drawback at all. If you have a look at the implementation of the Hive Thriftserver within the Spark project itself, they also manage SQLContext etc, in the Hive Server process. This is especially the case, if the amount of data is small and the driver can handle it easily. So I would also see this as a hint, that this okay for production use.
But I totally agree, the documentation or advice in general how to integrate spark into interactive customer-facing application is lacking behind the information for BigData pipelines.

How can I retrieve workers information for a running application in SPARK?

I want to get information about the workers that are being used by an application in Spark cluster. I need to get its IP address, CPU cores, memory available etc.
Is there any API in spark for this purpose?
Above image shows the same info on Spark UI but I am not able to figure out the way to get it by JAVA code.
It is specific to JAVA.
I want all worker nodes information.
Thanks.
There are multiple ways to do this:
Parse the output log messages and see what workers are started on each machine in your cluster. You can get the names/IPs of all the hosts, when tasks are started and where, how much memory each worker gets, etc. If you want to see the exact HW configuration, you will then need to log in to the worker nodes or use different tools.
The same information as in the web frontend is contained in the eventLogs of the spark applications (this is actually where the data you see comes from). I prefer to use the eventLog as it is very easy to parse in python rather than the log messages.
If you want to have real-time monitoring of the cluster you can use either ganglia (gives nice graphical displays of CPU/memory/network/disks) or use colmux that gives you the same data but in a text format. I personally prefer colmux (easier to set up, you get immediate stats, etc).

Resources