Spark structured streaming best VMs - apache-spark

I was hoping to ask if anyone found the best VM to use for Databricks clusters when running spark streaming.
I was testing out the Fv2 series (F32_v2), however I found out that most of the jobs have an issue with memory spill. With that said would it make sense to use more memory optimized clusters or add more compute VMs?
We are looking to see how we can improve the code, but as a general rule have you found some VM types work better with streaming jobs and some that do not work well (for example the L-series vs E-series vs F series).
Thank you in advance

It might depend on your use case. If you need more parallel processing - lets say you have more partitions on your message queue from you pull the data, you can go for compute optimized node and have more cores running in parallel and pulling data from message queue. If you feel your workload is memory intensive, you can go for memory optimized VMs.
This page has details around the benchmarking tests conducted on databricks and it might help you get some fair idea -
https://www.databricks.com/blog/2017/10/11/benchmarking-structured-streaming-on-databricks-runtime-against-state-of-the-art-streaming-systems.html
Github repo with .dbc files for benchmarking - https://github.com/databricks/benchmarks

Related

Can I run multiple Spark History Servers in parallel?

We use Spark History Server to analyze our Spark runs, and we have a lot of them. This causes a high load on the server, since logs are analyzed lazily (i.e. on the first request).
I was wondering if we can scale out this service, or just scale it up? Is it version-dependant?
Since it's a concurrency issue, I rather get a trustworthy answer instead of running it and hoping for the best.
Thanks!

Kubernetes Vs Spark Vs Spark on kubernetes

So I have a use case where I will stream about 1000 records per minute from kafka. I just need to dump these records in raw form in a no sql db or something like a data lake for that matter
I ran this through two approaches
Approach 1
——————————
Create kafka consumers in java and run them as three different containers in kubernetes. Since all the containers are in the same kafka consumer group, they would all contribute towards reading from same kafka topic and dump data into data lake. This works pretty quick for the volume of work load I have
Approach 2
——————————-
I then created a spark cluster and the same java logic to read from kafka and dump data in data lake
Observations
———————————-
Performance of kubernetes if not bad was equal to that of a spark job running in clustered mode.
So my question is, what is the real use case for using spark over kubernetes the way I am using it or even spark on kubernetes?
Is spark only going to rise and shine much much heavier work loads let’s say something of the order of 50,000 records per minute or cases where some real time processing needs to be done on the data before dumping it to the sink?
Spark has more cost associated to it so I need to make sure I use it only if it would scale better than kuberbetes solution
If your case is only to archive/snapshot/dump records I would recommend you to look into the Kafka Connect.
If you need to process the records you stream, eg. aggregate or join streams, then Spark comes into the game. Also for this case you may look into the Kafka Streams.
Each of these frameworks have its own tradeoffs and performance overheads, but in any case you save much development efforts using the tools made for that rather than developing your own consumers. Also these frameworks already support most of the failures handling, scaling, and configurable semantics. Also they have enough config options to tune the behaviour to most of the cases you can imagine. Just choose the available integration and you're good to go! And of course beware the open source bugs ;) .
Hope it helps.
Running kafka inside Kubernetes is only recommended when you have a lot of expertise doing it, as Kubernetes doesn't know it's hosting Spark, and Spark doesn't know its running inside Kubernetes you will need to double check for every feature you decide to run.
For your workload, I'd recommend sticking with Kubernetes. The elasticity, performance, monitoring tools and scheduling features plus the huge community support adds well on the long run.
Spark is a open source, scalable, massively parallel, in-memory execution engine for analytics applications so it will really spark when your load become more processing demand. It simply doesn't have much room to rise and shine if you are only dumping data, so keep It simple.

How to do performance tuning in production cluster for spark job?

Lets assume we have a spark job where we are doing all the performance tuning and making it to run of development environment which is going to have limited configuration (1 node 32GB RAM 500GB Hard disk)
Obviously our production cluster is going to be high, how the tuning parameters which measured in development environment can be helpful in production cluster. Is it advisable to tune jobs directly in production cluster ?
How it is being done in real-time ?
Shameless Plug (Author) try Sparklens https://github.com/qubole/sparklens Most of the time the real question is not if the application is slow, but will it scale. And for most of the applications, answer is upto a limit.
The structure of spark application puts important constraints on its scalability. Number of tasks in a stage, dependencies between stages, skew and amount of work done on the driver side are the main constraints.
One of the best features of Sparklens is that it simulates and tell you how your spark application will perform with different executor counts. Looks perfect for your problem.

Drawbacks of using embedded Spark in Application

I have a use case where in I launch local spark (embedded) inside an application server rather than going for spark rest job server or kernel. Because former(embedded spark) has very low latency compared to other. I am interested in
Drawbacks of this approach if there are any.
Can same be used in production?
P.S. Low latency is priority here.
EDIT: Size of the data being processed for most of the cases will be less than 100mb.
I don't think it is a drawback at all. If you have a look at the implementation of the Hive Thriftserver within the Spark project itself, they also manage SQLContext etc, in the Hive Server process. This is especially the case, if the amount of data is small and the driver can handle it easily. So I would also see this as a hint, that this okay for production use.
But I totally agree, the documentation or advice in general how to integrate spark into interactive customer-facing application is lacking behind the information for BigData pipelines.

how to benchmark the kafka spark-streaming?

I have to perform the benchmarking of spark streaming processing. My process gets pulls messages from the kafka, process and loads into ElasticSearch. The upstream generates 100k records per second. So I would like to calculate how many messages processed in 1 second and the latency time. Is there any tools available to monitor this or is there any process to calculate this.
Spark UI can help you,providing the necessary details you need.
By default, the spark ui is available on http://:4040 in a web browser(For a single spark Context).
For the help,you can use this link: http://spark.apache.org/docs/latest/monitoring.html
Beyond the Spark UI, which is useful for determining the rate of processing of your data, you can also use third-party tools like spark-perf to perform load testing on your cluster and obtain benchmark data in that way as well.
Maybe someone should try Yahoo's streaming-benchmarks, I found databricks use that tool to do benchmark between spark streaming and flink.
https://github.com/yahoo/streaming-benchmarks
https://databricks.com/blog/2017/10/11/benchmarking-structured-streaming-on-databricks-runtime-against-state-of-the-art-streaming-systems.html

Resources