What is the proper setup for spark with cassandra - apache-spark

After using and playing around with the spark connector, I want to utilize it in the most efficient way, for our batch processes.
is the proper approach to set up a spark worker on the same host where Cassandra node is on? does the spark connector ensure data locality?
I am a bit concerned that a memory intensive spark worker will cause the entire machine to stop, then I will lose a Cassandra node, so I'm a bit confused whether I should place the workers on the Cassandra nodes, or separate (which means no data locality). what is the common way and why?

This depends on your particular use case. Some things to be aware of
1) CPU Sharing, while memory will not be shared (heaps will be separate) between Spark and Cassandra. There is nothing stopping spark executors from stealing time on C* cpu cores. This can lead to load and slowdowns in C* if the spark process is very cpu intensive. If it isn't then this isn't much of a problem.
2) Your network speed, if your network is very fast then there is much less value to locality than if you are on a slower network.
So you have to ask yourself, do you want a simpler setup (everything in one place) or do you want a complicated setup but more isolated.
For instance DataStax (the company I work for) ships Spark running colocated with Cassandra by default, but we also offer the option of having it run separately. Most of our users colocate possibly because of this default, those who don't usually do so because of easier scaling.

Related

Drawbacks of using embedded Spark in Application

I have a use case where in I launch local spark (embedded) inside an application server rather than going for spark rest job server or kernel. Because former(embedded spark) has very low latency compared to other. I am interested in
Drawbacks of this approach if there are any.
Can same be used in production?
P.S. Low latency is priority here.
EDIT: Size of the data being processed for most of the cases will be less than 100mb.
I don't think it is a drawback at all. If you have a look at the implementation of the Hive Thriftserver within the Spark project itself, they also manage SQLContext etc, in the Hive Server process. This is especially the case, if the amount of data is small and the driver can handle it easily. So I would also see this as a hint, that this okay for production use.
But I totally agree, the documentation or advice in general how to integrate spark into interactive customer-facing application is lacking behind the information for BigData pipelines.

Do multiple spark applications running on yarn have any impact on each other?

Do multiple spark jobs running on yarn have any impact on each other?
e.g. If the traffic on one streaming job increases too much does it have any effect on second job? Will it slow it down or any other consequences?
I have enough resources for both of the applications to run concurrently.
Yes they do. Depending on how your scheduler is set up (static vs dynamic) they either share just the network output (important for shuffles) and disk throughput (important for shuffles, reading in of data locally or on HDFS, writing away data locally or on HDFS) or also the memory and CPUs if it's on dynamic allocation. Still, running your two jobs on parallel as opposed to sequentially will benefit on average, due to the network and disk resources not being used constantly. This mostly depends on the amount of shuffling necessary in your jobs.

hazelcast map-reduce use only one cpu core

I run a map-reduce job on a single-node hazelcast cluster and it consumes only about one CPU (120-130%). I can't find how to configure hazelcast to eat all available CPU up, is it possible at all?
EDIT:
While Hazelcast does not support in-node parallelism another competing opensource in-memory datagrid (IMDG) solution does - Infinispan. See this article to learn more about that.
The current implementation of Mapping and Reducing is single threaded. Hazelcast is not meant to run as a single node environment and the map-reduce framework is designed in a way to support scale-out and not exhaust the whole CPU. You can start up multiple nodes on your machine to parallelize processing and utilize the CPU that way but it seems to me that you might use Hazelcast for a problem that it is not meant to solve. Can you elaborate your use case?

Presto configuration

As I set up a cluster of Presto and try to do some performance tuning, I wonder if there's a more comprehensive configuration guide of Presto, e.g. how can I control how many CPU cores a Presto worker can use. And is it good practice if I start multiple presto workers on a single server (in which case I don't need a dedicated server to run the coordinator)?
Besides, I don't quite understand the task.max-memory argument. Will the presto worker start multiple tasks for a single query? If yes, maybe I can use task.max-memory together with the -Xmx JVM argument to control the level of parallelism?
Thanks in advance.
Presto is a multithreaded Java program and works hard to use all available CPU resources when processing a query (assuming the input table is large enough to warrant such parallelism). You can artificially constrain the amount of CPU resources that Presto uses at the operating system level using cgroups, CPU affinity, etc.
There is no reason or benefit to starting multiple Presto workers on a single machine. You should not do this because they will needlessly compete with each other for resources and likely perform worse than a single process would.
We use a dedicated coordinator in our deployments that have 50+ machines because we found that having the coordinator process queries would slow it down while it performs the query coordination work, which has a negative impact on overall query performance. For small clusters, dedicating a machine to coordination is likely a waste of resources. You'll need to run some experiments with your own cluster setup and workload to determine which way is best for your environment.
You can have a single Presto process act as both a coordinator and worker, which can be useful for tiny clusters or testing purposes. To do so, add this to the etc/config.properties file:
coordinator=true
node-scheduler.include-coordinator=true
Your idea of starting a dedicated coordinator process on a machine shared with a worker process is interesting. For example, on a machine with 16 processors, you could use cgroups or CPU affinity to dedicate 2 cores to the coordinator process and restrict the worker process to 14 cores. We have never tried this, but it could be a good option for small clusters.
A task is a stage in a query plan that runs on a worker (the CLI shows the list of stages while the query is running). For a query like SELECT COUNT(*) FROM t, there will be a task on every work that performs the table scan and partial aggregation, and another task on a single worker for the final aggregation. More complex queries that have joins, subqueries, etc., can result in multiple tasks on every worker node for a single query.
-Xmx must be higher than task.max-memory, or at least equal.
otherwise you will be likely to see OOM issue as I have experienced that before.
and also, since Presto-0.113 they have changed the way Presto manages the query memory and according configurations.
please refer to this link:
https://prestodb.io/docs/current/installation/deployment.html
For your question regarding "many CPU cores a Presto worker can use", I think it's controlled by the parameter task.concurrency, which by default is 16

Using the same computer as a Cassandra node and a Cassandra client

If you are using the Cassandra distributed key-value store, you will have several Cassandra nodes, and thus several computers. The data doesn't just sit there, of course, you also have one or more clients programs that communicate with the Cassandra nodes. Computationally intensive work done by the clients might also be distributed over several computers. Should the clients and the Cassandra nodes be separate computers? Is it OK to use the same computer as a Cassandra node and as a Cassandra client? I expect it would work, in the sense of performing correctly, but would there be unacceptable performance problems?
The Cassandra documentation I've seen talks in terms that suggest Cassandra nodes and clients should be separate computers, but I've not seen an explicit recommendation.
Why do I ask? Why might I want to do that? The application I have in mind does not require that the clients store any data locally, they use Cassandra for all persistent data. Their job is computationally intensive, so the bottleneck is likely to be client CPU processing rather than Cassandra processing. Not also using them as Cassandra nodes seems wasteful.
Also, if each computation (client) node is also a Cassandra node, I can use the Cassandra token of each node (used for distributing Cassandra's data) to distribute the client computations.
This is a valid setup for certain types of deployments. The most common case where people do this is when running Hadoop jobs against Cassandra. The Cassandra Wiki recommends you run one Hadoop TaskTracker on each node in your cluster. That type of deployment is similar to what you are describing.

Resources