Apache Cassandra overwhelming bandwidth overhead - cassandra

while testing Apache Cassandra, I inserted 1000 rows of data. I allow it to propagate to the other machine on LAN. This is a 2 machine cluster. I monitor the network connection between the two machine. The total data I expected to flow between the two servers should be around 25Mb including all column names, column values and timestamps). But the actual data sent and received between them was an whopping 362Mb!! Anybody knows why is there such an overwhelming overhead? Thank you

That's interesting. It's probably easier to figure out what's going on if you look at a single operation at a time, though.

It is probably due to the gossip protocol implemented to handle things like cluster membership, management and replication.

Related

How to determine infra needs for a spark cluster

I am looking for some suggestions or resources on how to size-up servers for a spark cluster. We have enterprise requirements that force us to use on-prem servers only so I can't try the task on a public cloud (and even if I used fake data for PoC I would still need to buy the physical hardware later). The org also doesn't have a shared distributed compute env that I could use/I wasn't able to get good internal guidance on what to buy. I'd like to have some idea on what we need before I talk to a vendor who would try to up-sell me.
Our workload
We currently have a data preparation task that is very parallel. We implement it in python/pandas/sklearn + multiprocessing package on a set of servers with 40 skylake cores/80 threads and ~500GB RAM. We're able to complete the task in about 5 days by manually running this task over 3 servers (each one working on a separate part of the dataset). The tasks are CPU bounded (100% utilization on all threads) and usually the memory usage is low-ish (in the 100-200 GB range). Everything is scalable to a few thousand parallel processes, and some subtasks are even more paralellizable. A single chunk of data is in 10-60GB range (different keys can have very different sizes, a single chunk of data has multiple things that can be done to it in parallel). All of this parallelism is currently very manual and clearly should be done using a real distributed approach. Ideally we would like to complete this task in under 12 hours.
Potential of using existing servers
The servers we use for this processing workload are often used on individual basis. They each have dual V100 and do (single node, multigpu) GPU accelerated training for a big portion of their workload. They are operated bare metal/no vm. We don't want to lose this ability to use the servers on individual basis.
Looking for typical spark requirements they also have the issue of (1) only 1GB ethernet connection/switches between them (2) their SSDs are configured into a giant 11TB RAID 10 and we probably don't want to change how the file system looks like when the servers are used on individual basis.
Is there a software solution that could transform our servers into a cluster and back on demand or do we need to reformat everything into some underlying hadoop cluster (or something else)?
Potential of buying new servers
With the target of completing the workload in 12 hours, how do we go about selecting the correct number of nodes/node size?
For compute nodes
How do we choose number of nodes
CPU/RAM/storage?
Networking between nodes (our DC provides 1GB switches but we can buy custom)?
Other considerations?
For storage nodes
Are they the same as compute nodes?
If not how do we choose what is appropriate (our raw dataset is actually small, <1TB)
We extensively utilize a NAS as a shared storage between the servers, are there special consideration on how this needs to work with a cluster?
I'd also like to understand how I can scale up/down these numbers while still being able to viably complete the parallel processing workload. This way I can get a range of quotes => generate a budget proposal for 2021 => buy servers ~Q1.

Multiple instances of Cassandra on each node in the cluster

Is it possible to have a cluster in Cassandra where each of the server is running multiple instances of Cassandra(each instance is part of the same cluster).
I'm aware that if there's a single server in the cluster, then it's possible to run multiple instances of Cassandra on it, but is it also possible to have multiple such servers in the cluster. If yes, how will the configuration look like(listen address,ports etc)?
Even if it was possible, I understand that there might not be any performance benefits at all, just wanted to know if it's theoretically possible.
Yes, it's possible & such setup is often used for testing, for example, using CCM, although it creates multiple interfaces on loopback (127.0.0.2, ...). DataStax Enterprise also has so-called Multi-instance.
You need carefully configure your instances separating ports, etc. Right now, potentially using the Docker could be the simpler solution to implement it.
But why do you need to do it? Until you have really biffy machine, with a lot of RAM & multiple SSDs, this won't bring you additional performance.
Yes, it is possible even i have worked with 5 instance running in one server in production cluster.
Trust me still it is still running but the generic issues i had is high GC all the time, dropped mutations and high latency so of course it is not good to have this kind of setup.
but for your questions's answer yes it is possible and can be in production also.

Need help in setting production environment for 300 million records elasticsearch

I am new to elastic search. I am doing POC on set-up production environment. I need help to do this.
1) What are the production parameters we need to consider when setting the environment ?
3) what are all the watermark need to set-up production ready environment ?
There are two process running live server(improve performance Ex: performance 20 to 40 milliseconds), Batch process server (improve throughput. Ex: In 1 hour 1 server will serve 200 transaction).
live server will have 8 dedicated server nodes. Batch Process will have 12 servers.
How to distribute request between live server and Batch node to not compromise live server performance while Batch in progress. How to scale up application without performance compromise.
Live server transaction 250K/hour in single server (We have 8 Online servers)
Batch process 1MN/hour in single server ( we have 8 Batch servers)
What are all the requirements needed for the above scenario for setting production environment ?
Firstly I would say definitely read the Elasticsearch: The Definitive Guide chapter on Production Deployment:
https://www.elastic.co/guide/en/elasticsearch/guide/current/deploy.html
There are far too many factors that contribute to needed settings and deployment topology to provide an exhaustive answer. Indexing and serving 300 million tweets is a lot different than 300 million science papers. Fulltext search is very different from numeric aggregations and analytics. Really the only way to know for sure is for you to test it (and monitor it!), using as close to real data and access patterns as you can.
Also, I'm a little bit confused between your "batch" and "live" servers, but you have several options for mixed workloads. For the best isolation, use two completely separate clusters. Alternatively, if you have separate indices for "batch" and "live" but want to be able to move an index from "live" to "batch", you can use Shard Allocation Filtering to control which servers each indices' shards go on. This will separate the data, and in many scenarios this is sufficient. The nice thing is those rules can be changed dynamically and ES will move around your shards to match.
Depending on your workloads, the client (coordinator) role might have troubles if a "live" request hits a "batch" client node, and in this case it can be useful to have one or more client only (no data) nodes allocated to just serving the "live" requests (direct your "live" application to these nodes), and another allocated to just serving the "batch" requests (direct your "batch" application to these nodes). Your options here depend on the type of client also.
Unfortunately your question is pretty generic, and the answer has to be "test it with your specific workload"

What would be the fastest way to connect to a Cassandra cluster?

I have an HTTP server receiving new client connections all the time. Each time, I have to reconnect to the Cassandra cluster (each client is attached to a new process via a fork() call.)
I have two problems:
Speed: I'd like to make use of a connection as fast as possible;
Robustness: any one of the Cassandra node could be down.
I would imagine that the best mechanism will work with any cluster, not just Cassandra.
We use thrift to connect, although we may change that later. Either way, as far as network connections are concerned, we just do the regular socket(), bind(), and connect() call sequence.
Most of the code I have seen dealing with similar problems is very serial: it tries to connect to host 1, if it times out, try again with host 2, etc. until all hosts are exhausted.
I was thinking I could instead create one thread per connection attempt (with some sort of limit like 3, 4, or 5 parallel attempts--the number will depend on the size of the Cassandra cluster.) However, I am although thinking that if all connections succeed, I am probably going to waste a lot of time on the cluster side...
Is there a specific way this sort of a thing is generally resolved?
Most (if not all) of these features (failover, smart request routin, retries) are available in the DataStax drivers for Cassandra.
If possible you should migrate away from Thrift.
If you really really have to (please consider the time to develop and maintain this solution on a protocol that has been deprecated for a long while) create your own, you could take some inspiration from the DataStax drivers.

Collecting high volume DNS stats with libpcap

I am considering writing an application to monitor DNS requests for approximately 200,000 developer and test machines. Libpcap sounds like a good starting point, but before I dive in I was hoping for feedback.
This is what the application needs to do:
Inspect all DNS packets.
Keep aggregate statistics. Include:
DNS name.
DNS record type.
Associated IP(s).
Requesting IP.
Count.
If the number of requesting IPs for one DNS name is > 10, then stop keeping the client ip.
The stats would hopefully be kept in memory, and disk writes would only occur when a new, "suspicious" DNS exchange occurs, or every few hours to write the new stats to disk for consumption by other processes.
My question are:
1. Do any applications exist that can do this? The link will be either 100 MB or 1 GB.
2. Performance is the #1 consideration by a large margin. I have experience writing c for other one-off security applications, but I am not an expert. Do any tips come to mind?
3. How much of an effort would this be for a good c developer in man-hours?
Thanks!
Jason
I suggest you to try something like DNSCAP or even Snort for capturing DNS traffic.
BTW I think this is rather a superuser.com question than a StackOverflow one.

Resources