I am running a project that requires to load millions of records to cassandra.
I am using kafka connect and doing partitioning and raising 24 workers I only get around 4000 rows per second.
I did a test with pentaho pdi inserting straight to cassandra with jdbc driver and I get a litle bit less rows per second: 3860 (avg)
The cassandra cluster has 24 nodes. What is the expected insertion pace by default? how can i fine tune the ingestion of big loads of data?
There is no magical "default" rate that a Cassandra cluster can ingest data. One cluster can take 100K ops/sec, another can do 10M ops/sec. In theory, it can be limitless.
A cluster's throughput is determined by a lot of moving parts which include (but NOT limited to):
hardware configuration
number of cores, type of CPU
amount of memory, type of RAM
disk bandwidth, disk configuration
network capacity/bandwidth
data model
client/driver configuration
access patterns
cluster topology
cluster size
The only way you can determine the throughput of your cluster is by doing your own test on as close to production loads as you can simulate. Cheers!
Related
I am using Datastax Cassandra 4.8.16. With cluster of 8 DC and 5 nodes on each DC on VM's. For last couple of weeks we observed below performance issue
1) Increase drop count on VM's.
2) LOCAL_QUORUM for some write operation not achieved.
3) Frequent Compaction of OpsCenter.rollup_state and system.hints are visible in Opscenter.
Appreciate any help finding the root cause for this.
Presence of dropped mutations means that cluster is heavily overloaded. It could be increase of the main load, so it + load from OpsCenter, overloaded system - you need to look into statistics about number of requests, latencies, etc. per nodes and per tables, to see where increase happened. Please also check the I/O statistics on machines (for example, with iostat) - sizes of the queues, read/write latencies, etc.
Also it's recommended to use a dedicated OpsCenter cluster to store metrics - it could be smaller size, and doesn't require an additional license for DSE. How it said in the OpsCenter's documentation:
Important: In production environments, DataStax strongly recommends storing data in a separate DataStax Enterprise cluster.
Regarding VMs - usually it's not really recommended setup, but heavily depends on what kind of underlying hardware - number of CPUs, RAM, disk system.
This may sound as an elementary question - but I am confused with all the literature.
I have a 3-node cassandra cluster on 3.11.x,with 1 seed node.
We are testing brute force write throughput in this setup with a single threaded client seated outside the cluster.
With nodetool and cqsl at my disposal - how do I go about realistically assessing the following:
How much volume was processed by each node.
How much of the total time was consumed
a)by the actual flush + compaction at each node
b)time taken by the cluster to resolve the node/partition(hashing)
c)network latency in chaperoning the data to the node
The best way would be monitoring JMX port of Cassandra by a tool like Opennms or Zabbix and compare Metrics such as Mutation, compatcion, etc (Hundreds of Metrics there):
https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsMonitoring.html
I would like to know about hardware limitations in cluster planning (in TBs) specific to my use case. I have read few threads and documents related to it but some content seem to be over 5 years old. Thought of giving it a shot again:
Use case: Building a time-series cassandra cluster where there is from time-to-time bulk loading from data sources which are in Gigabytes. However, the end-user will majorly be focused in reading the data from the cluster. Quite rarely will be some update or delete on the rows
I have an initial hardware configuration with me to setup Cassandra cluster:
2*12 Cores
128 GB RAM
HDD SAS 3.27 TB
This is the initial plan that I come up with:
When I now speculate over the setup, and after reading the post:
should I further divide my nodes with lesser RAM, vCPUs and HDD?
If yes, what would be the good fit wrt my case?
i am trying to load around 2 million records to cassandra through spark. Spark has 4 executors and cassandra has 4 nodes in the cluster. But it takes around 20 mins to save all the data to cassandra. Can anyone please help me to make this thing bit more faster.
Ok so I can see several issues with your configuration
Running Cassandra in VM for performance benchmark
Spark NOT co-located (so no data locality ...)
In general, installing Cassandra inside a virtual machine is not recommended for performance benchmark, this is an anti-pattern. So your slow insertion rate is normal, do not complain, you can't ask for better perf while using VM ...
I suspect the answer is "it depends", but is there any general guidance about what kind of hardware to plan to use for Presto?
Since Presto uses a coordinator and a set of workers, and workers run with the data, I imagine the main issues will be having sufficient RAM for the coordinator, sufficient network bandwidth for partial results sent from workers to the coordinator, etc.
If you can supply some general thoughts on how to size for this appropriately, I'd love to hear them.
Most people are running Trino (formerly PrestoSQL) on the Hadoop nodes they already have. At Facebook we typically run Presto on a few nodes within the Hadoop cluster to spread out the network load.
Generally, I'd go with the industry standard ratios for a new cluster: 2 cores and 2-4 gig of memory for each disk, with 10 gigabit networking if you can afford it. After you have a few machines (4+), benchmark using your queries on your data. It should be obvious if you need to adjust the ratios.
In terms of sizing the hardware for a cluster from scratch some things to consider:
Total data size will determine the number of disks you will need. HDFS has a large overhead so you will need lots of disks.
The ratio of CPU speed to disks depends on the ratio between hot data (the data you are working with) and the cold data (archive data). If you just starting your data warehouse you will need lots of CPUs since all the data will be new and hot. On the other hand, most physical disks can only deliver data so fast, so at some point more CPUs don't help.
The ratio of CPU speed to memory depends on the size of aggregations and joins you want to perform and the amount of (hot) data you want to cache. Currently, Presto requires the final aggregation results and the hash table for a join to fit in memory on a single machine (we're actively working on removing these restrictions). If you have larger amounts of memory, the OS will cache disk pages which will significantly improve the performance of queries.
In 2013 at Facebook we ran our Presto processes as follows:
We ran our JVMs with a 16 GB heap to leave most memory available for OS buffers
On the machines we ran Presto we didn't run MapReduce tasks.
Most of the Presto machines had 16 real cores and used processor affinity (eventually cgroups) to limit Presto to 12 cores (so the Hadoop data node process and other things could run easily).
Most of the servers were on a 10 gigabit networks, but we did have one large old crufty cluster using 1 gigabit (which worked fine).
We used the same configuration for the coordinator and the workers.
In recent times, we ran the following:
The machines had 256 GB of memory and we ran a 200 GB Java heap
Most of the machines had 24-32 real cores and Presto was allocated all cores.
The machines had only minimal local storage for logs, with all table data remote (in a proprietary distributed file system).
Most servers had a 25 gigabit network connection to a fabric network.
The coordinators and workers had similar configurations.