i am trying to load around 2 million records to cassandra through spark. Spark has 4 executors and cassandra has 4 nodes in the cluster. But it takes around 20 mins to save all the data to cassandra. Can anyone please help me to make this thing bit more faster.
Ok so I can see several issues with your configuration
Running Cassandra in VM for performance benchmark
Spark NOT co-located (so no data locality ...)
In general, installing Cassandra inside a virtual machine is not recommended for performance benchmark, this is an anti-pattern. So your slow insertion rate is normal, do not complain, you can't ask for better perf while using VM ...
Related
I am running a project that requires to load millions of records to cassandra.
I am using kafka connect and doing partitioning and raising 24 workers I only get around 4000 rows per second.
I did a test with pentaho pdi inserting straight to cassandra with jdbc driver and I get a litle bit less rows per second: 3860 (avg)
The cassandra cluster has 24 nodes. What is the expected insertion pace by default? how can i fine tune the ingestion of big loads of data?
There is no magical "default" rate that a Cassandra cluster can ingest data. One cluster can take 100K ops/sec, another can do 10M ops/sec. In theory, it can be limitless.
A cluster's throughput is determined by a lot of moving parts which include (but NOT limited to):
hardware configuration
number of cores, type of CPU
amount of memory, type of RAM
disk bandwidth, disk configuration
network capacity/bandwidth
data model
client/driver configuration
access patterns
cluster topology
cluster size
The only way you can determine the throughput of your cluster is by doing your own test on as close to production loads as you can simulate. Cheers!
We are running Spark/Hadoop on a different set of nodes than Cassandra. We have 10 Cassandra nodes and multiple spark cores but Cassandra is not running on Hadoop. Performance in fetching data from Cassandra through spark(in yarn client mode) is not very good and bulk data reads from HDFS are faster(6 mins in Cassandra to 2 mins in HDFS). Changing Spark-Cassandra parameters is not helping much also.
Will deploying Hadoop on top of Cassandra solve this issue and majorly impact read performance ?
Without looking at your code, bulk reads in an analytics/Spark capacity, are always going to be faster when directly going to the file VS. reading from a database. The database offers other advantages such as schema enforcement, availability, distribution control, etc but I think the performance differences you're seeing are normal.
After using and playing around with the spark connector, I want to utilize it in the most efficient way, for our batch processes.
is the proper approach to set up a spark worker on the same host where Cassandra node is on? does the spark connector ensure data locality?
I am a bit concerned that a memory intensive spark worker will cause the entire machine to stop, then I will lose a Cassandra node, so I'm a bit confused whether I should place the workers on the Cassandra nodes, or separate (which means no data locality). what is the common way and why?
This depends on your particular use case. Some things to be aware of
1) CPU Sharing, while memory will not be shared (heaps will be separate) between Spark and Cassandra. There is nothing stopping spark executors from stealing time on C* cpu cores. This can lead to load and slowdowns in C* if the spark process is very cpu intensive. If it isn't then this isn't much of a problem.
2) Your network speed, if your network is very fast then there is much less value to locality than if you are on a slower network.
So you have to ask yourself, do you want a simpler setup (everything in one place) or do you want a complicated setup but more isolated.
For instance DataStax (the company I work for) ships Spark running colocated with Cassandra by default, but we also offer the option of having it run separately. Most of our users colocate possibly because of this default, those who don't usually do so because of easier scaling.
I would like to know about hardware limitations in cluster planning (in TBs) specific to my use case. I have read few threads and documents related to it but some content seem to be over 5 years old. Thought of giving it a shot again:
Use case: Building a time-series cassandra cluster where there is from time-to-time bulk loading from data sources which are in Gigabytes. However, the end-user will majorly be focused in reading the data from the cluster. Quite rarely will be some update or delete on the rows
I have an initial hardware configuration with me to setup Cassandra cluster:
2*12 Cores
128 GB RAM
HDD SAS 3.27 TB
This is the initial plan that I come up with:
When I now speculate over the setup, and after reading the post:
should I further divide my nodes with lesser RAM, vCPUs and HDD?
If yes, what would be the good fit wrt my case?
I now have a cluster of 4 spark nodes and 1 solr node and use cassandra as my database. I want to increase the nodes in the medium term to 20 and in the long term to 100. But Datastax doesn't seem to support Mesos or Yarn. How would I best manage all these nodes CPU, memory and storage? Is Mesos even necessary with 20 or 100 nodes? So far I couldn't find any example of this using datastax. I usually do not have jobs that need to be completed but I am running a continuous stream of data. That's why I am even thinking of deleting Datastax because I couldn't manage this many nodes efficiently without YARN or Mesos in my opinion, but maybe there is a better solution I haven't thought of? Also I am using python so apparently Yarn is my only option.
If you have any suggestions or best practice examples let me know.
Thanks!
If you want to run DSE with a supported Hadoop/Yarn environmet you need to use BYOH, read about it HERE In BYOH you can either run the internal Hadoop platform in DSE or you can run a Cloudera or HDP platform with YARN and anything else that is available.