I’m using Cassandra 1.2.1, and I am using COPY command to insert millions of rows. Each row is 100 bytes long. The issue is that the insertion happens rather slowly, at rate of 1500 rows per second. We have 3 node cluster with 50 GB disk space each, and 4 GB RAM each. Cassandra process is running with max heap size of 1 GB. We are storing commit logs and data files on the same disk. What could be the cause of this behaviour? Any help would be appreciated.
Apparently as of now, they are not planning to improve the speed of COPY.
See https://issues.apache.org/jira/browse/CASSANDRA-4588
Related
I am importing data from Kafka server using VoltDB Kafka Importer, my current setup is in AWS and all the 9 nodes has following config
8 vCPUs
200 GB HDD
8 GB RAM
The import rate is 10,000 recs per second.
My problem is the cluster enters into in Read-only mode after importing 42 million records, even if other 7 nodes are using only 20-30% of memory.
My table and stored procedure are partitioned. I enabled auto snapshot as well with 1 Hr frequency.
I am expecting 144 million records
What changes I should do to my configuration and how I can move data from memory to disk.
Please help.
I have a cluster of 5 MemSQL Nodes and 2 Aggregators running this on 300 GB of memory.
Our size of row stores is fairly small less than 20 GB to be precise. The problem that we are facing is after few days the memsql memory is at 100% and at that time the application has to be restarted.
Is there a way to force the memory clean up
I have 4 cassandra nodes in cluster and one column family which has 10 columns where row cannot grow very wide (maybe max 1000 columns).
I have "peak" writes where I insert up to 500 000 records in 5-10 minutes range.
I use node.js driver: node-cassandra-cql
3 nodes are working fine but one node crashes every time on heavy writes.
All nodes currently have around 1.5 GB data size and problematic node has 1.9 GB data size.
All nodes have max heap space at 1GB (I have 4 GB RAM on machines so default cassandra config file calculated this amount of heap)
I use default cassandra configuration except I increased write/read timeouts.
Question: Does anyone knows what could be reason for this?
Is heap size really that small?
What and how to configure cassandra cluster for this use case (heavy writes at small time range and other time actually doing nothing or just small writes)
I haven't tried to increase heap size manually, first I would like to know if maybe there is something other to configure instead just increasing it.
We have a 32 node Cassandra cluster with around 100Gb per node using Murmur3 partitioner. It has time series data and we have build secondary indexes on two columns to perform range queries. Currently, the cluster is stable with all the data bulk loaded and all the secondary indexes rebuilt. The issue occurs when we are performing range queries using cql client or hector, just the query for count of rows takes a huge amount of time and it most cases causes nodes to fail due to memory issues. The nodes have 8gb memory, Cassandra MAX Heap is allotted to 4 GB. Has anyone else faced such an issue ? Is there a better way to do count queries ?
I've had similar issues and most often this can be solved by redesigning the schema bearing in mind the queries that you plan to execute against the data in Cassandra. For a timeseries data it is better to have wide tables with granularity depending on your queries. If your query requires data at a granularity of 1 hour, then it is best to have a wide table with all timestamped data points stored within a single row for every hour so you can get all the required data for 1 hour by reading just 1 row.
Since you say the data is bulk loaded, I am assuming that you may have put all the data into a single table which is why the get_count query is taking an enormous amount of time. We have a a cluster with 8GB RAM but have set the heap size to 3 GB because at 4GB, the RAM utilization is almost always at 8GB [full utilization].
We are experimenting a bit with Cassandra by trying out some of the long running test cases (stress test) and we are experiencing some memory issues on one node of the cluster at any given time (It could be any machine on the cluster!)
We am running DataStax Community with Cassandra 1.1.6 on a machine with Windows Server 2008 and 8 GB of RAM. Also, we have configured the Heap size to be 2GB as against the default value of 1GB.
A snippet from the logs:
java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid2440.hprof ...
Heap dump file created [1117876234 bytes in 11.713 secs]
ERROR 22:16:56,756 Exception in thread Thread[CompactionExecutor:399,1,main]
java.lang.OutOfMemoryError: Java heap space
at org.apache.cassandra.io.util.FastByteArrayOutputStream.expand(FastByteArrayOutputStream.java:104)
at org.apache.cassandra.io.util.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:220)
at java.io.DataOutputStream.write(Unknown Source)
Any pointers/help to investigate and fix this.??
You're doing the right thing by long-running your load tests but in a production use case you wouldn't be writing data like this.
Your rows are probably growing too big to fit in RAM when it comes time to compact them. A compaction requires the entire row to fit in RAM.
There's also a hard limit of 2 billion columns per row but in reality you shouldn't ever let rows grow that wide. Bucket them by adding a day or server name or some other value common across your dataset to your row keys.
For a "write-often read-almost-never" workload you can have very wide rows but you shouldn't come close to the 2 billion column mark. Keep it in millions with bucketing.
For a write/read mixed workload where you're reading entire rows frequently even hundreds of columns may be too much.
If you treat Cassandra right you'll easily handle thousands of reads and writes per second per node. I'm seeing about 2.5k reads and writes concurrently per node on my main cluster.