Issues with Cassandra for when performing large number of writes - cassandra

We are trying to write a large number of records (upwards of 5 million at a time) into Cassandra. These are being read from tab delimited files and are being imported into Cassandra using executeAsync.
We have been using much smaller datasets (~330k records) which will be more common. Until recently, our script has been silently stopping its import at around 65k records. Since upgrading the RAM from 2Gb to 4Gb the number of records importing have since doubled but we are still not successfully importing all the records.
This is an example of the process we are running at present:
$cluster = \Cassandra::cluster()->withContactPoints('127.0.0.1')->build();
$session = $cluster->connect('example_data');
$statement = $session->prepare("INSERT INTO example_table (example_id, column_1, column_2, column_3, column_4, column_5, column_6) VALUES (uuid(), ?, ?, ?, ?, ?, ?)");
$futures = array();
$data = array();
foreach ($results as $row) {
$data = array($row[‘column_1’], $row[‘column_2’], $row[‘column_3’], $row[‘column_4’], $row[‘column_5’], $row[‘column_6’]);
$futures = $session->executeAsync($statement, new \Cassandra\ExecutionOptions(array(
'arguments' => $data
)));
}
We suspect that this might be down to the heap running out of space:
DEBUG [SlabPoolCleaner] 2017-02-27 17:01:17,105 ColumnFamilyStore.java:1153 - Flushing largest CFS(Keyspace='dev', ColumnFamily='example_data') to free up room. Used total: 0.67/0.00, live: 0.33/0.00, flushing: 0.33/0.00, this: 0.20/0.00
DEBUG [SlabPoolCleaner] 2017-02-27 17:01:17,133 ColumnFamilyStore.java:854 - Enqueuing flush of example_data: 89516255 (33%) on-heap, 0 (0%) off-heap
The table we are inserting this data is as follows:
CREATE TABLE example_data (
example_id uuid PRIMARY KEY,
column_1 int,
column_2 varchar,
column_3 int,
column_4 varchar,
column_5 int,
column_6 int
);
CREATE INDEX column_5 ON example_data (column_5);
CREATE INDEX column_6 ON example_data (column_6);
We have attempted to use the batch method but believe it is not appropriate here as it causes the Cassandra process to run at a high level of CPU usage (~85%).
We are using the latest version of DSE/Cassandra available from the repository.
Cassandra 3.0.11.1564 | DSE 5.0.6

2gb (and 4gb really) is not even the minimum recommended for Cassandra in development or production. Running on it is possible but it requires more tweaking since its below what the defaults are tuned for. Even tweaked you shouldnt expect much performance before it starts having issues keeping up (errors your getting) and you need to add more nodes.
https://docs.datastax.com/en/landing_page/doc/landing_page/planning/planningHardware.html
Production: 32 GB to 512 GB; the minimum is 8 GB for Cassandra only and 32 GB for DataStax Enterprise analytics and search nodes.
Development in non-loading testing environments: no less than 4 GB.
DSE Graph: 2 to 4 GB in addition to your particular combination of DSE Search or DSE Analytics. If you want a large dedicated graph cache, add more RAM.
Also your spamming writes with executeAsync and not applying any backpressure. Eventually you will overrun any system like that. You either need to add some kind of throttling, feedback, or just use synchronous requests.

Related

DSE Cluster node disk gets filled

I have an 6 node cluster , each node is of 1000 GB in size. But the size of one node reached to 1000 GB randomly.On analysis i found only one key space gets filled & only 1 table of this keyspace size get increased from 200 GB to 800 GB (In 24 hours ) , which means someone execute operations on this table only . I want to figure out what operations had perform on this node which leads to this size increment ?
Are there any logs which can be looked at to see what operations were performed?
I guess how I would do this is to use "nodetool tablehistograms" to prove that you have large partitions for the table. Then I would go to the table directory and run "sstablemetadata" on some of the data files, locating ones that displays some large partition sizes.
One trick you could do once you find sstables that have larger partitions is:
sstabledump <sstable> | grep -n "\"key\" :"
What that will do is show you the line number every time the key switches, the larger the gap between lines, the more rows there are.
Here is an example:
sstabledump aa-483-bti-Data.db | grep -n "\"key\" :"
4: "key" : [ "PROCESSING" ],
65605: "key" : [ "PENDING" ],
8552007: "key" : [ "COMPLETED" ],
As you can see, the gap between PENDING and COMPLETED was much larger than PROCESSING and PENDING (65k lines v.s. 8M lines). So this tells me that the PROCESSING partition is relatively small compared to PENDING. The only mystery is how large is the COMPLETED one as there is no "ending" line. To get the total line count, run:
sstabledump aa-483-bti-Data.db | wc -l
16316029
Total line count is 16M. So COMPLETED goes from 8M to 16M, or about 8M lines. So the COMPLETED partition is large as well, about as large as the PENDING partition.
Looking at sstablemetadata to see if that matches up with the output, I see that it does:
sstablemetadata aa-483-bti-Data.db
Partition Size:
Size (bytes) | Count (%) Histogram
943127 (921.0 kB) | 1 ( 33) OOOOOOOOOOOOOOOOOOOOOOOOOOOOOO
129557750 (123.6 MB) | 1 ( 33) OOOOOOOOOOOOOOOOOOOOOOOOOOOOOO
155469300 (148.3 MB) | 1 ( 33) OOOOOOOOOOOOOOOOOOOOOOOOOOOOOO
I see two relatively large partitions and one small one. Bingo.
Maybe some of those can help you get to the bottom of your large partition(s).
With DataStax Enterprise, you should be able to turn on the Database Auditing feature. In fact, by configuring a logger class of CassandraAuditWriter, all activity gets written to the audit_log table in the dse_audit keyspace.
The data is organized by this PRIMARY KEY: ((date, node, day_partition), event_time); and has columns like username,table_name,keyspace_name,operation and others.
Check out the DataStax docs on that for configuration and query options.
As for (open source) Apache Cassandra, we use Ericsson's Cassandra Audit plugin for this functionality. By adding in the project's JAR, and making a couple of adjustments to the cassandra.yaml file, you can view the audit.logs for records like:
15:42:41.655 - client:'10.0.110.1'|user:'flynn'|status:'ATTEMPT'|operation:'DELETE FROM ecks.ectbl WHERE partk = ?'

Cassandra 'bad state', cannot run compaction?

We are using 72% of hard-drive, deleted about half of rows ( using cqlsh ), however Cassandra(3.9.0) cannot complete compaction, throws java.lang.RuntimeException: Not enough space for compaction, estimated sstables = 1, expected write size = 799429448428
Compaction triggers very 24 hrs and fails.
Note that is a single node setup and 'gc_grace_seconds=0';
Is there any other way to force removal of deleted data?
Thanks
You can try splitting large table (with sstablesplit) into smaller ones, so the compaction will require less space (this is requires to stop the node).
http://docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsSSTableSplit.html

Cassandra sstables accumulating

I've been testing out Cassandra to store observations.
All "things" belong to one or more reporting groups:
CREATE TABLE observations (
group_id int,
actual_time timestamp, /* 1 second granularity */
is_something int, /* 0/1 bool */
thing_id int,
data1 text, /* JSON encoded dict/hash */
data2 text, /* JSON encoded dict/hash */
PRIMARY KEY (group_id, actual_time, thing_id)
)
WITH compaction={'class': 'DateTieredCompactionStrategy',
'tombstone_threshold': '.01'}
AND gc_grace_seconds = 3600;
CREATE INDEX something_index ON observations (is_something);
All inserts are done with a TTL, and should expire 36 hours after
"actual_time". Something that is beyond our control is that duplicate
observations are sent to us. Some observations are sent in near real
time, others delayed by hours.
The "something_index" is an experiment to see if we can slice queries
on a boolean property without having to create separate tables, and
seems to work.
"data2" is not currently being written-- it is meant to be written by
a different process than writes "data1", but will be given the same
TTL (based on "actual_time").
Particulars:
Three nodes (EC2 m3.xlarge)
Datastax ami-ada2b6c4 (us-east-1) installed 8/26/2015
Cassandra 2.2.0
Inserts from Python program using "cql" module
(had to enable "thrift" RPC)
Running "nodetool repair -pr" on each node every three hours (staggered).
Inserting between 1 and 4 million rows per hour.
I'm seeing large numbers of data files:
$ ls *Data* | wc -l
42150
$ ls | wc -l
337201
Queries don't return expired entries,
but files older than 36 hours are not going away!
The large number SSTables is probably caused by the frequent repairs you are running. Repair would normally only be run once a day or once a week, so I'm not sure why you are running repair every three hours. If you are worried about short term downtime missing writes, then you could set the hint window to three hours instead of running repair so frequently.
You might have a look at CASSANDRA-9644. This sounds like it is describing your situation. Also CASSANDRA-10253 might be of interest.
I'm not sure why your TTL isn't working to drop old SSTables. Are you setting the TTL on a whole row insert, or individual column updates? If you run sstable2json on a data file, I think you can see the TTL values.
Full disclosure: I have a love/hate relationship with DTCS. I manage a cluster with hundreds of terabytes of data in DTCS, and one of the things it does absolutely horribly is streaming of any kind. For that reason, I've recommended replacing it ( https://issues.apache.org/jira/browse/CASSANDRA-9666 ).
That said, it should mostly just work. However, there are parameters that come into play, such as timestamp_resolution, that can throw things off if set improperly.
Have you checked the sstable timestamps to ensure they match timestamp_resolution (default: microseconds)?

Cassandra Stress Test results evaluation

I have been using the cassandra-stress tool to evaluate my cassandra cluster for quite some time now.
My problem is that I am not able to comprehend the results generated for my specific use case.
My schema looks something like this:
CREATE TABLE Table_test(
ID uuid,
Time timestamp,
Value double,
Date timestamp,
PRIMARY KEY ((ID,Date), Time)
) WITH COMPACT STORAGE;
I have parsed this information in a custom yaml file and used parameters n=10000, threads=100 and the rest are default options (cl=one, mode=native cql3, etc). The Cassandra cluster is a 3 node CentOS VM setup.
A few specifics of the custom yaml file are as follows:
insert:
partitions: fixed(100)
select: fixed(1)/2
batchtype: UNLOGGED
columnspecs:
-name: Time
size: fixed(1000)
-name: ID
size: uniform(1..100)
-name: Date
size: uniform(1..10)
-name: Value
size: uniform(-100..100)
My observations so far are as follows:
With n=10000 and time: fixed(1000), the number of rows getting inserted is 10 million. (10000*1000=10000000)
The number of row-keys/partitions is 10000(i.e n), within which 100 partitions are taken at a time (which means 100 *1000 = 100000 key-value pairs) out of which 50000 key-value pairs are processed at a time. (This is because of select: fixed(1)/2 ~ 50%)
The output message also confirms the same:
Generating batches with [100..100] partitions and [50000..50000] rows (of[100000..100000] total rows in the partitions)
The results that I get are the following for consecutive runs with the same configuration as above:
Run Total_ops Op_rate Partition_rate Row_Rate Time
1 56 19 1885 943246 3.0
2 46 46 4648 2325498 1.0
3 27 30 2982 1489870 0.9
4 59 19 1932 966034 3.1
5 100 17 1730 865182 5.8
Now what I need to understand are as follows:
Which among these metrics is the throughput i.e, No. of records inserted per second? Is it the Row_rate, Op_rate or Partition_rate? If it’s the Row_rate, can I safely conclude here that I am able to insert close to 1 million records per second? Any thoughts on what the Op_rate and Partition_rate mean in this case?
Why is it that the Total_ops vary so drastically in every run ? Has the number of threads got anything to do with this variation? What can I conclude here about the stability of my Cassandra setup?
How do I determine the batch size per thread here? In my example, is the batch size 50000?
Thanks in advance.
Row Rate is the number of CQL Rows that you have inserted into your database. For your table a CQL row is a tuple like (ID uuid, Time timestamp, Value double, Date timestamp).
The Partition Rate is the number of Partitions C* had to construct. A Partition is the data-structure which holds and orders data in Cassandra, data with the same partition key ends up located on the same node. This Partition rate is equal to the number of unique values in the Partition Key that were inserted in the time window. For your table this would be unique values for (ID,Date)
Op Rate is the number of actually CQL operations that had to be done. From your settings it is running unlogged Batches to insert the data. Each insert contains approximately 100 Partitions (Unique combinations of ID and Date) which is why OP Rate * 100 ~= Partition Rate
Total OP should include all operations, read and write. So if you have any read operations those would also be included.
I would suggest changing your batch size to match your workload, or keep it at 1 depending on your actual database usage. This should provide a more realistic scenario. Also it's important to run much longer than just 100 total operations to really get a sense of your system's capabilities. Some of the biggest difficulties come when the size of the dataset increases beyond the amount of RAM in the machine.

Severe degradation in Cassandra Write performance with continuous streaming data over time

I notice a severe degradation in Cassandra write performance with continuous writes over time.
I am inserting time series data with time stamp (T) as the column name in a wide column that stores 24 hours worth of data in a single row.
Streaming data is written from data generator (4 instances, each with 256 threads) inserting data into multiple rows in parallel.
Additionally, data is also inserted into a column family that has indexes over DateType and UUIDType.
CF1:
Col1 | Col2 | Col3(DateType) | Col(UUIDType4) |
RowKey1
RowKey2
:
:
CF2 (Wide column family):
RowKey1 (T1, V1) (T2, V3) (T4, V4) ......
RowKey2 (T1, V1) (T3, V3) .....
:
:
The no. of data points inserted/sec decreases over time until no further inserts are possible. The initial performance is of the order of 60000 ops/sec for ~6-8 hours and then it gradually tapers down to 0 ops/sec. Restarting the DataStax_Cassandra_Community_Server on all nodes helps restore the original throughput, but the behaviour is observed again after a few hours.
OS: Windows Server 2008
No.of nodes: 5
Cassandra version: DataStax Community 1.2.3
RAM: 8GB
HeapSize: 3GB
Garbage collector: default settings [ParNewGC]
I also notice a phenomenal increase in the no. of Pending write requests as reported by the OpsCenter (~of magnitude 200,000) when the performance begins to degrade.
I fail to understand what is preventing the write operations to be completed and why do they pile up over time? I do not see anything suspicious in the Cassandra logs.
Has the OS settings got anything to do with this?
Any suggestions to probe this issue further?
Do you see an increase in pending compactions (nodetool compactionstats)? Or are you seeing blocked flush writers (nodetool tpstats)? I'm guessing you're writing data to Cassandra faster than it can be consumed.
Cassandra won't block on writes, but that doesn't mean that you won't see an increase in the amount of heap used. Pending writes have overhead, as do blocked memtables. In addition, each SSTable has some memory overhead. If compactions fall behind this is magnified. At some point you probably don't have enough headroom in your heap to allocate the objects required for a single write, and you end up spending all your time waiting for an allocation that the GC can't provide.
With increased total capacity, or more IO on the machines consuming the data you would be able to sustain this write rate, but everything indicates you don't have enough capacity to sustain that load over time.
Bringing your write timeout in line with the new default in 2.0 (of 2s instead of 10s) will help with your write backlog by allowing load shedding to kick in faster: https://issues.apache.org/jira/browse/CASSANDRA-6059

Resources