Cassandra data query problems with PDI 5.3 - cassandra

I have a Cassandra installation which contains a table with no more then 110k records.
I'm getting quite a lot of troubles querying the data using PDI 5.3 (the latest version). I am constantly getting out of memory on Cassandra side.
Granted that the server I have Cassandra installed is not the greatest, 4Gb RAM and only 2 cores, I would still expect to perform this simple task without issues.
In cassandra /conf/cassandra-env.sh, I've configured:
MAX_HEAP_SIZE="4G"
HEAP_NEWSIZE="200M"
and now the maximum number of rows I can query is 80k.
The documentation suggests to set MAX_HEAP_SIZE to 1/4th of the machines RAM. But for me that meant 1G and only about 20k rows to query.
I am able to tell how many rows I can query by limiting the select, with the limit keyword, inside the Cassandra input step in PDI.
Are there any other parameters I can tweak to get better performance? This is a development server, on production I'll be expecting queries with 1mil+ rows.
Server on which Cassandra is installed: Red Hat Enterprise Linux Server release 6.6 (Santiago)
Cassandra version: apache-cassandra-2.1.2
Edit: versions updated.

Sacrifice IO for Memory (since memory is killing you):
lower key / row caches if they are enabled (key cache is on by default)
if you carry out lots of deletes you can lower gc_grace_seconds to remove tombstones quicker (assuming you many range scans which you do if you fetch 80k rows, this can help)
Some other ideas:
Paginate (Select 0-10k of 80k, then 10-20k etc.
Check sizes of memtables, if they are too large lower them.
Use tracing to verify what you are retrieving (tombstones can cause lots of overhead)
This thread suggests lowering the commit_log size, but the commit log was heavily revamped and moved offheap in 2.1 and shouldn't be such an issue anymore.

Related

Cassandra vs Cassandra+Ignite

(Single Node Cluster)I've got a table having 2 columns, one is of 'text' type and the other is a 'blob'. I'm using Datastax's C++ driver to perform read/write requests in Cassandra.
The blob is storing a C++ structure.(Size: 7 KB).
Since I was getting lesser than desirable throughput when using Cassandra alone, I tried adding Ignite on top of Cassandra, in the hope that there will be significant improvement in the performance as now the data will be read from RAM instead of hard disks.
However, it turned out that after adding Ignite, the performance dropped even more(roughly around 50%!).
Read Throughput when using only Cassandra: 21000 rows/second.
Read Throughput with Cassandra + Ignite: 9000 rows/second.
Since, I am storing a C++ structure in Cassandra's Blob, the Ignite API uses serialization/de-serialization while writing/reading the data. Is this the reason, for the drop in the performance(consider the size of the structure i.e. 7K) or is this drop not at all expected and maybe something's wrong in the configuration?
Cassandra: 3.11.2
RHEL: 6.5
Configurations for Ignite are same as given here.
I got significant improvement in Ignite+Cassandra throughput when I used serialization in raw mode. Now the throughput has increased from 9000 rows/second to 23000 rows/second. But still, it's not significantly superior to Cassandra. I'm still hopeful to find some more tweaks which will improve this further.
I've added some more details about the configurations and client code on github.
Looks like you do one get per each key in this benchmark for Ignite and you didn't invoke loadCache before it. In this case, on each get, Ignite will go to Cassandra to get value from it and only after it will store it in the cache. So, I'd recommend invoking loadCache before benchmarking, or, at least, test gets on the same keys, to give an opportunity to Ignite to store keys in the cache. If you think you already have all the data in caches, please share code where you write data to Ignite too.
Also, you invoke "grid.GetCache" in each thread - it won't take a lot of time, but you definitely should avoid such things inside benchmark, when you already measure time.

Cassandra latency metric flat [duplicate]

I have setup new Cassandra 3.3 cluster. Then I use jvisualvm to monitor Cassandra read/write latency by using MBean (jmx metric).
The result of read/write latency is always stable in all nodes for many weeks whereas read/write request in that cluster have normally movement (heavy or less in some day).
As I use jvisualvm to monitor Cassandra 2.0 cluster. The read/write latency have normally behavior. It have movement depending on read/wire requests.
I wonder that Why the read/write latency statistics of Cassandra 3.0+ are always stable? And I think it is incorrect result. (I have load tested in Cassandra v3.3, v3.7).
[Updated]
I have found bug relate with this issue.
Cassandra metric flat. https://issues.apache.org/jira/browse/CASSANDRA-11752
The detail show that this problem has been solved in C* version 2.2.8, 3.0.9, 3.8. But after I have tested in version 3.0.9, The result of latency still show flat line.
Any Idea?
Thanks.
have not found any metrics problem When using C*3.3
first,try to monitor with jconsole,have met same issue?
second,which attribute do you see?avg value or percentage?there value always count from node up,so it is common to see percentage value is same.but not always happens on average value.try to restart cassandra node and check the value.

Cassandra 3.0 latency statistic incorrect

I have setup new Cassandra 3.3 cluster. Then I use jvisualvm to monitor Cassandra read/write latency by using MBean (jmx metric).
The result of read/write latency is always stable in all nodes for many weeks whereas read/write request in that cluster have normally movement (heavy or less in some day).
As I use jvisualvm to monitor Cassandra 2.0 cluster. The read/write latency have normally behavior. It have movement depending on read/wire requests.
I wonder that Why the read/write latency statistics of Cassandra 3.0+ are always stable? And I think it is incorrect result. (I have load tested in Cassandra v3.3, v3.7).
[Updated]
I have found bug relate with this issue.
Cassandra metric flat. https://issues.apache.org/jira/browse/CASSANDRA-11752
The detail show that this problem has been solved in C* version 2.2.8, 3.0.9, 3.8. But after I have tested in version 3.0.9, The result of latency still show flat line.
Any Idea?
Thanks.
have not found any metrics problem When using C*3.3
first,try to monitor with jconsole,have met same issue?
second,which attribute do you see?avg value or percentage?there value always count from node up,so it is common to see percentage value is same.but not always happens on average value.try to restart cassandra node and check the value.

Datastax Cassandra repair service weird estimation and heavy load

I have a 5 node cluster with around 1TB of data. Vnodes enabled. Ops Center version 5.12 and DSE 4.6.7. I would like to do a full repair within 10 days and use the repair service in Ops Center so that i don't put unnecessary load on the cluster.
The problem that I'm facing is that repair service puts to much load and is working too fast. It progress is around 30% (according to Ops Center) in 24h. I even tried to change it to 40 days without any difference.
Questions,
Can i trust the percent-complete number in OpsCenter?
The suggested number is something like 0.000006 days. Could that guess be related to the problem?
Are there any settings/tweaks that could be useful to lower the load?
You can use OpsCenter as a guideline about where data is stored and what's going on in the cluster, but it's really more of a dashboard. The real 'tale of the tape' comes from 'nodetool' via command line on server nodes such as
#shell> nodetool status
Status=Up/Down |/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack UN 10.xxx.xxx.xx 43.95 GB 256 33.3%
b1e56789-8a5f-48b0-9b76-e0ed451754d4 RAC1
What type of compaction are you using?
You've asked a sort of 'magic bullet' question, as there could be several factors in play. These are examples but not limited to:
A. Size of data, and the whole rows in Cassandra (you can see these with nodetool cf_stats table_size entries). Rows that result in a binary size of larger than 16M will be seen as "ultra" wide rows, which might be an indicator your schema in your data model needs a 'compound' or 'composite' row key.
B. Type of setup you have with respects to replication and network strategy.
C. Data entry point, how Cassandra gets it's data. Are you using Python? PHP? What inputs the data? You can get funky behavior from a cluster with a bad PHP driver (for example)
D. Vnodes are good, but can be bad. What version of Cassandra are you running? You can find out via CQLSH with cqlsh -3 then type 'show version'
E. Type of compaction is a big killer. Are you using SizeTieredCompaction or LevelCompaction?
Start by running 'nodetool cfstats' from command line on the server any given node is running on. The particular areas of interest would be (at this point)
Compacted row minimum size:
Compacted row maximum size:
More than X amount of bytes in size here on systems with Y amount of RAM can be a significant problem. Be sure Cassandra has enough RAM and that the stack is tuned.
The default configuration for performance on Cassandra should normally be enough, so the next step would be to open a CQLSH interface to the node with 'cqlsh -3 hostname' and issue the command 'describe keyspaces'. Take the known key space name you are running and issue 'describe keyspace FOO' and look at your schema. Of particular interest are your primary keys. Are you using "composite rowkeys" or "composite primary key"? (as described here: http://www.datastax.com/dev/blog/whats-new-in-cql-3-0 )If not, you probably need to depending on read/write load expected.
Also check how your initial application layer is inserting data into Cassandra? Using PHP? Python? What drivers are being used? There are significant bugs in Cassandra versions < 1.2.10 using certain Thrift connectors such as the Java driver or the PHPcassa driver so you might need to upgrade Cassandra and make some driver changes.
In addition to these steps also consider how your nodes were created.
Note that migration from static nodes to virtual nodes (or vnodes) has to be mitigated. You can't simply switch configs on a node that's already been populated. You will want to check your initial_token: settings in /etc/cassandra/cassandra.yaml. The questions I ask myself here are "what initial tokens are set? (no initial tokens for vnodes) were the tokens changed after the data was populated?" For static nodes which I typically run, I calculate them using a tool like: [http://www.geroba.com/cassandra/cassandra-token-calculator/] as I've run into complications with vnodes (though they are much more reliable now than before).

Cassandra 1.1 - Setup for 24GB RAM and Row Cache

I would like to tune Cassandra for heavy read scenario with skinny rows (5-50 columns). The idea is to use row cache, and enable key cache just in case - when data is to large for row cache.
I have dual Intel Xeon server with 24GB RAM (3 in ring, two data centers - gives 6 machines in total)
Those are changes that I've made to default configuration:
cassandra-env.sh
#JVM_OPTS="$JVM_OPTS -ea"
MAX_HEAP_SIZE="6G"
HEAP_NEWSIZE="500M"
cassandra.yaml
# do not persist caches to disk
key_cache_save_period: 0
row_cache_save_period: 0
key_cache_size_in_mb: 512
row_cache_size_in_mb: 14336
row_cache_provider: SerializingCacheProvider
The idea it to dedicate 6GB to Cassandra JVM, 0.5GB to key cache (out of 6GB heap), and 14GB to row cache as off-heap.
OS has still 4GB which should be enough, since there is running only one JVM process and it should have overhead of max 2GB.
Is this setup optimal? Any hints?
Thanks,
Maciej
I'm using 1.1.6 version.
SerializingCacheProvider will save cache data at Native Heap area.
That area is not for GC inspect. so It will not be occurred GC.
Your row_cache_size_in_mb setting is for SerializingCache's reference object.
That reference is saved using FreeableMemory(It is in 1.1.x. but after 1.2, it changed).
In other words, Your real cache value is not calculated when calculating row_cache_size_in_mb.
At the result If you want to calculate row_cache_size_in_mb, try to set from minimal size.
In my case, when I set 500mb, each node was using 2G old gen.(in according to deal which data set)
Run the heapspace_calculator and use the suggested value as an initial heap configuration. Monitor your heap usage with "nodetool info".
Try to use short column names and merge columns when possible.
This setup works just fine - I've tested it.

Resources