How do I force a memtable flush for every write? - cassandra

I would like to flush the memtable to disk after every update/write operation (or in any case, as frequently as possible). My sole purpose is to stress test the underlying disk using a production-level database software.
It seems like memtable_cleanup_threshold is the way to go, but it's deprecated, is there another way to accomplish this? How about memtable_heap_space_in_mb and memtable_offheap_space_in_mb? I'm no Java Programmer, which one should I tune without compromising the rest of the functionalities?

You can definitely try setting both memtable_heap_space_in_mb and memtable_offheap_space_in_mb to a really low value.
Additionally, you can also configure commitlog_total_space_in_mb. If the occupied space goes above this property, it will cause more frequent flushes.
But since your goal is to stress-test the disk, my suggestion is to do the following:
Configure both data_file_directories and commitlog_directory to be mounted on the same disk.
Use NoSQLBench to stress test with heavy writes.
This way, you don't have to muck around with the memtable settings. Have a look at the NoSQLBench Beginner's Guide blog post for details. Cheers!

You also could just trigger a flush on the table by issuing nodetool flush or run the according JMX op after each write. However, Cassandra stores data distributed over many nodes and a flush is always a node bound operation. To find out which nodes you need to flush you would need to query the list of endpoints the written data is stored on (also available via JMX or with nodetool), otherwise you would need to flush all nodes.
While this is fine for testing purposes I would not recommend that for production.

Related

Cassandra vs Cassandra+Ignite

(Single Node Cluster)I've got a table having 2 columns, one is of 'text' type and the other is a 'blob'. I'm using Datastax's C++ driver to perform read/write requests in Cassandra.
The blob is storing a C++ structure.(Size: 7 KB).
Since I was getting lesser than desirable throughput when using Cassandra alone, I tried adding Ignite on top of Cassandra, in the hope that there will be significant improvement in the performance as now the data will be read from RAM instead of hard disks.
However, it turned out that after adding Ignite, the performance dropped even more(roughly around 50%!).
Read Throughput when using only Cassandra: 21000 rows/second.
Read Throughput with Cassandra + Ignite: 9000 rows/second.
Since, I am storing a C++ structure in Cassandra's Blob, the Ignite API uses serialization/de-serialization while writing/reading the data. Is this the reason, for the drop in the performance(consider the size of the structure i.e. 7K) or is this drop not at all expected and maybe something's wrong in the configuration?
Cassandra: 3.11.2
RHEL: 6.5
Configurations for Ignite are same as given here.
I got significant improvement in Ignite+Cassandra throughput when I used serialization in raw mode. Now the throughput has increased from 9000 rows/second to 23000 rows/second. But still, it's not significantly superior to Cassandra. I'm still hopeful to find some more tweaks which will improve this further.
I've added some more details about the configurations and client code on github.
Looks like you do one get per each key in this benchmark for Ignite and you didn't invoke loadCache before it. In this case, on each get, Ignite will go to Cassandra to get value from it and only after it will store it in the cache. So, I'd recommend invoking loadCache before benchmarking, or, at least, test gets on the same keys, to give an opportunity to Ignite to store keys in the cache. If you think you already have all the data in caches, please share code where you write data to Ignite too.
Also, you invoke "grid.GetCache" in each thread - it won't take a lot of time, but you definitely should avoid such things inside benchmark, when you already measure time.

Cassandra High client read request latency compared to local read latency

We have a 20 nodes Cassandra cluster running a lot of read requests (~900k/sec at peak). Our dataset is fairly small, so everything is served directly from memory (OS Page Cache). Our datamodel is quite simple (just a key/value) and all of our reads are performed with consistency level one (RF 3).
We use the Java Datastax driver with TokenAwarePolicy, so all of our reads should go directly to one node that has the requested data.
These are some metrics extracted from one of the nodes regarding client read request latency and local read latency.
org_apache_cassandra_metrics_ClientRequest_50thPercentile{scope="Read",name="Latency",} 105.778
org_apache_cassandra_metrics_ClientRequest_95thPercentile{scope="Read",name="Latency",} 1131.752
org_apache_cassandra_metrics_ClientRequest_99thPercentile{scope="Read",name="Latency",} 3379.391
org_apache_cassandra_metrics_ClientRequest_999thPercentile{scope="Read",name="Latency",} 25109.16
org_apache_cassandra_metrics_Keyspace_50thPercentile{keyspace=“<keyspace>”,name="ReadLatency",} 61.214
org_apache_cassandra_metrics_Keyspace_95thPercentile{keyspace="<keyspace>",name="ReadLatency",} 126.934
org_apache_cassandra_metrics_Keyspace_99thPercentile{keyspace="<keyspace>",name="ReadLatency",} 182.785
org_apache_cassandra_metrics_Keyspace_999thPercentile{keyspace="<keyspace>",name="ReadLatency",} 454.826
org_apache_cassandra_metrics_Table_50thPercentile{keyspace="<keyspace>",scope="<table>",name="CoordinatorReadLatency",} 105.778
org_apache_cassandra_metrics_Table_95thPercentile{keyspace="<keyspace>",scope="<table>",name="CoordinatorReadLatency",} 1131.752
org_apache_cassandra_metrics_Table_99thPercentile{keyspace="<keyspace>",scope="<table>",name="CoordinatorReadLatency",} 3379.391
org_apache_cassandra_metrics_Table_999thPercentile{keyspace="<keyspace>",scope="<table>",name="CoordinatorReadLatency",} 25109.16
Another important detail is that most of our queries (~70%) don't return anything, i.e., they are for records not found. So, bloom filters play an important role here and they seem to be fine:
Bloom filter false positives: 27574
Bloom filter false ratio: 0.00000
Bloom filter space used:
Bloom filter off heap memory used: 6760992
As it can be seen, the reads in each one of the nodes are really fast, the 99.9% is less than 0.5 ms. However, the client request latency is way higher, going above 4ms on the 99%. If I'm reading with CL ONE and using TokenAwarePolicy, shouldn't both values be similar to each other, since no coordination is required? Am I missing something? Is there anything else I could check to see what's going on?
Thanks in advance.
#luciano
there are various reasons why the coordinator and the replica can report different 99th percentiles for read latencies, even with token awareness configured in the client.
these can be anything that manifests in between the coordinator code to the replica's storage engine code in the read path.
examples can be:
read repairs (not directly related to a particular request, as is asynchronous to the read the triggered it, but can cause issues),
host timeouts (and/or speculative retries),
token awareness failure (dynamic snitch simply not keeping up),
GC pauses,
look for metrics anomalies per host, overlaps with GC, and even try to capture traces for some of the slower requests and investigate if they're doing everything you expect from C* (eg token awareness).
well-tuned and spec'd clusters may also witness the dynamic snitch simply not being able to keep up and do its intended job. in such situations disabling the dynamic snitch can fix the high latencies for top-end read percentiles. see https://issues.apache.org/jira/browse/CASSANDRA-6908
be careful though, measure and confirm hypotheses, as mis-applied solutions easily have negative effects!
Even if using TokenAwarePolicy, the driver can't work with the policy when the driver doesn't know which partition key is.
If you are using simple statements, no routing information is provided. So you need additional information to the driver by calling setRoutingKey.
The DataStax Java Driver's manual is a good friend.
http://docs.datastax.com/en/developer/java-driver/3.1/manual/load_balancing/#requirements
If TokenAware is perfectly working, CoordinatorReadLatency value is mostly same value with ReadLatency. You should check it too.
http://cassandra.apache.org/doc/latest/operating/metrics.html?highlight=coordinatorreadlatency
thanks for your reply and sorry about the delay in getting back to you.
One thing I’ve found out is that our clusters had:
dynamic_snitch_badness_threshold=0
in the config files. Changing that to the default value (0.1) helped a lot in terms of the client request latency.
The GC seems to be stable, even under high load. The pauses are constant (~10ms / sec) and I haven’t seen spikes (not even full gcs). We’re using CMS with a bigger Xmn (2.5GB).
Read repairs happen all the time (we have it set to 10% chance), so when the system is handling 800k rec/sec, we have ~80k read repairs/sec happening in background.
It also seems that we’re asking too much for the 20 machines cluster. From the client point of view, latency is quite stable until 800k qps, after that it starts to spike a little bit, but still under a manageable threshold.
Thanks for all the tips, the dynamic snitch thing was really helpful!

Is Cassandra able to detect corrupted data that doesn't get used often?

Is there anything like HDFS's DataBlockScanner for Cassandra, ie. an automatic mechanism that checks for corrupted data that doesn't get read often?
No.
Cassandra doesn't do that automatically - it can guarantee consistency on
read or write via ConsistencyLevel on each query, and it can run active
(AntiEntropy) repairs. But active repairs must be scheduled (by human or
cron or by third party script like http://cassandra-reaper.io/), and to
be pedantic, repair only fixes consistency issue, there's some work to be
done to properly address/support fixing corrupted replicas (for example,
repair COULD send a bit flip from one node to all of the others)
http://mail-archives.apache.org/mod_mbox/cassandra-user/201709.mbox/%3CCABNXB2CWXqvR_zkGSGfw7DJjU+Emer3a0Dcv0YkHUtKBEc1e+A#mail.gmail.com%3E
Big data as a trash can. Cool.
Best bet is to use nodetool verify to compare the hash of the sstable with the content. Particularly with nodetool verify -e to walk the individual cells.
https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsVerify.html

how to extend metrics of cassandra

there is a multicenter cassandra environment.
and set the consistency-level=local_quorum.
I want to know the latency of the local datacenter and other datacenter.
What I mean is when a data is writen successfully,and what's the time that other datacenter can have the replica.
this metrics is not exposed by cassandra.
Have found that writelatency is collected in org.apache.cassandra.service.StorageProxy.mutate method.
and want to add code in there to achieve collecting the latency of datacenter.
but the problem is cassandra write finish when the num of write consistency-level success,I cannot block the write transaction.
how to keep the sync between
write memtable and
write merics
have no idea going on.anybody have idea on achieving this,pls help a look.
There isnt anything available at this time directly, there is a ticket with patch available at CASSANDRA-11569 though.
There are some tricks you can try in mean time.
If you enable trace on a query (CL.ALL) you can check the trace events table to see the time that the mutations left coordinator and when it arrives on the replica.
You can make a local quorum write query, then a each quorum write query and track difference.
Theres a problem with some of these metrics in tracking mutations. Cassandra will piggyback all the writes in that DC over a single proxy write (vs coordinator actually sending to each node). If that node hits a GC it is likely to get a spike. Speculative retry will help with that affecting latency in an extreme case but then your not really tracking your raw cross dc latency. May want to just consider "ping".

fake fsync calls to improve performance [duplicate]

I am switching to PostgreSQL from SQLite for a typical Rails application.
The problem is that running specs became slow with PG.
On SQLite it took ~34 seconds, on PG it's ~76 seconds which is more than 2x slower.
So now I want to apply some techniques to bring the performance of the specs on par with SQLite with no code modifications (ideally just by setting the connection options, which is probably not possible).
Couple of obvious things from top of my head are:
RAM Disk (good setup with RSpec on OSX would be good to see)
Unlogged tables (can it be applied on the whole database so I don't have change all the scripts?)
As you may have understood I don't care about reliability and the rest (the DB is just a throwaway thingy here).
I need to get the most out of the PG and make it as fast as it can possibly be.
Best answer would ideally describe the tricks for doing just that, setup and the drawbacks of those tricks.
UPDATE: fsync = off + full_page_writes = off only decreased time to ~65 seconds (~-16 secs). Good start, but far from the target of 34.
UPDATE 2: I tried to use RAM disk but the performance gain was within an error margin. So doesn't seem to be worth it.
UPDATE 3:*
I found the biggest bottleneck and now my specs run as fast as the SQLite ones.
The issue was the database cleanup that did the truncation. Apparently SQLite is way too fast there.
To "fix" it I open a transaction before each test and roll it back at the end.
Some numbers for ~700 tests.
Truncation: SQLite - 34s, PG - 76s.
Transaction: SQLite - 17s, PG - 18s.
2x speed increase for SQLite.
4x speed increase for PG.
First, always use the latest version of PostgreSQL. Performance improvements are always coming, so you're probably wasting your time if you're tuning an old version. For example, PostgreSQL 9.2 significantly improves the speed of TRUNCATE and of course adds index-only scans. Even minor releases should always be followed; see the version policy.
Don'ts
Do NOT put a tablespace on a RAMdisk or other non-durable storage.
If you lose a tablespace the whole database may be damaged and hard to use without significant work. There's very little advantage to this compared to just using UNLOGGED tables and having lots of RAM for cache anyway.
If you truly want a ramdisk based system, initdb a whole new cluster on the ramdisk by initdbing a new PostgreSQL instance on the ramdisk, so you have a completely disposable PostgreSQL instance.
PostgreSQL server configuration
When testing, you can configure your server for non-durable but faster operation.
This is one of the only acceptable uses for the fsync=off setting in PostgreSQL. This setting pretty much tells PostgreSQL not to bother with ordered writes or any of that other nasty data-integrity-protection and crash-safety stuff, giving it permission to totally trash your data if you lose power or have an OS crash.
Needless to say, you should never enable fsync=off in production unless you're using Pg as a temporary database for data you can re-generate from elsewhere. If and only if you're doing to turn fsync off can also turn full_page_writes off, as it no longer does any good then. Beware that fsync=off and full_page_writes apply at the cluster level, so they affect all databases in your PostgreSQL instance.
For production use you can possibly use synchronous_commit=off and set a commit_delay, as you'll get many of the same benefits as fsync=off without the giant data corruption risk. You do have a small window of loss of recent data if you enable async commit - but that's it.
If you have the option of slightly altering the DDL, you can also use UNLOGGED tables in Pg 9.1+ to completely avoid WAL logging and gain a real speed boost at the cost of the tables getting erased if the server crashes. There is no configuration option to make all tables unlogged, it must be set during CREATE TABLE. In addition to being good for testing this is handy if you have tables full of generated or unimportant data in a database that otherwise contains stuff you need to be safe.
Check your logs and see if you're getting warnings about too many checkpoints. If you are, you should increase your checkpoint_segments. You may also want to tune your checkpoint_completion_target to smooth writes out.
Tune shared_buffers to fit your workload. This is OS-dependent, depends on what else is going on with your machine, and requires some trial and error. The defaults are extremely conservative. You may need to increase the OS's maximum shared memory limit if you increase shared_buffers on PostgreSQL 9.2 and below; 9.3 and above changed how they use shared memory to avoid that.
If you're using a just a couple of connections that do lots of work, increase work_mem to give them more RAM to play with for sorts etc. Beware that too high a work_mem setting can cause out-of-memory problems because it's per-sort not per-connection so one query can have many nested sorts. You only really have to increase work_mem if you can see sorts spilling to disk in EXPLAIN or logged with the log_temp_files setting (recommended), but a higher value may also let Pg pick smarter plans.
As said by another poster here it's wise to put the xlog and the main tables/indexes on separate HDDs if possible. Separate partitions is pretty pointless, you really want separate drives. This separation has much less benefit if you're running with fsync=off and almost none if you're using UNLOGGED tables.
Finally, tune your queries. Make sure that your random_page_cost and seq_page_cost reflect your system's performance, ensure your effective_cache_size is correct, etc. Use EXPLAIN (BUFFERS, ANALYZE) to examine individual query plans, and turn the auto_explain module on to report all slow queries. You can often improve query performance dramatically just by creating an appropriate index or tweaking the cost parameters.
AFAIK there's no way to set an entire database or cluster as UNLOGGED. It'd be interesting to be able to do so. Consider asking on the PostgreSQL mailing list.
Host OS tuning
There's some tuning you can do at the operating system level, too. The main thing you might want to do is convince the operating system not to flush writes to disk aggressively, since you really don't care when/if they make it to disk.
In Linux you can control this with the virtual memory subsystem's dirty_* settings, like dirty_writeback_centisecs.
The only issue with tuning writeback settings to be too slack is that a flush by some other program may cause all PostgreSQL's accumulated buffers to be flushed too, causing big stalls while everything blocks on writes. You may be able to alleviate this by running PostgreSQL on a different file system, but some flushes may be device-level or whole-host-level not filesystem-level, so you can't rely on that.
This tuning really requires playing around with the settings to see what works best for your workload.
On newer kernels, you may wish to ensure that vm.zone_reclaim_mode is set to zero, as it can cause severe performance issues with NUMA systems (most systems these days) due to interactions with how PostgreSQL manages shared_buffers.
Query and workload tuning
These are things that DO require code changes; they may not suit you. Some are things you might be able to apply.
If you're not batching work into larger transactions, start. Lots of small transactions are expensive, so you should batch stuff whenever it's possible and practical to do so. If you're using async commit this is less important, but still highly recommended.
Whenever possible use temporary tables. They don't generate WAL traffic, so they're lots faster for inserts and updates. Sometimes it's worth slurping a bunch of data into a temp table, manipulating it however you need to, then doing an INSERT INTO ... SELECT ... to copy it to the final table. Note that temporary tables are per-session; if your session ends or you lose your connection then the temp table goes away, and no other connection can see the contents of a session's temp table(s).
If you're using PostgreSQL 9.1 or newer you can use UNLOGGED tables for data you can afford to lose, like session state. These are visible across different sessions and preserved between connections. They get truncated if the server shuts down uncleanly so they can't be used for anything you can't re-create, but they're great for caches, materialized views, state tables, etc.
In general, don't DELETE FROM blah;. Use TRUNCATE TABLE blah; instead; it's a lot quicker when you're dumping all rows in a table. Truncate many tables in one TRUNCATE call if you can. There's a caveat if you're doing lots of TRUNCATES of small tables over and over again, though; see: Postgresql Truncation speed
If you don't have indexes on foreign keys, DELETEs involving the primary keys referenced by those foreign keys will be horribly slow. Make sure to create such indexes if you ever expect to DELETE from the referenced table(s). Indexes are not required for TRUNCATE.
Don't create indexes you don't need. Each index has a maintenance cost. Try to use a minimal set of indexes and let bitmap index scans combine them rather than maintaining too many huge, expensive multi-column indexes. Where indexes are required, try to populate the table first, then create indexes at the end.
Hardware
Having enough RAM to hold the entire database is a huge win if you can manage it.
If you don't have enough RAM, the faster storage you can get the better. Even a cheap SSD makes a massive difference over spinning rust. Don't trust cheap SSDs for production though, they're often not crashsafe and might eat your data.
Learning
Greg Smith's book, PostgreSQL 9.0 High Performance remains relevant despite referring to a somewhat older version. It should be a useful reference.
Join the PostgreSQL general mailing list and follow it.
Reading:
Tuning your PostgreSQL server - PostgreSQL wiki
Number of database connections - PostgreSQL wiki
Use different disk layout:
different disk for $PGDATA
different disk for $PGDATA/pg_xlog
different disk for tem files (per database $PGDATA/base//pgsql_tmp) (see note about work_mem)
postgresql.conf tweaks:
shared_memory: 30% of available RAM but not more than 6 to 8GB. It seems to be better to have less shared memory (2GB - 4GB) for write intensive workloads
work_mem: mostly for select queries with sorts/aggregations. This is per connection setting and query can allocate that value multiple times. If data can't fit then disk is used (pgsql_tmp). Check "explain analyze" to see how much memory do you need
fsync and synchronous_commit: Default values are safe but If you can tolerate data lost then you can turn then off
random_page_cost: if you have SSD or fast RAID array you can lower this to 2.0 (RAID) or even lower (1.1) for SSD
checkpoint_segments: you can go higher 32 or 64 and change checkpoint_completion_target to 0.9. Lower value allows faster after-crash recovery

Resources