I'm asking this question because I would like to understand how I can run RethinkDB better, which means what kind of hardware should it be running on, what kind of filesystem it should be running on and other system configurations to maximize it's throughput.
I'm trying to fill a table as fast as I can with documents that are {"n": <counter>, "rand": <Math.random()>}. I read somewhere that this is faster with batches of 200 documents, so that's what I'm inserting. I am also using soft durability. I started one nodejs process of this and I can insert on average 10k documents per second, pretty good.
But while this is happening, rethinkdb is using about 70% of one core (I have 8 virtual cores, it's an i7-4770) and the nodejs process is using 5%. So it seems that CPU is not the bottleneck.
As soon as I start another nodejs process doing the same thing, the inserts per second on both processes drop to about 4k-5k. Again, the CPU load keeps the same.
I fired iotop and I do see a lot of action there, but not what I expected. I configured two SSDs in a RAID0, and a quick dd test says I can write and read at about 800MBps. That's far above what the actual read and actual write speed iotop reports (average read ~14MBps average write ~50MBps).
So how can I exaust my machine's resources ? What does rethinkdb need to run faster ? Why doesn't it spend more resources and have a higher throughput ?
More information on what it's running on: It's an EX40SSD from Hetzner, two SSDs in a software RAID0, ext4 filesystem (tomorrow I'll try to mount it with noatime to see if it's better). The rethinkdb configuration is everything by default, the inserts are done to a table that has only one shard and one replica. Please feel free to ask anything else relevant I might have forgotten to mention.
Thanks in advance.
What I suspect is going on here is lock contention on the actual btrees. When you're inserting a large batch of documents, the system grabs various parts of the btree in parallel to update it with the new documents. This is a set of read-write locks -- other parts of the system can still read, but if you insert another large batch in parallel, there is a high probability that it will touch similar parts of the btree, and therefore has to wait for the system to start unlocking as it inserts parts of the first batch. (This isn't specific to RethinkDB, but a problem in databases in general) This is likely why you're not hitting 100% CPU/disk throughput.
There are a few things you can try, but note that there are subtleties to various approaches. Benchmarking in general is hard.
You can try to shard the table into 32 shards and retry your benchmark. You don't actually have to create a cluster, you can shard into 32 shards on a single machine. This will result in multiple btrees, so you'll minimize contention and will be able to use more system resources. Note that while this will likely increase the throughput, increasing the number of shards also slightly increases the latency, so you might need to significantly increase parallelism before you start seeing throughput increases.
You can try not to batch writes and instead write one document at a time (which generally approximates real-world use cases a little better). Then, start hundreds of parallel clients instead of just one or two, and have all them write one document at a time in parallel. Note that you need to make sure the clients themselves aren't a bottleneck in this situation.
You can try to rerun your benchmark and also spin up clients that read from the database in parallel with the writes. In RethinkDB reads can usually go through even when you're writing to a particular document, so this will give you the opportunity to up CPU usage and bypass contention.
Pay attention to the IDs of the documents. If the database is large enough (say, millions of documents), and the IDs you're inserting are random, you're much less likely to touch the same parts of the btree so contention becomes less of an issue.
You can combine various approaches (sharding, reading + writing, various numbers of concurrent clients) to start getting a sense for how the database behaves in various scenarios.
Note that there might be things going on that you wouldn't normally be aware of. For example, RethinkDB has a log-structured storage engine that does live compaction on disk, and this might use up some IO (and CPU) cycles that you'd be surprised by if you didn't know about live compaction. There are dozens of other components like this that might compound to surprising behavior, as these systems are typically very complex under the hood.
Hope this helps -- would love to hear about your progress on the benchmarks. We do a lot of them internally, and it's an art and a science to discover the boundaries of the system's performance on different use cases.
My guess is that the bottleneck here is the disk system, but not its throughput. What's more likely is that writes are happening in chunks that are too small to be efficient, or that there are delays due to latency between individual writes.
It's also possible that the latency between individual write queries coming from the client and their processing on the server slows the system down.
Here are a few things I recommend trying:
Increase the batch size further. Your documents are very small. Therefore I think you might get a significantly higher throughput with batches of 1,000-10,000 documents. This might work especially well in combination with the next point.
Run multiple concurrent clients. You mentioned that you have run 2 clients concurrently, but this might not be enough. I recommend running 16-32 if possible.
Check the cache size RethinkDB is using. By default, RethinkDB picks the cache size as a fraction of the available memory, but that is not always reliable. I recommend passing a --cache-size <MB> parameter to RethinkDB (or adding the cache-size=<MB> parameter to the configuration file, if you're using one). I can see that your server has 32 GB of RAM. I recommend using in the range of 20000 MB (or even more) as the cache size. A larger cache reduces the number of reads, but up to a certain limit also increases the amount of unsaved data that RethinkDB can accumulate in RAM to make disk writes more efficient.
Experiment with the --io-threads <THREADS> parameter. The default is 64, but you can try increasing it to e.g. 128 and see if it has any effect.
Related
What is the difference between Scylla read path and Cassandra read path? When I stress Cassandra and Scylla then Scylla read performance poor by 5 times than Cassandra using 16 core and normal HDD.
I expect better read performance on Scylla compared to Cassandra using normal HDD, because my company doesn't provide SSD's.
Can someone please confirm, is it possible to achieve better read performance using normal HDD or not?
If yes, what changes required scylla config?. Please guide me!
Some other responses focused on write performance, but this isn't what you asked about - you asked about reads.
Uncached read performance on HDDs is bound to be poor in both Cassandra and Scylla, because reads from disk each requires several seeks on the HDD, and even the best HDD cannot do more than, say, 200 of those seeks per second. Even with a RAID of several of these disks, you will rarely be able to do more than, say, 1000 requests per second. Since a modern multi-core can do orders of magnitude more CPU work than 1000 requests per second, in both Scylla and Cassandra cases, you'll likely see free CPU. So Scylla's main benefit, of using much less CPU per request, will not even matter when the disk is the performance bottleneck. In such cases I would expect Scylla's and Cassandra's performance (I am assuming that you're measuring throughput when you talk about performance?) should be roughly the same.
If, still, you're seeing better throughput from Cassandra than Scylla, there are several details that may explain why, beyond the general client mis-configuration issues raised in other responses:
If you have low amounts of data, that can fit in memory, Cassandra's caching policy is better for your workload. Cassandra uses the OS's page cache, which reads whole disk pages and may cache multiple items in one read, as well as multiple index entries. While Scylla works differently, and has a row cache - only caching the specific data read. Scylla's caching is better for large volumes of data that do not fit in memory, but much worse when the data can fit in memory, until the entire data set has been cached (after everything is cached, it becomes very efficient again).
On HDDs, the details of compaction are very important for read performance - if in one setup you have more sstables to read, it can increase the number of reads and lower the performance. This can change depending on your compaction configuration, or even randomly (depending on when compaction was run last). You can check if this explains your performance issues by doing a major compaction ("nodetool compact") on both systems and checking the read performance afterwards. You can switch the compaction strategy to LCS to ensure that random-access read performance is better, at the cost of more write work (on HDDs, this can be a worthwhile compromise).
If you are measuring scan performance (reading an entire table) instead of reading individual rows, other issues become relevant: As you may have heard, Scylla subdivides each nodes into shards (each shard is a single CPU). This is fantastic for CPU-bounded work, but could be worse for scanning tables which aren't huge, because each sstable is now smaller and the amount of contiguous data you can read before needing to seek again is lower.
I don't know which of these differences - or something else - is causing performance of your use-case to be lower in Scylla, but I please keep in mind that whatever you fix, your performance is always going to be bad with HDDs. With SDDs, we've measured in the past more than a million random-access read requests per second on a single node. HDDs cannot come anything close. If you really need optimum performance or performance per dollar, SDDs are really the way to go.
There can be various reasons why you are not getting the most out of your Scylla Cluster.
Number of concurrent connections from your clients/loaders is not high enough, or you're not using sufficient amount of loaders. In such case, some shards will be doing all the work, while others will be mostly idle. You want to keep your parallelism high.
Scylla likes have a minimum of 2 connections per shard (you can see the number of shards in /etc/scylla.d/cpuset.conf)
What's the size of your dataset? Are you reading a large amount of partitions or just a few? You might be hitting a hot partition situation
I strongly recommend reading the following docs that will provide you more insights:
https://www.scylladb.com/2019/03/27/best-practices-for-scylla-applications/
https://docs.scylladb.com/operating-scylla/benchmarking-scylla/
#Sateesh, I want to add to the answer by #TomerSan that both Cassandra and ScyllaDB utilize the same disk storage architecture (LSM). That means that they have relatively the same disk access patterns because the algorithms are largely the same. The LSM trees were built with the idea in mind that it is not necessary to do instant in-place updates. It consists of immutable data buckets that are large continuous pieces of data on disk. That means less random IO, more sequential IO for which the HDD works great (not counting utilized parallelism by modern database implementations).
All the above means that the difference that you see, is not induced by the difference in how those databases use a disk. It must be related to the configuration differences and what happens underneath. Maybe ScyllaDB tries to utilize more parallelism or more aggressively do compaction. It depends.
In order to be able to say anything specific, please share your tests, envs, and configurations.
Both databases use LSM tree but Scylla has thread-per-core architecture on top plus we use O_Direct while C* uses the page cache. Scylla also has a sophisticated IO scheduler that makes sure not to overload the disk and thus scylla_setup runs a benchmark automatically to tune. Check your output of it in io.conf.
There are far more things to review, better to send your data to the mailing list. In general, Scylla should perform better in this case as well but your disk is likely to be the bottleneck in both cases.
As a summary I would say Scylladb and cassandra have the same read / write path
memtable, commitlog, sstable.
However implementation is very different:
- cassandra rely on OS for low level IO and network (most DBMS does)
- scylladb rely on its own lib (seastar) to handle IO and network at a low level independently from OS page cache etc. This is why they can provide feature such as workload scheduling within the same cluster that would be very hard to implement in cassandra.
I am currently working on a framework for analysis application of an large scale experiment. The experiment contains about 40 instruments each generating about a GB/s with ns timestamps. The data is intended to be analysed in time chunks.
For the implemetation I would like to know how big such a "chunk" aka batch can get before Flink or Spark stop processing the data. I think it goes with out saying that I intend to recollect the processed data.
For live data analysis
In general, there is no hard limit on how much data you can process with the systems. It all depends on how many nodes you have and what kind of a query you have.
As it sounds as you would mainly want to aggregate per instrument on a given time window, your maximum scale-out is limited to 40. That's the maximum number of machines that you could throw at your problem. Then, the question arises on how big your time chunks are/how complex the aggregations become. Assuming that your aggregation requires all data of a window to be present, then the system needs to hold 1 GB per second. So if you window is one hour, the system needs to hold at least 3.6 TB of data.
If the main memory of the machines is not sufficient, data needs to be spilled to disk, which slows down processing significantly. Spark really likes to keep all data in memory, so that would be the practical limit. Flink can spill almost all data to disk, but then disk I/O becomes a bottleneck.
If you rather need to calculate small values (like sums, averages), main memory shouldn't become an issue.
For old data analysis
When analysis old data, the system can do batch processing and have much more options to handle the volume including spilling to local disk. Spark usually shines if you can keep all data of one window in main memory. If you are not certain about that or you know it will not fit into main memory, Flink is the more scalable solution. Nevertheless, I'd expect both frameworks to work well for your use case.
I'd rather look at the ecosystem and the suit for you. Which languages do you want to use? It feels like using Jupyter notebooks or Zeppelin would work best for your rather ad-hoc analysis and data exploration. Especially if you want to use Python, I'd probably give Spark a try first.
My company and I have purchased about 80,000$ in hardware to accomplish a goal. We have about 22,000 writes/sec in multiple application database for our Cassandra Cluster. We built 2 x nodes of Dual 3.5Ghz Xeons, 128GB RAM, Areca 1883, all top of the line high throughput. We also have a SSD RAID 10 array for Commitlog/saved_caches so that is not delayed.
The issue we have is the amount of data. In about 4 days we collected 1.8TB of data. We have no intention of ever releasing data. We then got a JBOD enclosure and put 6TB Platter drives in, 10 each, 20 total for about 110TB of space. We run fine with single replication, the issue is when we run to double replication.
We would love to add more nodes, we know that is the correct way, but at 20,000$ a node its costly. My question is, is it true to say if our write speed is the issue, that adding 10 more drives in each machine should allow for double the write speeds?
Does anyone have some of similar things going on and have some tweaks they made to Cassandra.yaml?
We did run htop for a while when we were in double replication, and CPU did seem to get a bit intensive (Read 24% average but it looks pretty close to maxed). RAM is all being used, 128GBs.
ANY thoughts on the matter will be considered and investigated.
Thanks,
Ken
It is not generally true that you can increase write speed simply by increasing disks, unless you are sure that you are IO bound. Cassandra batches writes (mutations go to the commitlog first, then a table in RAM, then are batch written to sstables when that table reaches a certain threshold - linear writes, so it's generally fast, even on spinning disks). At some point, you will max out the commitlog drive, fill the memtable faster than you can flush, or simply get to the point where GC can't keep up.
There are fairly large users of Cassandra who run multiple Cassandra instances on a given server simply to get the benefits of additional nodes without "just" adding disk. By running two JVMs, you can mitigate the pause times of a single node, and still take advantage of your (oversized) hardware. This is easiest if you can assign multiple IPs to your individual servers, but running on different ports also works. This is fairly atypical, and you'll need to pay close attention to your configs to avoid stepping on each other, but it will work, and will make more efficient use of your hardware than simply running huge nodes.
If I'm reading that correctly, you only have 2 nodes total?
If you only have 2 nodes I doubt that disk bandwidth would be the problem. Cassandra is usually CPU limited more than anything else.
Writes generally go to memory, so the disk only comes into play when memtables are flushed to disk as SStables. Now the thing that will probably kill your performance is when those SStables need to be compacted. When compaction starts happening, guess what part of the system that will stress, yup, the CPU.
You will also have a problem running repairs with huge disks like that. Usually I find that sustained transaction throughput is limited by compactions and repairs more than raw write performance.
With two nodes and single replication, you'd be splitting the load between the two nodes, with half going to one and half going to the other. If you set the replication factor to two, now every write would be going to both nodes, which is like going back in time to having a single machine database.
So I think it was a bad call to buy a small number of high end machines. You would have had much better performance with more machines where each machine was less expensive. You need more machines to spread out the load and get more CPUs into the equation.
Also you mention a disk enclosure. I hope you are not trying to use network storage with Cassandra. It needs the disks to be local.
I am switching to PostgreSQL from SQLite for a typical Rails application.
The problem is that running specs became slow with PG.
On SQLite it took ~34 seconds, on PG it's ~76 seconds which is more than 2x slower.
So now I want to apply some techniques to bring the performance of the specs on par with SQLite with no code modifications (ideally just by setting the connection options, which is probably not possible).
Couple of obvious things from top of my head are:
RAM Disk (good setup with RSpec on OSX would be good to see)
Unlogged tables (can it be applied on the whole database so I don't have change all the scripts?)
As you may have understood I don't care about reliability and the rest (the DB is just a throwaway thingy here).
I need to get the most out of the PG and make it as fast as it can possibly be.
Best answer would ideally describe the tricks for doing just that, setup and the drawbacks of those tricks.
UPDATE: fsync = off + full_page_writes = off only decreased time to ~65 seconds (~-16 secs). Good start, but far from the target of 34.
UPDATE 2: I tried to use RAM disk but the performance gain was within an error margin. So doesn't seem to be worth it.
UPDATE 3:*
I found the biggest bottleneck and now my specs run as fast as the SQLite ones.
The issue was the database cleanup that did the truncation. Apparently SQLite is way too fast there.
To "fix" it I open a transaction before each test and roll it back at the end.
Some numbers for ~700 tests.
Truncation: SQLite - 34s, PG - 76s.
Transaction: SQLite - 17s, PG - 18s.
2x speed increase for SQLite.
4x speed increase for PG.
First, always use the latest version of PostgreSQL. Performance improvements are always coming, so you're probably wasting your time if you're tuning an old version. For example, PostgreSQL 9.2 significantly improves the speed of TRUNCATE and of course adds index-only scans. Even minor releases should always be followed; see the version policy.
Don'ts
Do NOT put a tablespace on a RAMdisk or other non-durable storage.
If you lose a tablespace the whole database may be damaged and hard to use without significant work. There's very little advantage to this compared to just using UNLOGGED tables and having lots of RAM for cache anyway.
If you truly want a ramdisk based system, initdb a whole new cluster on the ramdisk by initdbing a new PostgreSQL instance on the ramdisk, so you have a completely disposable PostgreSQL instance.
PostgreSQL server configuration
When testing, you can configure your server for non-durable but faster operation.
This is one of the only acceptable uses for the fsync=off setting in PostgreSQL. This setting pretty much tells PostgreSQL not to bother with ordered writes or any of that other nasty data-integrity-protection and crash-safety stuff, giving it permission to totally trash your data if you lose power or have an OS crash.
Needless to say, you should never enable fsync=off in production unless you're using Pg as a temporary database for data you can re-generate from elsewhere. If and only if you're doing to turn fsync off can also turn full_page_writes off, as it no longer does any good then. Beware that fsync=off and full_page_writes apply at the cluster level, so they affect all databases in your PostgreSQL instance.
For production use you can possibly use synchronous_commit=off and set a commit_delay, as you'll get many of the same benefits as fsync=off without the giant data corruption risk. You do have a small window of loss of recent data if you enable async commit - but that's it.
If you have the option of slightly altering the DDL, you can also use UNLOGGED tables in Pg 9.1+ to completely avoid WAL logging and gain a real speed boost at the cost of the tables getting erased if the server crashes. There is no configuration option to make all tables unlogged, it must be set during CREATE TABLE. In addition to being good for testing this is handy if you have tables full of generated or unimportant data in a database that otherwise contains stuff you need to be safe.
Check your logs and see if you're getting warnings about too many checkpoints. If you are, you should increase your checkpoint_segments. You may also want to tune your checkpoint_completion_target to smooth writes out.
Tune shared_buffers to fit your workload. This is OS-dependent, depends on what else is going on with your machine, and requires some trial and error. The defaults are extremely conservative. You may need to increase the OS's maximum shared memory limit if you increase shared_buffers on PostgreSQL 9.2 and below; 9.3 and above changed how they use shared memory to avoid that.
If you're using a just a couple of connections that do lots of work, increase work_mem to give them more RAM to play with for sorts etc. Beware that too high a work_mem setting can cause out-of-memory problems because it's per-sort not per-connection so one query can have many nested sorts. You only really have to increase work_mem if you can see sorts spilling to disk in EXPLAIN or logged with the log_temp_files setting (recommended), but a higher value may also let Pg pick smarter plans.
As said by another poster here it's wise to put the xlog and the main tables/indexes on separate HDDs if possible. Separate partitions is pretty pointless, you really want separate drives. This separation has much less benefit if you're running with fsync=off and almost none if you're using UNLOGGED tables.
Finally, tune your queries. Make sure that your random_page_cost and seq_page_cost reflect your system's performance, ensure your effective_cache_size is correct, etc. Use EXPLAIN (BUFFERS, ANALYZE) to examine individual query plans, and turn the auto_explain module on to report all slow queries. You can often improve query performance dramatically just by creating an appropriate index or tweaking the cost parameters.
AFAIK there's no way to set an entire database or cluster as UNLOGGED. It'd be interesting to be able to do so. Consider asking on the PostgreSQL mailing list.
Host OS tuning
There's some tuning you can do at the operating system level, too. The main thing you might want to do is convince the operating system not to flush writes to disk aggressively, since you really don't care when/if they make it to disk.
In Linux you can control this with the virtual memory subsystem's dirty_* settings, like dirty_writeback_centisecs.
The only issue with tuning writeback settings to be too slack is that a flush by some other program may cause all PostgreSQL's accumulated buffers to be flushed too, causing big stalls while everything blocks on writes. You may be able to alleviate this by running PostgreSQL on a different file system, but some flushes may be device-level or whole-host-level not filesystem-level, so you can't rely on that.
This tuning really requires playing around with the settings to see what works best for your workload.
On newer kernels, you may wish to ensure that vm.zone_reclaim_mode is set to zero, as it can cause severe performance issues with NUMA systems (most systems these days) due to interactions with how PostgreSQL manages shared_buffers.
Query and workload tuning
These are things that DO require code changes; they may not suit you. Some are things you might be able to apply.
If you're not batching work into larger transactions, start. Lots of small transactions are expensive, so you should batch stuff whenever it's possible and practical to do so. If you're using async commit this is less important, but still highly recommended.
Whenever possible use temporary tables. They don't generate WAL traffic, so they're lots faster for inserts and updates. Sometimes it's worth slurping a bunch of data into a temp table, manipulating it however you need to, then doing an INSERT INTO ... SELECT ... to copy it to the final table. Note that temporary tables are per-session; if your session ends or you lose your connection then the temp table goes away, and no other connection can see the contents of a session's temp table(s).
If you're using PostgreSQL 9.1 or newer you can use UNLOGGED tables for data you can afford to lose, like session state. These are visible across different sessions and preserved between connections. They get truncated if the server shuts down uncleanly so they can't be used for anything you can't re-create, but they're great for caches, materialized views, state tables, etc.
In general, don't DELETE FROM blah;. Use TRUNCATE TABLE blah; instead; it's a lot quicker when you're dumping all rows in a table. Truncate many tables in one TRUNCATE call if you can. There's a caveat if you're doing lots of TRUNCATES of small tables over and over again, though; see: Postgresql Truncation speed
If you don't have indexes on foreign keys, DELETEs involving the primary keys referenced by those foreign keys will be horribly slow. Make sure to create such indexes if you ever expect to DELETE from the referenced table(s). Indexes are not required for TRUNCATE.
Don't create indexes you don't need. Each index has a maintenance cost. Try to use a minimal set of indexes and let bitmap index scans combine them rather than maintaining too many huge, expensive multi-column indexes. Where indexes are required, try to populate the table first, then create indexes at the end.
Hardware
Having enough RAM to hold the entire database is a huge win if you can manage it.
If you don't have enough RAM, the faster storage you can get the better. Even a cheap SSD makes a massive difference over spinning rust. Don't trust cheap SSDs for production though, they're often not crashsafe and might eat your data.
Learning
Greg Smith's book, PostgreSQL 9.0 High Performance remains relevant despite referring to a somewhat older version. It should be a useful reference.
Join the PostgreSQL general mailing list and follow it.
Reading:
Tuning your PostgreSQL server - PostgreSQL wiki
Number of database connections - PostgreSQL wiki
Use different disk layout:
different disk for $PGDATA
different disk for $PGDATA/pg_xlog
different disk for tem files (per database $PGDATA/base//pgsql_tmp) (see note about work_mem)
postgresql.conf tweaks:
shared_memory: 30% of available RAM but not more than 6 to 8GB. It seems to be better to have less shared memory (2GB - 4GB) for write intensive workloads
work_mem: mostly for select queries with sorts/aggregations. This is per connection setting and query can allocate that value multiple times. If data can't fit then disk is used (pgsql_tmp). Check "explain analyze" to see how much memory do you need
fsync and synchronous_commit: Default values are safe but If you can tolerate data lost then you can turn then off
random_page_cost: if you have SSD or fast RAID array you can lower this to 2.0 (RAID) or even lower (1.1) for SSD
checkpoint_segments: you can go higher 32 or 64 and change checkpoint_completion_target to 0.9. Lower value allows faster after-crash recovery
I'm working on a hobby project involving a rather CPU-intensive calculation. The problem is embarrassingly parallel. This calculation will need to happen on a large number of nodes (say 1000-10000). Each node can do its work almost completely independently of the others. However, the entire system will need to answer queries from outside the system. Approximately 100000 such queries per second will have to be answered. To answer the queries, the system needs some state that is sometimes shared between two nodes. The nodes need at most 128MB RAM for their calculations.
Obviously, I'm probably not going to afford to actually build this system in the scale described above, but I'm still interested in the engineering challenge of it, and thought I'd set up a small number of nodes as proof-of-concept.
I was thinking about using something like Cassandra and CouchDB to have scalable persistent state across all nodes. If I run a distributed database server on each node, it would be very lightly loaded, but it would be very nice from an ops perspective to have all nodes be identical.
Now to my question:
Can anyone suggest a distributed database implementation that would be a good fit for a cluster of a large number of nodes, each with very little RAM?
Cassandra seems to do what I want, but http://wiki.apache.org/cassandra/CassandraHardware talks about recommending at least 4G RAM for each node.
I haven't found a figure for the memory requirements of CouchDB, but given that it is implemented in Erlang, I figure maybe it isn't so bad?
Anyway, recommendation, hints, suggestions, opinions are welcome!
You should be able to do this with cassandra, though depending on your reliability requirements, an in memory database like redis might be more appropriate.
Since the data set is so small (100 MBs of data), you should be able to run with less than 4GB of ram per node. Adding in cassandra overhead you probably need 200MB of ram for the memtable, and another 200MB of ram for the row cache (to cache the entire data set, turn off the key cache), plus another 500MB of ram for java in general, which means you could get away with 2 gigs of ram per machine.
Using a replication factor of three, you probably only need a cluster on the order of 10's of nodes to serve the number of reads/writes you require (especially since your data set is so small and all reads can be served from the row cache). If you need the computing power of 1000's of nodes, have them talk to the 10's of cassandra nodes storing you data rather than try to split cassandra to run across 1000's of nodes.
I've not used CouchDB myself, but I am told that Couch will run in as little as 256M with around 500K records. At a guess that would mean that each of your nodes might need ~512M, taking into account the extra 128M they need for their calculations. Ultimately you should download and give each a test inside a VPS, but it does sound like Couch will run in less memory than Cassandra.
Okay, after doing some more read-up after posting the question, and trying some thing out, I decided to go with MongoDB.
So far I'm happy. I have very little load, and MongoDB is using very little system resources (~200MB at most). However, my dataset isn't nearly as large as described in the question, and I am only running 1 node, so this doesn't mean anything.
CouchDB doesn't seem to support sharding out-of-the-box, so is not (it turns out) a good fit for the problem described in the question (I know there are addons for sharding).