apache-spark-Cache table in Memory is spilling over to disk - apache-spark

How to pin the table in cache so it would not swap out of memory?
Situation: We are using Microstrategy BI reporting. Semantic layer is built. We wanted to Cache highly used tables into CACHE using Spark SQL CACHE Table ; we did cache for SPARK context( Thrift server). Initially it was all in cache , now some in cache and some in disk. That disk may be local disk relatively more expensive reading than from s3. Queries may take longer and inconsistent times from user experience perspective. If More queries running using Cache tables, copies of the cache table images are copied and copies are not staying in memory causing reports to run longer. so how to pin the table so would not swap to disk. Spark memory management is dynamic allocation, and how to use those few tables to Pin in memory .

There's a couple different ways to tackle this. First keep in mind that in memory storage is shared between storage and execution. so doing big joins/etc that may require temp storage may be competing for memory space. You may want to look at "spark.memory.storageFraction" which currently defaults to 0.5 Consider 0.75 but this will likely slow down your queries. Also consider applying good data engineering to the problem. Reduce the amount of data that needs to be stored. Create a temp view with old records removed and unneeded columns pruned, then cache that. Consider using smaller datatypes for improved storage. Ex ints are more space efficient than big strings. Lastly considering switching to an instance type that has more memory available or switching to an instance with fast local disks. In some situations disk storage isn't that much slower than in memory. This is particularly true if you're running big complicated analytical queries where the cluster is cpu bound and not io bound.

Related

What is the difference between scylla read path and cassandra read path?

What is the difference between Scylla read path and Cassandra read path? When I stress Cassandra and Scylla then Scylla read performance poor by 5 times than Cassandra using 16 core and normal HDD.
I expect better read performance on Scylla compared to Cassandra using normal HDD, because my company doesn't provide SSD's.
Can someone please confirm, is it possible to achieve better read performance using normal HDD or not?
If yes, what changes required scylla config?. Please guide me!
Some other responses focused on write performance, but this isn't what you asked about - you asked about reads.
Uncached read performance on HDDs is bound to be poor in both Cassandra and Scylla, because reads from disk each requires several seeks on the HDD, and even the best HDD cannot do more than, say, 200 of those seeks per second. Even with a RAID of several of these disks, you will rarely be able to do more than, say, 1000 requests per second. Since a modern multi-core can do orders of magnitude more CPU work than 1000 requests per second, in both Scylla and Cassandra cases, you'll likely see free CPU. So Scylla's main benefit, of using much less CPU per request, will not even matter when the disk is the performance bottleneck. In such cases I would expect Scylla's and Cassandra's performance (I am assuming that you're measuring throughput when you talk about performance?) should be roughly the same.
If, still, you're seeing better throughput from Cassandra than Scylla, there are several details that may explain why, beyond the general client mis-configuration issues raised in other responses:
If you have low amounts of data, that can fit in memory, Cassandra's caching policy is better for your workload. Cassandra uses the OS's page cache, which reads whole disk pages and may cache multiple items in one read, as well as multiple index entries. While Scylla works differently, and has a row cache - only caching the specific data read. Scylla's caching is better for large volumes of data that do not fit in memory, but much worse when the data can fit in memory, until the entire data set has been cached (after everything is cached, it becomes very efficient again).
On HDDs, the details of compaction are very important for read performance - if in one setup you have more sstables to read, it can increase the number of reads and lower the performance. This can change depending on your compaction configuration, or even randomly (depending on when compaction was run last). You can check if this explains your performance issues by doing a major compaction ("nodetool compact") on both systems and checking the read performance afterwards. You can switch the compaction strategy to LCS to ensure that random-access read performance is better, at the cost of more write work (on HDDs, this can be a worthwhile compromise).
If you are measuring scan performance (reading an entire table) instead of reading individual rows, other issues become relevant: As you may have heard, Scylla subdivides each nodes into shards (each shard is a single CPU). This is fantastic for CPU-bounded work, but could be worse for scanning tables which aren't huge, because each sstable is now smaller and the amount of contiguous data you can read before needing to seek again is lower.
I don't know which of these differences - or something else - is causing performance of your use-case to be lower in Scylla, but I please keep in mind that whatever you fix, your performance is always going to be bad with HDDs. With SDDs, we've measured in the past more than a million random-access read requests per second on a single node. HDDs cannot come anything close. If you really need optimum performance or performance per dollar, SDDs are really the way to go.
There can be various reasons why you are not getting the most out of your Scylla Cluster.
Number of concurrent connections from your clients/loaders is not high enough, or you're not using sufficient amount of loaders. In such case, some shards will be doing all the work, while others will be mostly idle. You want to keep your parallelism high.
Scylla likes have a minimum of 2 connections per shard (you can see the number of shards in /etc/scylla.d/cpuset.conf)
What's the size of your dataset? Are you reading a large amount of partitions or just a few? You might be hitting a hot partition situation
I strongly recommend reading the following docs that will provide you more insights:
https://www.scylladb.com/2019/03/27/best-practices-for-scylla-applications/
https://docs.scylladb.com/operating-scylla/benchmarking-scylla/
#Sateesh, I want to add to the answer by #TomerSan that both Cassandra and ScyllaDB utilize the same disk storage architecture (LSM). That means that they have relatively the same disk access patterns because the algorithms are largely the same. The LSM trees were built with the idea in mind that it is not necessary to do instant in-place updates. It consists of immutable data buckets that are large continuous pieces of data on disk. That means less random IO, more sequential IO for which the HDD works great (not counting utilized parallelism by modern database implementations).
All the above means that the difference that you see, is not induced by the difference in how those databases use a disk. It must be related to the configuration differences and what happens underneath. Maybe ScyllaDB tries to utilize more parallelism or more aggressively do compaction. It depends.
In order to be able to say anything specific, please share your tests, envs, and configurations.
Both databases use LSM tree but Scylla has thread-per-core architecture on top plus we use O_Direct while C* uses the page cache. Scylla also has a sophisticated IO scheduler that makes sure not to overload the disk and thus scylla_setup runs a benchmark automatically to tune. Check your output of it in io.conf.
There are far more things to review, better to send your data to the mailing list. In general, Scylla should perform better in this case as well but your disk is likely to be the bottleneck in both cases.
As a summary I would say Scylladb and cassandra have the same read / write path
memtable, commitlog, sstable.
However implementation is very different:
- cassandra rely on OS for low level IO and network (most DBMS does)
- scylladb rely on its own lib (seastar) to handle IO and network at a low level independently from OS page cache etc. This is why they can provide feature such as workload scheduling within the same cluster that would be very hard to implement in cassandra.

Dedicated commitlog storage vs Read/Write ratio?

As we are using SSD disks to provide storage for our cluster on servers with 30 GB of memory.
There is an argument about the commitlog directory, whether to dedicate an individual disk or having it on the same data disk.
As we already using SSD disks, performance should be fine having both commitlogs and data on the same disk, as there is no mechanical moving head for writing.
However, there is another factor, that is the read/write ratio. How would such a ratio affect the performance of writing or reading when we have both commitlogs and data on the same disk?
Using SSD, when would it become important to dedicate a high performance disk for the commitlog directory?
A dedicated commitlog device usually makes a lot of sense when you have HDDs, but is less obvious if you're using SSDs.
Even if you asked only if it makes sense with SSDs setups, I will try to give some general hints about the subject, primarily based on my understandings and my own experience. I admit the focus is probably too much on HDDs, but HDDs allow a deep insight on how Cassandra stuff works and why backing a commitlog/data directory with an SSD can be a life saver.
Background: IOPS and OPS are not the same thing.
I will start from a (very) far point: Device Performance. Here's a start-point lecture about storage device performances in general. Even if the article's neutrality is under discussion, it can provide some insights about the general metrics and performance you can expect from some systems. Of course, your mileage may vary, depending on what device (type/brand/model etc...) and how much stress (intended as type of workload) you put on the device, but I think it is a good starting point for our discussion here.
The reason I prefer to start from IOPS is because it is the very starting point for understanding storage performance. The C* literature speaks about OPS, Operations Per Second, because people usually don't think in terms of IOPS, especially when looking at stats. This really hides a lot of details, the operation size for starters.
A Cassandra Operation usually consists of multiple IOPS. Cassandra documentation usually refers to spinning disks (even if SSDs are referenced too), clearly states what happens when performing reads/writes, and people tend to ignore the fact that when their software stack (that spans from up the application down to Cassandra and its data files on the storage) hit the disks the performance decreases by a huge amount just because they have failed to recognize a random workload, and even if "Cassandra is an high-performance etc... etc.. etc...".
As an example, looking at the picture in the read path documentation, you can clearly see what data structures are in memory/on disk, and how the SSTable data is accessed. Further, the row cache paragraph says:
... If row cache is enabled, desired partition data is read from the row cache, potentially saving two seeks to disk for the data...
And here's where the catch starts: these two seeks are potentially saved from Cassandra's point of view. This simply means that Cassandra won't make two requests to the storage system: it will avoid to request the partition index, and the data because everything is already in RAM, but it doesn't really translates to "the storage system will save two IO operations". Indeed, how (generic) data is retrieved from the storage device is a very different thing, and of course depends on how the files are layed-out on the disk itself: are you using EXT4, XFS, or what? Assuming no cache is available (eg for very big data set sizes you can't really cache everything...), looking for a file is IOPS consuming, and this tends to amplify the potentially saved seeks when you have data in RAM, and tends to amplify the penalty you perceive when your data is not.
You can't escape physics: HDDs pay some taxes, SSDs no.
As you already know, the main "problem" (performance-wise) of HDDs is the average seek time, that is the time the HDD needs to wait on average in order to have a target sector under the heads. Once the sector is under the heads, if the system have to read a bunch of sequential bits everything is smooth and the throughput is proportional to the rotational speed of the HDD (to be precise to the tangential speed of the platters under the head, which depends also on the track etc...).
In other terms, HDDs have an average fixed performance tax (the average seek time), and everything after is almost "free". If an application requests a bunch of sectors that are not "contiguous" (from the disk point of view, eg a fragmented file is splitted across multiple sectors, but an application can't really know this), the disk will have to wait the average seek time on average multiple times, and this fixed tax influences its maximum throughput.
The strongest argument about storage is: every device have its own maximum magic average IOPS number. This number express the number of random IOPS the device can perform. You can't force an HDD to have more IOPS on average, it's a physical problem. The OS is usually smart enough to "enqueue" sector requests in the attempt to reduce the seek times, eg ordering by ascending requested sector number (trying to exploit some sequential operations), but nothing will save the performances from a random IO workload. You have X allotted available IOPS and must face your problems with that. No matter what.
You need to take advantage of the allotted IOPS of your device, and you must be wise on how you use them.
Suppose you have an HDD that maxes out at 100 IOPS on average. If your application performs a bunch of small (say 4KB) file reads, you have an application that performs 100 * 4KB reads every second: the throughput will be around 400KB/s (unless some caching is involved, and in that case the cache saved you precious IOPS). Astonishing. This is simply because you keep paying the seek time multiple times. If you change your access pattern to something that reads 16MB (contiguous) files, you get an higher throughput because you won't pay the seek time so much, you are exploiting a sequential pattern. What changes under the hood is the Request Size of each operation.
Now an interesting question is: how are "IOPS" and "Request Size" related to? Does one request size of 16MB can be considered one IOPS? And what about a 128MB request size? This is indeed a good question. At lower level, the Request Size spans from 512 bytes (the minimum sector size) to 128KB (32*4K sectors in one request). If the operation has small size, its transfer time, the time the disk needs to fetch the data, is also small. Higher request sizes have higher transfer times obviously. However, if you are able to perform 100 4KB IOPS, you will probably be able to perform around 80 IOPS #8KB. The relation can't be linear, because the transfer time depends on the rotational speed of the disks only (the transfer time is negligible compared to the seek time), and since you are actually reading from two adjacent sectors, you'll hit the seek time penalty once per request. This translates to a throughput of around 400KB/s for 4K requests and 1.6MB/s for 8K requests. And so on.... The larger the request size, the longer it takes to transfer data, the lesser IOPS you have, the higher throughput you have. (These are random numbers, pun intended, no measurements done! Just to let you understand. I however think they are in the ballpark).
SSDs don't suffer mechanical penalties and that's why they are capable of performing much better than HDDs. They have much more IOPS, and their limits come from the onboard electronics, bus connection etc.... Having an higher IOPS device is a big plus, these can be consumed by applications that are not IOPS friendly, and the user won't notice that the applications suck. However, with SSDs, the Request Size linearly influences the number of IOPS you can perform. When you look at some device that have 100k IOPS, these are usually referred at 4K. You'll be able to perform only 6.2k requests if you perform 64K requests.
Why Cassandra has a such good read performances even with HDDs then?
Speaking from a single node point of view (because given the performance of a cluster Cassandra scales linearly with the number of nodes in the cluster), the problem lies in the question itself. This is only true if you model your data in this particular way:
You must fetch all your data with one query only.
Your data must be ordered.
If you can't fetch your data with one data, denormalize in order to retrieve it with one query only.
You fetch a relative good amount of data on every read
These are well-known Cassandra modeling rules, but the key point is that these rules do really have a reason to be applied IOPS-wise. Indeed, these rules allow Cassandra to:
Be a super fast database because it will just require the partition index and the SSTable offset index of the data: two IOPS in the best case, much more IOPS in the worst case.
Be a super fast database because it will exploit the sequential capabilities of the HDDs and will not stress the IO subsystem by issuing other IO (random) seeks.
Be a super fast database because it will just fetch more data like the point number 1.
Be a super fast database because it will exploit longer the sequential capabilities of the HDDs.
In other terms, following these basic data modeling rules allows Cassandra to be IOPS friendly when reading data back.
What happens if you screw-up your data model? Cassandra won't be IOPS friendly, and as a consequence the performances will be predictably horrible. Unless you use an SSD, which has greater IOPS and then you won't notice slowness too much.
What happens if you read/write a small amount of data (eg due to misconfigured flush sizes, small commit log etc...)? Cassandra won't be IOPS friendly, and as a consequence the performances will be predictably horrible. Unless you use an SSD, which has greater IOPS and then you won't notice slowness too much.
How a read/write ratio pattern can influence performance in a Cassandra node?
Cassandra is a complex system, with different components that interact each other. I will try to explain from my point of view what are the main points when you put everything on one device only.
Writes/Deletes/Updates in Cassandra are fast because they are simply append-only writes to the CommitLog device. Reads, on the contrary, can be very IOPS consuming. When both CommitLog and Data are on the same physical disk (either HDD or SSD), the read/write paths interact, and they both consume IOPS.
Two important questions are:
How many IOPS a read (using the read path) consumes?
How many IOPS a write consumes?
These are important question because you have to remember that your device can perform at most X IOPS, and your system will have to split these X IOPS among these operations.
It is quite difficult to answer to the "read" question because, when you request some data, Cassandra needs to locate all the SSTables needed to satisfy the request. Assuming a very big dataset size, where caching is not effective, this imply that the Cassandra read path can be very IOPS hungry. Indeed, if your data is spread into 3 different SSTables, Cassandra will have to locate all of them, and for each SSTable will follow the read path: will read the partition index, and then will read the data in the SSTable. These are at least two IOPS, because if your filesystem is not "collaborative" enough, locating a file and/or pointing at a file offset could require some more IOPS. In the end, in this example Cassandra is consuming at least six IOPS per read.
Answering the "write" question is also tricky, because compactions and flushes can be triggered. They will consume a lot of IOPS. Flushes are easy to understand: they write data from memtables to disk with a sequential pattern. Instead, compactions read data back from different SSTables on disk, and while reading the tables they flush the result out to a new disk file. This is a mixed read/write pattern, and on HDDs this is very disruptive, because will force the disk to perform multiple seeks.
Mixing percentages: TL;DR
If you have a R/W ratio of 95% reads and 5% writes, having a separate CommitLog device can be a waste of resources, because writes will hardly impact your read performances, and you write so rarely that write performance may be considered not critical.
If you have a R/W ratio of 5% reads and 95% writes, having a separate CommitLog device can be again a waste of resources, because reads will hardly impact your write performances, and your read performances will hardly suffer from a bunch of sequential appends on the commitlog.
And finally, if you have a R/W ratio of 50% reads and 50% writes, having a separate CommitLog device is NOT a waste of resources, because every write performed on the CommitLog device won't produce at least two IOPS on the data drive (one for writing, and one for going back to read).
Please note that I didn't mention compactions, because independently on your workload, when compaction triggers in, your workload will be disrupted by mixed read/write background operations on different files (consuming disk IOPS all the way), and you will suffer both on reads and writes.
All this should be clear enough for HDDs because you run out of IOPS very fast, and when you do you notice it immediately. On SSDs, however, you don't run out of IOPS that fast, but you could do if your data consists of a lot small data rows.
The reality is that getting out of IOPS on an SSD is very hard because you'll get out (by a far amount) of CPU resources, But once you do you will see your performance slowly decrease. The effect however won't be such dramatic as in the cases of HDDs. As an example, if you have a 100 IOPS HDD and you run-out of IOPS by trying to issue 500 random IO stuff, you cleary get a penalty. By calling this penalty P, if you have an SSD with 100k IOPS, to get the same penalty P you should issue 500k IOPS, which can be very difficult to do without exhausting CPU or RAM.
In general, when you run out of some type of resource in your system, you need to increase its quantity. The most important thing (to me) is not to run out of IOPS in the "Data" part of your Cassandra cluster. In the case of SSDs IOPS, it's rare enough that you'll get the limit. You'll burn your CPU well before I think. But you will if you don't tune your system, or if your workload put too much stress on the disk subsystem (eg Leveled Compaction). I'd suggest to put an ordinary HDD instead of an high performance SSD for the commitlog, saving money. But if you have a lot of very small commitlog flushes an SSD is a completely life saver, because your writers won't suffer the latency of HDDs.
Finally, in my opition, you should go in pre-production with some sort of real data, and check your IOPS requirements. If you have enough room to put the SSD there don't worry. Go and save money. If your system gets too much pressure due to compaction then having a separate device is suggested. Analyze your commitlog pattern, and if its not IOPS demanding put it on a separate disk. Moreover, if you have a virtual environment you can provision a relatively small commitlog device regardless of other factors. It won't rise the cost of your solution too much.
The actual numbers will depend highly on the type of workload you have the configuration you have etc. You can have a look at Netflix tech blog posts for ballpark numbers, e.g. #1, #2.
Dedicating a disk for commitlog directory is a sort of scale up strategy. Cassandra works well with scale out approach. You just add more nodes into the cluster to spread the load - 2nd from the linked articles has a nice graph showing near linear scalability.

Cassandra In-Memory option

There is an In-Memory option introduced in the Cassandra by DataStax Enterprise 4.0:
http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/inMemory.html
But with 1GB size limited for an in-memory table.
Anyone know the consideration why limited it as 1GB? And possible extend to a large size of in-memory table, such as 64GB?
To answer your question: today it's not possible to bypass this limitation.
In-Memory tables are stored within the JVM Heap, regardless the amount of memory available on single node allocating more than 8GB to JVM Heap is not recommended.
The main reason of this limitation is that Java Garbage Collector slow down when dealing with huge memory amount.
However if you consider Cassandra as a distributed system 1GB is not the real limitation.
(nodes*allocated_memory)/ReplicationFactor
allocated_memory is max 1GB -- So your table may contains many GB in memory allocated in different nodes.
I think that in future something will improve but dealing with 64GB in memory it could be a real problem when you need to flush data on disk. One more consideration that creates limitation: avoid TTL when working with In-Memory tables. TTL creates tombstones, a tombstone is not deallocated until the GCGraceSeconds period passes -- so considering a default value of 10 days each tombstone will keep the portion of memory busy and unavailable, possibly for long time.
HTH,
Carlo

fake fsync calls to improve performance [duplicate]

I am switching to PostgreSQL from SQLite for a typical Rails application.
The problem is that running specs became slow with PG.
On SQLite it took ~34 seconds, on PG it's ~76 seconds which is more than 2x slower.
So now I want to apply some techniques to bring the performance of the specs on par with SQLite with no code modifications (ideally just by setting the connection options, which is probably not possible).
Couple of obvious things from top of my head are:
RAM Disk (good setup with RSpec on OSX would be good to see)
Unlogged tables (can it be applied on the whole database so I don't have change all the scripts?)
As you may have understood I don't care about reliability and the rest (the DB is just a throwaway thingy here).
I need to get the most out of the PG and make it as fast as it can possibly be.
Best answer would ideally describe the tricks for doing just that, setup and the drawbacks of those tricks.
UPDATE: fsync = off + full_page_writes = off only decreased time to ~65 seconds (~-16 secs). Good start, but far from the target of 34.
UPDATE 2: I tried to use RAM disk but the performance gain was within an error margin. So doesn't seem to be worth it.
UPDATE 3:*
I found the biggest bottleneck and now my specs run as fast as the SQLite ones.
The issue was the database cleanup that did the truncation. Apparently SQLite is way too fast there.
To "fix" it I open a transaction before each test and roll it back at the end.
Some numbers for ~700 tests.
Truncation: SQLite - 34s, PG - 76s.
Transaction: SQLite - 17s, PG - 18s.
2x speed increase for SQLite.
4x speed increase for PG.
First, always use the latest version of PostgreSQL. Performance improvements are always coming, so you're probably wasting your time if you're tuning an old version. For example, PostgreSQL 9.2 significantly improves the speed of TRUNCATE and of course adds index-only scans. Even minor releases should always be followed; see the version policy.
Don'ts
Do NOT put a tablespace on a RAMdisk or other non-durable storage.
If you lose a tablespace the whole database may be damaged and hard to use without significant work. There's very little advantage to this compared to just using UNLOGGED tables and having lots of RAM for cache anyway.
If you truly want a ramdisk based system, initdb a whole new cluster on the ramdisk by initdbing a new PostgreSQL instance on the ramdisk, so you have a completely disposable PostgreSQL instance.
PostgreSQL server configuration
When testing, you can configure your server for non-durable but faster operation.
This is one of the only acceptable uses for the fsync=off setting in PostgreSQL. This setting pretty much tells PostgreSQL not to bother with ordered writes or any of that other nasty data-integrity-protection and crash-safety stuff, giving it permission to totally trash your data if you lose power or have an OS crash.
Needless to say, you should never enable fsync=off in production unless you're using Pg as a temporary database for data you can re-generate from elsewhere. If and only if you're doing to turn fsync off can also turn full_page_writes off, as it no longer does any good then. Beware that fsync=off and full_page_writes apply at the cluster level, so they affect all databases in your PostgreSQL instance.
For production use you can possibly use synchronous_commit=off and set a commit_delay, as you'll get many of the same benefits as fsync=off without the giant data corruption risk. You do have a small window of loss of recent data if you enable async commit - but that's it.
If you have the option of slightly altering the DDL, you can also use UNLOGGED tables in Pg 9.1+ to completely avoid WAL logging and gain a real speed boost at the cost of the tables getting erased if the server crashes. There is no configuration option to make all tables unlogged, it must be set during CREATE TABLE. In addition to being good for testing this is handy if you have tables full of generated or unimportant data in a database that otherwise contains stuff you need to be safe.
Check your logs and see if you're getting warnings about too many checkpoints. If you are, you should increase your checkpoint_segments. You may also want to tune your checkpoint_completion_target to smooth writes out.
Tune shared_buffers to fit your workload. This is OS-dependent, depends on what else is going on with your machine, and requires some trial and error. The defaults are extremely conservative. You may need to increase the OS's maximum shared memory limit if you increase shared_buffers on PostgreSQL 9.2 and below; 9.3 and above changed how they use shared memory to avoid that.
If you're using a just a couple of connections that do lots of work, increase work_mem to give them more RAM to play with for sorts etc. Beware that too high a work_mem setting can cause out-of-memory problems because it's per-sort not per-connection so one query can have many nested sorts. You only really have to increase work_mem if you can see sorts spilling to disk in EXPLAIN or logged with the log_temp_files setting (recommended), but a higher value may also let Pg pick smarter plans.
As said by another poster here it's wise to put the xlog and the main tables/indexes on separate HDDs if possible. Separate partitions is pretty pointless, you really want separate drives. This separation has much less benefit if you're running with fsync=off and almost none if you're using UNLOGGED tables.
Finally, tune your queries. Make sure that your random_page_cost and seq_page_cost reflect your system's performance, ensure your effective_cache_size is correct, etc. Use EXPLAIN (BUFFERS, ANALYZE) to examine individual query plans, and turn the auto_explain module on to report all slow queries. You can often improve query performance dramatically just by creating an appropriate index or tweaking the cost parameters.
AFAIK there's no way to set an entire database or cluster as UNLOGGED. It'd be interesting to be able to do so. Consider asking on the PostgreSQL mailing list.
Host OS tuning
There's some tuning you can do at the operating system level, too. The main thing you might want to do is convince the operating system not to flush writes to disk aggressively, since you really don't care when/if they make it to disk.
In Linux you can control this with the virtual memory subsystem's dirty_* settings, like dirty_writeback_centisecs.
The only issue with tuning writeback settings to be too slack is that a flush by some other program may cause all PostgreSQL's accumulated buffers to be flushed too, causing big stalls while everything blocks on writes. You may be able to alleviate this by running PostgreSQL on a different file system, but some flushes may be device-level or whole-host-level not filesystem-level, so you can't rely on that.
This tuning really requires playing around with the settings to see what works best for your workload.
On newer kernels, you may wish to ensure that vm.zone_reclaim_mode is set to zero, as it can cause severe performance issues with NUMA systems (most systems these days) due to interactions with how PostgreSQL manages shared_buffers.
Query and workload tuning
These are things that DO require code changes; they may not suit you. Some are things you might be able to apply.
If you're not batching work into larger transactions, start. Lots of small transactions are expensive, so you should batch stuff whenever it's possible and practical to do so. If you're using async commit this is less important, but still highly recommended.
Whenever possible use temporary tables. They don't generate WAL traffic, so they're lots faster for inserts and updates. Sometimes it's worth slurping a bunch of data into a temp table, manipulating it however you need to, then doing an INSERT INTO ... SELECT ... to copy it to the final table. Note that temporary tables are per-session; if your session ends or you lose your connection then the temp table goes away, and no other connection can see the contents of a session's temp table(s).
If you're using PostgreSQL 9.1 or newer you can use UNLOGGED tables for data you can afford to lose, like session state. These are visible across different sessions and preserved between connections. They get truncated if the server shuts down uncleanly so they can't be used for anything you can't re-create, but they're great for caches, materialized views, state tables, etc.
In general, don't DELETE FROM blah;. Use TRUNCATE TABLE blah; instead; it's a lot quicker when you're dumping all rows in a table. Truncate many tables in one TRUNCATE call if you can. There's a caveat if you're doing lots of TRUNCATES of small tables over and over again, though; see: Postgresql Truncation speed
If you don't have indexes on foreign keys, DELETEs involving the primary keys referenced by those foreign keys will be horribly slow. Make sure to create such indexes if you ever expect to DELETE from the referenced table(s). Indexes are not required for TRUNCATE.
Don't create indexes you don't need. Each index has a maintenance cost. Try to use a minimal set of indexes and let bitmap index scans combine them rather than maintaining too many huge, expensive multi-column indexes. Where indexes are required, try to populate the table first, then create indexes at the end.
Hardware
Having enough RAM to hold the entire database is a huge win if you can manage it.
If you don't have enough RAM, the faster storage you can get the better. Even a cheap SSD makes a massive difference over spinning rust. Don't trust cheap SSDs for production though, they're often not crashsafe and might eat your data.
Learning
Greg Smith's book, PostgreSQL 9.0 High Performance remains relevant despite referring to a somewhat older version. It should be a useful reference.
Join the PostgreSQL general mailing list and follow it.
Reading:
Tuning your PostgreSQL server - PostgreSQL wiki
Number of database connections - PostgreSQL wiki
Use different disk layout:
different disk for $PGDATA
different disk for $PGDATA/pg_xlog
different disk for tem files (per database $PGDATA/base//pgsql_tmp) (see note about work_mem)
postgresql.conf tweaks:
shared_memory: 30% of available RAM but not more than 6 to 8GB. It seems to be better to have less shared memory (2GB - 4GB) for write intensive workloads
work_mem: mostly for select queries with sorts/aggregations. This is per connection setting and query can allocate that value multiple times. If data can't fit then disk is used (pgsql_tmp). Check "explain analyze" to see how much memory do you need
fsync and synchronous_commit: Default values are safe but If you can tolerate data lost then you can turn then off
random_page_cost: if you have SSD or fast RAID array you can lower this to 2.0 (RAID) or even lower (1.1) for SSD
checkpoint_segments: you can go higher 32 or 64 and change checkpoint_completion_target to 0.9. Lower value allows faster after-crash recovery

Cassandra in-memory configuration

We currently evaluate the use of Apache Cassandra 1.2 as a large scale data processing solution. As our application is read-intensive and to provide users with the fastest possible response time we would like to configure Apache Cassandra to keep all data in-memory.
Is it enough to set the storage option caching to rows_only on all column families and giving each Cassandra node sufficient memory to hold its data portion? Or are there other possibilities for Cassandra ?
Read performance tuning is much complex than write. Base on my experiences, there are some factors you can take into consideration. Some point of view are not memory related, but they also help improve the read performance.
1.Row Cache: avoid disk hit, but enable it only if the rows are not updated frequently. You could also enable the off-heap row cache to reduce the JVM heap usage.
2.Key Cache: enable by default, no need to disable it. It avoid disk searching when row cache is not hit.
3.Reduce the frequency of memtable flush: adjust memtable_total_space_in_mb, commitlog_total_space_in_mb, flush_largest_memtables_at
4.Using LeveledCompactionStrategy: avoid a row spread across multiple SSTables.
DataStax has added an in-memory computing feature in the latest version of its Apache Cassandra-based NoSQL database, as part of a drive to increase the performance of online applications.
Reference :
http://www.datastax.com/2014/02/welcome-to-datastax-enterprise-4-0-and-opscenter-4-1

Resources