I have 4 Node Cassandra 2.1.13 Cluster with the below configurations.
32 GB Ram
Max HEAP SIZE - 8 GB
250 GB Hard Disk Each (Not SSD).
I am trying to do a load test on write and read. I have created a multi threaded program to create 50 Million Records. Each row has 30 Columns.
I was able to insert 50 Million records in 84 Minutes at a rate of 9.5K insert per seconds.
Next I was trying to read those 50 Million records randomly using 32 clients and I was able to do read at 28K per second.
The problem is after some time, the memory gets full and most of it cached. almost 20GB.After some time the system hangs because of out of memory.
If I clean the cache Memory, my read throughput goes down to 100 per second.
How should I manage my cache memory without affecting read performance.
Let me know if you need any more any more information.
What you noticed is the Linux disk cache, which is supposed to serve data from RAM instead of going to disk in order to speed up data read access. Please make sure to understand how it works, e.g. see here.
As you're already using top, I'd recommend add "cache misses" as well to the overview (hit F + select nMaj). This will show you whenever a disk read cannot be served by the cache. You should see an increase of misses once the page cache starts to become saturated.
How should I manage my cache memory without affecting read performance.
The cache is fully managed by Linux and does not need any actions from your side to take care of.
Related
I wish to estimate how many cassandra storage nodes I would need to serve a specific number of reads per second.
My node specs are 32 cores, 256GB ram, 10Gbps NIC, 10x 6TB HDDs. Obviously SSDs would be much more preferrable, but these are not available in this instance.
I have around 5x10^11 values of 1kB each = 500TB of values to serve, at a rate of 100,000 read requests per second. The distribution of these requests is completely even, ie, ram capacity caching will have no effect.
If we assume that each HDD can sustain ~100 IOps, then I could expect that I need at least ~ 100 nodes to serve this read load - correct?
I also estimate that I would need at least ~ 20 machines for the total storage with a replication factor of 2, plus overhead.
It's a really broad question - you need to try to test your machines with tools, like, NoSQLBench that was specially built for such tasks.
Typical recommendation is to store ~1Tb of data per Cassandra node (including replication). You need to take into account other factors, like, how long it will take to replace the node in the cluster, or add new one - the speed of streaming is directly proportional to size of data on disk.
HDDs are really not recommended if you want to have low latency answers. I have a client with ~150Tb spread over ~30 machines with HDDs, with a lot of writes although, and read latencies regularly are going above 0.5 second, and higher. You need to take into account that Cassandra requires random access to data, and head of HDD simply couldn't move so fast to serve requests.
We have a continuously increasing data which is 1 gb in 10 minutes. We want to save this data in disk for further processing like SQL queries.
Can we store it via hazelcast like save the data to disk if it is bigger than 500 mb and use the 500 mb data for in memory computing and so on.
What is the solution big data technology for such a usage? We use 32 bit Windows XP.
You can use IMap and can configure map-store and eviction for it.
With map-store, entries in IMap is stored into db and with eviction, you can control used memory.
I have a hive table that is of 2.7 MB (which is stored in a parquet format). When I use impala-shell to convert this hive table to kudu, I notice that the /tserver/ folder size increases by around 300 MB. Upon exploring further, I see it is the /tserver/wals/ folder that holds the majority of this increase. I am facing serious issues due to this. If a 2.7 MB file generates a 300 MB WAL, then I cannot really work on bigger data. Is there a solution to this?
My kudu version is 1.1.0 and impala is 2.7.0.
I never used KUDU but I'm able to Google on a few keywords, and read some documentation.
From the Kudu configuration reference section "Unsupported flags"...
--log_preallocate_segments Whether the WAL should preallocate the entire segment before writing to it Default true
--log_segment_size_mb The default segment size for log roll-overs, in MB Default 64
--log_min_segments_to_retain The minimum number of past log segments to keep at all times, regardless of what is required for
durability. Must be at least 1. Default 2
--log_max_segments_to_retain The maximum number of past log segments to keep at all times for the purposes of catching up other
peers. Default 10
Looks like you have a minimum disk requirement of (2+1)x64 MB per tablet, for the WAL only. And it can grow up to 10x64 MB if some tablets are straggling and cannot catch up.
Plus some temp disk space for compaction etc. etc.
[Edit] these default values have changed in Kudu 1.4 (released in June 2017); quoting the Release Notes...
The default size for Write Ahead Log (WAL) segments has been reduced
from 64MB to 8MB. Additionally, in the case that all replicas of a
tablet are fully up to date and data has been flushed from memory,
servers will now retain only a single WAL segment rather than two.
These changes are expected to reduce the average consumption of disk
space on the configured WAL disk by 16x
I am getting a very strange PostgreSQL 9.4 behavior. When it runs an UPDATE on a large table, or performs VACUUM or CLUSTER of a large table it seems to hang for a very long time. In fact I just end up killing the process the following day. What's odd about it is that CPU is idle and at the same time disk activity is at 100% BUT it only reports a 4-5 MB/sec reads and writes (see screenshot of nmap & atop).
My server is 24CPU, 32GB RAM and RAID1 (2 SAS 15K x 2). Normally when disk is at 100% utilization it gives me 120-160 MB/s combined reads/writes which can stay almost indefinitely at >100MB/sec of sustained IO.
The system becomes very sluggish altogether even terminal command line. My guess it has something to do with shared memory and virtual memory. When this happens PostgreSQL consumes maximum configured shared memory.
I have disabled swapping vm.swappiness=0. I didn't play with vm.dirty_ratio, vm.dirty_background_ratio and such. System huge pages are disabled vm.nr_hugepages=0.
The following are my postgresql.conf settings:
shared_buffers = 8200MB
temp_buffers = 12MB
work_mem = 32MB
maintenance_work_mem = 128MB
#-----------------------------------------------------
synchronous_commit = off
wal_sync_method = fdatasync
checkpoint_segments = 32
checkpoint_completion_target = 0.9
#-----------------------------------------------------
random_page_cost = 3.2 # RAIDed disk
effective_cache_size = 20000MB # 32GB RAM
geqo_effort = 10
#-----------------------------------------------------
autovacuum_max_workers = 4
autovacuum_naptime = 45s
autovacuum_vacuum_scale_factor = 0.16
autovacuum_analyze_scale_factor = 0.08
How can disk be at 100% when it is only doing 5MB/sec? Even the most exhausting random read/write routine should still be a level of magnitude faster. It must have something to do with the way PostgreSQL deals with the mapped/shared memory. Also this wasn't occurring with postgres 9.1.
I am trying to educate myself on disk/memory behavior but at this point I need help from the PROs.
After lengthy investigation I found correlation between disk saturation with low read/write speeds and the IOPS number. The higher the number of IOPS the lower the IO saturation bandwidth. One of the screenshots in my question has "Transfers/sec". When than number goes high the transfer rate falls.
Unfortunately there isn't much can be done on the database configuration side. PostgreSQL heavily relies on shared memory mapping files to memory pages. When time comes to sync some/all memory pages back to disk it may have tens/hundreds of thousands of dirty pages to sync for a database with large tables. It causes a lot of random disk access and a zillion of small atomic IOs.
Since neither installing SSD nor enabling writeback is an option in my case I had to resolve the problem by approaching it from a different angle. I addressed each case individually.
The UPDATE statement I had was affecting more than half or table records every time it ran. Instead of doing the update I recreate the table each time. This almost doubled performance.
CLUSTER-ing a table results in rebuilding all table indexes except the one by which clustering is performed. For large tables with many indexes this is an important consideration to keep that in mind when performing clustering.
I also replaced VACUUM with ANALYSE which didn't seem like it affected table performance much but runs measurably quicker than VACUUM.
Here is situation
I am trying to fetch around 10k keys from CF.
Size of cluster : 10 nodes
Data on node : 250 GB
Heap allotted : 12 GB
Snitch used : property snitch with 2 racks in same Data center.
no. of sstables for cf per node : around 8 to 10
I am supercolumn approach.Each row contains around 300 supercolumn which in terms contain 5-10 columns.I am firing multiget with 10k row keys and 1 supercolumn.
When fire the call 1st time it take around 30 to 50 secs to return the result.After that cassandra serves the data from key cache.Then it return the result in 2-4 secs.
So cassandra read performance is hampering our project.I am using phpcassa.Is there any way I can tweak cassandra servers so that I can get result faster?
Is super column approach affects the read performance?
Use of super columns is best suited for use cases where the number of sub-columns is a relatively small number. Read more here:
http://www.datastax.com/docs/0.8/ddl/column_family
Just in case you haven't done this already, since you're using phpcassa library, make sure that you've compiled the Thrift library. Per the "INSTALLING" text file in the phpcassa library folder:
Using the C Extension
The C extension is crucial for phpcassa's performance.
You need to configure and make to be able to use the C extension.
cd thrift/ext/thrift_protocol
phpize
./configure
make
sudo make install
Add the following line to your php.ini file:
extension=thrift_protocol.so
After doing much of RND about this stuff we figured there is no way you can get this working optimally.
When cassandra is fetching these 10k rows 1st time it is going to take time and there is no way to optimize this.
1) However in practical, probability of people accessing same records are more.So we take maximum advantage of key cache.Default setting for key cache is 2 MB. So we can afford to increase it to 128 MB with no problems of memory.
After data loading run the expected queries to warm up the key cache.
2) JVM works optimally at 8-10 GB (Dont have numbers to prove it.Just observation).
3) Most important if you are using physical machines (not cloud OR virtual machine) then do check out the disk scheduler you are using.Set it NOOP which is good for cassandra as it reads all keys from one section reducing disk header movement.
Above changes helped to bring down time required for querying within acceptable limits.
Along with above changes if you have CFs which are small in size but frequently accessed enable row caching for it.
Hope above info is useful.