Cassandra Node Capacity Calculation - cassandra

I am trying to find out what is a data holding capacity of each Cassandra node in a Cluster before it starts showing latency. Basically I need to find out what is the right time to start adding new nodes to the existing cluster. I am referring to this page.
We use VMs having single data disk of 100G size. Here is how I calculated the usable disk space for each node.
raw_capacity = disk_size * number_of_data_disk
= 100 G * 1
= 100 G
formatted_disk_space = (raw_capacity * 0.9)
= 100 G * 0.9
= 90 G
usable_disk_space = formatted_disk_space * (0.5 to 0.8)
= 90 G * 0.5
= 45 G
So this means each node can hold data upto 45 G. Is this correct understanding?
Also if I need to compare it with current data size, can I directly compare it with nodetool status response? As per above calculation it can hold upto 45 G whereas my cluster is holding only around 11G data. I have been trying to read through, but may be because of my brains, I am not able to understand this.
Datacenter: prod_east
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
UN <IP_1> 11.17 GB NO TOKENS ? <token> rack1
UN <IP_2> 12.23 GB NO TOKENS ? <token> rack1
UN <IP_3> 10.72 GB NO TOKENS ? <token> rack1
Any help here is highly appreciated.

Nodetool status load take in consideration the replication factor, so each node might be having 100% or maybe less, try to add the name if your keyspace as a nodetool status command argument and it will give you the data that each node owns.
Here is an example :
nodetool status your_keyspace_name
Datacenter: dc1
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
Address Load Tokens Owns Host ID Rack
UN 127.0.0.1 47.66 MB 1 33.3% x rack1
UN 127.0.0.2 47.67 MB 1 33.3% x rack1
UN 127.0.0.3 47.67 MB 1 33.3% x rack1

Related

Understanding read from a single partition in Cassandra

I have a 3 node setup, Node1 (172.30.56.60), Node2 (172.30.56.61) and Node3 (172.30.56.62),
It has the single partition data of 100K, the partition is framed by nodeip.
Please find the token / partition value for the nodeip - 172.30.56.60
cqlsh:qnapstat> SELECT token(nodeip) FROM nodedata WHERE nodeip = '172.30.56.60' LIMIT 5;
system.token(nodeip)
----------------------
222567180698744628
222567180698744628
222567180698744628
222567180698744628
222567180698744628
As per the ./nodetool ring value provided below, '172.30.56.60' only will return the data to the coordinator since the value from 173960939250606057 to 239923324758894350 is handled bu the node 172.30.56.60. Note : This is my understanding
172.30.56.60 rack1 Up Normal 32.72 MiB 100.00% 173960939250606057
172.30.56.62 rack1 Up Normal 32.88 MiB 100.00% 239923324758894351
172.30.56.61 rack1 Up Normal 32.84 MiB 100.00% 253117576269706963
172.30.56.60 rack1 Up Normal 32.72 MiB 100.00% 273249439554531014
172.30.56.61 rack1 Up Normal 32.84 MiB 100.00% 295635292275517104
172.30.56.62 rack1 Up Normal 32.88 MiB 100.00% 301162927966816823
I have two questions here,
1) When I try to execute the following query, Does it mean that Coordinator (say 172.30.56.61) reads all the data from the 172.30.56.60?
2) Is that after receiving all the 100 K entries in the coordinator, Coordinator will perform the aggregation for 100K, If so does it keeps all 100K entries in memory in 172.30.56.61?
SELECT Max(readiops) FROM nodedata WHERE nodeip = '172.30.56.60';
There is nice tool called CQL TRACING that can help you understand and see the flow of events once a SELECT query is executed.
cqlsh> INSERT INTO test.nodedata (nodeip, readiops) VALUES (1, 10);
cqlsh> INSERT INTO test.nodedata (nodeip, readiops) VALUES (1, 20);
cqlsh> INSERT INTO test.nodedata (nodeip, readiops) VALUES (1, 30);
cqlsh> select * from test.nodedata ;
nodeip | readiops
--------+-----------
1 | 10
1 | 20
1 | 30
(3 rows)
cqlsh> SELECT MAX(readiops) FROM test.nodedata WHERE nodeip = 1;
system.max(readiops)
-----------------------
30
(1 rows)
Now let's set cqlsh> TRACING ON and run the same query again.
cqlsh> TRACING ON
Now Tracing is enabled
cqlsh> SELECT MAX(readiops) FROM test.nodedata WHERE nodeip = 1;
system.max(readiops)
----------------------
30
(1 rows)
Tracing session: 4d7bf970-eada-11e7-a79d-000000000003
activity | timestamp | source | source_elapsed
-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+--------------+----------------
Execute CQL3 query | 2017-12-27 07:48:44.404000 | 172.16.0.128 | 0
read_data: message received from /172.16.0.128 [shard 4] | 2017-12-27 07:48:44.385109 | 172.16.0.48 | 9
read_data handling is done, sending a response to /172.16.0.128 [shard 4] | 2017-12-27 07:48:44.385322 | 172.16.0.48 | 222
Parsing a statement [shard 1] | 2017-12-27 07:48:44.404821 | 172.16.0.128 | --
Processing a statement [shard 1] | 2017-12-27 07:48:44.404913 | 172.16.0.128 | 93
Creating read executor for token 6292367497774912474 with all: {172.16.0.128, 172.16.0.48, 172.16.0.115} targets: {172.16.0.48} repair decision: NONE [shard 1] | 2017-12-27 07:48:44.404966 | 172.16.0.128 | 146
read_data: sending a message to /172.16.0.48 [shard 1] | 2017-12-27 07:48:44.404972 | 172.16.0.128 | 152
read_data: got response from /172.16.0.48 [shard 1] | 2017-12-27 07:48:44.405497 | 172.16.0.128 | 676
Done processing - preparing a result [shard 1] | 2017-12-27 07:48:44.405535 | 172.16.0.128 | 715
Request complete | 2017-12-27 07:48:44.404722 | 172.16.0.128 | 722
As for your questions:
The Coordinator passes the query to the replica, if RF = 1 or (RF > 1 and CL=ONE), than it will receive the reply from 1 replica, but if (RF > 1 and CL > 1), than it needs to receive replies from multiple replicas and compare the answers, so there's also orchestration done on the Coordinator side.
The way it is actually done is a data request to the fastest replica (using the snitch) and a digest request to the other replicas needed to satisfy the CL.
And then the coordinator need to hash the responses from the data and digest requests and compare them.
If the partition is hashed into a specific node, it will reside in that node (assuming RF=1) and information will be read only from that node.
The Client sends with the query the page size, so the reply itself is returned in bulks (default=5000), which can be set from the client side.
I recommend watching this youtube clip on Cassandra read path for more details.

Cassandra query fails for some WHERE clauses

I have a large table that looks essentially like:
CREATE TABLE keyspace.table(
node bigint,
time bigint,
core bigint,
set bigint,
value1 bigint,
value2 bigint,
PRIMARY KEY (node, time, core)
);
And a secondary index (possibly irrelevant) on column set.
When I do a stupidly simple query:
SELECT * FROM keyspace.table LIMIT 10;
It succeeds.
However, for some WHERE clauses, it fails, e.g.:
SELECT * FROM keyspace.table WHERE node = 12 LIMIT 10;
Traceback (most recent call last):
File "/usr/bin/cqlsh.py", line 1297, in perform_simple_statement
result = future.result()
File "/usr/share/cassandra/lib/cassandra-driver-internal-only-3.0.0-6af642d.zip/cassandra-driver-3.0.0-6af642d/cassandra/cluster.py", line 3122, in result
raise self._final_exception
Unavailable: code=1000 [Unavailable exception] message="Cannot achieve consistency level ONE" info={'required_replicas': 1, 'alive_replicas': 0, 'consistency': 'ONE'}
And nothing shows up in the cassandra system log.
Details
The datacenter looks like:
Datacenter: DC2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 192.168.64.8 234.84 GB 256 ? 478f59e1-4797-45df-81d1-77559e393e8a RAC1
DN 192.168.64.9 217.85 GB 256 ? 84d4bfd6-054a-433e-b6d4-3eead77609ac RAC1
DN 192.168.64.10 208.99 GB 256 ? c6ac565e-5897-4205-9439-779becf7fafe RAC1
UN 192.168.64.11 225.69 GB 256 ? 4a941e8f-db80-48f3-8eca-c4430b795694 RAC1
DN 192.168.64.12 208.57 GB 256 ? 34e57e18-e8b9-40d6-85e8-40a309e91b49 RAC1
DN 192.168.64.13 240.22 GB 256 ? 7a312c4f-01c0-4ed4-be7f-417b8f14f940 RAC1
DN 192.168.64.4 208.5 GB 256 ? 41a49d5c-e569-46f3-9f0e-97de43a22690 RAC1
UN 192.168.64.5 213.19 GB 256 ? b5bfb4e3-30a2-46b5-ba41-cf1a58a7355d RAC1
UN 192.168.64.6 212.21 GB 256 ? 9f6781ca-09b7-4923-8fa1-5b91079e2a18 RAC1
UN 192.168.64.7 235.97 GB 256 ? 5f429e7b-2e16-4796-b746-834500aeb884 RAC1
The keyspace:
CREATE KEYSPACE keyspace WITH replication = {'class': 'NetworkTopologyStrategy', 'DC2': '2'} AND durable_writes = true;
This table is continuously ingesting 2000 rows per second using cassandra-loader and is fairly large. nodetool tablestats output (EDIT tombstones cleaned up):
Keyspace: keyspace
Read Count: 0
Read Latency: NaN ms.
Write Count: 31945
Write Latency: 7.547574268273595 ms.
Pending Flushes: 0
Table: table
SSTable count: 9
Space used (live): 15414599520
Space used (total): 15414599520
Space used by snapshots (total): 0
Off heap memory used (total): 4506364
SSTable Compression Ratio: 0.41688805047558564
Number of keys (estimate): 251
Memtable cell count: 0
Memtable data size: 0
Memtable off heap memory used: 0
Memtable switch count: 3
Local read count: 0
Local read latency: NaN ms
Local write count: 31945
Local write latency: 8.261 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 1528
Bloom filter off heap memory used: 1456
Index summary off heap memory used: 260
Compression metadata off heap memory used: 4504648
Compacted partition minimum bytes: 2300
Compacted partition maximum bytes: 74975550
Compacted partition mean bytes: 36544449
Average live cells per slice (last five minutes): 1.0222222222222221
Maximum live cells per slice (last five minutes): 2
Average tombstones per slice (last five minutes): 1.0222222222222221
Maximum tombstones per slice (last five minutes): 2
Dropped Mutations: 0
Output with tracing enabled
cqlsh> TRACING ON;
Now Tracing is enabled
cqlsh> SELECT * FROM keyspace.table WHERE node = 1223 LIMIT 10;
Traceback (most recent call last):
File "/usr/bin/cqlsh.py", line 1297, in perform_simple_statement
result = future.result()
File "/usr/share/cassandra/lib/cassandra-driver-internal-only-3.0.0-6af642d.zip/cassandra-driver-3.0.0-6af642d/cassandra/cluster.py", line 3122, in result
raise self._final_exception
Unavailable: code=1000 [Unavailable exception] message="Cannot achieve consistency level ONE" info={'required_replicas': 1, 'alive_replicas': 0, 'consistency': 'ONE'}
Let me know any other details I can provide.

Cassandra NoHostAvailableException when deletes are executed with cqlsh

We have a cluster with 7 nodes and we use the datastax java driver to connect to the cluster. The problem is that I am getting constant NoHostAvailableException like this:
Caused by:
com.datastax.driver.core.exceptions.NoHostAvailableException: All
host(s) tried for query failed (tried: /172.31.7.243:9042
(com.datastax.driver.core.exceptions.DriverException: Timeout while
trying to acquire available connection (you may want to increase the
driver number of per-host connections)), /172.31.7.245:9042
(com.datastax.driver.core.exceptions.DriverException: Timeout while
trying to acquire available connection (you may want to increase the
driver number of per-host connections)), /172.31.7.246:9042
(com.datastax.driver.core.exceptions.DriverException: Timeout while
trying to acquire available connection (you may want to increase the
driver number of per-host connections)), /172.31.7.247:9042,
/172.31.7.232:9042, /172.31.7.233:9042, /172.31.7.244:9042 [only
showing errors of first 3 hosts, use getErrors() for more details])
All the nodes are up:
UN 172.31.7.244 152.21 GB 256 14.5% 58abea69-e7ba-4e57-9609-24f3673a7e58 RAC1
UN 172.31.7.245 168.4 GB 256 14.5% bc11b4f0-cf96-4ca5-9a3e-33cc2b92a752 RAC1
UN 172.31.7.246 177.71 GB 256 13.7% 8dc7bb3d-38f7-49b9-b8db-a622cc80346c RAC1
UN 172.31.7.247 158.57 GB 256 14.1% 94022081-a563-4042-81ab-75ffe4d13194 RAC1
UN 172.31.7.243 176.83 GB 256 14.6% 0dda3410-db58-42f2-9351-068bdf68f530 RAC1
UN 172.31.7.233 159 GB 256 13.6% 01e013fb-2f57-44fb-b3c5-fd89d705bfdd RAC1
UN 172.31.7.232 166.05 GB 256 15.0% 4d009603-faa9-4add-b3a2-fe24ec16a7c1 RAC1
but two of them have high cpu load, especially the 232 because I am running a lot of deletes using cqlsh in that node.
I know that deletes generate tombstones, but with 7 nodes in the cluster I do not think is normal that all the host are not accesible.
Our configuration for the java connection is:
com.datastax.driver.core.Cluster cluster = null;
//Get contact points
String[] contactPoints=this.environment.getRequiredProperty(CASSANDRA_CLUSTER_URL).split(",");
cluster = com.datastax.driver.core.Cluster.builder()
.addContactPoints(contactPoints))
.withCredentials(this.environment.getRequiredProperty(CASSANDRA_CLUSTER_USERNAME),
this.environment.getRequiredProperty(CASSANDRA_CLUSTER_PASSWORD))
.withQueryOptions(new QueryOptions()
.setConsistencyLevel(ConsistencyLevel.QUORUM))
.withLoadBalancingPolicy(new TokenAwarePolicy(new RoundRobinPolicy()))
.withRetryPolicy(new LoggingRetryPolicy(DowngradingConsistencyRetryPolicy.INSTANCE))
.withPort(Integer.parseInt(this.environment.getRequiredProperty(CASSANDRA_CLUSTER_PORT)))
.build();
Metadata metadata = cluster.getMetadata();
for ( Host host : metadata.getAllHosts() ) {
LOG.info("Datacenter: "+host.getDatacenter()+"; Host: "+host.getAddress()+"; DC: "+host.getDatacenter()+"\n");
}
and the contact points are:
172.31.7.244,172.31.7.243,172.31.7.245,172.31.7.246,172.31.7.247
Anyone knows how I can solve this problem? Or at least have anyone some hint about how to deal with this situation?
Update: If I get the error messages withe.getErrors() I obtain:
/172.31.7.243:9042=com.datastax.driver.core.OperationTimedOutException: [/172.31.7.243:9042] Operation timed out,
/172.31.7.244:9042=com.datastax.driver.core.OperationTimedOutException: [/172.31.7.244:9042] Operation timed out,
/172.31.7.245:9042=com.datastax.driver.core.OperationTimedOutException: [/172.31.7.245:9042] Operation timed out,
/172.31.7.246:9042=com.datastax.driver.core.OperationTimedOutException: [/172.31.7.246:9042] Operation timed out,
/172.31.7.247:9042=com.datastax.driver.core.OperationTimedOutException: [/172.31.7.247:9042] Operation timed out}
UPDATE:
The replication factor of the keyspace is 3.
For the deletes Im running them using different files with the cql queries:
cqlsh ip_node_1 -f script-1.duplicates
cqlsh ip_node_1 -f script-2.duplicates
cqlsh ip_node_1 -f script-3.duplicates
...
I am not specifying any consistency level, so is using the default one which is ONE.
Each of the previous files contain deletes like this:
DELETE FROM keyspace_name.search WHERE idline1 = 837 and idline2 = 841 and partid = 8558 and id = 18c04c20-8a3a-11e5-9e20-0025905a2ab2;
And the column family is:
CREATE TABLE search (
idline1 bigint,
idline2 bigint,
partid int,
id uuid,
field3 int,
field4 int,
field5 int,
field6 int,
field7 int,
field8 int,
field9 double,
field10 bigint,
field11 bigint,
field12 bigint,
field13 boolean,
field14 boolean,
field15 int,
field16 bigint,
field17 int,
field18 int,
field19 int,
field20 int,
field21 uuid,
field22 boolean,
PRIMARY KEY ((idline1, idline2, partid), id)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='Table with the snp between lines' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=0 AND
index_interval=128 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='99.0PERCENTILE' AND
memtable_flush_period_in_ms=0 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'LZ4Compressor'};
CREATE INDEX search_partid ON search (partid);
CREATE INDEX search_field8 ON search (field8);
UPDATE (18-03-2016):
After the deletes start to be executed I found the cpu of some of the nodes increases a lot:
I check the processes on that nodes and only cassandra is running but consuming a lot of cpu. The rest of the nodes are not using almost cpu.
UPDATE (04-04-2016): I do not know if it is related. I check the nodes which a lot of CPU (near 96%) and th gc activity remains on 1.6% (using only 3 GB from the 10 which have assigned).
Checing the thread pool stats:
nodetool tpstats
Pool Name Active Pending Completed Blocked All time blocked
ReadStage 0 0 20042001 0 0
RequestResponseStage 0 0 149365845 0 0
MutationStage 32 117720 181498576 0 0
ReadRepairStage 0 0 799373 0 0
ReplicateOnWriteStage 0 0 13624173 0 0
GossipStage 0 0 5580503 0 0
CacheCleanupExecutor 0 0 0 0 0
AntiEntropyStage 0 0 32173 0 0
MigrationStage 0 0 9 0 0
MemtablePostFlusher 0 0 45044 0 0
MemoryMeter 0 0 9553 0 0
FlushWriter 0 0 9425 0 18
ValidationExecutor 0 0 15980 0 0
MiscStage 0 0 0 0 0
PendingRangeCalculator 0 0 7 0 0
CompactionExecutor 0 0 1293147 0 0
commitlog_archiver 0 0 0 0 0
InternalResponseStage 0 0 0 0 0
HintedHandoff 0 0 273 0 0
Message type Dropped
RANGE_SLICE 0
READ_REPAIR 0
PAGED_RANGE 0
BINARY 0
READ 0
MUTATION 0
_TRACE 0
REQUEST_RESPONSE 0
COUNTER_MUTATION 0
I realize that the pending mutation stages are growing but the active value remain the same, could be this the problem?
I see two problems with your datamodel.
You use two secondary indexes. One is on a field on the partition key. I don't know how cassandra behaves in this case. Worst case is, that even if you use the complete partition key (like you do in your example delete) cassandra does a lookup in the secondary index. In that case this would mean a full cluster scan, because secondary indexes are only stored per partition. Since only a part of the partition key is indexed cassandra does not know on which partition the index informations lies. This behavior at least would explain the timeouts.
You said, you delete a lot of rows in a specific partition. That is also a problem. For each deletion cassandra creates a tombstone. The more tombstones there are, the slower the read will become. This will sooner or later lead to timeouts or exceptions (I believe cassandra will write warnings when 1000 tombstones are reached and throw exceptions when 10.000 tombstones are reached). Btw. these tombstones are also created in the secondary index. By default cassandra will remove tombstones after gc_grace_seconds (by default 10 days) when a compaction is performed. You could change this property per table. More information on these table properties can be found here: Table Properties
I believe the first point could be the reason for the timeouts.

Coordinator get responce from one node notably later than from other nodes

Please, help me to understand what i missed.
I see strange behavior of one cluster node on SELECT with LIMIT and ORDER BY DESC clauses:
SELECT cid FROM test_cf WHERE uid = 0x50236b6de695baa1140004bf ORDER BY tuuid DESC LIMIT 1000;
TRACING (only part):
…
Sending REQUEST_RESPONSE message to /10.0.25.56 [MessagingService-Outgoing-/10.0.25.56] | 2016-02-29 22:17:25.117000 | 10.0.23.15 | 7862
Sending REQUEST_RESPONSE message to /10.0.25.56 [MessagingService-Outgoing-/10.0.25.56] | 2016-02-29 22:17:25.136000 | 10.0.25.57 | 6283
Sending REQUEST_RESPONSE message to /10.0.25.56 [MessagingService-Outgoing-/10.0.25.56] | 2016-02-29 22:17:38.568000 | 10.0.24.51 | 457931
…
10.0.25.56 - coordinator node
10.0.23.15, 10.0.24.51, 10.0.25.57 - node with data
Coordinator get response from 10.0.24.51 13 seconds later than other nodes! Why so? How can i fix it?
Number of rows for partition key (uid = 0x50236b6de695baa1140004bf) is about 300.
All is fine if we use ORDER BY ASC (our clustering order) or LIMIT value less than number of rows for this partition key.
Cassandra (v2.2.5) cluster contains 25 nodes.
Every node holds about 400Gb of data.
Cluster is placed in AWS. Nodes are evenly distributed over 3 subnets in VPC. Type of instance for nodes is c3.4xlarge (16 CPU cores, 30GB RAM). We use EBS-backed storages (1TB GP SSD).
Keyspace RF equals 3.
Column family:
CREATE TABLE test_cf (
uid blob,
tuuid timeuuid,
cid text,
cuid blob,
PRIMARY KEY (uid, tuuid)
) WITH CLUSTERING ORDER BY (tuuid ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction ={'class':'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression ={'sstable_compression':'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 86400
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
nodetool gcstats (10.0.25.57):
Interval (ms) Max GC Elapsed (ms)Total GC Elapsed (ms)Stdev GC Elapsed (ms) GC Reclaimed (MB) Collections Direct Memory Bytes
1208504 368 4559 73 553798792712 58 305691840
nodetool gcstats (10.0.23.15):
Interval (ms) Max GC Elapsed (ms)Total GC Elapsed (ms)Stdev GC Elapsed (ms) GC Reclaimed (MB) Collections Direct Memory Bytes
1445602 369 3120 57 381929718000 38 277907601
nodetool gcstats (10.0.24.51):
Interval (ms) Max GC Elapsed (ms)Total GC Elapsed (ms)Stdev GC Elapsed (ms) GC Reclaimed (MB) Collections Direct Memory Bytes
1174966 397 4137 69 1900387479552 45 304448986
This could be due to a number of factors both related and not related to Cassandra.
Non-Cassandra Specific
How does the hardware (CPU/RAM/Disk Type (SSD v Rotational) on this
node compare to the other nodes?
How is the network configured? Is traffic to this node slower than other nodes? Do you have a routing issue between the nodes?
How does the load on this server compare to other nodes?
Cassandra Specific
Is the JVM properly configured? Is GC running significantly more frequently than the other nodes? Check nodetool gcstats on this and other nodes to compare.
Has compaction been run on this node recently? Check nodetool compactionhistory
Are there any issues with corrupted files on disk?
Have you checked the system.log to see if it contains any information.
Besides general Linux troubleshooting I would suggest you compare some of the specific C* functionality using nodetool and look for differences:
https://docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsNodetool_r.html

Linux: Cannot allocate more than 32 GB/64 GB of memory in a single process due to virtual memory limit

I have a computer with 128 GB of RAM, running Linux (3.19.5-200.fc21.x86_64). However, I cannot allocate more than ~30 GB of RAM in a single process. Beyond this, malloc fails:
#include <stdlib.h>
#include <iostream>
int main()
{
size_t gb_in_bytes = size_t(1)<<size_t(30); // 1 GB in bytes (2^30).
// try to allocate 1 block of 'i' GB.
for (size_t i = 25; i < 35; ++ i) {
size_t n = i * gb_in_bytes;
void *p = ::malloc(n);
std::cout << "allocation of 1 x " << (n/double(gb_in_bytes)) << " GB of data. Ok? " << ((p==0)? "nope" : "yes") << std::endl;
::free(p);
}
}
This produces the following output:
/tmp> c++ mem_alloc.cpp && a.out
allocation of 1 x 25 GB of data. Ok? yes
allocation of 1 x 26 GB of data. Ok? yes
allocation of 1 x 27 GB of data. Ok? yes
allocation of 1 x 28 GB of data. Ok? yes
allocation of 1 x 29 GB of data. Ok? yes
allocation of 1 x 30 GB of data. Ok? yes
allocation of 1 x 31 GB of data. Ok? nope
allocation of 1 x 32 GB of data. Ok? nope
allocation of 1 x 33 GB of data. Ok? nope
allocation of 1 x 34 GB of data. Ok? nope
I searched for quite some time, and found that this is related to the maximum virtual memory size:
~> ulimit -all
[...]
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
virtual memory (kbytes, -v) 32505856
[...]
I can increase this limit to ~64 GB via ulimit -v 64000000, but not further. Beyond this, I get operation not permitted errors:
~> ulimit -v 64000000
~> ulimit -v 65000000
bash: ulimit: virtual memory: cannot modify limit: Operation not permitted
~> ulimit -v unlimited
bash: ulimit: virtual memory: cannot modify limit: Operation not permitted
Some more searching revealed that in principle it should be possible to set these limits via the "as" (address space) entry in /etc/security/limits.conf. However, by doing this, I could only reduce the maximum amount of virtual memory, not increase it.
Is there any way to either lift this limit of virtual memory per process completely, or to increase it beyond 64 GB? I would like to use all of the physical memory in a single application.
EDIT:
Following Ingo Leonhardt, I tried ulimits -v unlimited after logging in as root, not as standard user. Doing this solves the problem for root (the program can then allocate all the physical memory while logged in as root). But this works only for root, not for other users. However, at the least this means that in principle the kernel can handle this just fine, and that there is only a configuration problem.
Regarding limits.conf: I tried explicitly adding
hard as unlimited
soft as unlimited
to /etc/security/limits.conf, and rebooting. This had no effect. After login as standard user, ulimit -v still returns about 32 GB, and ulimit -v 65000000 still says permission denied (while ulimit -v 64000000 works). The rest of limits.conf is commented out, and in /etc/security/limits.d there is only one other, unrelated entry (limiting nproc to 4096 for non-root users). That is, the virtual memory limit must be coming from elsewhere than limits.conf. Any ideas what else could lead to ulimits -v not being "unlimited"?
EDIT/RESOLUTION:
It was caused by my own stupidity. I had a (long forgotten) program in my user setup which used setrlimit to restrict the amount of memory per process to prevent Linux from swapping to death. It was unintentionally copied from a 32 GB machine to the 128 GB machine. Thanks to Paul and Andrew Janke and everyone else for helping to track it down. Sorry everyone :/.
If anyone else encounters this: Search for ulimit/setrlimit in the bash and profile settings, and programs potentially calling those (both your own and the system-wide /etc settings) and make sure that /security/limits.conf does not include this limit... (or at least try creating a new user, to see if this happens in your user or the system setup)
This is a ulimit and system setup problem, not a c++ problem.
I can run your appropriately modified code on an Amazon EC2 instance type r3.4xlarge with no problem. These cost less than $0.20/hour on the spot market, and so I suggest you rent one, and perhaps take a look around in /etc and compare to your own setup... or maybe you need to recompile a Linux kernel to use that much memory... but it is not a C++ or gcc problem.
Ubuntu on the EC2 machine was already set up for unlimited process memory.
$ sudo su
# ulimit -u
--> unlimited
This one has 125GB of ram
# free
total used free shared buffers cached
Mem: 125903992 1371828 124532164 344 22156 502248
-/+ buffers/cache: 847424 125056568
Swap: 0 0 0
I modified the limits on your program to go up to 149GB.
Here's the output. Looks good up to 118GB.
root#ip-10-203-193-204:/home/ubuntu# ./memtest
allocation of 1 x 25 GB of data. Ok? yes
allocation of 1 x 26 GB of data. Ok? yes
allocation of 1 x 27 GB of data. Ok? yes
allocation of 1 x 28 GB of data. Ok? yes
allocation of 1 x 29 GB of data. Ok? yes
allocation of 1 x 30 GB of data. Ok? yes
allocation of 1 x 31 GB of data. Ok? yes
allocation of 1 x 32 GB of data. Ok? yes
allocation of 1 x 33 GB of data. Ok? yes
allocation of 1 x 34 GB of data. Ok? yes
allocation of 1 x 35 GB of data. Ok? yes
allocation of 1 x 36 GB of data. Ok? yes
allocation of 1 x 37 GB of data. Ok? yes
allocation of 1 x 38 GB of data. Ok? yes
allocation of 1 x 39 GB of data. Ok? yes
allocation of 1 x 40 GB of data. Ok? yes
allocation of 1 x 41 GB of data. Ok? yes
allocation of 1 x 42 GB of data. Ok? yes
allocation of 1 x 43 GB of data. Ok? yes
allocation of 1 x 44 GB of data. Ok? yes
allocation of 1 x 45 GB of data. Ok? yes
allocation of 1 x 46 GB of data. Ok? yes
allocation of 1 x 47 GB of data. Ok? yes
allocation of 1 x 48 GB of data. Ok? yes
allocation of 1 x 49 GB of data. Ok? yes
allocation of 1 x 50 GB of data. Ok? yes
allocation of 1 x 51 GB of data. Ok? yes
allocation of 1 x 52 GB of data. Ok? yes
allocation of 1 x 53 GB of data. Ok? yes
allocation of 1 x 54 GB of data. Ok? yes
allocation of 1 x 55 GB of data. Ok? yes
allocation of 1 x 56 GB of data. Ok? yes
allocation of 1 x 57 GB of data. Ok? yes
allocation of 1 x 58 GB of data. Ok? yes
allocation of 1 x 59 GB of data. Ok? yes
allocation of 1 x 60 GB of data. Ok? yes
allocation of 1 x 61 GB of data. Ok? yes
allocation of 1 x 62 GB of data. Ok? yes
allocation of 1 x 63 GB of data. Ok? yes
allocation of 1 x 64 GB of data. Ok? yes
allocation of 1 x 65 GB of data. Ok? yes
allocation of 1 x 66 GB of data. Ok? yes
allocation of 1 x 67 GB of data. Ok? yes
allocation of 1 x 68 GB of data. Ok? yes
allocation of 1 x 69 GB of data. Ok? yes
allocation of 1 x 70 GB of data. Ok? yes
allocation of 1 x 71 GB of data. Ok? yes
allocation of 1 x 72 GB of data. Ok? yes
allocation of 1 x 73 GB of data. Ok? yes
allocation of 1 x 74 GB of data. Ok? yes
allocation of 1 x 75 GB of data. Ok? yes
allocation of 1 x 76 GB of data. Ok? yes
allocation of 1 x 77 GB of data. Ok? yes
allocation of 1 x 78 GB of data. Ok? yes
allocation of 1 x 79 GB of data. Ok? yes
allocation of 1 x 80 GB of data. Ok? yes
allocation of 1 x 81 GB of data. Ok? yes
allocation of 1 x 82 GB of data. Ok? yes
allocation of 1 x 83 GB of data. Ok? yes
allocation of 1 x 84 GB of data. Ok? yes
allocation of 1 x 85 GB of data. Ok? yes
allocation of 1 x 86 GB of data. Ok? yes
allocation of 1 x 87 GB of data. Ok? yes
allocation of 1 x 88 GB of data. Ok? yes
allocation of 1 x 89 GB of data. Ok? yes
allocation of 1 x 90 GB of data. Ok? yes
allocation of 1 x 91 GB of data. Ok? yes
allocation of 1 x 92 GB of data. Ok? yes
allocation of 1 x 93 GB of data. Ok? yes
allocation of 1 x 94 GB of data. Ok? yes
allocation of 1 x 95 GB of data. Ok? yes
allocation of 1 x 96 GB of data. Ok? yes
allocation of 1 x 97 GB of data. Ok? yes
allocation of 1 x 98 GB of data. Ok? yes
allocation of 1 x 99 GB of data. Ok? yes
allocation of 1 x 100 GB of data. Ok? yes
allocation of 1 x 101 GB of data. Ok? yes
allocation of 1 x 102 GB of data. Ok? yes
allocation of 1 x 103 GB of data. Ok? yes
allocation of 1 x 104 GB of data. Ok? yes
allocation of 1 x 105 GB of data. Ok? yes
allocation of 1 x 106 GB of data. Ok? yes
allocation of 1 x 107 GB of data. Ok? yes
allocation of 1 x 108 GB of data. Ok? yes
allocation of 1 x 109 GB of data. Ok? yes
allocation of 1 x 110 GB of data. Ok? yes
allocation of 1 x 111 GB of data. Ok? yes
allocation of 1 x 112 GB of data. Ok? yes
allocation of 1 x 113 GB of data. Ok? yes
allocation of 1 x 114 GB of data. Ok? yes
allocation of 1 x 115 GB of data. Ok? yes
allocation of 1 x 116 GB of data. Ok? yes
allocation of 1 x 117 GB of data. Ok? yes
allocation of 1 x 118 GB of data. Ok? yes
allocation of 1 x 119 GB of data. Ok? nope
allocation of 1 x 120 GB of data. Ok? nope
allocation of 1 x 121 GB of data. Ok? nope
allocation of 1 x 122 GB of data. Ok? nope
allocation of 1 x 123 GB of data. Ok? nope
allocation of 1 x 124 GB of data. Ok? nope
allocation of 1 x 125 GB of data. Ok? nope
allocation of 1 x 126 GB of data. Ok? nope
allocation of 1 x 127 GB of data. Ok? nope
allocation of 1 x 128 GB of data. Ok? nope
allocation of 1 x 129 GB of data. Ok? nope
allocation of 1 x 130 GB of data. Ok? nope
allocation of 1 x 131 GB of data. Ok? nope
allocation of 1 x 132 GB of data. Ok? nope
allocation of 1 x 133 GB of data. Ok? nope
allocation of 1 x 134 GB of data. Ok? nope
allocation of 1 x 135 GB of data. Ok? nope
allocation of 1 x 136 GB of data. Ok? nope
allocation of 1 x 137 GB of data. Ok? nope
allocation of 1 x 138 GB of data. Ok? nope
allocation of 1 x 139 GB of data. Ok? nope
allocation of 1 x 140 GB of data. Ok? nope
allocation of 1 x 141 GB of data. Ok? nope
allocation of 1 x 142 GB of data. Ok? nope
allocation of 1 x 143 GB of data. Ok? nope
allocation of 1 x 144 GB of data. Ok? nope
allocation of 1 x 145 GB of data. Ok? nope
allocation of 1 x 146 GB of data. Ok? nope
allocation of 1 x 147 GB of data. Ok? nope
allocation of 1 x 148 GB of data. Ok? nope
allocation of 1 x 149 GB of data. Ok? nope
Now, about that US$0.17 I spent on this...

Resources