Find the CURRENT MEMORY SIZE ?
Does Active Pivot cube expose the current memory size on the cube ? or
Is it possible to find the current Date's memory Size ?
EXample: Total used Size = 400Gb
CobDate = 2014040 , Total used Size = 200GB
The large data structures of ActivePivot (datastore, aggregates, primary index and bitmap index) are allocated in large continuous chunks, they're in general not split by date or any other member and you cannot distinguish.
But that being said, there are many ways to get the total memory usage or your ActivePivot solution (jvisualvm, jconsole, the Quartet FS health check agent...). Then use your favorite frontend to make a simple contributors.COUNT query along your date dimension. That will give you the number of records for each date, and the total number of records, enough to compute an estimate for the memory usage of each date.
Related
My cassandra table looks like this -
CREATE TABLE cs_readwrite.cs_rw_test (
part_id bigint,
s_id bigint,
begin_ts bigint,
end_ts bigint,
blob_data blob,
PRIMARY KEY (part_id, s_id, begin_ts, end_ts)
) WITH CLUSTERING ORDER BY (s_id ASC, begin_ts DESC, end_ts DESC)
When I insert 1 million row per client, with 8 kb blob per row and test the speed of insertions from different client hosts the speed is almost constant at ~100 mbps. But with the same table definition, from same client hosts if I insert rows with 16 bytes of blob data then my speed numbers are dramatically low ~4 to 5 mbps. Why is there such a speed difference? I am only measuring write speeds for now. My main concern is not speed (though some inputs will help) when I add more clients I see speed is almost constant for bigger blob size but for 16 bytes blob the speed is increasing only by 10-20% per added client before it becomes constant.
I have also looked at bin/nodetool tablehistograms output and do adjust number of partitions in my test data so no partition is > 100 mb.
Any insights/ links for documentation would be helpful. Thanks!
I think you are measuring the throughput in the wrong way. The throughput should be measured in transactions per second and not in data written per second.
Even though the amount of data written can play a role in determining the write throughput of a system but usually it depends on many other factors.
Compaction Strategy like STCS is write-optimized whereas LOCS is
read-optimized.
Connection speed and latency between the client and the cluster, and
between machines in the cluster
CPU usage of the node which is processing data, sending data to other
replicas and waiting for their acknowledgment.
Most of the writes are immediately written in memory instead of being written directly in the disk which basically makes the impact of the amount of data being written on final write throughput almost negligible whereas other fixed things like Network delay, CPU to coordinate the processing of data across nodes, etc have a bigger impact.
The way you should see it is that with 8KB of payload you are getting X transactions per second and with 16 Bytes you are getting Y transactions per second. Y will always be better than X but it will not be linearly proportional to the size difference.
You can find how writes are handled in cassandra explained in detail here.
Theres management overhead in Cassandra per row/partition, the more data (in bytes) you have in each row the less that overhead impacts throughput in bytes/sec. The reverse is true if you look at rows per sec as a metric of throughput. The larger the payloads the worse your rows/sec throughput would get.
What's the best/reliable method of estimating the space required in Cassandra. My Cluster consists of 2 nodes(RHEL 6.5) on Cassandra 3.11.2. I want to estimate the average size each row in every table will take in my database so that I can plan accordingly. I know about some methods like nodetool status command, du -sh command used in the data directory, nodetool cfstats etc. However each of these are giving different values and hence I'm not sure which one should I use in my calculations.
Also I found out that apart from the actual data, various metadata is also stored by Cassandra in various system specific tables like size_estimates, sstable_activity etc. Does this metadata also keep on increasing with the data? What's the ratio of space occupied by such metadata and the space occupied by the actual data in the database? Also what particular configurations in YAML(if any) should I keep in mind which might affect the size of the data.
A similar question was asked before but I wasn't satisfied by the answer.
If you are expecting 20 GB of data per day, here is the calculation.
1 Day = 20 GB, 1 Month = 600 GB, 1 Year = 7.2 TB, so your raw data size for one year is 7.2 TB with replication factor of 3 it would be around 21.6 TB of data for one year.
Taking compaction into consideration and your use case being write heavy,if you go with size tiered compaction. you would need twice the space of your raw data.
So you would need around 43 TB to 45 TB of disk space.
I've noticed that relationships cannot be properly stored in C* due to its 100MB partition limit, denormalization doesn't help in this case and the fact that C* can have 2B cells per partition neither as those 2B cells of just Longs have 16GB ?!?!? Doesn't that cross 100MB partition size limit ?
Which is what I don't understand in general, C* proclaims it can have 2B cells but a partition sizes should not cross 100MB ???
What is the idiomatic way to do this? People say that this an ideal use case for TitanDB or JanusDB that scale well with billions of nodes and edges. How do these databases that use C* under the hood data-model this?
The use case of mine is described here https://groups.google.com/forum/#!topic/janusgraph-users/kF2amGxBDCM
Note that I'm fully aware of the fact that the answer to this question is "use extra partition key to decrease partition size" but honestly, who of us has this possibility? Especially in modeling relationships ... I'm not interested in relationship that happened in a particular hour...
Maximum number of cells (rows x columns) in a partition is 2 billion and single column value size is 2 GB ( 1 MB is recommended)
Source : http://docs.datastax.com/en/cql/3.1/cql/cql_reference/refLimits.html
Partition size 100MB is not the upper limit. If you check the datastax doc
For efficient operation, partitions must be sized within certain limits in Apache Cassandra™. Two measures of partition size are the number of values in a partition and the partition size on disk. Sizing the disk space is more complex, and involves the number of rows and the number of columns, primary key columns and static columns in each table. Each application will have different efficiency parameters, but a good rule of thumb is to keep the maximum number of rows below 100,000 items and the disk size under 100 MB
You can see that for efficient operation and low heap pressure they just made a good rule of thumb is to keep number of row 100,000 and disk size 100MB in a single partition.
TitanDB or JanusDB stores graphs in adjacency list format which means that a graph is stored as a collection of vertices with their adjacency list. The adjacency list of a vertex contains all of the vertex’s incident edges (and properties).
They used VertexID is the partition key, PropertyKeyID or EdgeID as clustering key and property value or edge properties as normal column.
If you use cassandra as storage backend.
In TitanDB or JanusDB, For efficient operation and low heap pressure, same rule apply, means number of edge and property of a vertex is 100,000 and size 100MB
There is a configuration item(max-split-size) to set one split's max size.In other word,i can change the value of the item to change the number of splits.
I know,more splits will use more cpu at the same time,and the search will become faster.
If so,why presto set the default value of the item is 32M instead of such 1M?
There's overhead to each split that is created, so you don't want them to be too small. Also, some file formats like ORC can't be split smaller than the size of an ORC stripe which tends to be tens to hundreds of megabytes
We are currently in the process of deploying a larger Cassandra cluster and looking for ways to estimate the best size of the key cache. Or more accurately looking for a way of finding out the size of one row in the key cache.
I have tried tying into the integrated metrics systems using graphite, but I wasn't able to receive any clear answer. Further I tried putting my own debugging code into org.cassandra.io.sstable, but this neither yielded any concrete results.
We are using Cassandra 1.20.10, but are there any fool proof solutions to getting the size of one row in the key cache?
With best regards,
Ben
Check out jamm. Its a library used for measuring the size of an object in memory.
You need to add -javaagent:"/path/to/jamm.jar" to your startup parameters but cassandra is configured to start with jamm, so if you change internal cassandra code, this is already done for you.
To size of objects (in bytes):
MemoryMeter meter = new MemoryMeter();
meter.measureDeep(object);
Measure deep is a more costly but much more accurate measurement of an object's memory size.
For estimation of key size, let's assume you intended to store 1 million keys in cache, each key of length 60 bytes on an average. There will be some overhead to store the key, lets say it is 40 bytes that means key size per row = 100 bytes.
Since we need to cache 1 million keys
total key cache = 1 mn * 100 = 100 Mbytes
perform this for each CF in your keyspace.