Maximum key size in Cassandra - cassandra

I'm completely new to using cassandra.. is there a maximum key size and would that ever impact performance?
Thanks!

The key (and column names) must be under 64K bytes.
Routing is O(N) of the key size and querying and updating are O(N log N). In practice these factors are usually dwarfed by other overhead, but some users with very large "natural" keys use their hashes instead to cut down the size.

http://en.wikipedia.org/wiki/Apache_Cassandra claims (apparently incorrectly!) that:
The row key in a table is a string
with no size restrictions, although
typically 16 to 36 bytes long
See also:
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/value-size-is-there-a-suggested-limit-td4959690.html which suggests that there is some limit.
Clearly, very large keys could have some network performance impact if they need to be sent over the Thrift RPC interface - and they would cost storage. I'd suggest you try a quick benchmark to see what impact it has for your data.
One way to deal with this might be to pre-hash your keys and just use the hash value as the key, though this won't suit all use cases.

Related

Data storage parallelization in .NET Core

I am a little lost in this task. There is a requirement for our caching solution to split a large data dictionary into partitions and perform operations on them in separate threads.
The scenario is: We have a large pool of data that is to be kept in memory (40m rows), the chosen strategy is first to have a Dictionary with int key. This dictionary contains a subset of 16 dictionaries that are keyed by guid and contain a data class.
The number 16 is calculated on startup and indicates CPU core count * 4.
The data class contains a byte[] which is basically a translated set of properties and their values, int pointer to metadata dictionary and checksum.
Then there is a set of control functions that takes care of locking and assigns/retrieves Guid keyed data based on a division of the first segment of guid (8 hex numbers) by divider. This divider is just FFFFFFFF / 16. This way each key will have a corresponding partition assigned.
Now I need to figure out how to perform operations (key lookup, iterations and writes) on these dictionaries in separate threads in parallel? Will I just wrap these operations using Tasks? Or will it be better to load these behemoth dictionaries into separate threads whole?
I have a rough idea how to implement data collectors, that will be the easy part I guess.
Also, is using Dictionaries a good approach? Their size is limited to 3mil rows per partition and if one is full, the control mechanism tries to insert on another server that is using the exact same mechanism.
Is .NET actually a good language to implement this solution?
Any help will be extremely appreciated.
Okay, so I implemented ReaderWriterLockSlim and implemented concurrent access through System.Threading.Tasks. I also managed to exclude any dataClass object from the storage, now it is only a dictionary of byte[]s.
It's able to store all 40 million rows taking just under 4GB of RAM and through some careful SIMD optimized manipulations performs EQUALS, <, > and SUM operation iterations in under 20ms, so I guess this issue is solved.
Also the concurrency throughput is quite good.
I just wanted to post this in case anybody faces similar issue in the future.

Why is it so bad to have large partitions in Cassandra?

I have seen this warning everywhere but cannot find any detailed explanation on this topic.
For starters
The maximum number of cells (rows x columns) in a single partition is
2 billion.
If you allow a partition to grow unbounded you will eventually hit this limitation.
Outside that theoretical limit, there are practical limitations tied to the impacts large partitions have on the JVM and read times. These practical limitations are constantly increasing from version to version. This practical limitation is not fixed but variable with data model, query patterns, heap size, and configurations which makes it hard to be give a straight answer on whats too large.
As of 2.1 and early 3.0 releases, the primary cost on reads and compactions comes from deserializing the index which marks a row every column_index_size_in_kb. You can increase the key_cache_size_in_mb for reads to prevent unnecessary deserialization but that reduces heap space and fills old gen. You can increase the column index size but it will increase worst case IO costs on reads. Theres also many different settings for CMS and G1 to tune the impact of a huge spike in object allocations when reading these big partitions. There are active efforts on improving this so in the future it might no longer be the bottleneck.
Repairs also only go down to (in best case scenario) the partition level. So if say you are constantly appending to a partition, and a hash of that partition on 2 nodes are compared at not an exact time (distributed system essentially guarantees this), the entire partition must be streamed over to ensure consistency. Incremental repairs can reduce impact of this, but your still streaming massive amounts of data and fluctuating disk significantly which will then need to be compacted together unnecessarily.
You can probably keep adding onto this of corner cases and scenarios that have issues. Many times large partitions are possible to read, but the tuning and corner cases involved in them are not really worth it, better to just design data model to be friendly with how Cassandra expects it. I would recommend targeting 100mb but you can go far beyond that comfortably. Into the Gbs and you will need to start consider tuning for it (depending on data model, use case etc).

Estimating Cassandra KeyCache

We are currently in the process of deploying a larger Cassandra cluster and looking for ways to estimate the best size of the key cache. Or more accurately looking for a way of finding out the size of one row in the key cache.
I have tried tying into the integrated metrics systems using graphite, but I wasn't able to receive any clear answer. Further I tried putting my own debugging code into org.cassandra.io.sstable, but this neither yielded any concrete results.
We are using Cassandra 1.20.10, but are there any fool proof solutions to getting the size of one row in the key cache?
With best regards,
Ben
Check out jamm. Its a library used for measuring the size of an object in memory.
You need to add -javaagent:"/path/to/jamm.jar" to your startup parameters but cassandra is configured to start with jamm, so if you change internal cassandra code, this is already done for you.
To size of objects (in bytes):
MemoryMeter meter = new MemoryMeter();
meter.measureDeep(object);
Measure deep is a more costly but much more accurate measurement of an object's memory size.
For estimation of key size, let's assume you intended to store 1 million keys in cache, each key of length 60 bytes on an average. There will be some overhead to store the key, lets say it is 40 bytes that means key size per row = 100 bytes.
Since we need to cache 1 million keys
total key cache = 1 mn * 100 = 100 Mbytes
perform this for each CF in your keyspace.

Azure Table Storage - How fast can I table scan?

Everyone warns not to query against anything other than RowKey or PartitionKey in Azure Table Storage (ATS), lest you be forced to table scan. For a while, this has paralyzed me into trying to come up with exactly the right PK and RK and creating pseudo-secondary indexes in other tables when I needed to query something else.
However, it occurs to me that I would commonly table scan in SQL Server when I thought appropriate.
So the question becomes, how fast can I table scan an Azure Table. Is this a constant in entities/second or does it depend on record size, etc. Are there some rules of thumb as to how many records is too many to table scan if you want a responsive application?
The issue of a table scan has to do with crossing the partition boundaries. The level of performance you are guaranteed is explicity set at the partition level. therefore, when you run a full table scan, its a) not very efficient, b) doesn't have any guarantee of performance. This is because the partitions themselves are set on seperate storage nodes, and when you run a cross partition scan, you're consuming potentially massive amounts of resources (tieing up multiple nodes simultaneously).
I believe, that the effect of crossing these boundaries also results in continuation tokens, which require additional round-trips to storage to retrieve the results. This results then in reducing performance, as well as an increase in transaction counts (and subsequently cost).
If the number of partitions/nodes you're crossing is fairly small, you likely won't notice any issues.
But please don't quote me on this. I'm not an expert on Azure Storage. Its actually the area of Azure I'm the least knowledgeable about. :P
I think Brent is 100% on the money, but if you still feel you want to try it, I can only suggest to run some tests to find out yourself. Try include the partitionKey in your queries to prevent crossing partitions because at the end of the day that's the performance killer.
Azure tables are not optimized for table scans. Scanning the table might be acceptable for a long-running background job, but I wouldn't do it when a quick response is needed. With a table of any reasonable size you will have to handle continuation tokens as the query reaches a partition boundary or obtains 1k results.
The Azure storage team has a great post which explains the scalability targets. The throughput target for a single table partition is 500 entities/sec. The overall target for a storage account is 5,000 transactions/sec.
The answer is Pagination. Use the top_size -- max number of results or records in result -- in conjunction with next_partition_key and next_row_key the continuation tokens. That makes a significant even factorial difference in performance. For one, your results are statistically more likely to come from a single partition. Plain results show that sets are grouped by the partition continuation key and not the row continue key.
In other words, you also need to think about your UI or system output. Don't bother returning more than 10 to 20 results max 50. The user probably wont utilize or examine any more.
Sounds foolish. Do a Google search for "dog" and notice that the search returns only 10 items. No more. The next records are avail for you if you bother to hit 'continue'. Research has proven that almost no user ventures beyond that first page.
the select (returning a subset of the key-values) may make a difference; for example, use select = "PartitionKey,RowKey" or 'Name' whatever minimum you need.
"I believe, that the effect of crossing these boundaries also results
in continuation tokens, which require additional round-trips to
storage to retrieve the results. This results then in reducing
performance, as well as an increase in transaction counts (and
subsequently cost)."
...is slightly incorrect. the continuation token is used not because of crossing boundaries but because azure tables permit no more than 1000 results; therefore the two continuation tokens are used for the next set. default top_size is essentially 1000.
For your viewing pleasure, here's the description for queries entities from the azure python api. others are much the same.
'''
Get entities in a table; includes the $filter and $select options.
table_name: Table to query.
filter:
Optional. Filter as described at
http://msdn.microsoft.com/en-us/library/windowsazure/dd894031.aspx
select: Optional. Property names to select from the entities.
top: Optional. Maximum number of entities to return.
next_partition_key:
Optional. When top is used, the next partition key is stored in
result.x_ms_continuation['NextPartitionKey']
next_row_key:
Optional. When top is used, the next partition key is stored in
result.x_ms_continuation['NextRowKey']
'''

How much data (many MB) can I uniquely identify using MD5

I've got millions of data records that are each about 2MB in size. Every one of these pieces of data are stored in a file and there is a set of other data associated with that record (stored in a database).
When my program runs I'll be presented, in memory, with one of the data records and need to produce the associated data. To do this I'm imagining taking an MD5 of the memory, then using this hash as a key into the database. The key will help me locate the other data.
What I need to know is if an MD5 hash of the data contents is a suitable way to uniquliy identify a 2MB piece of data, meaning can I use an MD5 hash without worrying too much about collisions?
I realize there is a chance for collision, my concern is how likely is the chance for collision on millions of 2MB data records? Is collision a likely occurrence? What about when compared to hard disk failure or other computer failures? How much data can MD5 be used to safely identify? what about millions of GB files?
I'm not worried about malice or data tampering. I've got protections such that I wont be receiving manipulated data.
This boils down to so-called Birthday paradox. That Wikipedia page has simplified formulas for evaluating the collision probability. It will be very some very small number.
The next question is how you deal with say 10-12 collision probability - see this very similar question.

Resources