I have a Cassandra UDT column that has about 10 attributes and now we are planning to add 3 more attributes to it. I am wondering if it would behave well if I alter the UDT type in the higher environments which has very large volume of data.
Altering UDT is same as altering table, except that you cannot remove an existing UDT unless you drop all the depended tables. Also you can't alter type of a column. Below is the query how you could add new udt columnd.
alter TYPE commentmetadata ADD columnname <type>;
It should be safe.
Just few precautions:
Don't run in mixed version Cassandra cluster.
Don't try to do same schema change concurrently from multiple client ( driver)
Additional pointers when you alter your table with UDT collections data type:
Never insert more than 2 billion items in a collection, as only that number can be queried.
The maximum number of keys for a map collection is 65,535.
The maximum size of an item in a list or a map collection is 2GB.
The maximum size of an item in a set collection is 65,535 bytes.
Keep collections small to prevent delays during querying.
Collections cannot be "sliced"; Cassandra reads a collection in its entirety, impacting performance.
Click here for more information.
[Alex Ott] MAP & LIST limits are version dependent.
65,535 bytes are supported by v3.0+ while lower versions are limited to 64,000 bytes. Fix version ticket.
Related
I have a high-write table I'm moving from Oracle to Cassandra. In Oracle the PK is a (int: clientId, id: UUID). There are about 10 billion rows. Right off the bat I run into this nonsensical warning:
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useWhenIndex.html :
"If you create an index on a high-cardinality column, which has many distinct values, a query between the fields will incur many seeks for very few results. In the table with a billion songs, looking up songs by writer (a value that is typically unique for each song) instead of by their artist, is likely to be very inefficient. It would probably be more efficient to manually maintain the table as a form of an index instead of using the Cassandra built-in index."
Not only does this seem to defeat efficient find by PK it fails to define what it means to "query between the fields" and what the difference is between a built-in index, a secondary-index, and the primary_key+clustering subphrases in a create table command. A junk description. This is 2019. Shouldn't this be fixed by now?
AFAIK it's misleading anyway:
CREATE TABLE dev.record (
clientid int,
id uuid,
version int,
payload text,
PRIMARY KEY (clientid, id, version)
) WITH CLUSTERING ORDER BY (id ASC, version DESC)
insert into record (id,version,clientid,payload) values
(d5ca94dd-1001-4c51-9854-554256a5b9f9,3,1001,'');
insert into record (id,version,clientid,payload) values
(d5ca94dd-1002-4c51-9854-554256a5b9e5,0,1002,'');
The token on clientid indeed shows they're in different partitions as expected.
Turning to the big point. If one was looking for a single row given the clientId, and UUID ---AND--- Cassandra allowed you to skip specifying the clientId so it wouldn't know which node(s) to search, then sure that find could be slow. But it doesn't:
select * from record where id=
d5ca94dd-1002-4c51-9854-554256a5b9e5;
InvalidRequest: ... despite the performance unpredictability,
use ALLOW FILTERING"
And ditto with other variations that exclude clientid. So shouldn't we conclude Cassandra handles high cardinality tables searches that return "very few results" just fine?
Anything that requires reading the entire context of the database wont work which is the case with scanning on id since any of your clientid partition key's may contain one. Walking through potentially thousands of sstables per host and walking through each partition of each of those to check will not work. If having hard time with data model and not totally getting difference between partition keys and clustering keys I would recommend you walk through some introduction classes (ie datastax academy), youtube videos or book etc before designing your schema. This is not a relational database and designing around your data instead of your queries will get you into trouble. When moving from oracle you should not just copy your tables over and move the data or it will not work as well.
The clustering key is the order in which the data for a partition is ordered on disk which is what it is referring to as "build-in index". Each sstable has an index component that contains the partition key locations for that sstable. This also includes an index of the clustering keys for each partition every 64kb (by default at least) that can be searched on. The clustering keys that exist between each of these indexed points are unknown so they all have to be checked. A long time ago there was a bloom filter of clustering keys kept as well but it was such a rare use case where it helped vs the overhead that it was removed in 2.0.
Secondary indexes are difficult to scale well which is where the warning comes from about cardinality, I would strongly recommend just denormalizing data and not using index in any form as using large scatter gather queries across a distributed system is going to have availability and performance issues. If you really need it check out http://www.doanduyhai.com/blog/?p=13191 to try to get the data right (not worth it in my opinion).
Using Cassandra as db:
Say we have this schema
primary_key((id1),id2,type) with index on type, because we want to query by id1 and id2.
Does query like
SELECT * FROM my_table WHERE id1=xxx AND type='some type'
going to perform well?
I wonder if we have to create and manage another table for this situation?
The way you are planning to use secondary index is ideal (which is rare). Here is why:
you specify the partition key (id1) in your query. This ensures that
only the relevant partition (node) will be queried, instead of
hitting all the nodes in the cluster (which is not scalable)
You are (presumably) indexing an attribute of low cardinality (I can imagine you have maybe a few hundred types?), which is the sweet spot when using secondary indexes.
Overall, your data model should perform well and scale. Yet, if you look for optimal performances, I would suggest you use an additional table ((id1), type, id2).
Finale note: if you have a limited number of type, you might consider using solely ((id1), type, id2) as a single table. When querying by id1-id2, just issue a few parallel queries against the possible value of type.
The final decision needs to take into account your target latency, the disk usage (duplicating table with a different primary key is sometimes too expensive), and the frequency of each of your queries.
I see info here on collection size limits in cassandra, but it includes this note: "The limits specified for collections are for non-frozen collections." I can't find limits on frozen collections defined anywhere.
Frozen collections are treated as blobs so there is no imposed limit on them (other than the overall size that you would want to have in partitions etc).
Frozen collections are useful if you want to use them in the primary key. Frozen collection can only be replaced as a whole, you cannot for example add/remove elements in a frozen collection.
We're looking for a tool (preferably open source) which helps us to perform complex queries (advanced filtering and joins, no need full SQL) in real time.
Assume that all the data needed fits in memory, and we want to avoid, if possible, the overhead of map reduce tools.
To be more specific, we need to load n partitions of a single table, and join them by clustering column.
Variables Table:
Variable ID: Partition key
Person ID: Clustering key
Variable Value
Desired output columns:
Person ID, Variable 1 Value, Variable 2 Vale, ..., Variable N Value
We can achieve it by an in-memory load-filter-join process, but we were wondering if there's any tool out there with this use case covered out of the box and with a fair performance.
We've tested Spark, but the partitioning of Spark C* connector is based on the primary key, so each Variable ID would be loaded in a different Spark node, and the join process would be really slow (all the data would travel all over the Spark cluster).
Any tips? known tools?
I believe that you have a number of options to perform this task:
Rethink your database schema, denormalize it. var_id:person_id:value rows are not the best table schema if you want to query by person_id (and it smells really bad as an entity-attribute-value db antipattern):
EAV gives a flexibility to the developer to define the schema as needed and this is good in some circumstances. On the other hand it performs very poorly in the case of an ill-defined query and can support other bad practices. In other words, EAV gives you enough rope to hang yourself and in this industry, things should be designed to the lowest level of complexity because the guy replacing you on the project will likely be an idiot.
You can use schema with multiple columns (cassandra can handle a lot of them):
create table person_data (
person_id int primary key,
var1 text,
var2 text,
var3 text,
var4 text,
....
);
If you don't have a predefined set of variables, you can use cql3 collections like map for storing the data in a more flexible way.
Create a secondary index on person_id (even it's a clustering key already). You can query for all data for a specific user without using joins, but with some issues:
As your query will hit multiple partitions, it will require not a single disk seek, but a series of them, so your query latency may be higher than you're expecting.
secondary indexes are not free: C* must perform more work under the hood if you insert a row to a table with indexed columns.
Use external index like ElasticSearch/Solr if you plan to have a lot of complex queries which do not fit well into cql3.
Is it a performance issue if we have two or more secondary indexes on a columnfamily? I have orderid,city and shipmenttype. So I thought I create primary key on orderid and secondary indexes on city and shipmenttype. And use combination of secondary index columns while querying. Is that a bad modelling?
Consider the data that will be placed in the secondary index. Looking at the docs, you want to avoid columns with high cardinality. If your city and shipment type values vary greatly (or conversely, too similarly) then a secondary index may not be the right fit.
Look in to potentially maintaining a separate table with this information. This would behave as a manual index of sorts, but have the additional benefit of behaving as you expect a Cassandra table should. When you create or update records be sure to update this index table. Writes are cheap, performing multiple writes over the course of updating a record is not unheard of.
When looking at your access patterns will you be using the partition key as part of the WHERE clause or just the secondary indexes?
If you're performing a query against the secondary indexes along with the partition key you will achieve better performance than when you just query with secondary indexes.
For example, with WHERE orderid = 'foo' AND shipmenttype = 'bar' the request will only be sent to nodes responsible for the partition where foo is stored. Then the secondary index will be consulted for shipmenttype = 'bar' and your results will be returned.
When you run a query with just WHERE shipmenttype = 'bar' the query is sent to all nodes in the cluster before the secondary indexes are consulted for looking up rows. This is less than ideal.
Additionally should you query against multiple secondary indexes with a single request you must use ALLOW FILTERING. This will only consult ONE secondary index during your request, usually the more specific of the indexes referenced. This will cause a performance hit as all records returned from checking the first index will require checking for the other values listed in your WHERE clause.
Should you be using a secondary index always strive to include the partition key portion of the query. Secondly do NOT use multiple secondary indexes when querying a table, this will cause a major performance hit.
Ultimately your performance is determined by how you construct your queries against the partition and secondary indexes.