I am new to cassandra and my cassandra is giving lot of read timeout errors..tweaked timout but still problem may be problem with design (for my application cassandra expected to store trillions of data):
Question 1 : In an all my cassandra tables i use UUID as rowkey...but for few tables just for maintainence i break that rule like in user table i make email id as rowkey....so that looking at tables i can understand data stored...IS using UUID right approach for huge case and second approach for user table is right or not ???????????
Question 2 : i have one relations table with startNodeId, relationTypeId, endNodeId...rowkey for that is UUID which is relationId.....i define secondary indexes on startNode, relationType, endNode as i can have lookup by any of them by business case.........becuase of that for each new row i have to do get to check ALREADY existing relation or not....One approach to avoid existing check is : i take startNodeId, relationTypeId, endNodeId SORT them and create HASH CODE and use that as ROWKEY...so my already checking explicitly will be avoided here..........IS THIS RIGHT approach ???????
Please guide me i am stuck at these thoughts...any guidance will really help me
Answering to your first question, until and unless you are comfortable in handling the rowkey with non-uuid value, its great also easier to track else go for the UUID.
Regarding to your second question, why don't you try the compound key. You don't have to maintain hashcode like stuffs, leave it on Cassandra.
1) Better use natural keys not UUIDs. Email, timestamp, composite primary keys, and so on. Using UUID is an approach from RDBMS world, you should avoid it in Cassandra
2) Read-modify-update is wrong pattern for Cassandra. Try rewritng data, if your business case allows this. Or just use timestamp and get the row with latest timestamp (don't forget about TTL).
Related
As the title says, I'm trying to query all the data I got with no value stored in it. I've been searching for a while, and the only operation allowed that I've found is CONTAINS, which doesn't fit my need.
consider the following table:
CREATE TABLE environment(
id uuid,
name varchar,
message text,
public Boolean,
participants set<varchar>,
PRIMARY KEY (id)
)
How can I get all entries in the table with an empty set? E.g. participants = {} or null?
Unfortunately, you really can't. Cassandra makes queries like this difficult by design, because there's no way it can be done without doing a full table scan (scanning each and every node). This is why a big part of Cassandra data modeling is understanding all the ways that table will be queried, and building it to support those queries.
The other issue that you'll have to deal with, is that (generally speaking) Cassandra does not allow filtering by nulls. Again, it's a design choice...it's much easier to query for data that exists, rather than data that does not exist. Although, when writing with lightweight transactions, there are ways around this one (using the IF clause).
If you knew all of the ids ahead of time, you could write something to iterate through them, SELECT and check for null on the app-side. Although that approach will be slow (but it won't stress the cluster). Probably the better approach, is to use a distributed OLAP layer like Apache Spark. It still wouldn't be fast, but this is probably the best way to handle a situation like this.
How can I query based on only one field in UDT (Cassandra) ?
I have a UDT which contains marital status and I need data for all the married people
but when I query based on one field it gives empty output
How can I do this ?
short answer - NO, at least not in the stock Cassandra. It's possible to do using the DSE Search, but it adds its own constraints. Maybe when SAI will be implemented some day, it will support indexing of UDT fields. I don't remember if SASI supports this, but it's really not recommended to use. But even if indexing was supported, it's still bad case (maybe except SAI) for Cassandra because it will create big partitions to represent each status.
The general rule is that you need to always query by partition key, with possibility to use secondary indexes when you need to search for another field, but only inside partition. If you need to query by multiple fields that not primary keys, I would suggest to take another database, or use Elasticsearch/Solr (although they aren't very good in geo-distributed environment).
We are trying to store user answers on questions. Answer is id based. When user comes to site we need to get all question ids to know that he had answered. Question count can be about 50000 or more. Currently, we are storing data in sql server using appendable blob (varbinary(max)). But we are searching database that can handle that situation. Casandra can give us non blob scheme and master - master replication. What is right way to store it in casandra? I known lists are not recommended for such situation. Does separate table give us suitable performance (<20-30ms) with composite key?
I'm not 100% sure that I understand how you'll fetch the data, but usually, if you have only one collection-like column in your table, then you can always implement it as an additional column as clustering column - this will solve a problem with accessing the part of the answers, etc. - when you have answer ID as clustering column, you can fetch only range of the answers, but when you use set, then you need to fetch everything...
For example, if you have table like this:
create table answers (
user_id text primary key,
answers set<int>
)
it could be always converted from collection to following:
create table answers (
user_id text,
answer_id int,
primary key (user_id, answer_id));
but actual table structure will depend on how you're performing queries for this data - this is primary requirement for data modeling for Cassandra.
P.S. I recommend to take DS220 course on DataStax Academy, to learn more about data modellng, or grab 3rd edition of Cassandra book, that was released recently - it has good chapter on data modeling.
Given a scenario where you have a User table, with id as PRIMARY KEY.
You have a column called email, and a column called name.
You want to UPDATE User.name based on User.email
I realized that the UPDATE command requires you to pass in a PRIMARY KEY. Does this mean I can't use a pure CQL migration, and would need to first query for the User.id primary key before I can UPDATE?
In this case, I DO know the PRIMARY KEY because the UUIDs are the same for dev and prod, but it feels dirty.
Yes, you're correct - you need to know primary key of the record to perform an update on the data, or deletion of specific record. There are several options here, depending of your data model:
Perform full scan of the table using effective token range scan (Look to this answer for more details);
If this is required very often, you can create a materialized view, with User.email as partition key, and fetch all message IDs that you can update (but you'll need to do this from your application, there is no nested query support in CQL). But also be aware that materialized views are "experimental" feature in Cassandra, and may not work all the time (it's more stable in DataStax Enterprise). Also, if you have some users with hundreds of thousands of emails, this may create big partitions.
Do like 2nd item with your code, by using an additional table
I think Alex's answer covers your question -- "how can I find a value in a PK column working backwards from a non-PK column's value?".
However, I think it's worth noting that asking this question indicates you should reconsider your data model. A rule of thumb in C* data model design is that you begin by considering the queries you need, and you've missed the UPDATE query use case. You can probably make things work without changing your model for now, but if you find you need to make other queries you're unprepared for, you'll run into operational issues with lots of indexes and/or MVs.
More generally, search around for articles and other resources about Cassandra data modeling. It sounds like you're basically using C* for a relational use case so you'll want to look into that.
Is there a way for Cassandra's Thrift interface to know in advance whether a particular client query will use a compound keys defined table (CQL3)? How can you know what the schema is for the table?
Cassandra stores the schema information in some system tables. You could query those to get the schema information that indicates that the rows have a compound primary key.
But you might want to reconsider why you want to do this. Your application program should know the schema of the tables it manipulates; it should already know what tables it uses and what their primary key is.
Check out this question and the answer for details on how to determine the schema from system tables.
Anyways as Raedwald already said, you should probably ask twice why you'd want to do this.