I am trying to design the application log table in Cassandra,
CREATE TABLE log(
yyyymmdd varchar,
created timeuuid,
logMessage text,
module text,
PRIMARY KEY(yyyymmdd, created)
);
Now when I try to perform the following queries it is working as expected,
select * from log where yymmdd = '20182302' LIMIT 50;
Above query is without grouping, kind of global.
Currently I did an secondary index for 'module' so I am able to perform the following,
select * from log where yymmdd = '20182302' WHERE module LIKE 'test' LIMIT 50;
Now my concern is without doing the secondary index, Is there an efficient way to query based on the module and fetch the data (or) Is there a better design?
Also let me know the performance issue in current design.
For fetching based on module and date, you can only use another table, like this:
CREATE TABLE module_log(
yyyymmdd varchar,
created timeuuid,
logMessage text,
module text,
PRIMARY KEY((module,yyyymmdd), created)
);
This will allow to have single partition for every combination of the module & yyyymmdd values, so you won't have very wide partitions.
Also, take into account that if you created a secondary index only on module field - you may get problems with too big partitions (I assume that you have very limited number of module values?).
P.S. Are you using pure Cassandra, or DSE?
Related
For my chat table design in cassandra I have the following scheme:
USE zwoop_chat
CREATE TABLE IF NOT EXISTS public_messages (
chatRoomId text,
date timestamp,
fromUserId text,
fromUserNickName text,
message text,
PRIMARY KEY ((chatRoomId, fromUserId), date)
) WITH CLUSTERING ORDER BY (date ASC);
The following query:
SELECT * FROM public_messages WHERE chatroomid=? LIMIT 20
Results in the typical message:
Cannot execute this query as it might involve data filtering and thus
may have unpredictable performance. If you want to execute this query
despite the performance unpredictability, use ALLOW FILTERING;
Obviously I'm doing something wrong with the partitioning here.
I'm not experienced with Cassandra and a bit confused about online suggestions that Cassandra will make an entire table scan, which is something that I don't really get realistically. Why would I want to fetch an entire table.
Another suggestion I read about is to create partitioning, e.g. to fetch the latest per day. But this doesn't work for me. You don't know when the latest chat message occurred.
Could be last day, last hour, or last week or month for that matter.
I'm pretty much used to sql or nosql like mongo, but this simple use case seems to be a problem for Cassandra. So what is the recommended approach here?
Edit:
It seems that it is common practise to add a bucket integer.
Let's say I create a bucket per 50 messages, is there a way to auto-increment it when the bucket is full?
I would prefer not having to do a fetch of MAX bucket and calculate when the bucket is full. Seems like bad performance for doing inserts.
Also it seems like a bad idea to manage the buckets in Java. Things like app restarts or load balancing would require extra logic.
(I currently use Java Spring JPA for Cassandra).
It works without bucketing using the following table design:
USE zwoop_chat
CREATE TABLE IF NOT EXISTS public_messages (
chatRoomId text,
date timestamp,
fromUserId text,
fromUserNickName text,
message text,
PRIMARY KEY ((chatRoomId), date)
) WITH CLUSTERING ORDER BY (date DESC);
I had to remove the fromUserId from the partition key, I assume it is required to include it in the where clause to avoid the error.
The jpa query:
publicMessageRepository.findFirst20ByPkChatRoomIdOrderByPkDateDesc(chatRoomId);
CREATE TABLE feed (
identifier text,
post_id int,
score int,
reason text,
timestamp timeuuid,
PRIMARY KEY ((identifier, post_id), score, id, timestamp)
) WITH CLUSTERING ORDER BY (score DESC, timestamp DESC);
CREATE INDEX IF NOT EXISTS index_identifier ON feed ( identifier );
I want to run 2 types of queries where identifier = 'user_5' and post_id = 11; and where identifier = 'user_5';
I want to paginate on 10 results per query. However, few queries can have variable result count. So best if there is something like a *column* > last_record that I can use.
Please help. Thanks in advance.
P.S: Cassandra version - 3.11.6
First, and most important - you're approaching to Cassandra like a traditional database that runs on the single node. Your data model doesn't support effective retrieval of data for your queries, and secondary indexes doesn't help much, as it's still need to reach all nodes to fetch the data, as data will be distributed between different nodes based on the value of partition key ((identifier, post_id) in your case) - it may work with small data in small cluster, but will fail miserably when you scale up.
In Cassandra, all data modelling starts from queries, so if you're querying by identifier, then it should be a partition key (although you may get some problems with big partitions if some users will produce a lot of messages). Inside partition you may use secondary indexes, it shouldn't be a problem. Plus, inside partition it's easier to organize paging. Cassandra natively support forward paging, so you just need to keep paging state between queries. In Java driver 4.6.0, the special helper class was added to support paging of results, although it may not be very effective, as it needs to read data from Cassandra anyway, to skip to the given page, but at least it's a some help. Here is example from documentation:
String query = "SELECT ...";
// organize by 20 rows per page
OffsetPager pager = new OffsetPager(20);
// Get page 2: start from a fresh result set, throw away rows 1-20, then return rows 21-40
ResultSet rs = session.execute(query);
OffsetPager.Page<Row> page2 = pager.getPage(rs, 2);
// Get page 5: start from a fresh result set, throw away rows 1-80, then return rows 81-100
rs = session.execute(query);
OffsetPager.Page<Row> page5 = pager.getPage(rs, 5);
I am fairly new to Cassandra and currently have to following table in Cassandra:
CREATE TABLE time_data (
id int,
secondary_id int,
timestamp timestamp,
value bigint,
PRIMARY KEY ((id, secondary_id), timestamp)
);
The compound partition key (with secondary_id) is necessary in order to not violate max partition sizes.
The issue I am running in to is that I would like to complete the query SELECT * FROM time_data WHERE id = ?. Because the table has a compound partition key, this query requires filtering. I realize this is a querying a lot of data and partitions, but it is necessary for the application. For reference, id has relatively low cardinality and secondary_id has high cardinality.
What is the best way around this? Should I simply allow filtering on the query? Or is it better to create a secondary index like CREATE INDEX id_idx ON time_data (id)?
You will need to specify full partition key on queries (ALLOW FILTERING will impact performance badly in most cases).
One way to go could be if you know all secondary_id (you could add a table to track them in necessary) and do the job in your application and query all (id, secondary_id) pairs and process them afterwards. This has the disadvantage of beeing more complex but the advantage that it can be done with async queries and in parallel so many nodes in your cluster participate in processing your task.
See also https://www.datastax.com/dev/blog/java-driver-async-queries
I have a table with the following schema.
CREATE TABLE IF NOT EXISTS group_friends(
groupId timeuuid,
friendId bigint,
time bigint,
PRIMARY KEY(groupId,friendId));
I need to keep a track of time if any changes happen in a group (such changing the group name or adding a new friend in table etc.). So I need to update the value of time field by groupId every time there is any change in any related table.
As update in cassandra requires mentioning all primary keys in where clause this query will not run.
update group_friends set time = 123456 where groupId = 100;
So I can do something like this.
update group_friends set time=123456 where groupId=100 and friendId in (...);
But it is showing the following error-->
[Invalid query] message="Invalid operator IN for PRIMARY KEY part friendid"
Is there any way to perform an update operation using IN operator in clustering column? If not then what are the possible ways to do this?
Thanks in advance.
Since friendId is a clustering column, a batch operation is probably a reasonable and well performing choice in this case since all updates would be made in the same partition (assuming you are using the same group id for the update). For example, with the java driver you could do the following:
Cluster cluster = new Cluster.Builder().addContactPoint("127.0.0.1").build();
Session session = cluster.connect("friends");
PreparedStatement updateStmt = session.prepare("update group_friends set time = ? where groupId = ? and friendId = ?");
long time = 123456;
UUID groupId = UUIDs.startOf(0);
List<Long> friends = Lists.newArrayList(1L, 2L, 4L, 8L, 22L, 1002L);
BatchStatement batch = new BatchStatement(BatchStatement.Type.UNLOGGED);
for(Long friendId : friends) {
batch.add(updateStmt.bind(time, groupId, friendId));
}
session.execute(batch);
cluster.close();
The other advantage of this is that since the partition key can be inferred from the BatchStatement, the driver will use token-aware routing to send a request to a replica that would own this data, skipping a network hop.
Although this will effectively be a single write, be careful with the size of your batches. You should take care not to make it too large.
In the general case, you can't really go wrong by executing each statement individually instead of using a batch. The CQL transport allows many requests on a single connection and are asynchronous in nature, so you can have many requests going on at a time without the typical performance cost of a request per connection.
For more about writing data in batch see: Cassandra: Batch loading without the Batch keyword
Alternatively, there may be an even easier way to accomplish what you want. If what you are really trying to accomplish is to maintain a group update time and you want it to be the same for all friends in the group, you can make time a static column. This is a new feature in Cassandra 2.0.6. What this does is shares the column value for all rows in the groupId partition. This way you would only have to update time once, you could even set the time in the query you use to add a friend to the group so it's done as one write operation.
CREATE TABLE IF NOT EXISTS friends.group_friends(
groupId timeuuid,
friendId bigint,
time bigint static,
PRIMARY KEY(groupId,friendId)
);
If you can't use Cassandra 2.0.6+ just yet, you can create a separate table called group_metadata that maintains the time for a group, i.e.:
CREATE TABLE IF NOT EXISTS friends.group_metadata(
groupId timeuuid,
time bigint,
PRIMARY KEY(groupId)
);
The downside here being that whenever you want to get at this data you need to select from this table, but that seems manageable.
If I want to partition my primary key by time window would it be better (for storage and retrieval efficiency) to use a textual representation of the time or a truncated native timestamp ie
CREATE TABLE user_data (
user_id TEXT,
log_day TEXT, -- store as 'yyyymmdd' string
log_timestamp TIMESTAMP,
data_item TEXT,
PRIMARY KEY ((user_id, log_day), log_timestamp));
or
CREATE TABLE user_data (
user_id TEXT,
log_day TIMESTAMP, -- store as (timestamp-in-milli - (timestamp-in-mills mod 86400)
log_timestamp TIMESTAMP,
data_item TEXT,
PRIMARY KEY ((user_id, log_day), log_timestamp));
Regarding your column key "log_timestamp":
If you are working with multiple writing clients - which I suggest, since otherwise you probably won't get near the possible throughput in a distributed write-optimized data base like C* - you should consider using TimeUUIDs instead of timestamps, as they are conflict-free (assuming MAC addresses are unique). Otherwise you would have to guarantee that no two inserts happen at the same time, otherwise you will lose this data. You can do column slice queries on TimeUUIDs and other time based operations.
I'd use unix time (i.e. 1234567890) over either of those formats - to point to an entire day, you'd just use the timestamp for 00:00.
However, I very much recommend reading Advanced Time Series with Cassandra on the DataStax dev blog. It covers some important things to consider in your model, with regards to bucketing/splitting.