how to update data in cassandra using IN operator - cassandra

I have a table with the following schema.
CREATE TABLE IF NOT EXISTS group_friends(
groupId timeuuid,
friendId bigint,
time bigint,
PRIMARY KEY(groupId,friendId));
I need to keep a track of time if any changes happen in a group (such changing the group name or adding a new friend in table etc.). So I need to update the value of time field by groupId every time there is any change in any related table.
As update in cassandra requires mentioning all primary keys in where clause this query will not run.
update group_friends set time = 123456 where groupId = 100;
So I can do something like this.
update group_friends set time=123456 where groupId=100 and friendId in (...);
But it is showing the following error-->
[Invalid query] message="Invalid operator IN for PRIMARY KEY part friendid"
Is there any way to perform an update operation using IN operator in clustering column? If not then what are the possible ways to do this?
Thanks in advance.

Since friendId is a clustering column, a batch operation is probably a reasonable and well performing choice in this case since all updates would be made in the same partition (assuming you are using the same group id for the update). For example, with the java driver you could do the following:
Cluster cluster = new Cluster.Builder().addContactPoint("127.0.0.1").build();
Session session = cluster.connect("friends");
PreparedStatement updateStmt = session.prepare("update group_friends set time = ? where groupId = ? and friendId = ?");
long time = 123456;
UUID groupId = UUIDs.startOf(0);
List<Long> friends = Lists.newArrayList(1L, 2L, 4L, 8L, 22L, 1002L);
BatchStatement batch = new BatchStatement(BatchStatement.Type.UNLOGGED);
for(Long friendId : friends) {
batch.add(updateStmt.bind(time, groupId, friendId));
}
session.execute(batch);
cluster.close();
The other advantage of this is that since the partition key can be inferred from the BatchStatement, the driver will use token-aware routing to send a request to a replica that would own this data, skipping a network hop.
Although this will effectively be a single write, be careful with the size of your batches. You should take care not to make it too large.
In the general case, you can't really go wrong by executing each statement individually instead of using a batch. The CQL transport allows many requests on a single connection and are asynchronous in nature, so you can have many requests going on at a time without the typical performance cost of a request per connection.
For more about writing data in batch see: Cassandra: Batch loading without the Batch keyword
Alternatively, there may be an even easier way to accomplish what you want. If what you are really trying to accomplish is to maintain a group update time and you want it to be the same for all friends in the group, you can make time a static column. This is a new feature in Cassandra 2.0.6. What this does is shares the column value for all rows in the groupId partition. This way you would only have to update time once, you could even set the time in the query you use to add a friend to the group so it's done as one write operation.
CREATE TABLE IF NOT EXISTS friends.group_friends(
groupId timeuuid,
friendId bigint,
time bigint static,
PRIMARY KEY(groupId,friendId)
);
If you can't use Cassandra 2.0.6+ just yet, you can create a separate table called group_metadata that maintains the time for a group, i.e.:
CREATE TABLE IF NOT EXISTS friends.group_metadata(
groupId timeuuid,
time bigint,
PRIMARY KEY(groupId)
);
The downside here being that whenever you want to get at this data you need to select from this table, but that seems manageable.

Related

Need pagination for following Cassandra table

CREATE TABLE feed (
identifier text,
post_id int,
score int,
reason text,
timestamp timeuuid,
PRIMARY KEY ((identifier, post_id), score, id, timestamp)
) WITH CLUSTERING ORDER BY (score DESC, timestamp DESC);
CREATE INDEX IF NOT EXISTS index_identifier ON feed ( identifier );
I want to run 2 types of queries where identifier = 'user_5' and post_id = 11; and where identifier = 'user_5';
I want to paginate on 10 results per query. However, few queries can have variable result count. So best if there is something like a *column* > last_record that I can use.
Please help. Thanks in advance.
P.S: Cassandra version - 3.11.6
First, and most important - you're approaching to Cassandra like a traditional database that runs on the single node. Your data model doesn't support effective retrieval of data for your queries, and secondary indexes doesn't help much, as it's still need to reach all nodes to fetch the data, as data will be distributed between different nodes based on the value of partition key ((identifier, post_id) in your case) - it may work with small data in small cluster, but will fail miserably when you scale up.
In Cassandra, all data modelling starts from queries, so if you're querying by identifier, then it should be a partition key (although you may get some problems with big partitions if some users will produce a lot of messages). Inside partition you may use secondary indexes, it shouldn't be a problem. Plus, inside partition it's easier to organize paging. Cassandra natively support forward paging, so you just need to keep paging state between queries. In Java driver 4.6.0, the special helper class was added to support paging of results, although it may not be very effective, as it needs to read data from Cassandra anyway, to skip to the given page, but at least it's a some help. Here is example from documentation:
String query = "SELECT ...";
// organize by 20 rows per page
OffsetPager pager = new OffsetPager(20);
// Get page 2: start from a fresh result set, throw away rows 1-20, then return rows 21-40
ResultSet rs = session.execute(query);
OffsetPager.Page<Row> page2 = pager.getPage(rs, 2);
// Get page 5: start from a fresh result set, throw away rows 1-80, then return rows 81-100
rs = session.execute(query);
OffsetPager.Page<Row> page5 = pager.getPage(rs, 5);

DataModel use case for logging in Cassandra

I am trying to design the application log table in Cassandra,
CREATE TABLE log(
yyyymmdd varchar,
created timeuuid,
logMessage text,
module text,
PRIMARY KEY(yyyymmdd, created)
);
Now when I try to perform the following queries it is working as expected,
select * from log where yymmdd = '20182302' LIMIT 50;
Above query is without grouping, kind of global.
Currently I did an secondary index for 'module' so I am able to perform the following,
select * from log where yymmdd = '20182302' WHERE module LIKE 'test' LIMIT 50;
Now my concern is without doing the secondary index, Is there an efficient way to query based on the module and fetch the data (or) Is there a better design?
Also let me know the performance issue in current design.
For fetching based on module and date, you can only use another table, like this:
CREATE TABLE module_log(
yyyymmdd varchar,
created timeuuid,
logMessage text,
module text,
PRIMARY KEY((module,yyyymmdd), created)
);
This will allow to have single partition for every combination of the module & yyyymmdd values, so you won't have very wide partitions.
Also, take into account that if you created a secondary index only on module field - you may get problems with too big partitions (I assume that you have very limited number of module values?).
P.S. Are you using pure Cassandra, or DSE?

Avoiding filtering with a compound partition key in Cassandra

I am fairly new to Cassandra and currently have to following table in Cassandra:
CREATE TABLE time_data (
id int,
secondary_id int,
timestamp timestamp,
value bigint,
PRIMARY KEY ((id, secondary_id), timestamp)
);
The compound partition key (with secondary_id) is necessary in order to not violate max partition sizes.
The issue I am running in to is that I would like to complete the query SELECT * FROM time_data WHERE id = ?. Because the table has a compound partition key, this query requires filtering. I realize this is a querying a lot of data and partitions, but it is necessary for the application. For reference, id has relatively low cardinality and secondary_id has high cardinality.
What is the best way around this? Should I simply allow filtering on the query? Or is it better to create a secondary index like CREATE INDEX id_idx ON time_data (id)?
You will need to specify full partition key on queries (ALLOW FILTERING will impact performance badly in most cases).
One way to go could be if you know all secondary_id (you could add a table to track them in necessary) and do the job in your application and query all (id, secondary_id) pairs and process them afterwards. This has the disadvantage of beeing more complex but the advantage that it can be done with async queries and in parallel so many nodes in your cluster participate in processing your task.
See also https://www.datastax.com/dev/blog/java-driver-async-queries

Data Modelling in Cassandra for job queues

I am trying to store all the scheduler jobs in Cassandra.
I designed all the locking tables and seems fine. I am finding difficulty in creating a job queue table.
My Requirement is
1) I need to query all the jobs whichever is not completed.
CREATE TABLE jobs(
jobId text,
startTime timestamp,
endTime timestamp,
status text,
state text,
jobDetails text,
primary key (X,X))
with clustering order by (X desc);
where, state - on / off
status - running / failed / completed
I am not sure which one to keep as primary key (Since it is unique), Also I need to query all the jobs in 'on' state. Could somebody help me in designing this in Cassandra. Even If you propose anything with composite partition key, I am fine with it.
Edited :
I come up with the data model like this ,
CREATE TABLE job(
jobId text,
startTime timestamp,
endTime timestamp,
state text,
status text,
jobDetails text,
primary key (state,jobId, startTime)
with clustering order by (startTime desc);
I am able to insert like this,
INSERT INTO job (jobId, startTime, endTime, status,state, jobDetails) VALUES('nodestat',toTimestamp(now()), 0,'running','on','{
"jobID": "job_0002",
"jobName": "Job 2",
"description": "This does job 2",
"taskHandler": require("./jobs/job2").runTask,
"intervalInMs": 1000
}');
Query like this,
SELECT * FROM job WHERE state = 'on';
will this create any performance impact ?
You are maybe implementing an antipattern for cassandra.
See https://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets for a blog post discussing what might be your problem when using cassandra as message queue.
Apart from that, there is some information how to do it the "right way" in cassandra on Slideshare: https://de.slideshare.net/alimenkou/high-performance-queues-with-cassandra
There are many projects out there which fit scheduling and or messaging better, for example http://www.quartz-scheduler.org/overview/features.html.
Update for your edit above:
primary key (state,jobId, startTime)
This will create one partition for each state - resulting in huge partitions and hotspots. Transitioning a jobs state will move it to a different partition - you will have deleted entries and possible compation and performance issues (depending on your number of jobs).
All jobs with state='on' will be on one node (and it's replicas) all jobs with state='off' on another node. You will have two partitions in your design.
Since you are open to changes to the model, see if below model works for you
CREATE TABLE job(
partition_key,
jobId text,
startTime timestamp,
endTime timestamp,
state text,
status text,
jobDetails text,
primary key (partition_key,state,jobId, startTime)
with clustering order by (startTime desc);
Here the partion_key column value can be calculated based on your volume of jobs.
For example:
If your job count is less than 100K jobs for a single day, then you can keep the partition at single day level i.e. YYYYMMDD (20180105) or if it is 100K per one hour, you can change it to YYYYMMDDHH (2018010518). Change the cluster columns depending upon your filtering order.
This way you can able to query the state only if you know when you want to query.
Avoiding creating too many partitions or exploding the partition with too many columns
It will evenly distribute load into partitions.
It will be helpful to design the model better if you can specify what adjustments/additions you can make to your query.
You need to include equality columns in partition key so your equality columns are status and state. You need to check whether these 2 makes good partition key or not, if not you need to use either custom column or any other existing column as part of partition key. As jobid is to make record unique you can keep it in clustering column. I am assuming you are not querying table on job id.

Cassandra get latest entry for each element contained within IN clause

So, I have a Cassandra CQL statement that looks like this:
SELECT * FROM DATA WHERE APPLICATION_ID = ? AND PARTNER_ID = ? AND LOCATION_ID = ? AND DEVICE_ID = ? AND DATA_SCHEMA = ?
This table is sorted by a timestamp column.
The functionality is fronted by a REST API, and one of the filter parameters that they can specify to get the most recent row, and then I appent "LIMIT 1" to the end of the CQL statement since it's ordered by the timestamp column in descending order. What I would like to do is allow them to specify multiple device id's to get back the latest entries for. So, my question is, is there any way to do something like this in Cassandra:
SELECT * FROM DATA WHERE APPLICATION_ID = ? AND PARTNER_ID = ? AND LOCATION_ID = ? AND DEVICE_ID IN ? AND DATA_SCHEMA = ?
and still use something like "LIMIT 1" to only get back the latest row for each device id? Or, will I simply have to execute a separate CQL statement for each device to get the latest row for each of them?
FWIW, the table's composite key looks like this:
PRIMARY KEY ((application_id, partner_id, location_id, device_id, data_schema), activity_timestamp)
) WITH CLUSTERING ORDER BY (activity_timestamp DESC);
IN is not recommended when there are a lot of parameters for it and under the hood it's making reqs to multiple partitions anyway and it's putting pressure on the coordinator node.
Not that you can't do it. It is perfectly legal, but most of the time it's not performant and is not suggested. If you specify limit, it's for the whole statement, basically you can't pick just the first item out from partitions. The simplest option would be to issue multiple queries to the cluster (every element in IN would become one query) and put a limit 1 to every one of them.
To be honest this was my solution in a lot of the projects and it works pretty much fine. Basically coordinator would under the hood go to multiple nodes anyway but would also have to work more for you to get you all the requests, might run into timeouts etc.
In short it's far better for the cluster and more performant if client asks multiple times (using multiple coordinators with smaller requests) than to make single coordinator do to all the work.
This is all in case you can't afford more disk space for your cluster
Usual Cassandra solution
Data in cassandra is suggested to be ready for query (query first). So basically you would have to have one additional table that would have the same partitioning key as you have it now, and you would have to drop the clustering column activity_timestamp. i.e.
PRIMARY KEY ((application_id, partner_id, location_id, device_id, data_schema))
double (()) is intentional.
Every time you would write to your table you would also write data to the latest_entry (table without activity_timestamp) Then you can specify the query that you need with in and this table contains the latest entry so you don't have to use the limit 1 because there is only one entry per partitioning key ... that would be the usual solution in cassandra.
If you are afraid of the additional writes, don't worry , they are inexpensive and cpu bound. With cassandra it's always "bring on the writes" I guess :)
Basically it's up to you:
multiple queries - a bit of refactoring, no additional space cost
new schema - additional inserts when writing, additional space cost
Your table definition is not suitable for such use of the IN clause. Indeed, it is supported on the last field of the primary key or the last field of the clustering key. So you can:
swap your two last fields of the primary key
use one query for each device id

Resources