How do I update a tuple in Cassandra? - cassandra

There are literary no tutorials on how to update tuples on Google.
Can someone explain how tuples can be updated in Cassandra?

The CQL tuple data type is implicitly "frozen" without needing the CQL frozen keyword so you can't update individual elements of a tuple column -- you need to update the whole column.
To illustrate, here's my example CQL table:
CREATE TABLE sensors (
id text PRIMARY KEY,
location tuple<decimal, decimal>,
temperature decimal,
weight int
)
Here's an example where I insert a sensor with its location:
INSERT INTO sensors (id, location)
VALUES ('abc123', (50.4501, 30.5234));
Here's an example where I update the location of a sensor:
UPDATE sensors
SET location = (47.0971, 37.5434)
WHERE id = 'abc123';
For details, see CQL tuple type. Cheers!

Related

Delete records in Cassandra table based on time range

I have a Cassandra table with schema:
CREATE TABLE IF NOT EXISTS TestTable(
documentId text,
sequenceNo bigint,
messageData blob,
clientId text
PRIMARY KEY(documentId, sequenceNo))
WITH CLUSTERING ORDER BY(sequenceNo DESC);
Is there a way to delete the records which were inserted between a given time range? I know internally Cassandra must be using some timestamp to track the insertion time of each record, which would be used by features like TTL.
Since there is no explicit column for insertion timestamp in the given schema, is there a way to use the implicit timestamp or is there any better approach?
There is never any update to the records after insertion.
It's an interesting question...
All columns that aren't part of the primary key have so-called WriteTime that could be retrieved using the writetime(column_name) function of CQL (warning: it doesn't work with collection columns, and return null for UDTs!). But because we don't have nested queries in the CQL, you will need to write a program to fetch data, filter out entries by WriteTime, and delete entries where WriteTime is older than your threshold. (note that value of writetime is in microseconds, not milliseconds as in CQL's timestamp type).
The easiest way is to use Spark Cassandra Connector's RDD API, something like this:
val timestamp = someDate.toInstant.getEpochSecond * 1000L
val oldData = sc.cassandraTable(srcKeyspace, srcTable)
.select("prk1", "prk2", "reg_col".writeTime as "writetime")
.filter(row => row.getLong("writetime") < timestamp)
oldData.deleteFromCassandra(srcKeyspace, srcTable,
keyColumns = SomeColumns("prk1", "prk2"))
where: prk1, prk2, ... are all components of the primary key (documentId and sequenceNo in your case), and reg_col - any of the "regular" columns of the table that isn't collection or UDT (for example, clientId). It's important that list of the primary key columns in select and deleteFromCassandra was the same.

Querying Cassandra by column pairs in a list

I'm trying to query Cassandra (3.2) for a any row where a pair of two of the columns should equal any entry in a list of pairs in the following table:
CREATE TABLE dataset (
bucket int,
namespace text,
name text,
data text,
PRIMARY KEY ((bucket), namespace, name)
);
From what I can tell from the SELECT documentation I should be able to use a relation clustering filter. Unfortunately the docs confuse things quite a bit by not properly matching up brackets and parenthesis, but I feel like it tells me the following should work:
SELECT namespace, name, data FROM dataset
WHERE id = 12345 AND (namespace, name) IN (namespace1, Peter),(namespace2, Clark)
It doesn't however, and neither does:
SELECT namespace, name, data FROM dataset
WHERE id = 12345 AND (namespace, name) IN ((namespace1, Peter),(namespace2, Clark))
The queries both fail with ...no viable alternative at input...
How can I write this query correctly?
After reading your comment under my answer, I also noticed in the official documentation that you can use tuples for IN clauses. But this applies only for clustering columns. This means that the following example will work, because namespace and name columns are both clustering columns in my table definition (see PRIMARY KEY section):
CREATE TABLE dataset (
bucket int,
namespace text,
name text,
data text,
PRIMARY KEY ((bucket), namespace, name)
);
SELECT * FROM dataset WHERE bucket = 1 AND (namespace, name) IN (('namespace1', 'Peter'), ('namespace2', 'Clark'));
If you change the table definition for the following, the query above will return with an error:
CREATE TABLE dataset (
namespace text,
name text,
data text,
PRIMARY KEY (namespace, name)
);
You can read more about clustering columns here.
Try using the following CQL command
SELECT namespace, name, data FROM dataset
WHERE id = 12345 AND namespace = 'dummy' AND name IN ('Peter', 'Clark')
Here is the official documentation for IN filter

How to model for word search in cassandra

my model design to save word search from checkbox and it must have update word search and status, delete(fake). my old model set pk is uuid(id of word search) and set index is status (enable, disable, deleted)
but I don't want to set index at status column(I think its very bad to set index at update column) and I don't change database
Is it have better way for model this?
sorry for my english grammar
You should not create index on very low cardinality column status
Avoid very low cardinality index e.g. index where the number of distinct values is very low. A good example is an index on the gender of an user. On each node, the whole user population will be distributed on only 2 different partitions for the index: MALE & FEMALE. If the number of users per node is very dense (e.g. millions) we’ll have very wide partitions for MALE & FEMALE index, which is bad
Source : https://www.datastax.com/dev/blog/cassandra-native-secondary-index-deep-dive
Best way to handle this type of case :
Create separate table for each type of status
Or Status with a known parameter (year, month etc) as partition key
Example of 2nd Option
CREATE TABLE save_search (
year int,
status int,
uuid uuid,
category text,
word_search text,
PRIMARY KEY((year, status), uuid)
);
Here you can see that i have made a composite partition key with year and status, because of low cardinality issue. If you think huge data will be in a single status then you should also add month as the part of composite partition key
If your dataset is small you can just remove the year field.
CREATE TABLE save_search (
status int,
uuid uuid,
category text,
word_search text,
PRIMARY KEY(status, uuid)
);
Or
If you are using cassandra version 3.x or above then you can use materialized view
CREATE MATERIALIZED VIEW search_by_status AS
SELECT *
FROM your_main_table
WHERE uuid IS NOT NULL AND status IS NOT NULL
PRIMARY KEY (status, uuid);
You can query with status like :
SELECT * FROM search_by_status WHERE status = 0;
All the deleting, updating and inserting you made on your main table cassandra will sync it with the materialized view

Cassandra: Is there a limit to amount of data that a collection column can hold?

In the below table, what is the maximum size phone_numbers column can accommodate ?
Like normal columns, is it 2GB ?
Is it 64K*64K as mentioned here
CREATE TABLE d2.employee (
id int PRIMARY KEY,
doj timestamp,
name text,
phone_numbers map<text, text>
)
Collection types in Cassandra are represented as a set of distinct cells in the internal data model: you will have a cell for each key of your phone_numbers column. Therefore they are not normal columns, but a set of columns. You can verify this by executing the following command in cassandra-cli (1001 stands for a valid employee id):
use d2;
get employee[1001];
The good answer is your point 2.

How to only return some map keys (aka, slice a range of map/set elements) in CQL 3?

I'm trying to do my own CF reverse index in Cassandra right now, for a geohash lookup implementation.
In CQL 2, I could do this:
CREATE COLUMNFAMILY song_tags (id uuid PRIMARY KEY) WITH comparator=text;
insert into song_tags ('id', 'blues', '1973') values ('a3e64f8f-bd44-4f28-b8d9-6938726e34d4', '', '');
insert into song_tags ('id', 'covers', '2007') values ('8a172618-b121-4136-bb10-f665cfc469eb', '', '');
SELECT * FROM song_tags;
Which resulted in:
id,8a172618-b121-4136-bb10-f665cfc469eb | 2007, | covers,
id,a3e64f8f-bd44-4f28-b8d9-6938726e34d4 | 1973, | blues,
And allowed to return 'covers' and 'blues' via:
SELECT 'a'..'f' FROM song_tags
Now, I'm trying to use CQL 3, which has gotten rid of dynamic columns, and suggests using a set or map column type instead. sets and maps have their values/keys ordered alphabetically, and under the hood (iirc) are columns - hence, they should support the same type of range slicing... but how?
Suggest to forget what you know about 'under the hood' implementation details and focus on what the query language lets you do.
Long reason why is in CQL3, multiple rows map to a single columnfamily though the query language presents them as different rows. It's just a different way of querying the same data.
Range slicing does not exist, the query language is flexible enough to support its use cases.
To do what you want, make an index on the genres so it is query-able without using the primary key and then select the genres value itself.
The 'gotcha' is that some functions can only be performed on partition keys, like distinct. Will have to do distinct client side in that case.
For example:
CREATE TABLE song_tags (
id uuid PRIMARY KEY,
year text,
genre list<text>
);
CREATE INDEX ON song_tags(genre);
INSERT INTO song_tags (id, year, genre)
VALUES(8a172618-b121-4136-bb10-f665cfc469eb, '2007', ['covers']);
INSERT INTO song_tags (id, year, genre)
VALUES(a3e64f8f-bd44-4f28-b8d9-6938726e34d4, '1973', ['blues']);
Can then query as:
SELECT genre from song_tags;
genre
------------
['blues']
['covers']

Resources