Cassandra Modelling for Date Range - cassandra

Cassandra Newbie here. Cassandra v 3.9.
I'm modelling the Travellers Flight Checkin Data.
My Main Query Criteria is Search for travellers with a date range (max of 7 day window).
Here is what I've come up with with my limited exposure to Cassandra.
create table IF NOT EXISTS travellers_checkin (checkinDay text, checkinTimestamp bigint, travellerName text, travellerPassportNo text, flightNumber text, from text, to text, bookingClass text, PRIMARY KEY (checkinDay, checkinTimestamp)) WITH CLUSTERING ORDER BY (checkinTimestamp DESC)
Per day, I'm expecting upto a million records - resulting in the partition to have a million records.
Now my users want search in which the date window is mandatory (max a week window). In this case should I use a IN clause that spans across multiple partitions? Is this the correct way or should I think of re-modelling the data? Alternatively, I'm also wondering if issuing 7 queries (per day) and merging the responses would be efficient.

Your Data Model Seems Good.But If you could add more field to the partition key it will scale well. And you should use Separate Query with executeAsync
If you are using in clause, this means that you’re waiting on this single coordinator node to give you a response, it’s keeping all those queries and their responses in the heap, and if one of those queries fails, or the coordinator fails, you have to retry the whole thing
Source : https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/
Instead of using IN clause, use separate query of each day and execute it with executeAsync.
Java Example :
PreparedStatement statement = session.prepare("SELECT * FROM travellers_checkin where checkinDay = ? and checkinTimestamp >= ? and checkinTimestamp <= ?");
List<ResultSetFuture> futures = new ArrayList<>();
for (int i = 1; i < 4; i++) {
ResultSetFuture resultSetFuture = session.executeAsync(statement.bind(i, i));
futures.add(resultSetFuture);
}
for (ResultSetFuture future : futures){
ResultSet rows = future.getUninterruptibly();
//You get the result set of each query, merge them here
}

Related

Reading guarantees for full table scan while updating the table?

Given schema:
CREATE TABLE keyspace.table (
key text,
ckey text,
value text
PRIMARY KEY (key, ckey)
)
...and Spark pseudocode:
val sc: SparkContext = ...
val connector: CassandraConnector = ...
sc.cassandraTable("keyspace", "table")
.mapPartitions { partition =>
connector.withSessionDo { session =>
partition.foreach { row =>
val key = row.getString("key")
val ckey = Random.nextString(42)
val value = row.getString("value")
session.execute(s"INSERT INTO keyspace.table (key, ckey, value)" +
" VALUES ($key, $ckey, $value)")
}
}
}
Is it possible for a code like this to read an inserted value within a single application (Spark job) run? More generalized version of my question would be whether a token range scan CQL query can read newly inserted values while iterating over rows.
Yes, it is possible exactly as Alex wrote
but I don't think it's possible with above code
So per data model the table is ordered by ckey in ascending order
The funny part however is the page size and how many pages are prefetched and since this is by default 1000 (spark.cassandra.input.fetch.sizeInRows), then the only problem could occur, if you wouldn't use 42, but something bigger and/or the executor didn't page yet
Also I think you use unnecessary nesting, so the code to achieve what you want might be simplified (after all cassandraTable will give you a data frame).
(I hope I understand that you want to read per partition (note a partition in your case is all rows under one primary key - "key") and for every row (distinguished by ckey) in this partition generate new one (with new ckey that will just duplicate value with new ckey) - use case for such code is a mystery for me, but I hope it has some sense:-))

Data model in Cassandra and proper deletion Strategy

I have following table in cassandra:
CREATE TABLE article (
id text,
price int,
validFrom timestamp,
PRIMARY KEY (id, validFrom)
) WITH CLUSTERING ORDER BY (validFrom DESC);
With articles and historical price information (validFrom is a timestamp of new price). Article price changes often. I want to query for
Historic price for a particular article.
Last price for an article.
From my understanding, I can solve both problems with following query:
select id, price from article where id = X validFrom < Y limit 1;
This query uses article id as restriction, query uses the partition key. Since the clustering order is based on the validFrom timestamp in reversed order, cassandra can efficient perform this query.
Am I getting this right?
What is the best approach to delete old data (house-keeping). Let's assume, I want delete all articles with validFrom > 20150101 and validFrom < 20151231. Since I don't have a primary key, this would be inefficient, even if I use an index on validFrom, right? How can I achieve this?
You can use external tools for that:
Spark with Spark Cassandra Connector (even in the local mode). Code could look as following (note that I'm using validfrom as name, not validFrom, as it's not escaped in your schema):
import com.datastax.spark.connector._
val data = sc.cassandraTable("test", "article")
.where("validfrom >= '2020-07-28T11:50:00Z' AND validfrom < '2020-07-28T12:50:00Z'")
.select("id", "validfrom")
data.deleteFromCassandra("test", "article", keyColumns=SomeColumns("id", "validfrom"))
use DSBulk to do find the matching entries and output them into the file (output.csv in my case), and then perform their deletion:
bin/dsbulk unload -url output.csv \
-query "SELECT id, validfrom FROM test.article WHERE token(id) > :start AND token(id) <= :end AND validFrom >= '2020-07-28T11:50:00Z' AND validFrom < '2020-07-28T12:50:00Z' ALLOW FILTERING"
bin/dsbulk load -query "DELETE from test.article WHERE id = :id and validfrom = :validfrom" \
-url output.csv
To add to Alex Ott's answer, this comment of yours is incorrect:
This query uses article id as restriction, query uses the partition key. Since the clustering order is based on price, cassandra can efficient perform this query.
The rows are not ordered by price. They are ordered by validFrom in reverse-chronological order. Cheers!

CQL table design for temporal data

As a Cassandra novice, I have a CQL design question. I want to re-use a concept which I've build before using RDBMS systems, to create history for customerData. The customer himself will only see the latest version, so that should be the fastest, but queries on whole history can be performed.
My suggested entity properties:
customerId text,
validFromDate date,
validUntilDate date,
customerData text
First save of customerData just INSERTs customerData with validFromDate=NOW and validUntilDate=31-12-9999
Subsequent saves of customerData changes the last record - setting validUntilDate=NOW - and INSERT new customerData with validFromDate=NOW and validUntilDate=31-12-9999
Result:
This way a query of (customerId, validUntilDate)=(id,31-12-9999) will give last saved version.
Query on (customerId) will give all history.
To query customerData at certain time t just use query with validFromDate < t < validUntilDate
My guess is PARTITION_KEY = customerId and CLUSTER_KEY can be validFromDate. Or use PRIMARY KEY = customerId. Or I could create two tables, one for fast querying of lastest version (has no history), and another for historical analyses.
How do you design this in CQL-way? I think I'm thinking too much RDBMish.
Use change timestamp as CLUSTERING KEY with DESC order, e.g
CREATE TABLE customer_data_versions (
id text,
change_time timestamp,
name text,
PRIMARY KEY (id, change_time)
) WITH CLUSTERING ORDER BY ( change_time DESC );
It will allow you to store data versions per customer id in descending order.
Insert two versions for the same id:
INSERT INTO customer_data_versions (id, change_time, name) VALUES ('id1', totimestamp(now()),'John');
INSERT INTO customer_data_versions (id, change_time, name) VALUES ('id1', totimestamp(now()),'John Doe');
Get last saved version:
SELECT * FROM customer_data_versions WHERE id='id1' LIMIT 1;
Get all versions for the id:
SELECT * FROM customer_data_versions WHERE id='id1';
Get versions between dates:
SELECT * FROM customer_data_versions WHERE id='id1' AND change_time <= before_date AND change_time >= after_date;
Please note, there are some limits for partition size (how much versions you will be able to store per customer id):
Cells in a partition: ~2 billion (231); single column value size: 2 GB ( 1 MB is recommended)

Cassandra read from large dataset

I need to get a count from a very large dataset in Cassandra, 100 million plus. I am worried about the memory hit cassandra would take if I just ran the following query.
select count(*) from conv_org where org_id = 'TEST_ORG'
I was told I could use cassandra Automatic Paging to do this? Does this seem like a good option?
Would the syntax look something like this?
Statement stmt = new SimpleStatement("select count(*) from conv_org where org_id = 'TEST_ORG'");
stmt.setFetchSize(1000);
ResultSet rs = session.execute(stmt);
I am unsure the above code will work as I do not need a result set back I just need a count.
Here is the data model.
CREATE TABLE ts.conv_org (
org_id text,
create_time timestamp,
test_id text,
org_type int,
PRIMARY KEY (org_id, create_time, conv_id)
)
If org_id isn't your primary key counting in cassandra in general is not a fast operation and can easily lead to a full scan of all sstables in your cluster and therefore be painfully slow.
In Java for example you can do something like this:
ResultSet rs = session.execute(...);
Iterator<Row> iter = rs.iterator();
while (iter.hasNext()) {
if (rs.getAvailableWithoutFetching() == 100 && !rs.isFullyFetched())
rs.fetchMoreResults();
Row row = iter.next()
... process the row ...
}
https://docs.datastax.com/en/drivers/java/2.0/com/datastax/driver/core/ResultSet.html
You could select a small colum and count your self. There is int getAvailableWithoutFetching() and isFullyFetched() that could help you.
In general if you really need a count - maintain it yourself.
On the other hand, if you have really many rows in one partition you can have also some other performance problems.
But that's hard to say without knowing the data model.
Maybe you want to use "counter table" in addition to your dataset.
Pros: get counter fast.
Cons: need to maintained that table.
Reference:
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useCountersConcept.html

CQL no viable alternative at input '(' error

I have a issue with my CQL and cassandra is giving me no viable alternative at input '(' (...WHERE id = ? if [(]...) error message. I think there is a problem with my statement.
UPDATE <TABLE> USING TTL 300
SET <attribute1> = 13381990-735b-11e5-9bed-2ae6d3dfc201
WHERE <attribute2> = dfa2efb0-7247-11e5-a9e5-0242ac110003
IF (<attribute1> = null OR <attribute1> = 13381990-735b-11e5-9bed-2ae6d3dfc201) AND <attribute3> = 0;
Any idea were the problem is in the statement about?
It would help to have your complete table structure, so to test your statement I made a couple of educated guesses.
With this table:
CREATE TABLE lwtTest (attribute1 timeuuid, attribute2 timeuuid PRIMARY KEY, attribute3 int);
This statement works, as long as I don't add the lightweight transaction on the end:
UPDATE lwttest USING TTL 300 SET attribute1=13381990-735b-11e5-9bed-2ae6d3dfc201
WHERE attribute2=dfa2efb0-7247-11e5-a9e5-0242ac110003;
Your lightweight transaction...
IF (attribute1=null OR attribute1=13381990-735b-11e5-9bed-2ae6d3dfc201) AND attribute3 = 0;
...has a few issues.
"null" in Cassandra is not similar (at all) to its RDBMS counterpart. Not every row needs to have a value for every column. Those CQL rows without values for certain column values in a table will show "null." But you cannot query by "null" since it isn't really there.
The OR keyword does not exist in CQL.
You cannot use extra parenthesis to separate conditions in your WHERE clause or your lightweight transaction.
Bearing those points in mind, the following UPDATE and lightweight transaction runs without error:
UPDATE lwttest USING TTL 300 SET attribute1=13381990-735b-11e5-9bed-2ae6d3dfc201
WHERE attribute2=dfa2efb0-7247-11e5-a9e5-0242ac110003
IF attribute1=13381990-735b-11e5-9bed-2ae6d3dfc201 AND attribute3=0;
[applied]
-----------
False

Resources