Cassandra DB Java: for each row gotten in query, fetch data stored in seperate table (eg counter table)? - cassandra

Does anyone know of an efficient way to fetch counter data stored in a separate table for each row gotten in a query?
The tables are defined as follows
TABLE person (
id timeuuid,
name text,
many other attributes... );
TABLE person_counts(
id timeuuid, //same id as person
count1 counter,
PRIMARY KEY (id));
The goal is that when persons/single person are fetched, before returning to add the count then return. Is iterating over each person and querying person_counts the only way to achieve this? It needs to be a counter however since I need a certain Primary Key for the person table I cannot have a counter directly there it seems.
I am using datastax cassandra if it makes a difference.

For this insertion, you can use the batch operation, which is an atomic operation in cassandra datastax driver. Whenever you try to enter a record in persons table, you have to create a prepared query for persons table and persons_count table, and add the two prepared queries to a single batch and carry on the insertion. The advantage with batches is, it is an atomic operation i.e., either it inserts both records or none at all.
In the same way whenever you want to delete from persons table, create a batch and delete from both the persons and persons_count table. You can read more about them here:https://datastax.github.io/cpp-driver/topics/basics/batches/
Note: The two tables are independent and you can read entries of the two tables separately.Inserting via batch operation does not make them interlinked.
Now, for the requirement of fetching, you have to first query the count from the table and then go to the persons table. Probably, there is no other way as cassandra doesn't support any joins. Moreover, you have to specify the primary key for the persons table and other attributes, which help in finalising whether count should be in other table or it can be use in the same table. If you are fine with this implementation, you can use this:
create table persons(id uuid,
name text, count counter, primary key(id, name));
and update statement for the counter column. Then, there is no need of other table too.

Related

YCQL Secondary indexes on tables with TTL in YugabyteDB

[Question posted by a user on YugabyteDB Community Slack]
I have a table with TTL and a secondary index, using YugabyteDB 2.9.0 and I’m getting the following error when I try to insert a row:
SyntaxException: Feature Not Supported
Below is my schema:
CREATE TABLE lists.list_table (
item_value text,
list_id uuid,
created_at timestamp,
updated_at timestamp,
is_deleted boolean,
valid_from timestamp,
valid_till timestamp,
metadata jsonb,
PRIMARY KEY ((item_value, list_id))
) WITH default_time_to_live = 0
AND transactions = {'enabled': 'true'};
CREATE INDEX list_created_at_idx ON lists.list_table (list_id, created_at)
WITH transactions = {'enabled': 'true'};
We have two types of queries (80% & 20% distribution):
select * from list_table where list_id= <id> and item_value = <value>
select * from list_table where list_id= <id> and created_at>= <created_at>
We expect per list_id there would be around 1000-10000 entries.
The TTL would be around 1 month.
It is a restriction, it’s currently not supported to transactionally expire rows using TTL out of a table which are indexed (i.e. atomic expiry of TTL entries in both table and index). There are several workarounds to this:
a) In YCQL, we also support an index with a weaker consistency. This is not well documented today, but you can see the details here: https://github.com/YugaByte/yugabyte-db/issues/1696
The main issue to call out when using this variant of index is that error handling (on INSERT failure), is that it is an application side responsibility to retry the INSERT on failure. As noted in the above issue << If an insert/update or batch of such operations fails, it is the app's responsibility to retry the operation so that the index is consistent. Much like in a 2-table case, it would have been the apps responsibility to retry (in case of a failure between the update to the two tables) to make sure both tables are in sync again. >>
This type of index supports a TTL at the table & index level. (which is recommended to keep the same): https://github.com/yugabyte/yugabyte-db/issues/2481#issuecomment-537177471
b)Another workaround is to use a background cleanup job to periodically delete stale records (instead of using TTL).
c)Avoid using indexes and store data in two tables. one organized by the original primary key and one organized by the index columns you wanted (as the primary key). Both tables can have TTL. But it is an application side responsibility to INSERT to both tables when data is added to the database.
The first table's PK would be ((list_id, item_value)), identical to the current main table. nstead of an index you'll have a second table; the second table's PK would be ((list_id), created_at) and both tables would have a TTL. The application must insert the data into both tables. In the 2nd table you have a choice:
(option 1) Duplicate all the columns from the main table here including your JSON columns etc. This makes Q2 lookup fast, the row has everything it needs; but increases your storage requirements.
(option 2): In addition to the PK, just store the item_value column in the second table. For Q2, you must first lookup the 2nd table and get the item_value, and then use list_id and item_value and retrieve the data from the main table (much like an index would do under the covers).
d)Another workaround, is if we could avoid the index and pick the PK to be ((list_id, item_value), created_at).
This would not affect the performance of Q1 because with (where list_id and item_value) provided it can use the PK to find the rows. But it would be slower for Q2 where list_id and created_at are provided because while it can still use list_id, it must filter out the data using the created_at value without the help of an index. So if Q2 is really 20% of your queries, you probably do not want to scan 1 to 10k items to find your matching row.
To clarify option (c), with the example in mind:
The first table's PK would be ((list_id, item_value)); it is the same as your current main table. Instead of an index you'll have a second table; the second table's PK would be ((list_id), created_at).
both tables would have a TTL
The application would have to insert entries into both tables.
In the 2nd table you have a choice:
(option 1) duplicate all the columns from the main table, including your JSON columns etc. This makes Q2 lookup fast, the row has everything it needs; but increases your storage requirements.
(option 2): in addition to the Primary Key, just store the item_value column in the second table. For Q2, you must first lookup the 2nd table and get the item_value, and then use list_id and item_value and retrieve the data from the main table (much like an index would do under the covers)

How to select data in Cassandra either by ID or date?

I have a very simple data table. But after reading a lot of examples in the internet, I am still more and more confused how to solve the following scenario:
1) The Table
My data table looks like this (without defining the primayr key, as this is my understanding problem):
CREATE TABLE documents (
uid text,
created text,
data text
}
Now my goal is to have to different ways to select data.
2) Select by the UID:
SELECT * FROM documents
WHERE uid = ‘xxxx-yyyyy-zzzz’
3) Select by a date limit
SELECT * FROM documents
WHERE created >= ‘2015-06-05’
So my question is:
What should my table definition in Cassandra look like, so that I can perform these selections?
To achieve both queries, you would need two tables.
First one would look like:
CREATE TABLE documents (
uid text,
created text,
data text,
PRIMARY KEY (uid));
and you retrieve your data with: SELECT * FROM documents WHERE uid='xxxx-yyyy-zzzzz' Of course, uid must be unique. You might want to consider the uuid data type (instead of text)
Second one is more delicate. If you set your partition to the full date, you won't be able to do a range query, as range query is only available on the clustering column. So you need to find the sweet spot for your partition key in order to:
make sure a single partition won't be too large (max 100MB,
otherwise you will run into trouble)
satisfy your query requirements.
As an example:
CREATE TABLE documents_by_date (
year int,
month int,
day int,
uid text,
data text,
PRIMARY KEY ((year, month), day, uid);
This works fine if within a day, you don't have too many documents (so your partition don't grow too much). And this allows you to create queries such as: SELECT * FROM documents_by_date WHERE year=2018 and month=12 and day>=6 and day<=24; If you need to issue a range query across multiple months, you will need to issue multiple queries.
If your partition is too large due to the data field, you will need to remove it from documents_by_date. And use documents table to retrieve the data, given the uid you retreived from documents_by_date.
If your partition is still too large, you will need to add hour in the partition key of documents_by_date.
So overall, it's not a straightforward request, and you will need to find the right balance for yourself when defining your partition key.
If latency is not a huge concern, an alternative would be to use the stratio lucene cassandra plugin, and index your date.
Question does not specify how your data is going to be with respect user and create time. But since its a document, I am assuming that one user will be creating one document at one "created" time.
Below is the table definition you can use.
CREATE TABLE documents (
uid text,
created text,
data text
PRIMARY KEY (uid, created)
) WITH CLUSTERING ORDER BY (created DESC);
WITH CLUSTERING ORDER BY (created DESC) can help you get the data order by created for a given user.
For your first requirement you can query like given below.
SELECT * FROM documents WHERE uid = 'SEARCH_UID';
For your second requirement you can query like given below
SELECT * FROM documents WHERE created > '2018-04-10 11:32:00' ALLOW FILTERING;
Use of Allow Filtering should be used diligently as it scans all partitions. If we have to create a separate table with date as primary key, it becomes tricky if there are many documents being inserted at very same second. Clustering order works best for the requirements where documents for a given user need to be sorted by time.

How to model for repeated information on many records on cassandra

I have a massively huge table with hundreds of billions of records and I mean to add a field in this table of which the same value would be repeated for millions of records. I don't know how to efficiently model this in cassandra. Allow me to elaborate:
I have a generic table:
CREATE TABLE readings (
key int,
key2 int,
time timestamp,
name text,
PRIMARY KEY ((key, key2) time)
)
This table has 700.000.000+ records.
I want to create a field in this table, named source. This field indicates where the record was gotten from (since the software has many ways of receiving the information on the reading table). One possible value for this field is "XML: path\to\file.xml" or "Direct import from the X database" or even "Manually added", I want this to be a descriptive field, used exclusively to allow later maintenance in the database where we want to manipulate only records from a given source.
The queries I want to run that I can't now are:
Which records on the readings table were gotten from a given source?
What is the source of a given record?
A solution would be for me to create a table such as:
CREATE TABLE readings_per_source(
source text,
key int,
key2 int,
time timestamp,
PRIMARY KEY (source, key, key2, time)
)
which would allow me to execute the first query, but would also mean that I would create 700.000.000+ new records on my database with a lot of information, which would take a lot of unnecessary storage space since tens of millions of these records would have the same value for source.
If this was a relational environment, I would create a source_id field on the readings table and a source table with id (PK) and name fields, that would mean storing only an additional integer for each row on the readings table and a new table with as many records as different sources there was.
How does one go about modelling this in cassandra?
Your schema
CREATE TABLE readings_per_source(
source text,
key int,
key2 int,
time timestamp,
PRIMARY KEY (source, key, key2, time)
)
is a very bad idea because source is the partition key and you can have millions of records sharing the same source e.g. having a very very wide partition --> hot spots
For you second query, What is the source of a given record? is it quite trivial if you access the data using the record primary keys (key, key2). The source column can be added as just a regular column into the table
For the first query Which records on the readings table were gotten from a given source? it is trickier. The idea here is to fetch all the records having the same source.
Do you realize that this query can potentially return tens of millions of records ?
If it's what you want to do, there is a solution, use the new SASI secondary index (read my blog post for all details) and create an index on the source column
CREATE TABLE readings (
key int,
key2 int,
time timestamp,
name text,
source text,
PRIMARY KEY ((key, key2), time)
)
CREATE CUSTOM INDEX source_idx ON readings(source)
USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = {
'mode': 'PREFIX',
'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer',
'case_sensitive': 'false'
};
Then to fetch all records having the same source, use server-side paging feature of the Java driver (or any other Datastax driver)
http://www.datastax.com/2015/03/how-to-do-joins-in-apache-cassandra-and-datastax-enterprise is a pretty good article on how to go about joining tables in Cassandra.
normalized data will always take up less storage than de-normalized (flat) data (provided the related data is larger than the key being used to join the tables together) but requires joins which take more horsepower to compute during queries.
There's always a trade-off. There's also a tradeoff concerning state with fully normalized data, one example being the customer who changes addresses. In a fully normalized schema, once the address change is made, all invoices for the customer, past and present show the new address. This isn't always desirable.
Often it's desirable to partially normalize to provide historic state on records where it's important to show the state of the data at a given time, such as on invoices. In that case you'd store a copy of the customer address data on the invoice at the time of invoice creation.
This is especially important for pricing and taxes as well. You want the price/tax stored with the invoice so you can show what the customer paid at the time the invoice was created, so when accounting runs monthly, yearly and beyond numbers that the prices on a given invoice are correct for the date on the invoice, even though the prices of the products may have changed. Otherwise you have an accounting nightmare!
There is a lot more to consider than simply storage space when deciding how to normalize/de-normalize a schema.
Sorry for rambling...

How to do Cassandra data modeling for aggregate counts?

Let's say I have customer orders data coming into my service and I would like do some reporting on this data. All customer orders are saved in a Cassandra table so that I can get all orders for a given customer:
TABLE customer_orders
store_id uuid,
customer_id text,
order_id text,
order_amount int,
order_date timestamp,
PRIMARY: KEY (store_id, customer_id)
But I would also like to find all the customers with a given number of orders. Ideally I would like to have this in a ready to query table in Cassandra. For example "get all customers who have 1 order".
Therefore I have a table like this:
TABLE order_count_to_customer
store_id uuid,
order_count int,
customer_id text
PRIMARY KEY ((store_id, order_count), customer_id)
So the idea is when an order arrives both of these tables to be updated.
So I create a third table:
TABLE customer_to_orders_count
store_id uuid,
customer_id text,
orders_count counter,
PRIMARY KEY (store_id, orders_count)
When an order arrives:
I save it in the first table
Then update the counter in the third table by incrementing it with 1.
Then I read the counter in the third table and insert a new record in the second table.
When I need to find all the customers with a given number of orders I just query the second table.
The problem with this is that counters are not atomic and consistent. If I update the counter say to 3 there is no guarantee that when I read it next in order to update the second table it would be 3. It could be 2. Even if I read the counter before I do the update of the counter it could be some value from several steps back. So no guarantee either.
Please note that I am aware of the limitations of the counters in Cassandra and I am not asking how to solve the issue with the counters.
I am rather giving this example, in order to ask for some general advice on how to model the data in order to be able to do aggregate counting on it. I can of course use Spark to do aggregate queries directly on the first table in my example. But it seems to me that there could be some more clever way to do this and also Spark would involve bringing the whole table data into memory.
Have you thought about using the CQL Batch command. https://docs.datastax.com/en/cql/3.1/cql/cql_reference/batch_r.html
You can use this with all your steps to keep all your steps in one logical atomic transaction where either they will all succeed or fail. However this functionality does have a performance penalty.

Cassandra column family design

I'm having trouble designing a column family that suits the following requirement:
I would like to update X rows that match some condition for a field that is not the primary key and is not unique.
For example if a User column family has ID, name and birthday columns, I would like to update all the users that were born after some specific day.
Even if I add the 'birthday' to the primary key (lets say 'ID', 'birthday') I cannot perform this query because part of the primary key is missing.
How can i approach this by designing my column family differently ?
Thanks.
According to cassandra docs, there is no way to update rows without explicitly defining their partition key. This was done not by an accident, but because this feature (e.g. update users set status=1 where id>10) can allow user to update all data in table at once, which can be very-very-very expensive on large databases. Cassandra explicitly forbids all operations requiring data scans within multiple partitions.
To update multiple users all at once, you have to know their IDs. Having a table defined as:
CREATE TABLE stackoverflow.users (
id timeuuid PRIMARY KEY,
dob timestamp,
status text
)
and knowing user's primary key, you can run queries like update users set status='foo' where id in (1,2,3,4). But queries with really large sets of keys inside IN statement may cause performance issues on C*.
But how can you have an efficient range query like select id from some_table where dob>'2000-01-01 00:00:01'? There are two options available, and both of them are not really acceptable:
Create an index table like
CREATE TABLE stackoverflow.dob_index (
year int,
dob timestamp,
ids list<timeuuid>,
PRIMARY KEY (year, dob)
)
with compound partition+clustering primary key and use multiple queries like select * from dob_index where year=2014 and dob<'2014-05-01 00:00:01'; to fetch ids for different years. Notice that I've defined multiple partitions for the table to have some kind of even partition distribution in cluster. But the general idea is that you really shouldn't have a small amount of very large partitions. Prefer a large amount of small ones, if there's a choice.
Have a separate stand-alone index available for complex queries (like ElasticSearch/Solr/Sphinx).
But I suggest you to revisit your application logic in a way to avoid updating/deleting data at all:
instead of updating users table directly, you can have a separate table user_status you insert new statuses:
CREATE TABLE user_statuses (
id timeuuid,
updated_at timestamp,
status text,
PRIMARY KEY (id, updated_at)
)
When you need to scan/update a lot of rows at once, prefer using tools like Spark to efficiently distribute your workload among your cluster nodes.

Resources