how to do a query with cassandradb counter table - cassandra

i have a table in Cassandradb as mentioned below:
CREATE TABLE remaining (owner varchar,buddy varchar,remain counter,primary key(owner,buddy));
generally i do some inc/dec operations on REMAIN field ,using cql like below:
update remaining set remain=remain + 1 where owner='userA' and buddy='userB';
update remaining set remain=remain + 1 where owner='userA' and buddy='userC';
....
and now i need to find out all buddies for userA which it's REMAIN field greater then 1. when i using:
select buddy,remain from remaining where owner='userA' and remain > 0;
gives me an error:
No indexed columns present in by-columns clause with Equal operator
how to do this in a cassandradb way?

The short answer to this is that you cannot do queries with conditionals on counter columns in Cassandra.
The reason behind this is that all Cassandra queries need to be modeled around the primary key of that particular table. Counter columns are not allowed as parts of the primary key of a table (their changing values would cause constant reorganization of the dat on disk). Counter columns are more used for tracking the state of a known piece of data, for example number of times a photo has been up-voted. This could be quickly recalled as long as we knew which photo we were interested in. To actually sort photos by numbers of votes you would need to perform an analytics style query using spark or Hadoop.

Related

Cassandra - get all data for a certain time range

Is it possible to query a Cassandra database to get records for a certain range?
I have a table definition like this
CREATE TABLE domain(
domain_name text,
status int,
last_scanned_date long
PRIMARY KEY(text,last_scanned_date)
)
My requirement is to get all the domains which are not scanned in the last 24 hours. I wrote the following query, but this query is not efficient as Cassandra is trying to fetch entire dataset because of ALLOW FILTERING
SELECT * FROM domain where last_scanned_date<=<last24hourstimeinmillis> ALLOW FILTERING;
Then I decided to do it in two queries
1st query:
SELECT DISTINCT name from domain;
2nd query:
Use IN operator to query domains which are not scanned i nlast 24 hours
SELECT * FROM domain where
domain_name IN('domain1','domain2')
AND
last_scanned_date<=<last24hourstimeinmillis>
My second approach works, but comes with an extra overhead of querying first for distinct values.
Is there any better approach than this?
You should update your structure table definition. Currently, you are selecting domain name as your partition key while you can not have more than 2 billion records in single Cassandra partition.
I would suggest you should use your time as part of your partition key. If you are not going to receive more than 2 billion requests per day. Try to use day since epoch as the partition key. You can do composite partition keys but they won't be helpful for your query.
While querying you have to scan at max two partitions with an additional filter in a query or in your application filtering out results which do not belong to a
the range you have specified.
Go over following concepts before finalizing your design.
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useCompositePartitionKeyConcept.html
https://docs.datastax.com/en/dse-planning/doc/planning/planningPartitionSize.html
Cassandra can effectively perform range queries only inside one partition. The same is for use of the aggregations, such as DISTINCT. So in your case you'll need to have only one partition that will contain all data. But that's is bad design.
You may try to split this big partition into smaller ones, by using TLDs as separate partition keys, and perform fetching in parallel from every partition - but this also will lead to imbalance, as some TLDs will have more sites than other.
Another issue with your schema is that you have last_scanned_date as clustering column, and this means that when you update last_scanned_date, you're effectively insert a new row into database - you'll need to explicitly remove row for previous last_scanned_date, otherwise the query last_scanned_date<=<last24hourstimeinmillis> will always fetch old rows that you already scanned.
Partially your problem with your current design could be solved by using the Spark that is able to perform effective scanning of full table via token range scan + range scan for every individual row - this will return only data in given time range. Or if you don't want to use Spark, you can perform token range scan in your code, something like this.

Cassandra pagination and token function; selecting a partition key

I've been doing a lot of reading lately on Cassandra data modelling and best practices.
What escapes me is what the best practice is for choosing a partition key if I want an application to page through results via the token function.
My current problem is that I want to display 100 results per page in my application and be able to move on to the next 100 after.
From this post: https://stackoverflow.com/a/24953331/1224608
I was under the impression a partition key should be selected such that data spreads evenly across each node. That is, a partition key does not necessarily need to be unique.
However, if I'm using the token function to page through results, eg:
SELECT * FROM table WHERE token(partitionKey) > token('someKey') LIMIT 100;
That would mean that the number of results returned from my partition may not necessarily match the number of results I show on my page, since multiple rows may have the same token(partitionKey) value. Or worse, if the number of rows that share the partition key exceeds 100, I will miss results.
The only way I could guarantee 100 results on every page (barring the last page) is if I were to make the partition key unique. I could then read the last value in my page and retrieve the next query with an almost identical query:
SELECT * FROM table WHERE token(partitionKey) > token('lastKeyOfCurrentPage') LIMIT 100;
But I'm not certain if it's good practice to have a unique partition key for a complex table.
Any help is greatly appreciated!
But I'm not certain if it's good practice to have a unique partition key for a complex table.
It depends on requirement and Data Model how you should choose your partition key. If you have one key as partition key it has to be unique otherwise data will be upsert (overridden with new data). If you have wide row (a clustering key), then make your partition key unique (a key that appears once in a table) will not serve the purpose of wide row. In CQL “wide rows” just means that there can be more than one row per partition. But here there will be one row per partition. It would be better if you can provide the schema.
Please follow below link about pagination of Cassandra.
You do not need to use tokens if you are using Cassandra 2.0+.
Cassandra 2.0 has auto paging. Instead of using token function to
create paging, it is now a built-in feature.
Results pagination in Cassandra (CQL)
https://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0
https://docs.datastax.com/en/developer/java-driver/2.1/manual/paging/
Saving and reusing the paging state
You can use pagingState object that represents where you are in the result set when the last page was fetched.
EDITED:
Please check the below link:
Paging Resultsets in Cassandra with compound primary keys - Missing out on rows
I recently did a POC for a similar problem. Maybe adding this here quickly.
First there is a table with two fields. Just for illustration we use only few fields.
1.Say we insert a million rows with this
Along comes the product owner with a (rather strange) requirement that we need to list all the data as pages in the GUI. Assuming that there are hundred entries 10 pages each.
For this we update the table with a column called page_no.
Create a secondary index for this column.
Then do a one time update for this column with page numbers. Page number 10 will mean 10 contiguous rows updated with page_no as value 10.
Since we can query on a secondary index each page can be queried independently.
Code is self explanatory and here - https://github.com/alexcpn/testgo
Note caution on how to use secondary index properly abound. Please check it. In this use case I am hoping that i am using it properly. Have not tested with multiple clusters.
"In practice, this means indexing is most useful for returning tens,
maybe hundreds of results. Bear this in mind when you next consider
using a secondary index." From http://www.wentnet.com/blog/?p=77

Primary key : query & updates

Little problem here with cassandra. Basically my data has a status (INITIALIZED, PERFORMED, ENDED...), and I have different scheduled tasks that will query this data based on the status with an "IN" clause. So one scheduler will work with the data that is INITIALIZED, one with the PERFORMED, some with both, etc...
Once the data is retrieved, it is processed and the status changes accordingly (INITIALIZED -> PERFORMED -> ENDED).
The problem : in order to be able to use the IN clause, the status has to figure among the primary keys of my table. But when I update the status... it creates a new record in my table, since the UPSERT doesn't find any data with the primary keys given...
How do I solve that ?
Instead of including the status column in your primary key columns you can create a secondary index on the column. However, the IN clause is not (yet) supported for secondary index columns. But as you have a very limited number of values to look up you could use equality conditions in your WHERE clause and then merge the results client-side?
Beware that using secondary indexes comes at a cost. Check out "when not to use an index". In your case these points may apply:
On a frequently updated or deleted column. See Problems using an
index on a frequently updated or deleted column below.
To look for a
row in a large partition unless narrowly queried. See Problems using
an index to look for a row in a large partition unless narrowly
queried below.

get_range_slices and CQL query handling, need for ALLOW FILTERING

I have a following CQL table (a bit simplified for clarity):
CREATE TABLE test_table (
user uuid,
app_id ascii,
domain_id ascii,
props map<ascii,blob>,
PRIMARY KEY ((user), app_id, domain_id)
)
The idea is that this table would contain many users (i.e. rows, say, dozens of millions). For each user there would be a few domains of interest and there would be a few apps per domain. And for each user/domain/app there would be a small set of properties.
I need to scan this entire table and load its contents in chunks for given app_id and domain_id. My idea was to use the TOKEN function to be able to read the whole data set in several iterations. So, something like this:
SELECT props FROM test_table WHERE app_id='myapp1'
AND domain_id='mydomain1'
AND TOKEN(user) > -9223372036854775808
AND TOKEN(user) < 9223372036854775807;
I was assuming that this query would be efficient because I specify the range of the row keys and by specifying the values of the clustering keys I effectively specify the column range. But when I try to run this query I get the error message "Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING".
I have limited experience with Cassandra and I assumed that this sort of query would map into get_range_slices() call, which accepts the slice predicate (i.e. the range of columns defined by my app_id/domain_id values) and the key range defined by my token range. It seems either I misunderstand how this sort of query is handled or maybe I misunderstand about the efficiency of get_range_slices() call.
To be more specific, my questions are:
- if this data model does make sense for the kind of query I have in mind
- if this query is expected to be efficient
- if it is efficient, then why am I getting this error message asking me to ALLOW FILTERING
My only guess about the last one was that the rows that do not have the given combination of app_id/domain_id would need to be skipped from the result.
--- update ----
Thank for all the comments. I have been doing more research on this and there is still something that I do not fully understand.
In the given structure what I am trying to get is like a rectangular area from my data set (assuming that all rows have the same columns). Where top and the bottom of the rectangle is determined by the token range (range) and the left/right sides are defined as column range (slice). So, this should naturally transform into get_range_slices request. My understanding (correct me if I am wrong) that the reason why CQL requires me to put ALLOW FILTERING clause is because there will be rows that do not contain the columns I am looking for, so they will have to be skipped. And since nobody knows if it will have to skip every second row or first million rows before finding one that fits my criteria (in the given range) - this is what causes the unpredictable latency and possibly even timeout. Am I right? I have tried to write a test that does the same kind of query but using low-level Astyanax API (over the same table, I had to read the data generated with CQL, it turned out to be quite simple) and this test does work - except that it returns keys with no columns where the row does not contain the slice of columns I am asking for. Of course I had to implement some kind of simple paging based on the starting token and limit to fetch the data in small chunks.
Now I am wondering - again, considering that I would need to deal with dozens of millions of users: would it be better to partially "rotate" this table and organize it in something like this:
Row key: domain_id + app_id + partition no (something like hash(user) mod X)
Clustering key: column partition no (something like hash(user) >> 16 mod Y) + user
For the "column partition no"...I am not sure if it is really needed. I assume that if I go with this model I will have relatively small number of rows (X=1000..10000) for each domain + app combination. This will allow me to query the individual partitions, even in parallel if I want to. But (assuming the user is random UUID) for 100M users it will result in dozens or hundreds of thousands of columns per row. Is it a good idea to read one such a row in one request? It should created some memory pressure for Cassandra, I am sure. So maybe reading them in groups (say, Y=10..100) would be better?
I realize that what I am trying to do is not what Cassandra does well - reading "all" or large subset of CF data in chunks that can be pre-calculated (like token range or partition keys) for parallel fetching from different hosts. But I am trying to find a pattern that is the most efficient for such a use case.
By the way, the query like "select * from ... where TOKEN(user)>X and TOKEN(user)
Short answer
This warning means that Cassandra would have to read non-indexed data and filter out the rows that don't satisfy the criteria. If you add ALLOW FILTERING to the end of query, it will work, however it will scan a lot of data:
SELECT props FROM test_table
WHERE app_id='myapp1'
AND domain_id='mydomain1'
AND TOKEN(user) > -9223372036854775808
AND TOKEN(user) < 9223372036854775807
ALLOW FILTERING;
Longer explanation
In your example primary key consists of two parts: user is used as partition key, and <app_id, domain_id> form remaining part. Rows for different users are distributed across the cluster, each node responsible for specific range of token ring.
Rows on a single node are sorted by the hash of partition key (token(user) in your example). Different rows for single user are stored on a single node, sorted by <app_id, domain_id> tuple.
So, the primary key forms a tree-like structure. Partition key adds one level of hierarchy, and each remaining field of a primary key adds another one. By default, Cassandra processes only the queries that return all rows from the continuos range of the tree (or several ranges if you use key in (...) construct). If Cassandra should filter out some rows, ALLOW FILTERING must be specified.
Example queries that don't require ALLOW FILTERING:
SELECT * FROM test_table
WHERE user = 'user1';
//OK, returns all rows for a single partition key
SELECT * FROM test_table
WHERE TOKEN(user) > -9223372036854775808
AND TOKEN(user) < 9223372036854775807;
//OK, returns all rows for a continuos range of the token ring
SELECT * FROM test_table
WHERE user = 'user1'
AND app_id='myapp1';
//OK, the rows for specific user/app combination
//are stored together, sorted by domain_id field
SELECT * FROM test_table
WHERE user = 'user1'
AND app_id > 'abc' AND app_id < 'xyz';
//OK, since rows for a single user are sorted by app
Example queries that do require ALLOW FILTERING:
SELECT props FROM test_table
WHERE app_id='myapp1';
//Must scan all the cluster for rows,
//but return only those with specific app_id
SELECT props FROM test_table
WHERE user='user1'
AND domain_id='mydomain1';
//Must scan all rows having user='user1' (all app_ids),
//but return only those having specific domain
SELECT props FROM test_table
WHERE user='user1'
AND app_id > 'abc' AND app_id < 'xyz'
AND domain_id='mydomain1';
//Must scan the range of rows satisfying <user, app_id> condition,
//but return only those having specific domain
What to do?
In Cassandra it's not possible to create a secondary index on the part of the primary key. There are few options, each having its pros and cons:
Add a separate table that has primary key ((app_id), domain_id, user) and duplicate the necessary data in two tables. It will allow you to query necessary data for a specific app_id or <app_id, domain_id> combination. If you need to query specific domain and all apps - third table is necessary. This approach is called materialized views
Use some sort of parallel processing (hadoop, spark, etc) to perform necessary calculations for all app/domain combinations. Since Cassandra needs to read all the data anyway, there probably won't be much difference from a single pair. If the result for other pairs might be cached for later use, it will probably save some time.
Just use ALLOW FILTERING if query performance is acceptable for your needs. Dozens of millions partition keys is probably not too much for Cassandra.
Presuming you are using the Murmur3Partitioner (which is the right choice), you do not want to run range queries on the row key. This key is hashed to determine which node holds the row, and is therefore not stored in sorted order. Doing this kind of range query would therefore require a full scan.
If you want to do this query, you should store some known value as a sentinel for your row key, such that you can query for equality rather than range. From your data it appears that either app_id or domain_id would be a good choice, since it sounds like you always know these values when performing your query.

Counting rows in a sqlite db

I have a sqlite db on an ARM embedded platform running Linux with somewhat limited resources. Storage device is a microSD card. Sqlite version is 3.7.7.1. The application accessing sqlite is written in C++.
I want to know the number of rows in several tables in regular intervals. I currently use
select count(*) from TABLENAME;
to get this information. I'm having trouble with the performance: When the table sizes reach a certain point (~200K lines), I have a lot of system and iowait load every time I check the table sizes.
When I wrote this, I though looking up the number of rows in a table would be fast as it is probably stored somewhere. But now I'm suspecting that sqlite actually looks through all rows and when I pass the point where the data doesn't fit into the disk cache anymore I get a lot of io load. This would roughly fit from db size and available memory.
Can anyone tell me if sqlite behaves in the way I suspect?
Is there any way to get the number of table rows without producing this amount of load?
EDIT: plaes has asked about the table layout:
CREATE TABLE %s (timestamp INTEGER PRIMARY KEY, offset INTEGER, value NUMERIC);
Does this table have integer index? If not, then add one. Otherwise it has to scan the whole table to count the items.
This is an excerpt of comments from SQLite code that implements COUNT() parsing and execution:
/* If isSimpleCount() returns a pointer to a Table structure, then
** the SQL statement is of the form:
**
** SELECT count(*) FROM <tbl>
**
** where the Table structure returned represents table <tbl>.
**
** This statement is so common that it is optimized specially. The
** OP_Count instruction is executed either on the intkey table that
** contains the data for table <tbl> or on one of its indexes. It
** is better to execute the op on an index, as indexes are almost
** always spread across less pages than their corresponding tables.
*/
[...]
/* Search for the index that has the least amount of columns. If
** there is such an index, and it has less columns than the table
** does, then we can assume that it consumes less space on disk and
** will therefore be cheaper to scan to determine the query result.
** In this case set iRoot to the root page number of the index b-tree
** and pKeyInfo to the KeyInfo structure required to navigate the
** index.
**
** (2011-04-15) Do not do a full scan of an unordered index.
Also, you can get more information about your query with EXPLAIN QUERY PLAN.
From all the information I gathered, count() apparently really needs to scan the table. As plaes has pointed out, this is faster if the count is done on a integer indexed column, but scanning the index is still needed.
What I do now is store the row count somewhere and increment / decrement it manually in the same transactions I use to do inserts and deletes to keep it consistent.
Here are 2 possible table row count workarounds (with caveats) that do not cause a table / index scan:
Note for tables where you can use INTEGER PRIMARY KEY AUTOINCREMENT as a primary key, you can grab the count from the sqlite_sequence sqlite meta-table:
select name,seq from sqlite_sequence
seq will contain either the last id or the next id (I think the last but not sure).
"select max(pkid) from table", which will probably do an index search instead of a scan (and will also only be accurate for tables with no deletions).
Knowing this, if your use case includes UNIQUE deletions for tables you can use AUTOINCREMENT on, you could do a hybrid of the trigger-based solution and only count deleted rows (which would arguably be less bookkeeping than counting the inserts for most scenarios). However if you insert and delete the same row twice this also won't work.

Resources