How to model boolean flags in cassandra

How to model boolean flags in cassandra - cassandra

I am running into a strange problem using Cassandra 1.2 (DSE 3.1.1). I have a table called JSESSION and here is the structure:
cqlsh> use recommender;
cqlsh:recommender> describe table jsession;
CREATE TABLE jsession (
sessionid text,
accessdate timestamp,
atompaths set<text>,
filename text,
processed boolean,
processedtime timestamp,
userid text,
usertag bigint,
PRIMARY KEY (sessionid, accessdate)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'SnappyCompressor'};
CREATE INDEX processed_index ON jsession (processed);
You can see that the table is indexed on the field 'processed' which is boolean. When I started coding on this table, the following query used to work fine:
cqlsh:recommender> select * from jsession where processed = false limit 100;
But now that the size is more than 100,000 (not a large number at all), the query has stopped working suddenly, and I couldn't figure out a workaround yet.
cqlsh:recommender> select count(*) from jsession limit 1000000;
count
--------
142320
cqlsh:recommender> select * from jsession where processed = false limit 100;
Request did not complete within rpc_timeout.
I tried several options, to increase the rpc_timout to 60 seconds, also to start Cassandra with more memory (it is 8GB now), but I still have the same problem. Do you have any solution for this?
The deeper question is what is the right way to model a boolean field in CQL3 so that I can search for that field and update it as well. I need to set the field 'processed' to true after I have processed that session.

You don't have a boolean modeling problem. You just need to paginate the results.
select * from jsession where processed = false and token(sessionid) > token('ABC') limit 1000;
Where 'ABC' is the last session id you read (or '' for the first query). Just keep feeding the token id back into this query until you've read everything.
See also http://www.datastax.com/documentation/cql/3.1/webhelp/index.html#cql/cql_reference/../cql_using/paging_c.html

Related

Cassandra CQL alternative to OR in WHERE clause

Here's the code I used to create the table:
CREATE TABLE test.packages (
packageuuid timeuuid,
ruserid text,
suserid text,
timestamp int,
PRIMARY KEY (ruserid, suserid, packageuuid, timestamp)
);
and then I create a materialized view:
CREATE MATERIALIZED VIEW test.packages_by_userid
AS SELECT * FROM test.packages
WHERE ruserid IS NOT NULL
AND suserid IS NOT NULL
AND TIMESTAMP IS NOT NULL
AND packageuuid IS NOT NULL
PRIMARY KEY (ruserid, suserid, timestamp, packageuuid)
WITH CLUSTERING ORDER BY (packageuuid DESC);
I want to be able to search for packages sent between two IDs
so I would need something like this:
SELECT * FROM test.packages_by_userid WHERE (ruserid = '1' AND suserid = '2' AND suserid = '1' AND ruserid = '2') AND timestamp > 1496601553;
How would I accomplish something like this with CQL?
I've searched a bit but I can't figure it out.
I'm willing to change the structure of the table if it will make something like this possible.
If it's doable without a materialized view that would also be good.

Use In Clause:
SELECT * FROM test.packages_by_userid WHERE ruserid IN ( '1', '2') AND suserid IN ( '1','2') AND timestamp > 1496601553;
Note : Keep the in clause size smaller, Large in clause in the partition can cause GC pauses and heap pressure that leads to overall slower performance
In practical terms this means you’re waiting on this single coordinator node to give you a response, it’s keeping all those queries and their responses in the heap, and if one of those queries fails, or the coordinator fails, you have to retry the whole thing.
If the multiple partition in clause larger try to use separate query, for each partition (ruserid) with executeAsync.
SELECT * FROM test.packages_by_userid WHERE ruserid = '1' AND suserid IN ( '1','2') AND timestamp > 1496601553;
SELECT * FROM test.packages_by_userid WHERE ruserid = '2' AND suserid IN ( '1','2') AND timestamp > 1496601553;
Learn More : https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/

Since you always search for both sender and receiver, I'd model this with the following table layout:
CREATE TABLE test.packages (
ruserid text,
suserid text,
timestamp int,
packageuuid timeuuid,
PRIMARY KEY ((ruserid, suserid), timestamp)
);
In this way, for each pair of sender/receiver you need to run two queries, one for each partition:
SELECT * FROM packages WHERE ruserid=1 AND suserid=2 AND timestamp > 1496601553;
SELECT * FROM packages WHERE ruserid=2 AND suserid=1 AND timestamp > 1496601553;
This is IMHO the best solution because, remember, in Cassandra you start from your queries and build your table models on that, never the reverse.

Cassandra CQL range query rejected despite equality operator and secondary index

From the table schema below, I am trying to select all pH readings that are below 5.
I have followed these three pieces of advice:
Use ALLOW FILTERING
Include an equality comparison
Create a secondary index on the reading_value column.
Here is my query:
select * from todmorden_numeric where sensor_name = 'pHradio' and reading_value < 5 allow filtering;
Which is rejected with this message:
Bad Request: No indexed columns present in by-columns clause with Equal operator
I tried adding a secondary index to the sensor_name column and was told that it was already part of the key and therefore already indexed.
I created the index after the table had been in use for a while - could that be the problem? I ran "nodetool refresh" in the hope it would make the index available but this did not work. Here is the output of describe table todmorden_numeric :
CREATE TABLE todmorden_numeric (
sensor_name text,
reading_time timestamp,
reading_value float,
PRIMARY KEY ((sensor_name), reading_time)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='Data that suits being stored as floats' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='99.0PERCENTILE' AND
memtable_flush_period_in_ms=0 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'LZ4Compressor'};
CREATE INDEX todmorden_numeric_reading_value_idx ON todmorden_numeric (reading_value);

Cassandra allows range search only on:
a) Partition Key only if ByteOrderPartitioner is used (default now is murmur3).
b) any single clustering key ONLY IF any clustering keys defined BEFORE the target column in the primary key definition are already specified by an = operator in the predicate.
They don't work on secondary indices.
Consider the following table definition:
CREATE TABLE tod1 (name text, time timestamp,
val float, PRIMARY KEY (name, time));
You CAN'T do a range on the val in this case.
Consider this one:
CREATE TABLE tod2 (name text, time timestamp,
val float, PRIMARY KEY (name, time, val));
Then the following is valid:
SELECT * FROM tod2 WHERE name='X' AND time='timehere' AND val < 5;
Kinda pointless, but this is not valid:
SELECT * from tod2 WHERE name='X' AND val < 5;
It's not valid as you haven't filtered by a previous clustering key in the primary key def (in this case, time).
For your query, you may want to do this:
CREATE TABLE tod3 (name text, time timestamp,
val float, PRIMARY KEY (name, val, time));
Note the order of columns in the primary key: val's before time.
This will allow you to do:
SELECT * from tod3 WHERE name='asd' AND val < 5;
On a different note, how long do you intend to hold data? How frequently do you get readings? This can cause your partition to grow quite large quite quickly. You may want to bucket it readings into multiple partitions (manual sharding). Perhaps one partition per day? Of course, such things would greatly depend on your access patterns.
Hope that helps.

Cassandra RPC TimeOut on Secondary Index

We are getting rpc_timeout when running query on a secondary index in cassandra.
The Secondary Index column holds only 2 values, either 'true' or 'false'.
There query has pagination build in, to limit the number of records returned
Here is the Query
Select id_firm, id_uuid from efstatus where isFinal='true' and TOKEN(id_firm) >= TOKEN(99625490-29b4-4474-a731-9b7664f642f8) LIMIT 25;
This is the Table Structure
CREATE TABLE efstatus (
id_firm uuid,
id_uuid uuid,
isfinal text,
json_data text,
type text,
year text,
PRIMARY KEY (id_firm, id_uuid)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'SnappyCompressor'};
CREATE INDEX efstatus_isfinal ON efstatus (isfinal);
CREATE INDEX efstatus_year ON efstatus (year);
Running Trace ON gives no relevant information. This is what I see
Request did not complete within rpc_timeout.
unsupported operand type(s) for /: 'NoneType' and 'float'
We are using DataStax version 3.1.4, which has I believe Cassandra v 1.2.10.1
Any help would be appreciated.

cassandra cql query with in condition in rowkey and in clustering columns

I am a newbie to cassandra. And i have a table with composite primary key. The description of the table is
CREATE TABLE testtable (
foid bigint,
id bigint,
severity int,
category int,
ack boolean,
PRIMARY KEY (foid, id, severity, category)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='NONE' AND
memtable_flush_period_in_ms=0 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'LZ4Compressor'};
My requirement is that i need to query the table with foid with in condition and id with range condition and severity with in condition
so when i tried the following query
select * from testtable where foid in (5,6) and id>10 and severity in (5);
i got the error message as
select * from testtable where foid in (5,6) and id>10 and severity in (5);
or even the equal condition on the severity column is suffice for me which won't work either.
Is there any way that the same can be accomplished
I tried with secondary indexes for severity and category too and that did not give me anything positive.

You need to successively restrict the primary keys, so the following will work:
select * from testtable where foid in (1) and id=2 and severity<20 ;
but this won't:
select * from testtable where foid in (1) and id>10 and severity=3;
What about making the query less restrictive (as you suggested in your question) as follows
select * from testtable where foid in (5,6) and id>10
And sorting through the results at the client side?
An alternative (and probably more attractive) solution would be to order your keys according to how you are going to perform the query, e.g.,
CREATE TABLE testtable2 (
foid bigint,
severity int,
id bigint,
category int,
ack boolean,
PRIMARY KEY (foid, severity, id, category)
)
allowing you to make queries like this (note the equality operation on severity, an IN operation on severity won't work):
select * from testtable2 where foid in (5,6) and severity=5 and id>10;
(tested with cql [cqlsh 4.0.1 | Cassandra 2.0.1 | CQL spec 3.1.1 | Thrift protocol 19.37.0])

Cassandra/Hector: Add a counter on a composite primary key

I've created a table in CQL3 console (no single primary key constituent is unique, together they will be):
CREATE TABLE aggregate_logs (
bpid varchar,
jid int,
month int,
year int,
value counter,
PRIMARY KEY (bpid, jid, month, year));
then been able to update and query by using:
UPDATE aggregate_logs SET value = value + 1 WHERE bpid='1' and jid=1 and month=1 and year=2000;
This works as expected. I wanted to do the same update in Hector (in Scala):
val aggregateMutator:Mutator[Composite] = HFactory.createMutator(keyspace, compositeSerializer)
val compKey = new Composite()
compKey.addComponent(bpid, stringSerializer)
compKey.addComponent(new Integer(jid), intSerializer)
compKey.addComponent(new Integer(month), intSerializer)
compKey.addComponent(new Integer(year), intSerializer)
aggregateMutator.incrementCounter(compKey, LogsAggregateFamily, "value", 1)
but I get an error with the message:
...HInvalidRequestException: InvalidRequestException(why:String didn't validate.)
Running the query direct from hector with:
val query = new me.prettyprint.cassandra.model.CqlQuery(keyspace, compositeSerializer, stringSerializer, new IntegerSerializer())
query.setQuery("UPDATE aggregate_logs SET value = value + 1 WHERE 'bpid'=1 and jid=1 and month=1 and year=2000")
query.execute()
which gives me the error:
InvalidRequestException(why:line 1:59 mismatched input 'and' expecting EOF)
I've not seem any other examples which use a counter under a composite primary key. Is it even possible?

It's definitely possible using directly cql (both via CQLSH and C++, at least):
cqlsh:goh_master> describe table daily_caps;
CREATE TABLE daily_caps
( caps_type ascii, id ascii, value counter, PRIMARY KEY
(caps_type, id) ) WITH COMPACT STORAGE AND comment='' AND
caching='KEYS_ONLY' AND read_repair_chance=0.100000 AND
gc_grace_seconds=864000 AND replicate_on_write='true' AND
compaction_strategy_class='SizeTieredCompactionStrategy' AND
compression_parameters:sstable_compression='SnappyCompressor';
cqlsh:goh_master> update daily_caps set value=value +1 where caps_type='xp' and id ='myid';
cqlsh:goh_master> select * from daily_caps;
caps_type | id | value
-----------+------+-------
xp | myid | 1

CQL3 and the thrift API are not compatible. So creating a column family with CQL3 and accessing it with Hector or another thrift based client will not work. For more information see:
https://issues.apache.org/jira/browse/CASSANDRA-4377

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to model boolean flags in cassandra - cassandra

Related

Cassandra CQL alternative to OR in WHERE clause

Cassandra CQL range query rejected despite equality operator and secondary index

Cassandra RPC TimeOut on Secondary Index

cassandra cql query with in condition in rowkey and in clustering columns

Cassandra/Hector: Add a counter on a composite primary key

Categories

Resources