Cassandra CQL selecting rows with has values different on two columns - cassandra

I create table:
CREATE TABLE T (
I int PRIMARY KEY,
A text,
B text
);
Than I add two columns X and Y using:
ALTER TABLE T ADD X int;
CREATE INDEX ON T (X);
ALTER TABLE T ADD Y int;
CREATE INDEX ON T (Y);
I put some data and now I would like to count rows which has different values on X and Y (even X < Y would be fine). I tried something like this:
select COUNT(*) from T where X < Y ;
This also doesn't work without COUNT - just simple *.
But I'm getting error no viable alternative at input ';'
Do you have some suggestions how to overcome this error?
I tried using counters instead of integers but they forced me to put all non-counter data to primary key which wasn't good idea in my case ...
I'm using Cassandra 1.2.6 and CQL 3.
PS can I perform UPDATE on all rows? without WHERE clause or with some dummy one?

As Cassandra prefers simple reads the Cassandra-way to do this is to insert a boolean flagged column on update/insert. With an (secondary) index you may query the reads faster as well.

Related

How do I select all rows from two clustering columns in cassandra database

I have a Partion key: A
Clustering columns: B, C
I do understand I can query like this
Select * from table where A = ?
Select * from table where A = ? and B = ?
Select * from table where A = ? and B = ? and C = ?
Now I have a scenario where I need to fetch results from only B and C. Is there a way to this with out using Allow Filtering.
You cannot fetch on basis of 'B' and 'C' (the clustering columns) without partition key without using Allow Filtering. Though you can use spark and spark-cassandra-connector for filtering out the results on basis of 'B' and 'C'. Behind the scene it also used allow filtering but it has efficient mechanism to scan the table the right way.

Using Large Look up table

Problem Statement :
I have two tables - Data (40 cols) and LookUp(2 cols) . I need to use col10 in data table with lookup table to extract the relevant value.
However I cannot make equi join . I need a join based on like/contains as values in lookup table contain only partial content of value in Data table not complete value. Hence some regex based matching is required.
Data Size :
Data Table : Approx - 2.3 billion entries (1 TB of data)
Look up Table : Approx 1.4 Million entries (50 MB of data)
Approach 1 :
1.Using the Database ( I am using Google Big Query) - A Join based on like take close to 3 hrs , yet it returns no result. I believe Regex based join leads to Cartesian join.
Using Apache Beam/Spark - I tried to construct a Trie for the lookup table which will then be shared/broadcast to worker nodes. However with this approach , I am getting OOM as I am creating too many Strings. I tried increasing memory to 4GB+ per worker node but to no avail.
I am using Trie to extract the longest matching prefix.
I am open to using other technologies like Apache spark , Redis etc.
Do suggest me on how can I go about handling this problem.
This processing needs to performed on a day-to-day basis , hence time and resources both needs to be optimized .
However I cannot make equi join
Below is just to give you an idea to explore for addressing in pure BigQuery your equi join related issue
It is based on an assumption I derived from your comments - and covers use-case when y ou are looking for the longest match from very right to the left - matches in the middle are not qualified
The approach is to revers both url (col10) and shortened_url (col2) fields and then SPLIT() them and UNNEST() with preserving positions
UNNEST(SPLIT(REVERSE(field), '.')) part WITH OFFSET position
With this done, now you can do equi join which potentially can address your issue at some extend.
SO, you JOIN by parts and positions then GROUP BY original url and shortened_url while leaving only those groups HAVING count of matches equal of count of parts in shorteded_url and finally you GROUP BY url and leaving only entry with highest number of matching parts
Hope this can help :o)
This is for BigQuery Standard SQL
#standardSQL
WITH data_table AS (
SELECT 'cn456.abcd.tech.com' url UNION ALL
SELECT 'cn457.abc.tech.com' UNION ALL
SELECT 'cn458.ab.com'
), lookup_table AS (
SELECT 'tech.com' shortened_url, 1 val UNION ALL
SELECT 'abcd.tech.com', 2
), data_table_parts AS (
SELECT url, x, y
FROM data_table, UNNEST(SPLIT(REVERSE(url), '.')) x WITH OFFSET y
), lookup_table_parts AS (
SELECT shortened_url, a, b, val,
ARRAY_LENGTH(SPLIT(REVERSE(shortened_url), '.')) len
FROM lookup_table, UNNEST(SPLIT(REVERSE(shortened_url), '.')) a WITH OFFSET b
)
SELECT url,
ARRAY_AGG(STRUCT(shortened_url, val) ORDER BY weight DESC LIMIT 1)[OFFSET(0)].*
FROM (
SELECT url, shortened_url, COUNT(1) weight, ANY_VALUE(val) val
FROM data_table_parts d
JOIN lookup_table_parts l
ON x = a AND y = b
GROUP BY url, shortened_url
HAVING weight = ANY_VALUE(len)
)
GROUP BY url
with result as
Row url shortened_url val
1 cn457.abc.tech.com tech.com 1
2 cn456.abcd.tech.com abcd.tech.com 2

Cassandra schema - select by frequently updated column

Given table:
CREATE TABLE T (
a int,
last_modification_time timestamp,
b int,
PRIMARY KEY (a)
);
I'm frequently updating records. With each update last_modification_time is set to now() and also other fields are set.
What is the right cassandra approach to be able to query by last_modification_time range? I need to query like this:
select * from .. where a=Z and last_modification_time < X and last_modification_time > Y;
One way would be to create materialized view with PRIMARY KEY (a, last_modification_time) but I want to avoid this since materialized views are buggy in 3.X cassandra versions.
What would be alternative way of querying by last_modification_time range given last_modification_time is frequently updated?
How about having two tables? One could hold the current snapshot where you're updating the last_modification_time field and another one which holds the changes over time (something like a history table)? You could write to both of them using BATCH statements.
CREATE TABLE t_modifications (
a int,
last_modification_time timestamp,
b int,
PRIMARY KEY (a, last_modification_time)
) WITH CLUSTERING ORDER BY (last_modificaton_time DESC);
BEGIN BATCH
UPDATE T SET last_modification_time = 123, b = 4 WHERE a = 2;
INSERT INTO t_modifications (a, last_modification_time, b) values (2, 123, 4);
APPLY BATCH;
If you're interested on the latest snapshot by a given modification range, you can select and limit the t_modifications table:
SELECT * FROM t_modifications WHERE a = 2 AND last_modification_time < 136 LIMIT 1;
In general, to do range queries like this, the field you want to range on has to be part of the composite key, has to be the right-most element of the composite key, and all other elements in the composite key have to be specified. In your case, you would modify your PRIMARY KEY to (a, last_modification_time). You can then
SELECT * from t_modifications
WHERE a = aval
AND last_modification_time > beg
AND last_modification_time < end;
This will get you all records for aval between beg and end.

Cassandra slow SELECT MAX(x) query

I have a dev machine with Cassandra 3.9 and 2 tables, one has ~~ 400,000 records, another about 40,000,000 records. Their structures are different.
Each of them has a secondary index on a field x, and I'm trying to run a query of the form SELECT MAX(x) FROM table. On the first table, the query takes a couple of seconds, and on the second table, it times out.
My experience is with relational databases where these queries are trivial and fast. So in Cassandra, it looks like the index isn't used to execute these queries? Is there an alternative?
In cassandra aggregation functions such as MIN, MAX, COUNT, SUM or AVG on a table without specifing a partition key is a bad practice. instead, you can have an other table that store the max value of x field for both tables.
However, you have to add some client side logic to maintain this max value in the other table when you run INSERT or UPDATE statements.
Tables structures :
CREATE TABLE t1 (
pk text PRIMARY KEY,
x int
);
CREATE TABLE t2 (
pk text PRIMARY KEY,
x int
);
CREATE TABLE agg_table (
table_name text PRIMARY KEY,
max_value int
);
So with this structure you can have the max value for a table :
SELECT max_value
FROM agg_table
WHERE table_name = 't1';
Hope this can help you.

cassandra 2.0.9: query for undefined column

Using Cassandra 2.0.9 CQL, how does one query for rows that don't have a particular column defined? For example:
create table testtable ( id int primary key, thing int );
create index on testtable ( thing );
# can now select rows by thing
insert into testtable( id, thing ) values ( 100, 100 );
# row values will be persistent
update testtable using TTL 30 set thing=1 where id=100;
# wait 30 seconds, thing column will go away for the row
select * from testtable;
Ideally I'd like to be able to do something like this:
select * from testtable where NOT DEFINED thing;
or some such and have the row with the id==1 returned. Is there any way to search for rows that do not have a particular column value assigned?
I'm afraid I've been through the Datastax 2.0 manual, as well as the CQLSH help with no luck trying to find an operator or syntax for this. Thanks.
Doesn't appear to be available yet
https://issues.apache.org/jira/browse/CASSANDRA-3783

Resources