Does REPLACE work with columnstore tables in #memsql ?
The Memsql documentation does not mention about this but I am not able to replace a record in a columnstore. IS there any way to implement REPLACE in a columnstore ?
Thanks
zebb
REPLACE is only useful with unique keys, otherwise it is just equivalent to INSERT, and columnstore tables do not support unique keys.
There isn't a good way to implement it efficiently in the columnstore, since columnstores are generally not intended to perform well for single-row updates. See http://docs.memsql.com/docs/columnstore.
One way you can implement it (not very efficiently) is with multistatement transactions. Run a select to see whether a matching row is already present, if so run an update, otherwise run an insert.
E.g. say we have
create table c(i int, a int, key using clustered columnstore(i));
We can do
memsql> begin;
Query OK, 0 rows affected (0.00 sec)
memsql> select count(*) from c where i = 4;
+----------+
| count(*) |
+----------+
| 0 |
+----------+
1 row in set (0.00 sec)
memsql> insert into c values (4, 4);
Query OK, 1 row affected (0.00 sec)
memsql> commit;
Query OK, 0 rows affected (0.00 sec)
in the case where there is no match, and
memsql> begin;
Query OK, 0 rows affected (0.00 sec)
memsql> select count(*) from c where i = 4;
+----------+
| count(*) |
+----------+
| 1 |
+----------+
1 row in set (0.01 sec)
memsql> update c set a = 4 where i = 4;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1 Changed: 1 Warnings: 0
memsql> commit;
Query OK, 0 rows affected (0.00 sec)
in the case where there is a match.
Related
Use case: Find maximum counter value in a specific id range
I want to create a table with these columns: time_epoch int, t_counter counter
The frequent query is:
select time_epoch, MAX t_counter where time_epoch >= ... and time_epoch < ...
This is to find the counter in specific time range. Planning to make time_epoch as primary key. I am not able to query the data. It is always asking for ALLOW FILTERING. Since its a very costly function, We dont want to use it.
How to design the table and query for the use case.
Let's assume that we can "bucket" (partition) your data by day, assuming that enough write won't happen in a day to make the partitions too large. Then, we can cluster by time_epoch in DESCending order. With time based data, storing data in descending order often makes the most sense (as business reqs usually care more about the most-recent data).
Therefore, I'd build a table like this:
CREATE TABLE event_counter (
day bigint,
time_epoch timestamp,
t_counter counter,
PRIMARY KEY(day,time_epoch))
WITH CLUSTERING ORDER BY (time_epoch DESC);
After inserting a few rows, the clustering order becomes evident:
> SELECT * FROM event_counter ;
WHERE day=20210219
AND time_epoch>='2021-02-18 18:00'
AND time_epoch<'2021-02-19 8:00';
day | time_epoch | t_counter
----------+---------------------------------+-----------
20210219 | 2021-02-19 14:09:21.625000+0000 | 1
20210219 | 2021-02-19 14:08:32.913000+0000 | 2
20210219 | 2021-02-19 14:08:28.985000+0000 | 1
20210219 | 2021-02-19 14:08:05.389000+0000 | 1
(4 rows)
Now SELECTing the MAX t_counter in that range should work:
> SELECT day,max(t_counter) as max
FROM event_counter
WHERE day=20210219
AND time_epoch>='2021-02-18 18:00'
AND time_epoch<'2021-02-19 09:00';
day | max
----------+-----
20210219 | 2
Unfortunately there is no better way. Think about it.
If you know cassandra architecture then you would know that your data is spread across multiple nodes based on primary key. only way to filter on values from primary key would be to transverse each node which is essentially what "ALLOW FILTERING" is done.
I am trying to find a way to determine if the table is empty in Cassandra DB.
cqlsh> SELECT * from examples.basic ;
key | value
-----+-------
(0 rows)
I am running count(*) to get the value of the number of rows , but I am getting warning message, So I wanted to know if there is any better way to check if the table is empty(zero rows).
cqlsh> SELECT count(*) from examples.basic ;
count
-------
0
(1 rows)
Warnings :
Aggregation query used without partition key
cqlsh>
Aggregations, like count, can be an overkill for what you are trying to accomplish, specially with the star wildcard, as if there is any data on your table, the query will need to do a full table scan. This can be quite expensive if you have several records.
One way to get the result you are looking for is the query
cqlsh> SELECT key FROM keyspace1.table1 LIMIT 1;
Empty table:
The resultset will be empty
cqlsh> SELECT key FROM keyspace1.table1 LIMIT 1;
key
-----
(0 rows)
Table with data:
The resultset will have a record
cqlsh> SELECT key FROM keyspace1.table1 LIMIT 1;
key
----------------------------------
uL24bhnsHYRX8wZItWM6xKdS0WLvDsgi
(1 rows)
When I am trying to execute the below query, I am always getting QueryTimeOutException,
Exception is,
com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout during read query at consistency QUORUM (2 responses were required but only 0 replica responded)
Query is,
SELECT * FROM my_test.my_table WHERE key_1 = 101 ORDER BY key_2 ASC LIMIT 25;
I am using cassandra version 2.1.0 with 3 nodes, Single DC with replication of 3, cassandra.yaml has all default values and I am having following keyspace and table as schema,
CREATE KEYSPACE my_test
WITH REPLICATION = {
'class' : 'SimpleStrategy',
'replication_factor' : 3
};
CREATE TABLE my_test.my_table (
key_1 bigint,
key_2 bigint,
key_3 text,
key_4 text,
key_5 text,
key_6 text,
key_7 text,
key_8 text,
key_9 text,
key_10 text,
key_11 timestamp,
PRIMARY KEY (key_1, key_2)
);
Currently the table has around 39000 records but initially it has 50000 records, 11000 records has been deleted for some business logic.
One of the solution to avoid such exception is to increase query read time out, But my schema and query are more direct why should I increase my read time out?
Since In my query I have given the partition key (key_1) so it should reach the destination exactly, after that I had specified the start range of parition key,
So it should retrieve with a maximum time of 2seconds, but is not so. But the below query is working fine and retrieved the results less than 1 seconds (Difference is, ASC is not working and DESC is working)
SELECT * FROM my_test.my_table WHERE key_1 = 101 ORDER BY key_2 DESC LIMIT 25;
Again as per schema the cluster key default order is ASC, So retrieving the data in ASC should be faster than DESC order as per cassandra documentation.
But it is reverse in my case.
Again some clues, The following are the queries that has been tried through CQLSH.
The following query is working and retrieved the results less than 1 seconds
SELECT * FROM my_test.my_table WHERE key_1 = 101 AND key_2 > 1 AND key_2 < 132645 LIMIT 1;
But, the following query is not working and throws time out exception,
SELECT * FROM my_test.my_table WHERE key_1 = 101 AND key_2 > 1 AND key_2 < 132646 LIMIT 1;
But, the following queries are working, and retrieved results less than 1 seconds
SELECT * FROM my_test.my_table WHERE key_1 = 101 AND key_2 = 132644;
SELECT * FROM my_test.my_table WHERE key_1 = 101 AND key_2 = 132645;
SELECT * FROM my_test.my_table WHERE key_1 = 101 AND key_2 = 132646;
SELECT * FROM my_test.my_table WHERE key_1 = 101 AND key_2 = 132647;
Strange behaviour any help would be appreciated.
For each key_1 there will be around 1000000 key_2.
And this is what happens when you take the 2 billion cells per partition limit, and try to use all of it. I know I've answered plenty of posts here before by acknowledging that there is a hard limit of 2 billion cells per partition, your (very) wide row will become ungainly and probably time-out long before that. This is what I believe you are seeing.
The solution here, is a technique called "bucketing." Basically, you have to find an additional key to partition your data by. Too many CQL rows are being written to the same data partition, and bucketing would help bring the ratio of partition to clustering keys back to a sane level.
The logical way to go about bucketing, is with a time element. I see your last key is a timestamp. I don't know how many rows each key_1 gets in a day, but let's say that you only get a few thousand every month. In that case, I would create an additional partition key of month_bucket:
CREATE TABLE my_test.my_table (
key_1 bigint,
key_2 bigint,
...
key_11 timestamp,
month_bucket text,
PRIMARY KEY ((key_1,month_bucket) key_2)
);
That would allow you to support a query like this:
SELECT * FROM my_test.my_table
WHERE key_1 = 101 AND month_bucket = '201603'
AND key_2 > 1 AND key_2 < 132646 LIMIT 1;
Again, bucketing on month is just an example. But basically, you need to find an additional column to partition your data on.
Issue got resolved after restarting all the 3 cassandra servers. I don't know what the hell makes trouble.. Since it is in production server couldn't able to get exact Root Cause.
I have created a KEYSPACE and a TABLE with a uuid column as primary key and a timestamp column using an index. All this succeeded like the following picture showed:
cassandra#cqlsh:my_keyspace> insert into my_test ( id, insert_time, value ) values ( uuid(), '2015-03-12 09:10:30', '111' );
cassandra#cqlsh:my_keyspace> insert into my_test ( id, insert_time, value ) values ( uuid(), '2015-03-12 09:20:30', '222' );
cassandra#cqlsh:my_keyspace> select * from my_test;
id | insert_time | value
--------------------------------------+--------------------------+-------
9d7f88bc-5cb9-463f-b679-fd66e6469eb5 | 2015-03-12 09:20:30+0000 | 222
69579f6f-bf88-493b-a1d6-2f89fac25650 | 2015-03-12 09:10:30+0000 | 111
(2 rows)
and now query
cassandra#cqlsh:my_keyspace> select * from my_test where insert_time = '2015-03-12 09:20:30';
id | insert_time | value
--------------------------------------+--------------------------+-------
9d7f88bc-5cb9-463f-b679-fd66e6469eb5 | 2015-03-12 09:20:30+0000 | 222
(1 rows)
and now query with less than:
cassandra#cqlsh:my_keyspace> select * from my_test where insert_time < '2015-03-12 09:20:30';
InvalidRequest: code=2200 [Invalid query] message="No secondary indexes on the restricted columns support the provided operators: 'insert_time < <value>'"
while the first query is successful, why this happened? How should I make the second query successful since that's just what I want?
You can test all this on your own machine. Thanks
CREATE TABLE my_test (
id uuid PRIMARY KEY,
insert_time timestamp,
value text
) ;
CREATE INDEX my_test_insert_time_idx ON my_keyspace.my_test (insert_time);
Cassandra range queries are quite limited. It goes down to performance, and data storage mechanics. A range query must have the following:
Hit a (or few with IN) partition key, and include exact matches on all consecutive clustering keys except the last one in the query, which you can do a range query on.
Say your PK is (a, b, c, d), then the following are allowed:
where a=a1 and b < b1
where a=a1 and b=b1 and c < c1
The following is not:
where a=a1 and c < 1
[I won't go into Allow Filtering here...avoid it.]
Secondary indexes must be exact matches. You can't have range queries on them.
CREATE TABLE test (
ck INT,
pk INT,
PRIMARY KEY (ck, pk)
);
for (int i = 1; i < 10000; i++) {
sessionRW.execute(QueryBuilder.insertInto("test").value("ck", 1).value("pk", i));
}
root#cqlsh:ks> select * from test limit 5;
ck | pk
----+----
1 | 1
1 | 2
1 | 3
1 | 4
1 | 5
(5 rows)
root#cqlsh:ks> delete from test where ck = 1;
root#cqlsh:ks> insert into test(ck,pk) values (1, 0); -- new minimal value
root#cqlsh:ks> select * from test limit 1;
ck | pk
----+-------
1 | 0
(1 rows)
WARN 11:37:39 Read 1 live and 9999 tombstoned cells in ks.test (see tombstone_warn_threshold). 1 columns was reque
Why when I do a SELECT with "LIMIT 1" I get the tombstone warning ?
The rows are order by pk ASC and the lower pk value of this table (0) is the first row and is not deleted.
I don't understand why cassandra keep scanning my table for other results (hence fetching a lot of tombstone) because the first row match and I specified I just want one row.
I could have understood the warning If I didn't specified LIMIT. But what's the point of scanning the whole table when first row match with LIMIT 1?
Because the way cassandra stores data. The data is stored as a single wide row as columns even if it looks through cql like multiple rows. Therefore in order to get to the lastly inserted "row" it needs to read all tombstoned columns as well.
Below is an illustration
| 1 | 2 | 3 |...|9999| 0 |
----+---+---+---+---+----+---+
ck=1| T | T | T | T | T | |
As you can see it is one row under a clustering key 1. I marked tombstoned columns (or rows if you prefer) with "T". Cassandra reads the entire row, and then in order to find first non-tombstoned column, it needs to cycle through all 9999 tombstoned ones.
When you do "select * from test limit 1;", Cassandra has to go to all the nodes and filter the entire table to find the first live row. It needs to stream the tombstones to the coordinator since other nodes may be out of sync and the limit 1 would match a row that had been deleted. You should be able to avoid this by specifying the query such that the tombstones wouldn't matter, such as "select * from test where ck=1 and pk < 1;"
OK so I think I found the answer, the answer is cassandra is doing one more lookup after limit 1 (like if you did limit 2).
Just insert one more row:
insert into test(ck,pk) values (1, 1);
and now select * from test limit 1; won't trigger a tombstone error.
However, if you do LIMIT 2, it will trigger a tombstone error even if we have 2 valid rows, first in the table order.
Why cassandra is doing (limit+1) lookup is the question.